Forem: u11d

AWS CloudFront Cache Policies: Complete Guide

Maciej Łopalewski — Wed, 20 May 2026 11:30:24 +0000

A CloudFront cache policy controls two things: the cache key (which combination of URL, headers, cookies, and query strings makes a request unique) and the TTL (how long CloudFront keeps an object at the edge before re-checking the origin). Those two settings together determine your cache hit ratio.

The CloudFront console currently shows fifteen managed cache policies. Five are broadly useful on a typical self-managed distribution: CachingOptimized, CachingOptimizedForUncompressedObjects, CachingDisabled, UseOriginCacheControlHeaders, and UseOriginCacheControlHeaders-QueryStrings. One (Elemental-MediaPackage) is for AWS Elemental MediaPackage video origins. The remaining nine are Amplify-related — a standalone Amplify policy for Amplify origins plus eight Amplify-* policies that Amplify Hosting attaches to its own distributions automatically. You can also build a custom policy when none of the managed ones fit the shape of your traffic.

This is a follow-up to my earlier post on CloudFront's three policy types, going one level deeper on just cache policies.

What a cache policy controls

A cache policy has three categories of settings: policy information (name and description, just metadata), TTL settings (Minimum, Maximum, Default), and cache key settings (which headers, cookies, and query strings to include). Everything that affects caching behavior lives in the latter two.

The cache key is the unique identifier CloudFront uses to look up an object at an edge location. If two viewer requests produce the same cache key, the second one is a cache hit. If they differ — say, one request includes a ?ref=twitter query string and the other does not, and your policy includes query strings in the cache key — they get treated as separate objects, even when the response body is identical. Cache key shape is the single biggest lever for hit ratio.

The TTL settings work alongside Cache-Control and Expires headers from your origin to determine how long a cached object stays valid at the edge. They behave subtly differently from each other; more on that next.

How Minimum, Maximum, and Default TTL actually work

The three TTL settings are not redundant. Each one applies in a different scenario, and getting them confused is one of the more common ways to either over-cache stale content or hammer your origin unnecessarily.

Default TTL is used when your origin sends no Cache-Control or Expires header at all. CloudFront falls back to this value, subject to the Minimum TTL floor — if Minimum TTL is greater than Default TTL, CloudFront caches for at least the Minimum TTL. The default for the AWS managed CachingOptimized policy is 86,400 seconds (24 hours).

Maximum TTL caps the TTL when your origin does send Cache-Control or Expires. If your origin says Cache-Control: max-age=5184000 (60 days) and Maximum TTL is 31,536,000 (365 days), CloudFront honors the origin's 60 days. If your origin says max-age=63072000 (730 days), CloudFront caps it at 365 days. This setting only matters when you want to override an origin that's claiming overly aggressive cache durations.

One nuance worth knowing: if your origin sends both max-age and s-maxage, CloudFront uses s-maxage for its own caching decisions and lets browsers use max-age. This is how you get different cache durations at the edge versus the browser without writing two policies.

Minimum TTL is the floor. CloudFront keeps the object for at least this long, no matter what. This one comes with a sharp edge: if Minimum TTL is greater than 0, CloudFront ignores Cache-Control: no-cache, no-store, and private directives from your origin. The object gets cached anyway, for at least the Minimum TTL duration. Both CachingOptimized and CachingOptimizedForUncompressedObjects have Minimum TTL of 1 second, and the standalone Amplify policy has Minimum TTL of 2 seconds — meaning you cannot reliably stop caching with origin headers alone if you are using them.

If you set all three TTLs to 0, caching is effectively disabled — which is exactly what CachingDisabled does.

Cache key settings: headers, cookies, query strings, and compression

Each of the three viewer-data sources (headers, cookies, query strings) can be configured independently. Cookies and query strings have four modes: include none, include all, include specifically named ones, or include all-except-named-ones. Headers are the exception — they support only "none" or a specific list. There is no "all headers" option, because that would be unbounded and would fragment the cache catastrophically.

When you include a header, cookie, or query string in the cache key, CloudFront also automatically forwards it to the origin on cache misses. The cache key and the origin request are coupled by default; the only way to forward something to the origin without affecting the cache key is to add it to a separate origin request policy. This is the most common confusion with cache policies and the reason CloudFront split them apart in the first place.

A subtle but important detail: cache key matching uses header, cookie, and query string names, but the matching is on the full name+value. Specifying session_id in the cache key means every distinct value of session_id produces a different cache key — so if every visitor has a unique session ID, every request is a cache miss. This is why "include all cookies" rarely makes sense for general traffic.

The compression settings (EnableAcceptEncodingGzip, EnableAcceptEncodingBrotli) tell CloudFront to normalize the Accept-Encoding header before adding it to the cache key. With both enabled, the cache key sees one of br,gzip, gzip, or br (depending on what the viewer supports), or no Accept-Encoding at all when the viewer supports neither — in that last case CloudFront sends Accept-Encoding: identity to the origin instead. Without normalization, every browser variation of Accept-Encoding: gzip, deflate, br, zstd would produce a distinct cache key. Enable this when your origin returns compressed responses or when CloudFront edge compression is on. Leave it off otherwise.

One gotcha: if you enable Gzip or Brotli in the cache policy, do not also include Accept-Encoding in an origin request policy attached to the same behavior. CloudFront handles that header itself when compression is enabled, and adding it to the origin request policy has no effect.

The AWS managed cache policies

This is the comparison table for the policies you'll actually attach to a self-managed distribution by hand. The Amplify Hosting policies (Amplify-* and Amplify-*-V2) are intentionally excluded — see the Amplify section below for why.

Policy	TTL (min/default/max)	Cookies	Query strings	Headers in cache key	Compression
`CachingOptimized`	1s / 24h / 365d	None	None	None	Gzip + Brotli
`CachingOptimizedForUncompressedObjects`	1s / 24h / 365d	None	None	None	Off
`CachingDisabled`	0 / 0 / 0	None	None	None	Off
`UseOriginCacheControlHeaders`	0 / 0 / 365d	All	None	Host, Origin, method overrides	Gzip + Brotli
`UseOriginCacheControlHeaders-QueryStrings`	0 / 0 / 365d	All	All	Host, Origin, method overrides	Gzip + Brotli
`Elemental-MediaPackage`	0 / 24h / 365d	None	aws.manifestfilter, start, end, m	Origin	Gzip

CachingOptimized

ID: 658327ea-f89d-4fab-a63d-7e88639e58f6

The default choice for static content — S3 buckets, image assets, JS and CSS bundles, anything that does not change based on who is asking. The cache key is just the requested object plus the normalized Accept-Encoding. No headers, no cookies, no query strings. This produces the highest possible cache hit ratio for static content.

Watch the Minimum TTL of 1 second: even with Cache-Control: no-store on your origin, CloudFront holds the object for at least one second. That is usually fine, but if you are using this policy on something where origin-side cache busting needs to take effect immediately, swap to UseOriginCacheControlHeaders or CachingDisabled.

CachingOptimizedForUncompressedObjects

ID: b2884449-e4de-46a7-ac36-70bc7f1ddd6d

Identical to CachingOptimized except compression is off. Use this when your origin does not return Gzip or Brotli (raw binary files, video segments, pre-compressed media formats like MP4 or WebP) and you are not using CloudFront edge compression. With compression off, the Accept-Encoding header is excluded from the cache key entirely, which keeps things simple.

If you cannot decide between this and CachingOptimized, default to CachingOptimized. The compression setting only causes problems when your origin produces objects that do not benefit from compression — and even then, it is a minor inefficiency rather than a correctness bug.

CachingDisabled

ID: 4135ea2d-6df8-44a3-9df3-4b5a84be39ad

All TTLs at 0. Nothing in the cache key. Every request goes straight through to the origin. Use this for API behaviors, dynamic GET endpoints that should never be cached, real-time data, WebSockets, and anywhere caching would be incorrect rather than just suboptimal. (POST, PUT, and DELETE aren't cached by CloudFront in the first place — only GET, HEAD, and optionally OPTIONS are — so attaching CachingDisabled to those methods is more about being explicit than functionally necessary.)

UseOriginCacheControlHeaders, UseOriginCacheControlHeaders-QueryStrings, and Elemental-MediaPackage also have Minimum TTL of 0, so they too will respect Cache-Control: no-store from your origin. The difference is that CachingDisabled never caches anything regardless of origin headers, while the others cache when the origin tells them to.

UseOriginCacheControlHeaders

ID: 83da9c7e-98b4-4e11-a168-04f0df8e2c65

Defers caching duration to your origin's Cache-Control and Expires headers. If the origin says max-age=600, CloudFront caches for 10 minutes. If the origin says no-store, CloudFront does not cache at all (because Minimum TTL is 0).

This is the right choice for CMS-backed sites, mixed static-and-dynamic apps, and anywhere your application code already knows what should be cached and for how long. WordPress, Drupal, server-rendered Next.js, and most traditional web apps fit here.

The cache key includes all cookies plus Host, Origin, and three method-override headers (X-HTTP-Method-Override, X-HTTP-Method, X-Method-Override). The cookie inclusion is significant: if your application sets per-user session cookies on every response, you will fragment the cache per user. Either avoid setting cookies on cacheable responses, or stick with CachingOptimized for paths that should be shared across users.

UseOriginCacheControlHeaders-QueryStrings

ID: 4cc15a8a-d715-48a4-82b8-cc0b614638fe

Same as UseOriginCacheControlHeaders but with all query strings included in the cache key. Use this when your origin returns different responses based on query string values — search results, filtered listings, paginated content — and you want CloudFront to cache each variant separately.

The trade-off is cache fragmentation. URLs that include UTM parameters, click IDs, or other tracking junk get cached independently, which both lowers your hit ratio and bloats CloudFront's storage of your content. If you can strip those server-side or with a CloudFront Function before this policy applies, do.

Amplify (and the eight related Amplify policies)

ID: 2e54312d-136d-493c-8eb9-b001f22f67d2

Two distinct things share the Amplify name. The standalone Amplify policy is documented as a regular CloudFront managed policy designed for use with an Amplify web app origin — Min TTL 2 seconds, Max TTL 600 seconds (10 minutes), Default TTL 2 seconds, with Authorization, CloudFront-Viewer-Country, and Host in the cache key plus all cookies and all query strings. AWS doesn't warn against using it; it's just narrowly tuned for Amplify-shaped workloads. The 2-second Minimum TTL means even no-store responses get cached briefly, which is rarely what you want outside of that specific architecture.

The eight Amplify-* policies are something else entirely: Amplify-Default, Amplify-DefaultNoCookies, Amplify-ImageOptimization, Amplify-StaticContent, plus a -V2 variant of each. These are managed by Amplify Hosting itself — Amplify attaches them to the distributions it provisions and resets them on every deploy. AWS explicitly says "we don't recommend that you use these policies for your distributions." If you need similar cache key shapes for a non-Amplify app, copy the settings into a custom policy.

The V2 variants appear to be tied to the August 2024 Amplify Hosting caching overhaul, which raised default static asset cache duration from 2 seconds to 1 year and Maximum TTL from 10 minutes to 1 year. AWS hasn't documented the V2 policies in the public CloudFront developer guide as of this writing — they're visible in the console but the documentation only covers the four originals.

Elemental-MediaPackage

ID: 08627262-05a9-4f76-9ded-b50ca2e3a84f

For AWS Elemental MediaPackage origins specifically — HLS and DASH video streaming. The cache key includes the four query string parameters that MediaPackage actually uses for manifest filtering and time-shifted playback (aws.manifestfilter, start, end, m) plus the Origin header for CORS. Other query strings are excluded, which keeps the cache key tight even when player libraries append cache-busting noise.

If you are not using MediaPackage, do not use this policy. If you are, use it — it is tuned for the specific request shape MediaPackage produces.

How to choose a cache policy

Match your origin and content type to the policy that is already tuned for it.

Use case	Recommended policy
S3 static site, image bucket, asset CDN	`CachingOptimized`
Video files (MP4, WebP), pre-compressed binaries	`CachingOptimizedForUncompressedObjects`
API endpoints, dynamic GET responses, real-time data	`CachingDisabled`
WordPress, Drupal, server-rendered apps with origin Cache-Control	`UseOriginCacheControlHeaders`
Same as above, but query strings affect the response	`UseOriginCacheControlHeaders-QueryStrings`
AWS Elemental MediaPackage HLS/DASH origin	`Elemental-MediaPackage`
Custom requirements that do not match any of the above	Custom cache policy

For a single distribution serving multiple content types, you typically attach different policies to different cache behaviors. A common setup: CachingOptimized on the default behavior (static assets), CachingDisabled on /api/*, UseOriginCacheControlHeaders on /blog/* if blog pages set their own Cache-Control.

When to create a custom cache policy

Reach for a custom policy when you need to include something specific in the cache key that the managed policies do not cover. The most common reasons are device-type segmentation using CloudFront-Is-Mobile-Viewer (serving different markup to phones versus desktops), country-based variation using CloudFront-Viewer-Country (geo-targeted content where edge handling beats origin-side detection), language negotiation via Accept-Language, and tier-based content where a coarse-grained auth bucket determines which response to serve.

Here is a Terraform example for a custom policy that varies on viewer country and a coarse auth-tier cookie:

resource "aws_cloudfront_cache_policy" "country_and_tier" {
  name        = "country-and-tier"
  default_ttl = 3600       # 1 hour when origin sends no Cache-Control
  max_ttl     = 86400      # cap origin Cache-Control at 24 hours
  min_ttl     = 0          # respect origin no-store

  parameters_in_cache_key_and_forwarded_to_origin {
    # normalize Accept-Encoding so cache key collapses across browsers
    enable_accept_encoding_gzip   = true
    enable_accept_encoding_brotli = true

    headers_config {
      header_behavior = "whitelist"
      headers {
        items = ["CloudFront-Viewer-Country"]  # geo-target at the edge
      }
    }

    cookies_config {
      cookie_behavior = "whitelist"
      cookies {
        # coarse bucket like anonymous/free/premium — NOT a per-user session ID
        items = ["auth_tier"]
      }
    }

    query_strings_config {
      query_string_behavior = "none"
    }
  }
}

The cookie name matters here. auth_tier with values like anonymous, free, and premium produces three cache variants per country, which is reasonable. A per-user session token in the same slot would produce one cache variant per user, which is the cache-buster pattern the article warns against earlier.

A few things to know when going custom. CloudFront-generated headers like CloudFront-Viewer-Country are not sent to the origin by default — you need to either include them in the cache key (which automatically forwards them to the origin) or add them to an origin request policy if you want them at the origin without varying the cache. The cache compression toggle should match what your origin returns; turning it on when your origin does not compress can fragment the cache without benefit. And keep Minimum TTL at 0 unless you specifically want to override origin no-store directives, because the surprise-caching behavior burns a lot of debugging hours when you forget about it.

Common pitfalls

Cache hit ratio tanks after a policy change. Almost always means the cache key got too specific. Check whether the new policy includes cookies or query strings the previous one did not, and re-check whether your origin sets per-user cookies on responses that should be shared across users.

Origin keeps getting hit despite a long Default TTL. Default TTL only applies when the origin sends no Cache-Control or Expires header. If your origin sets Cache-Control: max-age=60, CloudFront uses 60 seconds, not the policy's Default TTL. Either change the origin or use Maximum TTL to cap origin-controlled durations.

no-cache from origin is being ignored. Several managed policies (CachingOptimized, CachingOptimizedForUncompressedObjects, and Amplify) have a Minimum TTL greater than 0, which overrides origin no-cache directives. If you need origin-side cache busting to take effect immediately, switch to CachingDisabled or build a custom policy with Minimum TTL = 0.

Forwarding CloudFront-Viewer-Country to origin does not work. For CloudFront-generated headers like CloudFront-Viewer-Country, make sure the header is explicitly enabled via a cache policy or an origin request policy — they aren't sent to your origin by default. If it's in the cache key, it's also forwarded to the origin automatically. If you only need it at the origin and don't want to vary the cache, put it in an origin request policy instead. (Note: a few CloudFront-* headers, including CloudFront-Viewer-Address, CloudFront-Viewer-ASN, and the TLS-related ones, can only be added via an origin request policy and not in a cache policy.)

Authorization header is not reaching the origin. CloudFront removes it by default. To forward it, either include Authorization in the cache key with a cache policy, or use an origin request policy that forwards all viewer headers (the managed Managed-AllViewer policy does this). You cannot forward only Authorization via an origin request policy. Be careful: putting Authorization in the cache key creates per-token cache variants and is usually inappropriate for broadly shared cacheable content.

What's next

If you came here looking for "which cache policy should I use," the short answer is CachingOptimized for static content, CachingDisabled for APIs, UseOriginCacheControlHeaders for content where your origin can express its own caching intent. Everything else is tuning at the margins.

The next post in this series goes deep on origin request policies — where the separation between cache key and origin forwarding really earns its keep — followed by response headers policies, which can replace a lot of security middleware with a single CloudFront config.

Secure Access to Private EKS Clusters Without Bastion Hosts Using SSM

Paweł Swiridow — Wed, 13 May 2026 13:00:00 +0000

Accessing Private EKS Clusters Without Losing Your Mind

Locking down your Kubernetes control plane is a basic requirement for any production environment. Exposing the EKS API server to the public internet is just asking for automated scanners to ruin your weekend. However, securing the endpoint creates an operational headache: how do you actually run kubectl when the API is sealed inside a private subnet?

The traditional answer was a bastion host. But managing SSH keys, rotating credentials, and maintaining yet another publicly exposed EC2 instance is tedious. We all know that a "temporary" bastion host spun up on a Friday afternoon will inevitably become a load-bearing production pillar by Monday.

Instead, we can use AWS Systems Manager (SSM) Session Manager. By leveraging the SSM agent already running on your EKS worker nodes, we can securely tunnel our local traffic directly to the private API endpoint without opening inbound ports or managing SSH keys.

The Mechanics of the SSM Tunnel

The flow is straightforward:

Your local machine initiates an SSM port forwarding session targeting a specific EKS worker node.
The SSM session is instructed to forward traffic to a remote host (the private EKS API endpoint URL) on port 443.
You update your kubeconfig to point to localhost on your chosen forwarded port.

Because the worker node is already in the VPC and authorized to talk to the EKS control plane, it acts as a highly secure, identity-aware proxy. Access is governed entirely by IAM, meaning you can audit every connection via CloudTrail.

Prerequisite: IAM Configuration

For this to work, your EKS worker nodes must have the SSM agent installed (the official EKS optimized AMIs have this by default) and the correct IAM permissions.

Here is a Terraform snippet demonstrating how to attach the necessary SSM policy to your existing EKS node IAM role.

# Assumes you already have an aws_iam_role defined for your worker nodes
# named 'eks_node_role'

resource "aws_iam_role_policy_attachment" "ssm_managed_instance_core" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
  role       = aws_iam_role.eks_node_role.name
}

# Optional but recommended: Restrict who can start sessions in IAM
resource "aws_iam_policy" "ssm_user_access" {
  name        = "EKS-SSM-Tunnel-Access"
  description = "Allows users to port forward to EKS nodes"
  policy      = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = "ssm:StartSession"
        Resource = [
          "arn:aws:ec2:*:*:instance/*",
          "arn:aws:ssm:*:*:document/AWS-StartPortForwardingSessionToRemoteHost"
        ]
        # In a real environment, restrict the instance resource via tags
        Condition = {
          StringEquals = {
            "ssm:resourceTag/eks:cluster-name" = "my-production-cluster"
          }
        }
      }
    ]
  })
}

This ensures your nodes can communicate with the SSM service and restricts which IAM users can actually initiate the tunnel.

Establishing the Tunnel

Once the nodes are registered in SSM, you need a script to extract a valid instance ID, locate the cluster API endpoint, and start the tunnel.

Here is a Bash script you can execute locally to handle the heavy lifting. It requires the AWS CLI and the Session Manager plugin to be installed on your workstation.

#!/bin/bash
set -euo pipefail

CLUSTER_NAME="my-production-cluster"
REGION="us-east-1"
LOCAL_PORT="8443"

# Fetch the private endpoint of the EKS cluster
echo "Fetching EKS endpoint for ${CLUSTER_NAME}..."
EKS_ENDPOINT=$(aws eks describe-cluster \
  --name "${CLUSTER_NAME}" \
  --region "${REGION}" \
  --query "cluster.endpoint" \
  --output text | sed 's/https:\/\///')

# Find an active worker node instance ID using tags
echo "Finding an active worker node..."
INSTANCE_ID=$(aws ec2 describe-instances \
  --region "${REGION}" \
  --filters "Name=tag:eks:cluster-name,Values=${CLUSTER_NAME}" "Name=instance-state-name,Values=running" \
  --query "Reservations[0].Instances[0].InstanceId" \
  --output text)

if [ "$INSTANCE_ID" == "None" ]; then
  echo "Error: No running worker nodes found."
  exit 1
fi

echo "Establishing SSM tunnel through ${INSTANCE_ID} to ${EKS_ENDPOINT}..."
echo "Leave this terminal open. Access EKS via https://localhost:${LOCAL_PORT}"

# Start the port forwarding session
aws ssm start-session \
  --region "${REGION}" \
  --target "${INSTANCE_ID}" \
  --document-name AWS-StartPortForwardingSessionToRemoteHost \
  --parameters "{\"host\":[\"${EKS_ENDPOINT}\"],\"portNumber\":[\"443\"],\"localPortNumber\":[\"${LOCAL_PORT}\"]}"

Run this script, and it will bind localhost:8443 to the private API endpoint.

Updating Kubeconfig

The final step is modifying your local Kubernetes configuration. You cannot simply run aws eks update-kubeconfig and call it a day, because that will configure the private AWS endpoint, which your machine still cannot route to directly.

You need to manually alter the server field for your cluster to point to the local port.

When you port-forward the EKS API server to your local machine, connecting to https://localhost:8443 introduces a new problem. The API server presents a TLS certificate minted for its internal AWS endpoint (e.g., 1234567890ABCDEF.yl4.us-east-1.eks.amazonaws.com), not localhost.

The quick, dirty fix is to add insecure-skip-tls-verify: true to your kubeconfig. But nothing screams "I definitely passed my SOC2 audit" quite like explicitly disabling TLS validation in production. It is the infrastructure equivalent of putting black tape over a check engine light.

Instead of turning off validation, we can instruct kubectl to connect via our local port but validate the TLS certificate against the actual EKS endpoint hostname. We do this by utilizing the tls-server-name parameter.

apiVersion: v1
clusters:
- cluster:
    server: https://localhost:8443
    # Validate the certificate against the real AWS endpoint
    tls-server-name: 1234567890ABCDEF.yl4.us-east-1.eks.amazonaws.com
  name: arn:aws:eks:us-east-1:123456789012:cluster/my-production-cluster
# ... contexts and users remain unchanged

Once saved, kubectl get pods will route securely through the SSM tunnel, across the worker node, and hit the control plane.

Wrap-Up

Relying on SSM port forwarding eliminates the need for VPNs, bastion hosts, and complex routing rules just to run operational commands against an isolated EKS cluster. By utilizing the existing IAM-integrated agent on your worker nodes, you shrink your external attack surface while maintaining strict audit trails for developer access.

Next.js Server Actions vs API Routes: Architecture, Performance, and Use Cases

Paweł Sobolewski — Tue, 05 May 2026 22:00:00 +0000

With the introduction of the App Router, Next.js added two powerful server-side primitives:

At first glance, they might seem interchangeable—both run on the server. But they're designed for different responsibilities, follow different execution models, and come with different constraints.

Understanding these differences is essential to avoid architectural pitfalls and performance issues.

What Are Server Actions?

According to the official documentation, Server Actions are a specific type of React Server Function used for mutations. They are invoked from the client by sending a POST request to the current page’s route (the same URL context in which the action was triggered), where Next.js internally routes the request to the corresponding server-side function based on unique server action identifier.

Here's an important clarification:

We only call them Server Actions when they're invoked from the client.

From the server's perspective, they're just regular asynchronous functions.

For example:

"use server"

export async function updateUser() {
    // mutation logic
}

Called from a Client Component → acts as a Server Action
Called from another server-side function (like server component) → just a normal function call

This distinction is crucial:

The network boundary only exists when invoked from the client,
The serialization and execution constraints only apply in that context.

Key characteristics

Must be asynchronous
Marked with the "use server" directive
Invoked via POST requests under the hood

Mutations + Rendering = Sequential Constraints

The official documentation states that React Server Functions are designed for server-side mutations. This design choice is directly tied to one of their most important limitations.

When a Server Action is invoked:

It mutates server-side state
It triggers a React Server Component (RSC) re-render
It produces a new UI snapshot
The client merges the updated result

This tight coupling between mutation and rendering means React must ensure each update applies to a consistent version of the UI tree.

This creates a critical constraint described in the docs:

The client currently dispatches and awaits them one at a time

What this means in practice

Even if multiple Server Actions are triggered “at once”:

useEffect(() => {
    actionA();
    actionB();
}, [])

They will execute sequentially, not in parallel:

actionA → re-render → actionB → re-render

This is not an implementation detail—it is a consequence of the architecture.

Running them in parallel would risk:

race conditions between mutations
inconsistent UI snapshots
stale data being rendered

So React prioritizes consistency over concurrency.

Future note

The documentation explicitly indicates that this behavior is current, which means it may change in the future. However, today:

You should design your system assuming Server Actions are sequential

What Are Route Handlers?

Route Handlers address a key limitation of Server Actions: the lack of true concurrent, independent calls.

Because Server Actions are tightly coupled with React's rendering lifecycle, they aren't suited for scenarios where multiple operations must run in parallel without UI-driven synchronization.

Route Handlers provide a decoupled HTTP-based execution model that restores full concurrency and independence between requests.

Definition

Route Handlers are standard HTTP endpoints defined in:

app/api/.../route.ts

Example:

import { NextResponse }from"next/server"

export async function POST(request: Request) {
    const body= await request.json();

    await db.cart.add(body.productId);

    return NextResponse.json({ success:true });
}

Key characteristics

Full HTTP interface
Support for all HTTP methods (GET, POST, etc.)
Publicly accessible endpoints
Independent of React rendering lifecycle
Fully parallelizable and concurrent by default

Good to know

In older versions of Next.js there were only Route Handlers, as Server Action concept came with React Server Components in Next.js 13. So they are improving specific use cases, not replacing them. You can still write the whole application using only Route Handlers and not writing a single Server Action at all.

Why Use Server Actions?

At this point, you might ask:

If Route Handlers provide full concurrency, HTTP control, and fewer constraints—why use Server Actions at all?

Server Actions aren't meant to replace Route Handlers. They solve a different class of problems: UI-driven mutations tightly integrated with React's rendering model, not transport-layer API design.

1. They eliminate the client–server API layer

With Route Handlers, every mutation typically requires:

defining an API endpoint
handling request/response manually
writing client-side fetch calls
managing loading and error states explicitly

Server Actions remove this entire layer:

<form action={createPost}>
    <input name="title"/>
    <button type="submit">Create</button>
</form>

No more:

manual fetch calls
endpoint definitions
request parsing boilerplate

The mutation becomes a direct extension of the UI.

2. They're first-class citizens of React Server Components

Server Actions are deeply integrated with the RSC model:

They trigger server component re-renders
They return updated UI state without manual client synchronization
They work seamlessly with Next.js caching and revalidation primitives

This creates a unique property:

A mutation can immediately reflect in the UI—without explicitly managing state on the client.

Route Handlers can't do this. They only return data.

3. They reduce client-side complexity

With Route Handlers, a typical mutation flow looks like:

await fetch("/api/posts", { method:"POST", body: ... });
setState(...);
handleLoading(...);
handleError(...);

With Server Actions:

no manual request lifecycle
no fetch boilerplate
no explicit client-state synchronization layer

This shifts complexity from the client to the framework boundary.

Conclusion

Server Actions are powerful but intentionally constrained.

They are:

React-integrated mutation primitives designed for UI-driven updates—not general-purpose concurrency.

Choosing between Server Actions and Route Handlers isn't about preference—it's about understanding whether your problem belongs to the UI layer or the transport/API layer.

Payload CMS Security Best Practices: Top 10 Threats & Mitigation Strategies in 2026

Michał Miler — Tue, 21 Apr 2026 08:20:00 +0000

Payload CMS is a powerful, developer-first headless CMS built on Node.js and TypeScript. It gives you complete control over authentication, access control, and API behavior - but with that flexibility comes responsibility for implementing robust security measures and following OWASP security best practices.

Security misconfigurations remain one of the leading causes of data breaches in modern web applications. According to IBM's Cost of a Data Breach Report, thousands of CMS-powered websites and APIs are compromised every year due to preventable issues like weak authentication, improper access control, and exposed admin panels.

From our experience working on production SaaS applications, eCommerce platforms, and multi-tenant systems at u11d, over 80% of Payload CMS projects lack proper implementation of critical security controls aligned with OWASP Top 10 risks - especially around authentication, authorization, API exposure, and infrastructure hardening.

In this comprehensive guide, we'll cover the most common Payload CMS security threats and practical, production-tested mitigation strategies you should implement to avoid costly vulnerabilities, data leaks, and security incidents.

Who This Guide is For:

Payload CMS developers building production applications and APIs
DevOps engineers securing Payload deployments on AWS, DigitalOcean, Vercel
Project managers and product owners overseeing headless CMS implementations
Security auditors reviewing Payload CMS implementations for compliance
Technical leads architecting secure headless CMS solutions with Next.js

What You'll Learn:

Critical security threats specific to Payload CMS (with OWASP mapping)
OWASP Top 10 aligned mitigation strategies for headless CMS
Production-ready implementation examples with TypeScript
Complete security checklist for production deployment
Infrastructure hardening techniques for Node.js applications
Real-world security incidents and lessons learned

I. Admin Account Compromise (Critical Priority)

The Security Risk

Admin accounts are the highest-value target in any CMS. In Payload CMS, administrators typically have unrestricted access to:

All content and collections
User management and permissions
System configuration
API access controls
Database operations

Attack Impact

If compromised, attackers can:

Modify or deface content
Inject malicious scripts (XSS)
Manipulate pricing or product data
Access sensitive user information
Delete or corrupt critical data

In real-world incidents, compromised admin access often leads to full platform takeover within minutes - especially in systems without audit logging or alerts.

The Solution: Multi-Layered Admin Protection

1. Enforce Modern Password Policies (NIST-Compliant)

Modern password policies prioritize length and uniqueness over complexity rules (NIST SP 800-63B). Best practices include:

Minimum 15+ characters (passphrases preferred over complexity)
Prevent password reuse (store hash history)
Block common and breached passwords (Have I Been Pwned API)
Encourage password managers
Avoid forced periodic password expiration (outdated practice)

Why This Matters:

Short, complex passwords (e.g., P@ssw0rd123) are far weaker than long passphrases (e.g., correct-horse-battery-staple-2025).

2. Enable Multi-Factor Authentication (MFA/2FA)

Critical: Payload CMS does not enforce 2FA by default for admin users. You must explicitly add this protection layer.

Recommended Solutions for Payload CMS:

Option A: TOTP-Based

payloadcms-tfa - Community plugin for Time-based OTP
payload-totp - Alternative TOTP implementation
Supports authenticator apps (Google Authenticator, Authy, 1Password)

Option B: Custom OTP Implementation

Email-based one-time codes
SMS-based codes (requires Twilio/similar)
Hardware tokens (YubiKey, FIDO2)

Option C: External Auth Providers

Auth.js (NextAuth) with 2FA providers
Keycloak with MFA policies
Zitadel with passkey support

Production Requirement

For production and SaaS systems, MFA for all admin users should be mandatory, not optional.

3. Enforce HTTPS Everywhere (TLS/SSL)

Never expose Payload admin panels over HTTP. This is a critical vulnerability that exposes:

Admin credentials during login
Session cookies
API tokens
All transmitted data

Recommended TLS Configuration:

TLS 1.3 preferred (TLS 1.2 minimum)
Strong cipher suites only
HSTS header with preload
Redirect all HTTP → HTTPS
Secure cookie flags (secure, httpOnly, sameSite)

Summary: Admin security is your first line of defense - weak authentication here leads to total system compromise.

II. Weak Authentication Strategy

The Risk

Payload provides flexible authentication, but that flexibility often leads to insecure defaults in real projects.

Common issues include:

Long-lived JWT tokens
Tokens stored in localStorage
No refresh token rotation
Mixing admin and public authentication flows

These mistakes significantly increase the risk of session hijacking and token theft.

The Solution

Secure Token Handling

Use short-lived access tokens
Implement refresh token rotation
Store tokens in HTTP-only cookies
Avoid localStorage for sensitive tokens

Consider External Identity Providers

For more advanced or scalable setups, integrate external auth systems:

Auth.js (NextAuth)
Better-Auth
Keycloak
Zitadel

These solutions provide:

OAuth & social login
Enterprise SSO
Centralized identity management
Advanced session control

Summary: A well-designed authentication layer reduces your attack surface and improves scalability.

III. Missing Access Control Rules

The Risk

Payload’s access control system is powerful - but optional. Many teams either skip it or implement overly permissive rules.

This can lead to:

Unauthorized data access
Privilege escalation
Exposure of sensitive fields via API

In many breaches, improper authorization - not authentication - is the root cause.

The Solution

Define Explicit Access Rules

Always define:

read
create
update
delete

For every collection.

Best practices:

Public content → read-only for anonymous users
Admin content → role-based restrictions
User data → owner-only access

Never rely on frontend restrictions — enforce everything server-side.

Summary: Authorization must be explicit and restrictive by default.

IV. Public API Exposure

The Risk

Payload automatically exposes REST and optionally GraphQL APIs, which can unintentionally leak data if not configured correctly.

Common risks:

Public access to internal collections
Exposure of sensitive fields
Endpoint enumeration and brute-force attacks

Attackers often scan APIs first - not your frontend.

The Solution

Limit API Surface

Disable GraphQL if unused
Restrict public endpoints
Use API gateways or reverse proxies

Protect Sensitive Fields

hidden: true
access: { read: () => false }

Add Rate Limiting

Implement at infrastructure level:

Cloudflare
AWS API Gateway
Reverse proxy throttling

Payload does not provide built-in rate limiting.

Summary: Reduce what is exposed - every public endpoint is a potential attack vector.

V. No Audit Logging

The Risk

Without audit logs, security incidents become invisible.

You won’t know:

Who changed what
When it happened
Whether malicious activity occurred

This makes incident response and compliance extremely difficult.

The Solution

Enable Versioning

Use Payload’s versioning for:

Pages
Products
Critical content

Centralize Logging

Track:

Login attempts
Failed logins
Content changes
Permission updates

Send logs to:

CloudWatch
Datadog
ELK stack

Summary: If you can’t see it, you can’t secure it.

VI. Database Security Misconfiguration

The Risk

Payload typically uses MongoDB or PostgreSQL. Misconfigured databases are a frequent source of major data breaches.

Risks include:

Public database exposure
Weak credentials
Lack of encryption
Lateral movement within infrastructure

The Solution

Never expose databases publicly
Use private VPC networking
Rotate credentials regularly
Use IAM-based authentication where possible
Encrypt data at rest and in transit

Summary: Infrastructure security is just as important as application security.

VII. Missing Content Validation (XSS Risk)

The Risk

Allowing rich text or HTML input without sanitization opens the door to stored XSS attacks.

Attackers can inject scripts that execute in:

Admin panel
Frontend applications
Other users’ browsers

The Solution

Sanitize HTML inputs
Use strict schema validation
Limit custom HTML fields
Escape output in frontend

Never trust user-generated content - even from “trusted” users.

Summary: Input validation is essential to prevent client-side attacks.

Final Thoughts: Security is a Feature, Not an Afterthought in Payload CMS

Payload CMS gives developers exceptional flexibility and control over authentication, authorization, and data access - but security must be explicitly designed and implemented from day one, not bolted on later.

Unlike managed SaaS CMS platforms (Contentful, Sanity, Hygraph), Payload assumes you understand authentication mechanisms, authorization patterns, and infrastructure security. That's powerful and flexible - but also a common source of critical vulnerabilities in production deployments.

Key Takeaways:

Payload CMS requires explicit security configuration - No secure-by-default settings
80% of projects have preventable security gaps - Based on real-world security audits
OWASP Top 10 alignment is critical - Authentication, access control, API security
Infrastructure security matters as much as application security - Database, network, TLS configuration
Security is continuous, not one-time - Regular audits, dependency updates, monitoring
Security impacts performance and UX - See our localization guide for secure field components
Secure scaling is possible - Our Connect211 case study shows 50+ domains secured

If you're running Payload CMS in production — especially for:

eCommerce platforms with payment processing
SaaS applications with sensitive user data
Fintech solutions requiring PCI-DSS compliance
Healthcare systems needing HIPAA compliance
Mobile app backends with millions of users
Multi-tenant platforms isolating customer data

Treat security as a first-class feature from the start, not a checkbox before launch.

Additional Resources

OWASP Top 10 - Web application security risks (updated 2021)
Payload CMS Authentication Documentation - Official authentication guide
NIST Password Guidelines - Modern password policy standards (SP 800-63B)
CIS Benchmarks - Infrastructure hardening guides for Linux, Docker, databases
Have I Been Pwned API - Password breach detection service
Payload Discord Community - Security discussions with Payload experts
Snyk Vulnerability Database - Node.js package vulnerabilities

Need Payload CMS Experts?

u11d specializes in Payload CMS development, migration, and deployment. We help you build secure, scalable Payload projects, migrate from legacy CMS platforms, and optimize your admin, API, and infrastructure for production. Get expert support for custom features, localization, and high-performance deployments.

Talk to Payload Experts

Auto-Translation in Payload CMS with Azure AI Translator: Complete 2026 Implementation Guide

Michał Miler — Tue, 14 Apr 2026 09:52:00 +0000

Manual translation in Payload CMS doesn't scale for production applications: it's slow, expensive, and error‑prone once you support multiple documents and locales. This comprehensive guide shows a production-tested automated approach using Azure AI Translator, Payload background jobs, and LRU caching to batch translations, preserve manual edits, and minimize API costs.

At u11d, we've implemented this solution across multiple high-traffic Payload CMS applications, processing millions of translations monthly. This guide covers everything from Azure setup to production deployment with real-world performance metrics.

Key features of this production-ready solution:

One-click batch translation for any Payload collection (100+ languages supported)
Async background jobs - No timeouts, scalable processing with Payload Jobs API
Intelligent caching - LRU cache to avoid duplicate API calls and reduce costs by 70%+
Preserves manual edits - Only empty fields are translated by default
Force mode - Re-translate all fields when needed (content updates, quality improvements)
Audit trail - Tracks translation metadata for compliance and debugging
Type-safe - Full TypeScript support with Payload's generated types
Production-tested - Handles millions of translations in real-world applications

Solution Overview

Architecture:

Translation Service: Azure + Cache
Payload Background Job: async, scalable
API Route Handler: secure job submission
Admin UI: button + modal for user control
Translation Metadata: audit what/when/force

This tutorial focuses on a single collection for clarity. For a full, type-safe, multi-collection solution, see the complete implementation on GitHub.

Prerequisites

Before implementing auto-translation, ensure you have:

Azure Account - Create free account ($200 credit included)
Azure Translator Resource - Setup guide
Payload Localization Enabled - See our localization best practices guide
Node.js 18+ - Required for Payload 3.0+
TypeScript 5+ - For type-safe implementation

Database Note: This tutorial uses SQLite for local development only. For production, configure DATABASE_URL to connect to PostgreSQL, MongoDB, or MySQL for better performance and reliability.

Step-by-Step Implementation

Step 1: Install Dependencies

First, install the Azure AI Translator SDK and the additional project dependencies:

npm install @azure-rest/ai-translation-text lru-cache radash

Step 2: Configure Environment Variables

Create or update your .env file:

# Azure Translator Configuration
AZURE_TRANSLATOR_KEY=your_azure_translator_key
AZURE_TRANSLATOR_REGION=your_azure_translator_region

Check .env.example to see what else can be configured.

Getting Azure Credentials:

Go to Azure Portal
Navigate to your Translator resource
Create Translator resource if it does not exist yet
Click "Keys and Endpoint" in the left menu
Copy KEY 1 and set AZURE_TRANSLATOR_KEY in .env
Copy Location/Region and set AZURE_TRANSLATOR_REGION in .env

Never commit API keys. Add .env.local to .gitignore, use secret managers or CI/CD secret stores, and rotate keys regularly.

Step 3: Create Translation Service

This service handles Azure API calls, batching, and caching:

// src/services/translationService.ts
import crypto from "crypto";
import TextTranslationClient, {
  isUnexpected,
} from "@azure-rest/ai-translation-text";
import { LRUCache } from "lru-cache";

export type TranslationEngine = "azure";

export interface TranslationResult {
  text: string;
  fromCache: boolean;
}

export interface BatchTranslationInput {
  id?: string;
  text: string;
  targetLocale: string;
}

export interface BatchTranslationResult {
  id?: string;
  text: string;
  translatedText: string;
  targetLocale: string;
  fromCache: boolean;
}

const CACHE_TTL_MS = 7 * 24 * 60 * 60 * 1000; // 7 days

const translationCache = new LRUCache<string, string>({
  max: 10000,
  ttl: CACHE_TTL_MS,
  ttlAutopurge: true,
});

function createTextHash(text: string): string {
  return crypto.createHash("sha256").update(text, "utf8").digest("hex");
}

function createCacheKey(locale: string, englishText: string): string {
  const hash = createTextHash(englishText);
  return `${locale}:${hash}`;
}

async function translateWithAzure(
  texts: string[],
  targetLocale: string,
): Promise<string[]> {
  const apiKey = process.env.AZURE_TRANSLATOR_KEY;
  const endpoint =
    process.env.AZURE_TRANSLATOR_ENDPOINT ||
    "<https://api.cognitive.microsofttranslator.com>";
  const region = process.env.AZURE_TRANSLATOR_REGION || "global";

  if (!apiKey) {
    throw new Error(
      "AZURE_TRANSLATOR_KEY is not configured in environment variables",
    );
  }

  const client = TextTranslationClient(endpoint, {
    key: apiKey,
    region: region,
  });

  const inputText = texts.map((text) => ({ text }));
  const translateResponse = await client.path("/translate").post({
    body: inputText,
    queryParameters: {
      to: targetLocale,
      from: "en",
    },
  });

  if (isUnexpected(translateResponse)) {
    throw new Error(
      `Azure translation failed: ${translateResponse.body.error?.message || "Unknown error"}`,
    );
  }

  return translateResponse.body.map((item) => item.translations[0].text);
}

export async function batchTranslate(
  inputs: BatchTranslationInput[],
): Promise<BatchTranslationResult[]> {
  const results: BatchTranslationResult[] = new Array(inputs.length);

  const inputsWithIndex = inputs.map((input, index) => ({ input, index }));

  const byLocale = inputsWithIndex.reduce<
    Record<string, typeof inputsWithIndex>
  >((acc, item) => {
    if (!acc[item.input.targetLocale]) {
      acc[item.input.targetLocale] = [];
    }
    acc[item.input.targetLocale].push(item);
    return acc;
  }, {});

  for (const [locale, localeInputs] of Object.entries(byLocale)) {
    const textsToTranslate: string[] = [];
    const itemsToTranslate: typeof inputsWithIndex = [];

    for (const item of localeInputs) {
      const cacheKey = createCacheKey(locale, item.input.text);
      const cached = translationCache.get(cacheKey);

      if (cached) {
        results[item.index] = {
          id: item.input.id,
          text: item.input.text,
          translatedText: cached,
          targetLocale: locale,
          fromCache: true,
        };
      } else {
        textsToTranslate.push(item.input.text);
        itemsToTranslate.push(item);
      }
    }

    if (textsToTranslate.length > 0) {
      const BATCH_SIZE_AZURE = 100;

      for (let i = 0; i < textsToTranslate.length; i += BATCH_SIZE_AZURE) {
        const batch = textsToTranslate.slice(i, i + BATCH_SIZE_AZURE);
        const batchItems = itemsToTranslate.slice(i, i + BATCH_SIZE_AZURE);

        const translations = await translateWithAzure(batch, locale);

        for (let j = 0; j < translations.length; j++) {
          const item = batchItems[j];
          const translatedText = translations[j];
          const cacheKey = createCacheKey(locale, item.input.text);

          translationCache.set(cacheKey, translatedText);

          results[item.index] = {
            id: item.input.id,
            text: item.input.text,
            translatedText,
            targetLocale: locale,
            fromCache: false,
          };
        }
      }
    }
  }

  return results;
}

export async function translate(
  text: string,
  targetLocale: string,
): Promise<TranslationResult> {
  const results = await batchTranslate([{ text, targetLocale }]);
  const result = results[0];

  return {
    text: result.translatedText,
    fromCache: result.fromCache,
  };
}

What This Does:

LRU Cache - Stores up to 10,000 translations in memory
Cache Key - SHA256 hash of source text + locale
7-Day TTL - Reasonable cache duration for production content
Batch Processing - Groups translations by locale for efficiency
Azure Integration - Sends up to 100 texts per request
Cache Hits - Returns cached translations instantly (< 1ms)
Cache Misses - Fetches from Azure and caches result
Type Safety - Full TypeScript interfaces for reliability

Caching reduces API calls, latency, and cost by reusing recent translations. Use the in-memory LRU cache for local development, but deploy a shared cache (for example, Redis) in production to persist entries across restarts, share results between instances, and scale capacity.

Step 4: Create Payload Background Job

Background jobs prevent timeout issues and allow progress tracking.

Payload's Jobs API runs translation work outside the request lifecycle so long-running batches won't block the admin UI or hit HTTP timeouts. Queued jobs can report progress, be retried on failure, and scale across worker processes, which makes them a reliable choice for large or repeated translation tasks. In this step we'll implement a job that fetches the English document, extracts fields to translate, batches calls to the translation service, and updates target locales in a safe, idempotent way.

// src/jobs/translateResource.ts
import { TaskConfig, TypedLocale } from "payload";
import { batchTranslate } from "../services/translationService";
import { retry } from "radash";
import { isValidLocale } from "@/locales";
import { Resource } from "@/payload-types";

interface FieldToTranslate {
  path: string;
  value: string;
  locale: string;
  id?: string;
}

function isEmpty(value: unknown): boolean {
  if (value === null || value === undefined) return true;
  if (typeof value === "string") return value.trim() === "";
  return false;
}

function shouldTranslate(
  force: boolean,
  sourceValue: string | null | undefined,
  targetValue: string | null | undefined,
): boolean {
  if (isEmpty(sourceValue)) return false;
  if (force) return true;
  return isEmpty(targetValue);
}

function extractFieldsToTranslate(
  englishDoc: Resource,
  targetDoc: Resource,
  force: boolean,
  targetLocale: TypedLocale,
): FieldToTranslate[] {
  const fieldsToTranslate: FieldToTranslate[] = [];

  // Extract title field
  if (shouldTranslate(force, englishDoc.title, targetDoc.title)) {
    fieldsToTranslate.push({
      path: "title",
      value: englishDoc.title,
      locale: targetLocale,
    });
  }

  // Extract list array fields
  if (Array.isArray(englishDoc.list)) {
    englishDoc.list.forEach((englishItem, index) => {
      if (!englishItem.id) return;

      const targetItem = Array.isArray(targetDoc.list)
        ? targetDoc.list.find((t) => t.id === englishItem.id)
        : null;

      if (shouldTranslate(force, englishItem.name, targetItem?.name)) {
        fieldsToTranslate.push({
          path: `list.${index}.name`,
          value: englishItem.name!,
          locale: targetLocale,
          id: englishItem.id,
        });
      }
    });
  }

  return fieldsToTranslate;
}

function buildUpdateData(
  englishDoc: Resource,
  targetDoc: Resource,
  translationsByPath: Record<string, string>,
): Partial<Resource> {
  const updateData: Partial<Resource> = {
    ...targetDoc,
  };

  // Update title
  if (translationsByPath["title"]) {
    updateData.title = translationsByPath["title"];
  } else if (isEmpty(targetDoc.title) && englishDoc.title) {
    updateData.title = englishDoc.title;
  }

  // Update list array
  if (Array.isArray(englishDoc.list)) {
    const targetItemsById = new Map(
      Array.isArray(targetDoc.list)
        ? targetDoc.list.map((item) => [item.id, item])
        : [],
    );

    updateData.list = englishDoc.list.map((englishItem, index) => {
      const targetItem = targetItemsById.get(englishItem.id);
      const mergedItem = {
        ...englishItem,
        ...(targetItem || {}),
        id: englishItem.id,
      };

      const namePath = `list.${index}.name`;
      if (translationsByPath[namePath]) {
        mergedItem.name = translationsByPath[namePath];
      } else if (isEmpty(targetItem?.name) && englishItem.name) {
        mergedItem.name = englishItem.name;
      }

      return mergedItem;
    });
  }

  return updateData;
}

async function executeBatchTranslation(
  fields: FieldToTranslate[],
): Promise<Record<string, string>> {
  const BATCH_SIZE_AZURE = 100;
  const translationsByPath: Record<string, string> = {};

  for (let i = 0; i < fields.length; i += BATCH_SIZE_AZURE) {
    const batch = fields.slice(i, i + BATCH_SIZE_AZURE);
    const translations = await retry({ times: 3, delay: 1000 }, () =>
      batchTranslate(
        batch.map((f) => ({
          id: f.path,
          text: f.value,
          targetLocale: f.locale,
        })),
      ),
    );
    translations.forEach((t) => {
      if (t?.id) translationsByPath[t.id] = t.translatedText;
    });
  }

  return translationsByPath;
}

export const translateResourceJob: TaskConfig<"translateResource"> = {
  slug: "translateResource",
  inputSchema: [
    {
      name: "documentId",
      type: "text",
      required: true,
    },
    {
      name: "locales",
      type: "array",
      required: true,
      fields: [
        {
          name: "locale",
          type: "text",
        },
      ],
    },
    {
      name: "force",
      type: "checkbox",
      defaultValue: false,
    },
  ],
  outputSchema: [
    {
      name: "success",
      type: "checkbox",
      required: true,
    },
    {
      name: "translated",
      type: "number",
      required: true,
    },
    {
      name: "details",
      type: "json",
    },
  ],
  handler: async ({ input, job, req }) => {
    const { payload } = req;
    const { documentId, force } = input;
    const locales = input.locales.map(({ locale }) => locale);

    console.log(
      `[Job ${job.id}] Translation started for resources:${documentId}`,
      { locales, force },
    );

    try {
      const englishDoc = await payload.findByID({
        collection: "resources",
        id: documentId,
        locale: "en",
        depth: 0,
      });

      if (!englishDoc) {
        throw new Error(`Resource not found: ${documentId}`);
      }

      const totalTranslatedCounts: Record<string, number> = {};

      for (const targetLocale of locales) {
        if (targetLocale === "en") continue;

        console.log(`[Job ${job.id}] Processing locale: ${targetLocale}`);

        if (!isValidLocale(targetLocale)) {
          console.warn(`[Job ${job.id}] Invalid locale: ${targetLocale}`);
          continue;
        }

        const targetDoc = await payload.findByID({
          collection: "resources",
          id: documentId,
          locale: targetLocale,
        });

        const fieldsToTranslate = extractFieldsToTranslate(
          englishDoc,
          targetDoc,
          force || false,
          targetLocale,
        );

        if (fieldsToTranslate.length === 0) {
          console.log(
            `[Job ${job.id}] No fields to translate for locale ${targetLocale}`,
          );
          continue;
        }

        console.log(
          `[Job ${job.id}] Translating ${fieldsToTranslate.length} fields for ${targetLocale}`,
        );

        const translationsByPath =
          await executeBatchTranslation(fieldsToTranslate);

        const updateData = buildUpdateData(
          englishDoc,
          targetDoc,
          translationsByPath,
        );

        updateData._translationMeta = {
          lastTranslatedAt: new Date().toISOString(),
          translatedBy: "auto",
        };

        await payload.update({
          collection: "resources",
          id: documentId,
          locale: targetLocale,
          data: updateData,
        });

        totalTranslatedCounts[targetLocale] = fieldsToTranslate.length;
      }

      const totalTranslated = Object.values(totalTranslatedCounts).reduce(
        (a, b) => a + b,
        0,
      );

      console.log(
        `[Job ${job.id}] Translation completed. Total fields translated: ${totalTranslated}`,
        totalTranslatedCounts,
      );

      return {
        output: {
          success: true,
          translated: totalTranslated,
          details: totalTranslatedCounts,
        },
      };
    } catch (error) {
      console.error(`[Job ${job.id}] Translation failed:`, error);
      throw error;
    }
  },
};

What This Does:

Simple & Focused - Handles only Resources collection for tutorial clarity
Selective Translation - Only translates empty fields by default
Force Mode - Option to re-translate all fields
Retry Logic - 3 attempts with 1-second delays
Structure Preservation - Maintains array order and IDs
Fallback - Uses English value if translation fails
Metadata Tracking - Records when and how translation occurred

Want to translate multiple collections? This tutorial simplifies the code for learning purposes. For a production-ready implementation with type-safe mappers, field extractors, and support for multiple collections, check out the complete codebase on GitHub. Don't forget to give it a star!

Step 5: Create API Route for Job Queueing

This endpoint triggers translation jobs. It validates the requesting admin session, sanitizes and checks the request body (documentId, locales, force), and queues the translateResource job with the provided input. The implementation below uses cookie-based Payload auth and returns the queued job's ID on success.

// src/app/api/translate/route.ts
import { NextRequest, NextResponse } from "next/server";
import { getPayload } from "payload";
import config from "@/payload.config";

export async function POST(request: NextRequest) {
  try {
    const payload = await getPayload({ config });

    const cookies = request.cookies;
    const payloadToken = cookies.get("payload-token");

    if (!payloadToken) {
      return NextResponse.json({ error: "Unauthorized" }, { status: 401 });
    }

    try {
      const { user } = await payload.auth({ headers: request.headers });
      if (!user) {
        return NextResponse.json({ error: "Unauthorized" }, { status: 401 });
      }
    } catch (authError) {
      return NextResponse.json({ error: "Unauthorized" }, { status: 401 });
    }

    const body = await request.json();
    const { documentId, locales, force } = body;

    if (!documentId || !locales) {
      return NextResponse.json(
        {
          error: "Missing required fields: documentId, locales",
        },
        { status: 400 },
      );
    }

    if (!Array.isArray(locales) || locales.length === 0) {
      return NextResponse.json(
        { error: "locales must be a non-empty array" },
        { status: 400 },
      );
    }

    const job = await payload.jobs.queue({
      task: "translateResource",
      input: {
        documentId,
        locales: locales.map((locale: string) => ({ locale })),
        force: force || false,
      },
      queue: "translation",
    });

    console.log(
      `Translation job queued: ${job.id} for resources:${documentId}`,
      { locales },
    );

    return NextResponse.json({
      success: true,
      jobId: job.id,
      message: "Translation job queued successfully",
    });
  } catch (error) {
    console.error("Failed to queue translation job:", error);
    return NextResponse.json(
      {
        error: "Internal server error",
        message: error instanceof Error ? error.message : "Unknown error",
      },
      { status: 500 },
    );
  }
}

Step 6: Create Admin UI Components

This step adds an admin-side UI — a translate button and modal — that lets editors pick target locales, toggle force mode, and queue translation jobs without leaving the editor.

Translate Button:

// src/components/TranslateButton.tsx
"use client";

import React, { useState } from "react";
import { toast, Button, useModal, useDocumentInfo } from "@payloadcms/ui";
import { LOCALES_WITHOUT_EN, TranslateModal } from "./TranslateModal";

export const TranslateButton: React.FC = () => {
  const { id } = useDocumentInfo();
  const [isLoading, setIsLoading] = useState(false);
  const { openModal } = useModal();

  const handleTranslateClick = async () => {
    if (!id) {
      toast.error("Document ID is required");
      return;
    }

    setIsLoading(true);

    try {
      if (LOCALES_WITHOUT_EN.length === 0) {
        toast.error("No target locales available for translation");
        return;
      }

      openModal("translate-modal");
    } catch (error) {
      console.error("Error opening translate modal:", error);
      toast.error("Failed to open translation modal");
    } finally {
      setIsLoading(false);
    }
  };

  if (!id) {
    return null;
  }

  return (
    <>
      <Button
        buttonStyle="primary"
        onClick={handleTranslateClick}
        disabled={isLoading}
      >
        {isLoading ? "Loading..." : "Translate (BETA)"}
      </Button>
      <TranslateModal />
    </>
  );
};

Translate Modal:

// src/components/TranslateModal.tsx
"use client";

import React, { useState } from "react";
import {
  Button,
  toast,
  Modal,
  useModal,
  useDocumentInfo,
} from "@payloadcms/ui";
import { LOCALES_WITHOUT_EN } from "@/locales";

export const TranslateModal: React.FC = () => {
  const { closeModal } = useModal();
  const { id } = useDocumentInfo();
  const [selectedLocales, setSelectedLocales] = useState<Set<string>>(
    new Set(),
  );
  const [force, setForce] = useState(false);
  const [isSubmitting, setIsSubmitting] = useState(false);

  const handleLocaleToggle = (locale: string) => {
    const newSelected = new Set(selectedLocales);
    if (newSelected.has(locale)) {
      newSelected.delete(locale);
    } else {
      newSelected.add(locale);
    }
    setSelectedLocales(newSelected);
  };

  const handleSelectAll = () => {
    if (selectedLocales.size === LOCALES_WITHOUT_EN.length) {
      setSelectedLocales(new Set());
    } else {
      setSelectedLocales(new Set(LOCALES_WITHOUT_EN));
    }
  };

  const handleSubmit = async () => {
    if (selectedLocales.size === 0) {
      toast.error("Please select at least one locale");
      return;
    }

    if (!id) {
      toast.error("Document ID is required");
      return;
    }

    setIsSubmitting(true);

    const body = {
      documentId: id.toString(),
      locales: Array.from(selectedLocales),
      force,
    };

    try {
      const response = await fetch("/api/translate", {
        method: "POST",
        headers: {
          "Content-Type": "application/json",
        },
        credentials: "include",
        body: JSON.stringify(body),
      });

      if (!response.ok) {
        const error = await response.json();
        throw new Error(error.message || "Failed to start translation job");
      }

      const result = await response.json();

      toast.success(
        `Translation job #${result.jobId} queued! Check the Jobs section for progress.`,
      );

      closeModal("translate-modal");

      // Reset form
      setSelectedLocales(new Set());
      setForce(false);
    } catch (error) {
      console.error("Translation error:", error);
      toast.error(
        error instanceof Error
          ? error.message
          : "Failed to start translation job",
      );
    } finally {
      setIsSubmitting(false);
    }
  };

  return (
    <Modal slug="translate-modal">
      <div
        style={{
          display: "flex",
          alignItems: "center",
          justifyContent: "center",
          minHeight: "100vh",
          padding: "20px",
        }}
      >
        <div
          style={{
            background: "var(--theme-elevation-0)",
            borderRadius: "8px",
            padding: "32px",
            maxWidth: "600px",
            width: "100%",
            boxShadow: "0 0 16px rgba(0, 0, 0, 0.25)",
          }}
        >
          <h2 style={{ marginTop: "0", marginBottom: "24px" }}>
            Auto-Translate
          </h2>

          <div>
            <div
              style={{
                display: "flex",
                justifyContent: "space-between",
                alignItems: "center",
              }}
            >
              <label style={{ fontWeight: "bold" }}>Locales</label>
              <Button
                buttonStyle="secondary"
                size="small"
                onClick={handleSelectAll}
              >
                {selectedLocales.size === LOCALES_WITHOUT_EN.length
                  ? "Deselect All"
                  : "Select All"}
              </Button>
            </div>
            <div
              style={{
                border: "1px solid #ccc",
                borderRadius: "4px",
                padding: "12px",
                maxHeight: "200px",
                overflowY: "auto",
                gap: "8px",
                marginTop: "8px",
              }}
            >
              {LOCALES_WITHOUT_EN.map((locale) => (
                <div
                  key={locale}
                  style={{
                    display: "flex",
                    alignItems: "center",
                    marginBottom: "8px",
                  }}
                >
                  <input
                    type="checkbox"
                    id={`locale-${locale}`}
                    checked={selectedLocales.has(locale)}
                    onChange={() => handleLocaleToggle(locale)}
                    style={{ marginRight: "8px" }}
                  />
                  <label htmlFor={`locale-${locale}`}>{locale}</label>
                </div>
              ))}
            </div>
          </div>

          <div
            style={{ display: "flex", alignItems: "center", marginTop: "20px" }}
          >
            <input
              type="checkbox"
              id="force-translate"
              checked={force}
              onChange={(e) => setForce(e.target.checked)}
              style={{ marginRight: "16px" }}
            />
            <label htmlFor="force-translate">
              <strong>Force re-translate all fields</strong>
              <br />
              <span style={{ fontSize: "0.75em", color: "#666" }}>
                If unchecked, only empty fields will be translated
              </span>
            </label>
          </div>

          <div
            style={{
              display: "flex",
              justifyContent: "flex-end",
              gap: "10px",
              marginTop: "24px",
            }}
          >
            <Button
              buttonStyle="secondary"
              onClick={() => closeModal("translate-modal")}
              disabled={isSubmitting}
            >
              Cancel
            </Button>
            <Button onClick={handleSubmit} disabled={isSubmitting}>
              {isSubmitting ? "Starting..." : "Start Translation"}
            </Button>
          </div>
        </div>
      </div>
    </Modal>
  );
};

Step 7: Update Collection Configuration

Wire the translate button into the collection UI and add a _translationMeta group so automated translations and audit data are stored alongside your content.

// src/collections/Resources.ts
import type { CollectionConfig } from "payload";

export const Resources: CollectionConfig = {
  slug: "resources",
  admin: {
    useAsTitle: "title",
    components: {
      edit: {
        SaveButton: "@/components/TranslateButton",
      },
    },
  },
  fields: [
    {
      name: "title",
      type: "text",
      localized: true,
      required: true,
      admin: {
        components: {
          Field: "@/components/LocalizedTextField",
        },
      },
    },
    {
      name: "list",
      type: "array",
      fields: [
        {
          name: "name",
          type: "text",
          localized: true,
          admin: {
            components: {
              Field: "@/components/LocalizedTextField",
            },
          },
        },
      ],
    },
    {
      name: "_translationMeta",
      type: "group",
      admin: {
        description: "Metadata about automatic translations",
        condition: (_, siblingData) => !!siblingData._translationMeta,
      },
      fields: [
        {
          name: "lastTranslatedAt",
          type: "date",
          admin: {
            readOnly: true,
          },
        },
        {
          name: "translatedBy",
          type: "text",
          admin: {
            readOnly: true,
          },
        },
      ],
    },
  ],
};

Step 8: Update Payload Config

Register the translation job, enable localization settings, and ensure the job is included in Payload's jobs configuration so queued translations run reliably.

// src/payload.config.ts
import path from "path";
import { buildConfig } from "payload";
import { fileURLToPath } from "url";
import { sqliteAdapter } from "@payloadcms/db-sqlite";
import { lexicalEditor } from "@payloadcms/richtext-lexical";
import { Resources } from "./collections/Resources";
import { Users } from "./collections/Users";
import { translateResourceJob } from "./jobs/translateResource";
import { AVAILABLE_LOCALES } from "./locales";

const filename = fileURLToPath(import.meta.url);
const dirname = path.dirname(filename);

export default buildConfig({
  admin: {
    user: "users",
    importMap: {
      baseDir: path.resolve(dirname),
    },
  },

  // Localization config
  localization: {
    locales: AVAILABLE_LOCALES,
    defaultLocale: "en",
    fallback: true,
  },

  // Jobs config for auto-translation
  jobs: {
    tasks: [translateResourceJob],
  },

  collections: [Users, Resources],
  editor: lexicalEditor(),
  secret: process.env.PAYLOAD_SECRET || "dev-secret-change-me",
  typescript: {
    outputFile: path.resolve(dirname, "payload-types.ts"),
  },
  db: sqliteAdapter({
    client: {
      url: process.env.DATABASE_URL || "file:./payload.db",
    },
  }),
});

Step 9: Generate Import Map and Types

npm run generate:importmap
npm run generate:types

This registers your custom components and generates TypeScript types.

How It Works

Complete Translation Flow

User clicks "Translate" button in admin UI
Modal opens with locale selection and force mode option
User submits translation request
API route verifies authentication and validates parameters
Job is queued in Payload's background job system
Job handler fetches English document from database
For each target locale:
- Fetch existing locale document
- Identify fields needing translation
- Group by locale for batch processing
Translation service:
- Checks cache for each text
- Batches uncached texts (up to 100 per request)
- Calls Azure Translator API
- Stores results in cache
Job handler updates documents with translations
Metadata recorded (timestamp, auto-translated flag)
User sees success message with job ID

Monitoring and Debugging

View Job Status

Navigate to Admin → Jobs in Payload
Find your translation job by ID
View status: pending, processing, completed, or failed
Check output for translation counts

Console Logs

The job handler logs progress:

[Job abc123] Translation started for resources:65f7a...
[Job abc123] Processing locale: es
[Job abc123] Translating 15 fields for es
[Job abc123] Translation completed. Total: 15

Error Handling

Common errors and solutions:

Error	Cause	Solution
`AZURE_TRANSLATOR_KEY is not configured`	Missing env var	Add to `.env.local`
`Unauthorized`	Not logged into admin	Log in to Payload admin
`Document not found`	Invalid document ID	Verify document exists
`Translation failed: Rate limit`	Too many requests	Wait and retry

Advanced Customization

Adding More Locales

Update two places:

Locales file:

// src/locales.ts
export const AVAILABLE_LOCALES: string[] = ["en", "es", "fr", "de"];
export const LOCALES_WITHOUT_EN = AVAILABLE_LOCALES.filter(
  (value) => value !== "en",
);

Regenerate types:

npm run generate:types

The locales are automatically used by:

Payload's localization config (in payload.config.ts)
TranslateModal component (imports LOCALES_WITHOUT_EN)
Translation validation (in locales.ts)

Supporting More Field Types

Extend the job handler to support other localized fields:

// Add textarea support
if (englishDoc.description) {
  if (shouldTranslate(englishDoc.description, targetDoc.description)) {
    fieldsToTranslate.push({
      path: "description",
      value: englishDoc.description,
      locale: targetLocale,
    });
  }
}

// Add nested fields
if (englishDoc.metadata?.title) {
  if (shouldTranslate(englishDoc.metadata.title, targetDoc.metadata?.title)) {
    fieldsToTranslate.push({
      path: "metadata.title",
      value: englishDoc.metadata.title,
      locale: targetLocale,
    });
  }
}

Using Redis for Distributed Caching

For multi-server deployments, replace LRU cache with Redis:

import { createClient } from "redis";

const redis = createClient({
  url: process.env.REDIS_URL,
});

await redis.connect();

// Get from cache
const cached = await redis.get(cacheKey);

// Set in cache
await redis.set(cacheKey, translatedText, {
  EX: CACHE_TTL, // Expiration in seconds
});

Benefits:

Shared cache across server instances
Persistent cache across restarts
Higher capacity (GBs vs MBs)

Auto-Translate on Save (Delta Translation)

Use Payload hooks (for example beforeChange/afterChange) to detect which English fields were modified and queue translation jobs for only those deltas. This keeps translations in sync, reduces API usage, and avoids translating unchanged content. Consider UI indicators for in-progress translations, field-level locking, and debouncing to prevent rapid repeated jobs.

Multi-Provider Translation Support

Azure is a solid default, but it isn't the only option — some languages or content types can produce better results with DeepL, Google Cloud, OpenAI, or other services. Add a provider abstraction and selection rules so you can route specific locales or content categories to the best provider, implement fallbacks, and monitor costs/quotas to balance quality against price.

Production Checklist

Before deploying to production:

[ ] Environment variables configured in production
[ ] Azure Translator key has sufficient quota
[ ] Job queue working (test with small document)
[ ] Backup database before first bulk translation
[ ] Cache size appropriate for your data volume
[ ] Error monitoring set up (Sentry, etc.)
[ ] Rate limiting configured if needed
[ ] Cost alerts set in Azure portal
[ ] User permissions reviewed (who can translate?)

Troubleshooting Common Issues

Translations Not Appearing

Check job status: Admin → Jobs → Find your job ID
Review job logs: Look for errors in console output
Verify authentication: Ensure you're logged into Payload admin
Check document: Switch to target locale in admin UI
Validate locales: Ensure target locale exists in config

Cache Not Working

Verify LRU cache: Check import and initialization in service
Test cache hit: Translate same content twice, check logs for "fromCache: true"
Cache size limit: Increase max if needed (default: 10,000 entries)
Memory issues: Monitor Node.js heap usage with -max-old-space-size
Production caching: Consider Redis for multi-instance deployments

Azure API Errors

"Invalid API key": Double-check AZURE_TRANSLATOR_KEY in environment variables
"Rate limit exceeded": Wait 60 seconds or upgrade Azure plan (F0 → S1)
"Invalid language code": Verify locale codes match Azure supported languages
"Text too long": Azure limit is 50,000 characters per text - batch large content
"Authentication failed": Check AZURE_TRANSLATOR_REGION matches resource location

Performance Issues

Slow translations: Increase batch size or upgrade Azure tier
Job timeouts: Split large jobs into smaller batches
Memory leaks: Clear cache periodically or use Redis
Database locks: Add retry logic with exponential backoff
Network errors: Implement connection pooling and retry strategies

Conclusion: Production-Ready Azure AI Translation for Payload CMS 2026

You've successfully implemented an enterprise-grade, scalable auto-translation system for Payload CMS with:

Azure AI Translator - Enterprise-quality translation for 100+ languages with 99.9% uptime
LRU Cache (7-day TTL) - Minimize API costs with intelligent caching (70%+ cost reduction)
Payload Background Jobs - Reliable async processing without HTTP timeouts
Batch Translation (100 texts/request) - Optimal Azure API usage and performance
Selective Mode - Preserve manual edits, translate only empty fields by default
Force Mode - Re-translate all fields when needed for content updates
Metadata Tracking - Complete audit trail for compliance and debugging
Type-Safe Implementation - Full TypeScript support with zero runtime errors

Technical Stack Summary

Payload CMS: 3.79.0
Next.js: 15.4.11 (App Router + API Routes)
Azure AI Translator: @azure-rest/ai-translation-text v1.0.1
Caching: lru-cache v11.2.7 (7-day TTL)
Utilities: radash v12.1.1 (retry logic)
TypeScript: 5.7.2
React: 19.0.0

Real-World Performance Metrics

Based on our production implementations at u11d:

Translation Speed: 100 fields in ~3 seconds (with cache: <100ms)
API Cost: $10 per million characters (Azure S1 tier)
Cache Hit Rate: 70-85% for typical content updates
Uptime: 99.9% (Azure SLA)
Concurrent Jobs: 10+ simultaneous translations without issues
Database Impact: Minimal (batched updates, optimized queries)

Next Steps & Advanced Features

Consider these production enhancements:

Automatic translation triggers - Translate on document publish via hooks
Translation quality scoring - Flag low-confidence translations for review
Custom glossaries - Brand-specific term preservation (Azure feature)
Translation memory - Learn from manual corrections
Multi-provider support - Fallback to Google/DeepL/OpenAI
Batch document translation - Translate multiple documents in one job
Translation diff viewer - Compare original vs. translated side-by-side
Cost monitoring - Track Azure API usage and set budget alerts

Additional Resources for Payload CMS Translation

Azure AI Translator Documentation - Official Azure docs
Payload Jobs API Documentation - Background job implementation
Azure Supported Languages - 100+ language codes
Azure Custom Translator - Brand terminology glossaries
Azure Pricing Calculator - Estimate translation costs
LRU Cache Documentation - Caching implementation details
Payload Discord Community - Get help from Payload experts
Complete Code on GitHub - Full implementation

Need Payload CMS Experts?

Talk to Payload Experts

How to Show Default Locale Hints in Localized Array Fields in Payload CMS (2026 Guide)

Michał Miler — Mon, 13 Apr 2026 08:34:57 +0000

When building multilingual applications with Payload CMS and Next.js, content editors face a critical UX challenge: localized array fields appear completely empty in secondary locales, making translation workflows frustrating and error-prone. This comprehensive guide shows how to implement default locale hints in Payload CMS.

Real-World Scenario: The Empty Field Problem

Imagine you're building a multilingual eCommerce product catalog or SaaS application with Payload CMS. Your schema includes localized array fields like this:

{
  name: 'list',
  type: 'array',
  fields: [
    {
      name: 'name',
      type: 'text',
      localized: true,
    },
  ]
}

What happens:

English editor fills in all product names
Spanish editor switches locale to Spanish
All fields appear completely empty
Editor has no context about what needs translation

This creates several problems:

Editors waste time - They can't see what content exists in the default locale
Translation errors - No context leads to inconsistent translations
Workflow bottlenecks - Constant back-and-forth between locales
Poor UX - Frustrating editing experience

Why This Happens

This is not a bug in Payload CMS - it's by design. Understanding why helps you implement the correct solution.

How Payload Stores Localized Data

Payload CMS handles localization at the field level, not the document level. When you mark a field as localized: true:

{
  name: 'title',
  type: 'text',
  localized: true,
}

Payload stores it like this in the database:

{
  "title": {
    "en": "English Title",
    "es": "Título en Español"
  }
}

The Admin UI Only Shows Current Locale

The admin interface only renders the value for the active locale. If es is empty, you see an empty field - even though en has data.

Arrays Are Shared Across Locales

Arrays themselves are not localized - only the fields inside them:

{
  "list": [
    {
      "id": "1",
      "name": {
        "en": "Item One",
        "es": ""
      }
    }
  ]
}

The array structure is shared, but field values are locale-specific.

Fallback Only Works in the API

Payload's fallback: true config ensures empty locale values return the default locale via the API:

// payload.config.ts
localization: {
  locales: ['en', 'es'],
  defaultLocale: 'en',
  fallback: true, // <- Only affects API responses
}

This protects your frontend from showing empty content, but doesn't help editors in the admin UI.

The Solution Architecture

Our solution has three components:

Payload Config: API-level protection
API Route Handler + Auth + Cache: Verify auth & fetch via SDK
Custom Field Component: Show fallback in admin UI

Key Design Decisions:

Use GET requests (RESTful for read operations)
Verify authentication to prevent unauthorized access
Use Next.js API Route Handlers for parallel request handling
Implement in-memory caching to minimize database queries
Use Payload SDK (not REST API) for better performance and type safety
Single reusable component that works everywhere
Zero configuration - component reads context automatically

Step-by-Step Implementation

Step 1: Enable Required Configuration

First, ensure your Payload config has fallback enabled and Next.js has the experimental useCache flag enabled:

Payload Config:

// src/payload.config.ts
import { buildConfig } from "payload";

export default buildConfig({
  // ... other config

  localization: {
    locales: ["en", "es"],
    defaultLocale: "en",
    fallback: true, // <- Essential for API protection
  },

  // ... collections, etc.
});

Next.js Config:

// next.config.ts
import { withPayload } from "@payloadcms/next/withPayload";
import type { NextConfig } from "next";

const nextConfig: NextConfig = {
  cacheComponents: true,
  experimental: {
    useCache: true, // <- Required for "use cache" directive
  },
};

export default withPayload(nextConfig);

Why this matters:

Payload fallback prevents your production frontend from showing empty content when translations are missing
Next.js useCache flag enables the "use cache" directive for optimal performance

Step 2: Create API Route Handler with Next.js Native Caching

Create a Next.js API route that uses Payload SDK to fetch the default locale value. This endpoint uses GET (RESTful for read operations), includes authentication checks for security, and leverages Next.js native caching with the "use cache" directive for optimal performance:

// src/app/api/default-locale-value/route.ts
import { NextRequest, NextResponse } from "next/server";
import { unstable_cacheLife as cacheLife } from "next/cache";
import { CollectionSlug, getPayload } from "payload";
import config from "@/payload.config";
import { get } from "radash";

async function getDefaultLocaleValue(
  collectionSlug: CollectionSlug,
  documentId: string,
  fieldPath: string,
) {
  "use cache";
  cacheLife("minutes");

  const payload = await getPayload({ config });

  const doc = await payload.findByID({
    collection: collectionSlug,
    id: documentId,
    locale: "en",
    depth: 0,
  });

  if (!doc) {
    return null;
  }

  const pathParts = fieldPath.split(".");
  let value = doc;

  for (const part of pathParts) {
    if (value === null || value === undefined) {
      return null;
    }

    value = get(value, part);
  }

  return typeof value === "string" ? value : null;
}

export async function GET(request: NextRequest) {
  try {
    const searchParams = request.nextUrl.searchParams;
    const collectionSlug = searchParams.get("collectionSlug") as CollectionSlug;
    const documentId = searchParams.get("documentId");
    const fieldPath = searchParams.get("fieldPath");

    if (!collectionSlug || !documentId || !fieldPath) {
      return NextResponse.json(
        { error: "Missing required parameters" },
        { status: 400 },
      );
    }

    const cookies = request.cookies;
    const payloadToken = cookies.get("payload-token");

    if (!payloadToken) {
      return NextResponse.json({ error: "Unauthorized" }, { status: 401 });
    }

    const payload = await getPayload({ config });

    try {
      const { user } = await payload.auth({ headers: request.headers });
      if (!user) {
        return NextResponse.json({ error: "Unauthorized" }, { status: 401 });
      }

      const value = await getDefaultLocaleValue(
        collectionSlug,
        documentId,
        fieldPath,
      );

      return NextResponse.json({ value });
    } catch (authError) {
      return NextResponse.json({ error: "Unauthorized" }, { status: 401 });
    }
  } catch (error) {
    console.error("Failed to fetch English value:", error);
    return NextResponse.json(
      { error: "Internal server error" },
      { status: 500 },
    );
  }
}

Key Points:

Next.js "use cache" directive - Automatic function-level caching
cacheLife configuration - Cache duration set to 60 seconds
RESTful GET method - Semantically correct for read operations
Authentication required - Verifies user is logged into Payload admin
Cookie-based auth - Checks for payload-token and validates session
Uses getPayload() for direct database access
Handles nested paths like list.0.name automatically
Returns null gracefully on errors
Zero configuration caching - No manual cache management needed

Step 3: Create Localized Text Field Component

Create a custom field component that shows the English value as a reference:

// src/components/LocalizedTextField.tsx
"use client";

import {
  useField,
  useLocale,
  useDocumentInfo,
  TextInput,
} from "@payloadcms/ui";
import React, { CSSProperties, useEffect, useState } from "react";

const FALLBACK_STYLE: CSSProperties = {
  marginTop: "8px",
  padding: "4px 8px",
  backgroundColor: "#f5f5f5",
  borderRadius: "4px",
  fontSize: "12px",
};

interface LocalizedTextFieldProps {
  path: string;
}

export const LocalizedTextField: React.FC<LocalizedTextFieldProps> = ({
  path,
}) => {
  const locale = useLocale();
  const { id, collectionSlug } = useDocumentInfo();
  const { value, setValue } = useField<string>({ path });
  const [englishValue, setEnglishValue] = useState<string | null>(null);
  const [loading, setLoading] = useState(false);

  const isEnglish = locale.code === "en";

  useEffect(() => {
    if (isEnglish || !id || !collectionSlug) {
      setEnglishValue(null);
      return;
    }

    const fetchEnglishValue = async () => {
      setLoading(true);
      try {
        const params = new URLSearchParams({
          collectionSlug,
          documentId: id.toString(),
          fieldPath: path,
        });

        const response = await fetch(`/api/default-locale-value?${params}`, {
          method: "GET",
          credentials: "include", // Include cookies for authentication
        });

        if (!response.ok) {
          throw new Error("Failed to fetch English value");
        }

        const data = await response.json();
        setEnglishValue(data.value);
      } catch (error) {
        console.error("Failed to fetch English value:", error);
      } finally {
        setLoading(false);
      }
    };

    fetchEnglishValue();
  }, [id, isEnglish, path, collectionSlug]);

  return (
    <div>
      <TextInput
        path={path}
        value={value || ""}
        onChange={(e: React.ChangeEvent<HTMLInputElement>) =>
          setValue(e.target.value)
        }
      />

      {!isEnglish && englishValue && (
        <div style={FALLBACK_STYLE}>EN: {englishValue}</div>
      )}

      {loading && <div style={FALLBACK_STYLE}>Loading English value...</div>}
    </div>
  );
};

export default LocalizedTextField;

What This Component Does:

Reads current locale from Payload's context
Fetches English value via authenticated GET request
Includes credentials to pass authentication cookies
Shows reference below the input for editorial context
Handles loading states gracefully
Works automatically - no manual configuration needed
Benefits from caching - subsequent loads are instant

Step 4: Apply to Your Collection Fields

Update your collection to use the custom component:

// src/collections/Resources.ts
import type { CollectionConfig } from "payload";

export const Resources: CollectionConfig = {
  slug: "resources",
  admin: {
    useAsTitle: "title",
  },
  fields: [
    {
      name: "title",
      type: "text",
      localized: true,
      required: true,
      admin: {
        components: {
          Field: "@/components/LocalizedTextField",
        },
      },
    },
    {
      name: "list",
      type: "array",
      fields: [
        {
          name: "name",
          type: "text",
          localized: true,
          admin: {
            description: "Localized field with English fallback reference",
            components: {
              Field: "@/components/LocalizedTextField",
            },
          },
        },
      ],
    },
  ],
};

Notice:

Simple configuration - just reference the component path
No props needed - component reads everything from context
Works in top-level fields (title) and array fields (list.*.name)
Same component, no duplication

Step 5: Generate Import Map

Run Payload's import map generator to register your custom components:

npx payload generate:importmap

This updates Payload's admin UI to use your custom components.

For production deployments with multiple localized collections, consider implementing automated translation workflows to scale your multilingual content. Check our Auto-Translation with Azure AI guide for the complete implementation.

How It Works: Request Flow

Flow Diagram

Editor switches to Spanish locale
Component detects locale is not English
Component calls GET /api/default-locale-value with query params
Route verifies user authentication via Payload session
Route checks Next.js native cache layer
Cache miss: Executes getDefaultLocaleValue (cached function)
Fetches document with locale=en via Payload SDK
Navigates to field path (e.g., list.0.name)
Result cached automatically by Next.js for 60 seconds
Returns to component
Component displays English value below input field

Performance Benefits with Next.js 15 Caching:

When editing a document with 100 localized array fields:

First load: 100 API requests → cached by Next.js
Subsequent loads (within 60s): Instant cache hits
Automatic invalidation: Cache refreshes after TTL expires
Edge-optimized: Compatible with Vercel Edge Runtime

Example: Editing an Array Item

English Locale:

Title: [Product Overview]

Spanish Locale (before typing):

Title: [                    ]
EN: Product Overview

Spanish Locale (after typing):

Title: [Descripción del Producto]
EN: Product Overview

The English reference stays visible for context.

Conclusion: Production-Ready Payload CMS Localization UX for 2026

Localized fields in Payload CMS arrays work correctly by design, but the default admin UX doesn't support translation workflows well. By implementing this Next.js 15 + Payload CMS solution with native caching, you:

Solve the empty field problem for editors - No more context switching between locales
Use Next.js 15 native caching for optimal performance - Instant load times with use cache directive
Follow RESTful API design with GET requests - Semantically correct, cacheable endpoints
Implement authentication to secure your endpoints - Prevent unauthorized access to field data
Use Payload SDK for type-safe database access - Better performance than REST API
Enable automatic cache management - Zero-config caching with configurable TTL
Create a reusable component that works everywhere - Single component for all localized fields
Lay the foundation for translation automation - Ready for AI-powered workflows

Key Takeaways for Developers & AI Agents

Enable fallback: true in Payload config for API-level protection
Use GET requests for read operations (RESTful best practice)
Always verify authentication to prevent unauthorized data access
Leverage Next.js use cache directive for function-level caching
Use Payload SDK (not REST API) for better type safety and performance
Create reusable components that read context automatically via hooks
Include credentials in fetch requests for cookie-based authentication
Consider automation for large-scale translation - see Auto-Translation with Azure AI
Implement security best practices from our Payload CMS Security guide
Monitor performance with proper caching TTL and cache hit rates

Need Payload CMS Experts?

Talk to Payload Experts

Asset-Based Data Orchestration: Lessons from Building a Multi-State Social Data Platform

uninterrupted — Thu, 26 Mar 2026 07:21:25 +0000

Building reliable data platforms rarely fails because of scale alone.
More often, reliability collapses under heterogeneity: multiple data providers, inconsistent schemas, partial updates, and unclear ownership.

While building a multi-state social data platform ingesting resource data from dozens of organizations, we discovered that reliability is not a property of pipelines. It is a property of data artifacts and their relationships.

1. Why Reliability Becomes a Systems Problem at Scale

The accompanying essay frames trust as something earned through consistent system behavior under real-world pressure. What that framing leaves implicit, but what became unavoidable in practice, is that reliability stops being a property of individual components very early on. Once multiple organizations, jurisdictions, and publishing surfaces are involved, reliability becomes an emergent property of the entire system.

For us, this meant that no single pipeline, connector, or database could be made "reliable enough" in isolation. Failures were rarely total. They were partial, localized, and often silent. The engineering challenge was not preventing all failure, but designing the system so that failures were isolated, detectable, and explainable.

That requirement drove nearly every architectural decision that followed.

2. System Overview

Technically, the platform ingests social service resource data from independent organizations operating across multiple U.S. states. Each organization exposes data via a different source system (for example iCarol, WellSky, VisionLink, RTM), with varying schemas, update cadences, and quality guarantees.

At a high level, the system consists of:

Writers - per-tenant ingestion and transformation projects that:
- Fetch raw data from source systems via connector adapters
- Persist raw data into Snowflake source schemas
- Normalize and standardize data via DBT
- Apply enhancements such as geocoding or translations
Readers - publisher processes that:
- React to completed writer runs, either bulk or incremental
- Publish curated artifacts to OpenSearch, MongoDB, and optionally Postgres

Snowflake acts as the system of record for intermediate and normalized datasets. Dagster coordinates the execution and materialization of data assets. DBT is used explicitly for set-based transformations, not orchestration.

Scale is not extreme in raw volume, but complexity is high: dozens of tenants, hundreds of tables per tenant, and frequent partial updates.

High-Level Architecture

The following diagram illustrates the writer → Snowflake → DBT → asset orchestration → readers pattern.

The platform is designed around asset lineage and normalization contracts, not pipelines.

3. Source Heterogeneity as the Dominant Constraint

The hardest constraint was not throughput or storage. It was heterogeneity.

Each data provider differed along several axes simultaneously:

Schema shape: even when nominally "the same" entities existed, fields varied
Semantics: identical fields often meant subtly different things
Update cadence: some sources updated continuously, others weekly or ad hoc
Quality guarantees: missing fields, stale records, or partial exports were common

Early on, we underestimated how strongly these differences would dominate system design. Attempting to treat ingestion as a uniform, pipeline-shaped process led to brittle assumptions and cross-tenant coupling.

The system only became manageable once heterogeneity was treated as fundamental, not incidental.

4. Normalization (HSDS, SDOH or Equivalent) as an Architectural Contract

Normalization into an HSDS-like model was not implemented as a downstream convenience. It became an architectural contract.

All downstream consumers, internal and external, implicitly rely on the guarantees of the normalized model: stable fields, predictable relationships, and documented semantics. That meant normalization could not be "best effort" or delayed until the end of a pipeline.

In practice:

Raw source data is written verbatim into Snowflake source schemas
DBT ELT projects transform this raw data into standardized intermediate models
DBT STAGE projects apply tenant-specific adaptations while preserving the contract

This separation made it explicit where interpretation happens. If a field is wrong in the normalized model, the question becomes which contract was violated, not "what broke in the pipeline".

5. What We Got Wrong Initially

Several early assumptions did not survive contact with reality.

Pipeline-first thinking

We initially modeled work as long-running jobs. This obscured which intermediate datasets were durable, reusable, or safe to depend on. Debugging often meant rerunning more than necessary.

Manual validation

Data quality checks lived outside the orchestration layer. Engineers and analysts manually inspected outputs, which worked at small scale but failed under concurrency and time pressure.

Shared failure domains

Multiple tenants often shared execution paths. A failure in one tenant’s ingestion could block or delay others, even when their data was unrelated.

None of these issues were catastrophic individually. Together, they made reliability depend on human attention.

6. Transitioning to Asset-Based Data Orchestration with Dagster

The shift to asset-based orchestration was driven less by tooling preference and more by a change in mental model.

Instead of asking "what jobs should run?", we started asking:

What data artifacts must exist?
What do they depend on?
How fresh do they need to be?
What constitutes success or failure for this artifact?

Dagster assets provided a way to encode those questions directly.

A simplified example from a writer project shows how DBT models are treated as assets rather than opaque steps:

# writer-xyz/assets.py (excerpt)
from dagster_dbt import load_assets_from_dbt_project

dbt_elt_assets = load_assets_from_dbt_project(
    project_dir="dbt_elt",
    profiles_dir="dbt_elt",
)

This does not explain how DBT runs. It declares that the resulting tables are first-class assets with lineage and state.

Once assets replaced jobs as the primary abstraction, freshness, lineage, and partial recomputation became explicit rather than implicit.

7. Partitioning for Failure Isolation

Partitioning was critical for isolating failures.

We partitioned primarily along tenant and state boundaries, not time. This reflected operational reality: data issues almost always affected a single organization or region.

In Dagster terms, this meant:

Separate writer projects per tenant
Independent schedules and sensors
Asset materializations scoped to a tenant’s data domain

A failure in one writer no longer blocked publishing for others. More importantly, remediation could be targeted and auditable.

8. Data Quality Embedded in the Asset Graph

Data validation moved into the asset graph itself.

Instead of post-hoc checks, validations became explicit dependencies. If a validation asset failed, downstream assets simply did not materialize.

An example pattern used across writers:

@asset
def validate_staging_tables(staging_tables):
    assert staging_tables.count_missing_ids() == 0

This is intentionally simple. The key point is not the check itself, but that failure is structural. The system records that an expected artifact does not exist, rather than silently publishing bad data.

This shifted failure detection earlier and reduced the blast radius of errors.

9. Operational Outcomes

Day-to-day operations changed in several concrete ways:

On-call work shifted from rerunning pipelines to inspecting asset lineage
Partial backfills became routine rather than exceptional
Publishing delays were easier to attribute to specific upstream causes
New tenants could be added without increasing shared operational risk
None of this eliminated operational effort. It made that effort more focused and less reactive.

10. Open Trade-offs and Unresolved Questions

Some challenges remain unresolved:

Cross-tenant schema evolution still requires coordination and discipline
Observability across Snowflake, DBT, Dagster, and downstream stores is fragmented
Cost attribution at the asset level is still coarse-grained
Human review remains necessary for certain semantic validations

With more time, we would invest earlier in unified observability and more formal schema versioning.

11. Why These Lessons Matter Beyond This Platform

These lessons are not unique to this system.

Any platform operating in civic tech, govtech, or environmental data shares similar constraints: multiple data producers, uneven quality, and real-world consequences for failure.

The core takeaway is not "use asset-based orchestration", but treat data artifacts as obligations. Once that shift happens, many architectural decisions become clearer.

Reliability stops being something you hope for and becomes something you can reason about.

Final Thoughts

The biggest lesson from this platform was not about any particular tool. It was about how we model the system itself.

Once data artifacts became the core abstraction, many reliability problems became easier to reason about. Failures became visible, dependencies became explicit, and operational work shifted from firefighting pipelines to managing data contracts.

For data platforms operating across heterogeneous sources, this shift can be the difference between a system that merely runs - and one that can be trusted.

FAQ

What is asset-based data orchestration?

Asset-based data orchestration treats data artifacts (tables, datasets, models) as the primary units of orchestration instead of pipelines or jobs. Systems like Dagster allow teams to define dependencies between assets, enabling better lineage tracking, partial recomputation, and failure isolation.

Why is asset-based orchestration useful for complex data platforms?

In large systems with many data producers and consumers, failures are rarely binary. Asset-based orchestration makes dependencies explicit, allowing teams to:

isolate failures
recompute only affected datasets
track lineage across transformations
enforce data quality checks before publishing

How does Dagster differ from traditional pipeline orchestrators?

Traditional orchestrators schedule jobs or workflows. Dagster emphasizes data assets and lineage. This enables better visibility into:

what datasets exist
what produced them
what depends on them
whether they meet freshness or quality requirements

When should teams adopt asset-based orchestration?

Asset-based orchestration becomes particularly valuable when systems have:

many data sources
heterogeneous schemas
multiple downstream consumers
partial or incremental updates
strict reliability requirements

These conditions are common in multi-tenant data platforms and civic tech systems.

Does asset-based orchestration replace tools like DBT?

No. Asset-based orchestration complements transformation tools like DBT. In many architectures:

DBT performs set-based transformations
Dagster manages asset lineage, dependencies, scheduling, and validation

This separation keeps orchestration and transformation responsibilities clear.

Orchestrating Trust: Building Reliable Data Systems for Social Impact

Paweł Sławacki — Tue, 24 Mar 2026 08:54:40 +0000

How production-grade orchestration enables impact at scale

Systems built for social impact are often judged by their intent. The assumption is that good outcomes naturally follow from good goals, and that technical sophistication is secondary to mission. In practice, the opposite is often true. When information systems fail, the cost is not measured in lost revenue or delayed insights, but in missed opportunities for help, support, or timely intervention.

At scale, social impact is not a matter of aspiration. It is a matter of reliability.

Platforms that serve vulnerable populations operate under constraints that are both technical and ethical. Information must be accurate, current, and accessible. It must adapt as the world changes. It must withstand uneven demand and tolerate partial failure without collapsing. Most importantly, it must continue operating without constant human supervision, because manual intervention does not scale to moments of urgency.

These requirements are familiar to anyone who has built systems for finance, logistics, or large-scale commerce. What differs is the margin for error. In social contexts, latency is not an inconvenience. Inconsistency is not an annoyance. Failure is not an abstract metric. The system either delivers trustworthy information when it is needed, or it does not.

This is where data architecture quietly becomes social infrastructure.

The hidden fragility of well-intentioned systems

Many social platforms begin with pragmatic solutions. Data is collected from disparate sources, normalized through custom pipelines, and exposed through simple interfaces. Early success reinforces the approach: the system works, organizations adopt it, and impact grows.

Over time, however, complexity accumulates. Data sources evolve independently. Update cycles diverge. Quality varies across contributors. What once felt manageable starts to strain under its own assumptions.

In building social data platforms, we learned that fragility rarely appears all at once. It emerges gradually. Pipelines grow longer. Reprocessing becomes broader than necessary. Validation shifts from design to manual oversight. Eventually, the system still functions but confidence in its outputs begins to erode.

When correctness depends on human vigilance, availability depends on institutional memory. When updates become opaque, trust shifts away from architecture toward individual heroics. For systems intended to support people under real-world pressure, this is an unsustainable state.

The problem is not a lack of data or compute. It is a lack of structural guarantees.

From pipelines to obligations

Traditional data pipelines are designed around execution. They define a sequence of tasks that transform inputs into outputs. This model assumes that intermediate states are transient and that value resides primarily at the end of the flow.

In social data systems, this assumption does not hold.

Normalized datasets, enriched resources, derived aggregates-these are not disposable by-products. They are durable artefacts with meaning beyond a single run. They are reused, audited, compared over time, and relied upon by downstream organizations making decisions under uncertainty.

Once data outputs are treated as obligations rather than by-products, the role of orchestration changes fundamentally. The system’s responsibility is no longer to run jobs, but to ensure that specific states of data exist, remain current, and remain explainable.

This distinction matters because obligations persist. They require guarantees: freshness, lineage, and reproducibility. They require the system to know what it has produced, what it depends on, and what must change when assumptions shift.

In this framing, orchestration stops being an operational convenience and becomes a form of governance.

Reliability under real-world constraints

These principles became tangible while building Connect211, a modern search platform designed to support 211 organizations operating across multiple U.S. states. The platform aggregates resource data from independent organizations, each maintaining its own systems, taxonomies, and update rhythms.

What we learned early is that reliability in such an environment cannot be retrofitted. Data sources change independently. Failures are localized. Demand is uneven and often event-driven. Manual coordination quickly becomes the bottleneck.

Meeting these constraints required treating data artefacts as first-class citizens. Each normalized dataset, each enrichment step, each derived index represents a commitment: this information must exist, must be correct, and must be traceable back to its origins.

Asset-oriented orchestration provided a natural way to express these commitments. Instead of reasoning about execution order, the system reasons about data state. Instead of pushing data through pipelines, it ensures that required artefacts are materialized and kept current as upstream conditions change.

Dagster’s asset-based model aligned closely with this way of thinking. It allowed us to encode not only how data is processed, but what must be true for the system to be considered healthy. Orchestration became a mechanism for maintaining trust rather than merely coordinating tasks.

Automation without opacity

Automation is often presented as a universal solution to scale. In social systems, automation without structure can be as dangerous as manual fragility. When updates propagate automatically but their causes remain hidden, errors scale just as efficiently as value.

What distinguishes resilient systems is not the absence of automation, but the presence of clarity. Asset-based orchestration preserves the narrative of the data. Every artefact carries its provenance. Every update has a reason. When stakeholders ask why information changed, the answer is embedded in the structure of the system itself.

In environments where information influences real-world outcomes, this explainability underpins legitimacy. Trust is not established through assurances, but through the ability to demonstrate correctness when it matters.

Automation, in this sense, is not about removing humans from the loop. It is about ensuring that when humans intervene, they do so with understanding rather than guesswork.

Social impact as an emergent property

It is tempting to frame social impact in terms of outcomes alone. Did the platform help more people? Did it improve access? Did it reduce friction?

These questions are essential, but they are downstream. They describe effects, not causes.

At scale, social impact emerges from systems that behave predictably under stress. From platforms that continue operating as inputs change unexpectedly. From architectures that degrade gracefully rather than fail catastrophically. From data systems that are transparent by design rather than opaque by accident.

The same principles that govern production-grade platforms in commercial domains apply here, but with heightened stakes. Reliability is not an optimization. It is the foundation upon which impact rests.

In this light, orchestration is not a technical detail. It is part of the social contract embedded in the system-defining how obligations are met, how failures are contained, and how trust is maintained over time.

Beyond mission-driven engineering

There is a persistent tendency to treat social platforms as exceptional-worthy of different standards because their goals are noble. In practice, this often leads to underinvestment in architecture, justified by urgency or limited resources.

Our experience suggests the opposite conclusion. When tolerance for error is low and consequences are real, architectural rigor becomes more important, not less. Production-grade data systems are not at odds with social missions. They are prerequisites for sustaining them.

Asset-based orchestration, exemplified by tools like Dagster, provides a framework for expressing this rigor. It shifts focus from execution to responsibility, from pipelines to promises. It allows systems to scale not only in size, but in trustworthiness.

Social impact does not arise from technology alone. But without reliable systems, even the strongest intentions struggle to translate into lasting effect. When data platforms are designed as social infrastructure, reliability ceases to be a purely technical concern and becomes a public good.

FAQ: Production-Grade Orchestration for Social Impact Systems

What is asset-based orchestration in data engineering?

Asset-based orchestration is an architectural approach where data artefacts (datasets, models, indexes, aggregates) are treated as first-class citizens rather than by-products of pipeline runs.

Instead of defining execution steps, you define data states that must exist and their dependencies. The orchestration system ensures:

Correct dependency resolution
Incremental recomputation
Freshness guarantees
Lineage tracking
Failure isolation

This shifts orchestration from task coordination to state governance.

How is asset-based orchestration different from traditional pipelines?

Traditional pipelines are execution-oriented:

Step A → Step B → Step C
Outputs are transient
Reprocessing is often coarse-grained

Asset-based systems are state-oriented:

Explicit dependency graphs
Selective re-materialization
Persistent artefacts with lineage
Declarative data contracts

The key difference is that pipelines answer:

“What runs next?”

Asset-based orchestration answers:

“What must be true about the data?”

For systems operating under strict reliability constraints, that distinction is critical.

Why does data lineage matter in social impact systems?

In social systems, incorrect data can influence:

Access to services
Emergency response decisions
Resource allocation
Regulatory reporting

Lineage provides:

Auditability
Explainability
Reproducibility
Impact traceability

When stakeholders ask, “Why did this information change?”, lineage allows engineering teams to answer with certainty, not speculation.

What architectural risks do social data platforms typically face?

Common failure modes include:

Silent schema drift from independent data providers
Broad, expensive reprocessing triggered by minor upstream changes
Manual validation becoming a hidden operational dependency
Lack of observability into partial failures
Tight coupling between ingestion and serving layers

Without structural guarantees, reliability degrades gradually — often without obvious alerts.

How does orchestration contribute to data governance?

Orchestration becomes governance when it encodes:

Explicit data ownership boundaries
Dependency contracts
Freshness expectations
Failure domains
Version-aware updates

Rather than governance being a policy document, it becomes embedded in the system’s execution model.

This reduces reliance on institutional memory and tribal knowledge.

Is asset-based orchestration only relevant at large scale?

No. It becomes more visible at scale, but its benefits appear earlier:

Faster iteration cycles
Safer refactoring
More predictable deployments
Lower operational overhead
Clearer system reasoning

For mission-critical domains, reliability requirements often emerge before traffic scale does.

How does this relate to data observability and reliability engineering?

Asset-based orchestration complements:

Data observability (freshness, volume, schema monitoring)
Data reliability engineering practices
SLA/SLO enforcement
Incident response workflows

Because dependencies are explicit, blast radius and impact analysis become tractable. Observability signals can be tied directly to defined data obligations.

What role does automation play in maintaining trust?

Automation enables scale, but trust requires:

Transparency
Traceability
Deterministic recomputation
Controlled failure propagation

Well-structured orchestration ensures that automation is explainable, not opaque. Errors do not silently cascade across the system.

When should a team consider moving from pipelines to asset-oriented architecture?

Signals include:

Increasing reprocessing cost
Growing dependency complexity
Difficulty explaining data changes
Manual intervention becoming routine
Rising stakeholder sensitivity to correctness

If correctness is non-negotiable, state-aware orchestration becomes a strategic investment rather than an optimization.

About the authors / context

This article is based on our direct experience building production-grade data infrastructure for https://connect211.com, a modern search platform supporting 211 organizations across multiple U.S. states. The insights presented reflect real architectural decisions made while scaling a social impact system operating under strict reliability and data-quality constraints.

AWS Amplify vs Vercel: Complete Pricing Comparison for Next.js Applications

Michał Miler — Thu, 12 Mar 2026 15:55:59 +0000

When choosing a hosting platform for your Next.js application, understanding the true cost is crucial for making an informed decision. Both Vercel and AWS Amplify offer powerful deployment solutions, but their pricing models differ significantly. This comprehensive comparison analyzes the actual costs based on official pricing from both platforms, helping you determine which solution fits your budget and requirements.

The Pricing Challenge

Both platforms have strengths, but understanding their pricing structures is essential:

Vercel's Challenge: The free Hobby tier is strictly for personal, non-commercial use. Any business use requires the Pro plan at $20/month per team member, creating immediate baseline costs regardless of traffic.

AWS Amplify's Challenge: While there's no per-seat cost, the pricing structure involves multiple components (builds, bandwidth, storage, compute) that can be complex to predict without understanding your usage patterns.

Key Question: Which platform offers better value for your specific use case?

Vercel Pricing Structure

Hobby Plan (Free Forever)

Cost: $0/month

Included Resources:

100 GB Fast Data Transfer
1M Edge Requests
1M Function Invocations
4 hours Active CPU (Vercel Functions)
360 GB-hrs Provisioned Memory
1M ISR Reads, 200K ISR Writes
5K Image Transformations
Unlimited deployments

Restrictions: Personal and non-commercial use only

Pro Plan ($20/month per seat)

Base Cost: $20/month per team member

Included Resources:

$20 usage credit (can offset overages)
100 GB Fast Data Transfer
1M Edge Requests
1M Function Invocations
4 hours Active CPU (Vercel Functions)
360 GB-hrs Provisioned Memory
10 GB Fast Origin Transfer
1M ISR Reads, 200K ISR Writes
5K Image Transformations
1 GB Blob Storage
Team collaboration + free viewer seats
Advanced spend management

Overage Pricing: Usage beyond included amounts is charged separately (pricing varies by resource type)

Key Consideration: Team costs scale with size. A 5-person team = $100/month base cost before any usage overages.

AWS Amplify Pricing Structure

Free Tier (First 12 Months)

Cost: $0/month (for eligible usage)

Included Resources:

1,000 build minutes
15 GB data served
5 GB storage
500,000 requests
100 GB-hours SSR compute

Note: New AWS customers starting July 15, 2025 receive up to $200 in AWS Free Tier credits usable within 12 months

Pay-As-You-Go Pricing

Build & Deploy:

Standard builds: $0.01/minute
Enhanced builds: $0.025/minute
Turbo builds: $0.10/minute

Hosting:

Data transfer: $0.15/GB served
Storage: $0.023/GB/month
Requests: $0.30 per million

SSR Compute (if applicable):

$0.20 per GB-hour

Firewall (optional):

$15/month per Amplify app + WAF costs

Key Advantages:

No per-seat pricing (unlimited team members)
No base monthly fee (pay only for usage)
Multiple sites per project included
Public SSL certificates included

Cost Comparison Methodology

To provide accurate comparisons, we'll analyze costs across different scenarios based on:

Team Size: Solo developer vs. team of 5
Traffic Levels: Small (10K visitors/month), Medium (100K visitors/month), Large (500K visitors/month)
Deployment Frequency: 1 build/week (typical for production)
Average Page Size: 2MB (including assets)
Build Time: 5 minutes per build (standard Next.js app)

Comparison 1: Solo Developer with Small Traffic

Scenario: 1 developer, 10,000 visitors/month, 20 GB data transfer/month, 4 builds/month

Vercel Hobby (Free)

Base cost: $0
Data transfer: 20 GB (within 100 GB free)
Edge requests: ~30K (within 1M free)
Total: $0/month
Limitation: Cannot be used for commercial projects

Vercel Pro

Base cost: $20/month (1 seat)
Data transfer: 20 GB (within 100 GB included)
All other usage within included limits
Total: $20/month

AWS Amplify

Builds: 4 builds × 5 min × $0.01 = $0.20
Data transfer: 20 GB × $0.15 = $3.00
Storage: 1 GB × $0.023 = $0.02
Requests: 30K × $0.30/million = $0.01
Subtotal: $3.23/month
With Free Tier (first 12 months): $0/month (all usage covered)
After Free Tier: $3.23/month

Winner: AWS Amplify (96% savings) - After free tier: $3.23 vs $20

Comparison 2: Solo Developer with Medium Traffic

Scenario: 1 developer, 100,000 visitors/month, 200 GB data transfer/month, 4 builds/month

Vercel Pro

Base cost: $20/month (1 seat)
Data transfer: 200 GB
- Included: 100 GB
- Overage: 100 GB (~$15 estimated based on typical CDN costs)
Usage credit: $20 applies to overages
Total: ~$20-35/month (depending on exact overage rates)

AWS Amplify

Builds: 4 builds × 5 min × $0.01 = $0.20
Data transfer: 200 GB × $0.15 = $30.00
Storage: 5 GB × $0.023 = $0.12
Requests: 300K × $0.30/million = $0.09
Subtotal: $30.41/month
With Free Tier (first 12 months): ~$28/month (15 GB data + other limits covered)
After Free Tier: $30.41/month

Winner: Comparable (Vercel $20-35 vs Amplify $28-30)

Comparison 3: Team of 5 with Medium Traffic

Scenario: 5 developers, 100,000 visitors/month, 200 GB data transfer/month, 20 builds/month

Vercel Pro

Base cost: $20/month × 5 seats = $100/month
Data transfer: 200 GB
- Included: 100 GB
- Overage: ~100 GB (~$15)
Usage credit: $20 applies to overages
Total: ~$100-115/month

AWS Amplify

Builds: 20 builds × 5 min × $0.01 = $1.00
Data transfer: 200 GB × $0.15 = $30.00
Storage: 5 GB × $0.023 = $0.12
Requests: 300K × $0.30/million = $0.09
Subtotal: $31.21/month
With Free Tier (first 12 months): ~$29/month
After Free Tier: $31.21/month

Winner: AWS Amplify (73% savings) - $31 vs $100-115

Comparison 4: Team of 5 with Large Traffic

Scenario: 5 developers, 500,000 visitors/month, 1 TB (1,000 GB) data transfer/month, 20 builds/month

Vercel Pro

Base cost: $20/month × 5 seats = $100/month
Data transfer: 1,000 GB
- Included: 100 GB
- Overage: 900 GB (~$135 estimated)
Usage credit: $20 applies to overages
Total: ~$215/month

AWS Amplify

Builds: 20 builds × 5 min × $0.01 = $1.00
Data transfer: 1,000 GB × $0.15 = $150.00
Storage: 10 GB × $0.023 = $0.23
Requests: 1.5M × $0.30/million = $0.45
Subtotal: $151.68/month
With Free Tier (first 12 months): ~$149/month
After Free Tier: $151.68/month

Winner: AWS Amplify (29% savings) - $152 vs $215

Quick Comparison Table

Scenario	Vercel Pro	AWS Amplify	Savings	Notes
Solo, 10K visitors	$20	$3 ($0 first year)	85-100%	Can't use Vercel Hobby commercially
Solo, 100K visitors	$20-35	$28-30	0-20%	Comparable costs
Team of 5, 100K visitors	$100-115	$29-31	73%	Per-seat cost dominates
Team of 5, 500K visitors	$215	$149-152	29%	Bandwidth becomes factor

Additional Cost Factors

SSL Certificates

Vercel: Included (free, automatic)
Amplify: Included (free, automatic via AWS Certificate Manager)

Custom Domains

Vercel: Unlimited (free)
Amplify: Unlimited (free)

Preview Deployments

Vercel: Unlimited (included in all plans)
Amplify: Free (included)

Build Concurrency

Vercel Pro: No queues, faster builds
Amplify: First-come-first-served, can be slower during high demand

Analytics & Monitoring

Vercel: 10K Speed Insights events, 50K Web Analytics events (included)
Amplify: Use CloudWatch (pay-per-use, typically $2-10/month)

Image Optimization

Vercel: 5K transformations/month included
Amplify: Not included (use CloudFront + Lambda@Edge or third-party CDN)

Key Pricing Insights

When AWS Amplify Wins on Cost

Teams with 3+ developers: The per-seat pricing on Vercel ($20/month per developer) quickly adds up. Amplify has zero per-seat costs.
Moderate to high traffic: Once you exceed Vercel's included limits, bandwidth costs become significant. Amplify's $0.15/GB is competitive but avoid the per-seat baseline.
Multiple projects: No per-seat cost means you can host multiple projects without multiplying team costs.
Predictable traffic patterns: You can accurately forecast Amplify costs based on bandwidth and build frequency.

When Vercel Wins on Cost

Solo developers with low traffic: If you're within Hobby plan limits and it's a personal project, Vercel is free.
1-2 person teams with moderate traffic: The $20-40/month base cost might be comparable to Amplify once you factor in bandwidth.
Included features matter: If you heavily use Image Optimization, Speed Insights, or Web Analytics, the included quotas add value.

The Hidden Cost Factor: Developer Time

Vercel's Value: Zero-config deployments, instant previews, seamless Next.js integration can save hours of setup and troubleshooting. Critically, Vercel offers exclusive Next.js optimizations including:

Edge Functions with ultra-low latency execution at the edge
Incremental Static Regeneration (ISR) with better caching and invalidation
Next.js-specific features like Image Optimization, Font Optimization, and automatic code splitting
Edge Middleware for advanced routing and personalization
Native support for latest Next.js features (often available before other platforms)

Amplify's Trade-off: More configuration required, less polished DX, potential troubleshooting time. Amplify doesn't support some advanced Next.js features like Edge Runtime, Edge Middleware, or Vercel's optimized ISR implementation. However, costs are dramatically lower for teams.

ROI Calculation: If Amplify saves your 5-person team $85/month ($1,020/year), but requires 5 extra developer hours/year at $100/hour, Vercel might win on total cost. However, if setup is one-time and hosting cost savings compound monthly, Amplify wins. Consider whether your app needs Vercel's exclusive Next.js optimizations - if you rely heavily on Edge Functions, ISR, or cutting-edge Next.js features, Vercel may be the only viable option regardless of cost.

When to Choose Vercel

Choose Vercel if:

You need advanced Next.js features (Edge Functions, Edge Middleware, optimized ISR, cutting-edge Next.js support)
Performance at the edge is critical (ultra-low latency with edge compute)
You're a solo developer with a personal project (Hobby plan is free)
Team is 1-2 people and traffic is low-moderate (costs are comparable)
Developer experience is paramount (zero-config, instant previews, seamless integration)
You heavily use included features (Image Optimization, Analytics, Speed Insights)
Time-to-market is critical (faster setup, less configuration)
Budget accommodates premium pricing for better DX and exclusive features

When to Choose AWS Amplify

Choose AWS Amplify if:

Your app doesn't require Edge Functions or advanced Next.js features (standard SSR/SSG is sufficient)
Team has 3+ developers (saves $40-80+/month in per-seat fees)
Cost optimization is a priority (especially for small-medium businesses)
Traffic is moderate to high (bandwidth costs dominate, no seat fees)
You're already using AWS services (S3, Lambda, DynamoDB, etc.)
You want infrastructure control (access to underlying S3/CloudFront config)
Multiple projects/sites (no per-seat multiplication)

Pricing Summary by Scenario

Solo Developer

Low traffic (personal project): Vercel Hobby (free) wins
Low traffic (commercial): Amplify wins (85% savings: $3 vs $20)
Medium traffic: Comparable ($28-35 both platforms)
High traffic: Amplify wins slightly (lower baseline)

Small Team (2-3 developers)

Any traffic level: Amplify wins (30-70% savings)
Per-seat fees become significant: $40-60/month baseline vs $0

Medium-Large Team (5+ developers)

Any traffic level: Amplify wins significantly (70%+ savings)
Example: 5 devs, 100K visitors = $100+ (Vercel) vs $30 (Amplify)

Enterprise Scale

Both become expensive: Consider custom solutions (ECS, CloudFront + S3 direct)
Amplify still wins on per-seat costs: But contact both for enterprise pricing

The Real Cost Comparison

Factor	Vercel Pro	AWS Amplify	Winner
Base monthly cost (1 dev)	$20	$0	Amplify
Per additional developer	+$20	$0	Amplify
Bandwidth (per GB)	~$0.15*	$0.15	Tie
Build time	Free	$0.01/min	Vercel
Setup complexity	Very easy	Moderate	Vercel
Free tier generosity	Good (Hobby)	Excellent (12 months)	Amplify
Edge Functions	Included	Not available	Vercel
Edge Middleware	Included	Not available	Vercel
Image optimization	Included	Extra cost	Vercel
Analytics	Included	Extra cost	Vercel
Infrastructure control	Limited	Full	Amplify

*Estimated based on typical overage patterns

Bottom Line Recommendations

Choose Vercel if: You need Edge Functions, Edge Middleware, or cutting-edge Next.js features; value developer experience over cost; have a small team (1-2 people); or need zero-config deployment.

Choose Amplify if: Your app works with standard SSR/SSG (no edge compute required), you have a team of 3+ developers, cost optimization matters, or you want to avoid per-seat pricing that scales with team size.

For most businesses with development teams: Amplify offers 30-75% cost savings, primarily by eliminating per-seat fees. The trade-off is more configuration, slightly less polished developer experience, and lack of support for advanced Next.js features like Edge Functions and Edge Middleware. Evaluate your technical requirements first - if you need Vercel's exclusive optimizations, cost savings may be secondary.

Next Steps

Ready to try AWS Amplify and see the cost savings yourself? Check out our step-by-step deployment guide for Next.js on AWS Amplify.

Note: Pricing information accurate as of January 2026. Always verify current pricing on Vercel's pricing page and AWS Amplify's pricing page before making decisions.

Next.js 16 Partial Prerendering (PPR): The Best of Static and Dynamic Rendering

uninterrupted — Thu, 05 Mar 2026 08:21:46 +0000

Next.js has long been a leader in giving developers flexible, high-performance rendering strategies - Static Site Generation (SSG), Server-Side Rendering (SSR), Incremental Static Regeneration (ISR), and Client-Side Rendering (CSR) all play roles depending on your use case. With Next.js 14, Partial Prerendering (PPR) was introduced as an experimental feature, starting from Next.js 16, PPR has become a stable and becomes a foundational part of this landscape by letting you blend static pre-rendering and dynamic behavior in the same route, improving perceived performance and SEO without sacrificing real-time data needs.

What Is Partial Prerendering (PPR)?

Partial Prerendering is a rendering strategy that lets Next.js deliver a static HTML “shell” immediately, then stream in dynamic content as it becomes available - all in the same server response. Instead of choosing static or dynamic at the page level, PPR combines both on a per-component basis.

The static shell (layout, logo, product titles, navigation) is pre-generated at build time or cached ahead of requests.
Dynamic parts (cart state, recommendations, session data) load later and are streamed to the client using React’s <Suspense> boundaries.
Users see meaningful UI immediately, with dynamic content filling in seamlessly.

This leads to faster Time to First Byte (TTFB) and a smoother perceived experience, while still allowing data personalization and real-time updates.

How Partial Prerendering Works in Next.js 16

Cache Components: The Heart of PPR

In Next.js 16, PPR isn’t just an experimental flag - it’s integrated into the Cache Components system. Cache Components lets you opt-in to cache certain components or data instead of rendering them on every request, making pre-rendering more predictable and performant.

cacheComponents: true in next.config.ts enables PPR and component-level caching.
The new "use cache" directive lets you mark data-fetching functions or components as cacheable.
Anything not cacheable or depending on request-specific data can be wrapped in a <Suspense> fallback.

This approach gives you fine-grained control: static UI is reused across requests, while dynamic parts rehydrate or stream in only when needed.

Streaming and Suspense: The Engine Under the Hood

React’s <Suspense> plays a key role in PPR:

Wrap dynamic components in <Suspense> with a fallback UI (like a skeleton loader).
During rendering, Suspense tells Next.js where to halt pre-rendering and stream dynamic content later.
The server sends the pre-rendered shell immediately and then streams dynamic sections in parallel as they’re ready.

This strategy avoids blocks in page delivery and reduces the “white screen” effect often seen in traditional SSR or CSR approaches.

Why PPR Matters for E-commerce

For e-commerce platforms, performance directly impacts sales and SEO - factors also emphasized in the U11D articles about rendering strategies. A slow page can hurt conversions and search rankings. Partial Prerendering improves this by:

Faster initial loads: Users see the page skeleton instantly, improving engagement and performance metrics.
SEO advantages: Static content is available for crawler indexing immediately.
Dynamic personalization: Cart contents, recommendations, user-specific prices or availability can appear without blocking the initial render.
More flexible than SSG/SSR alone: You’re not limited to fully static or completely dynamic pages - the best of both lives together.

In e-commerce apps where product detail pages might be static for most users but show personalized recommendations or inventory status, PPR is a compelling hybrid.

How to Enable and Use PPR in Next.js 16

Enable Cache Components:

In your next.config.ts:

import type { NextConfig } from 'next';

const nextConfig: NextConfig = {
  cacheComponents: true, // enables Partial Prerendering
};

export default nextConfig;

Wrap dynamic UI:

Use React’s Suspense for dynamic content:

import { Suspense } from 'react';

export default function ProductPage() {
  return (
    <>
      <Header />
      <Suspense fallback={<div>Loading product...</div>}>
        <ProductDetails />
      </Suspense>
    </>
  );
}

Use use cache for predictable dynamic data:

Inside dynamic data functions, prefix with "use cache" to mark them cacheable:
```
'use cache';
const product = await getProduct(id);
```
This tells Next.js that caching is safe for that data.

How PPR Fits with Other Rendering Strategies

Partial Prerendering doesn’t replace SSR, SSG, ISR, or CSR - but coordinates with them. While U11D’s articles give a great overview of traditional strategies like SSG, SSR, and ISR, PPR adds a hybrid strategy that sits between:

SSG/ISR: Good for full static pages or pages that regenerate occasionally.
SSR: Ideal for real-time personalized content.
CSR: Great for highly interactive client-heavy UI.
PPR (Next.js 16): Combines static cached shells with streaming dynamic content.

Think of PPR as a component-level SSG + dynamic hybrid - fast to load like SSG, flexible like SSR.

Wrapping Up

Next.js 16 elevates Partial Prerendering from an experimental concept to a practical, integrated performance strategy through Cache Components and React Suspense. It’s especially powerful for complex, dynamic sites like e-commerce stores, where you want the benefits of static pre-rendering and dynamic personalization without sacrificing UX or SEO.

By serving instantly usable HTML and streaming dynamic parts in parallel, PPR bridges the traditional divide between static and dynamic rendering - helping your app feel faster while staying robust and scalable.

If you’re building with Next.js 16, definitely explore PPR alongside other strategies like SSR, ISR, and CSR to find the most performance-optimized combination for your routes.

Secure RDS Access Without Bastion Hosts: Using ECS Containers and SSM

Bartek Gałęzowski — Wed, 25 Feb 2026 08:00:00 +0000

The Problem

Accessing RDS databases in production environments presents a common security challenge. Direct internet access to databases is a significant security risk, so most organizations place their RDS instances in private subnets without public endpoints. But how do you access these databases for debugging, data analysis, or administrative tasks?

Traditional approaches include:

Bastion hosts: Requires maintaining and securing additional EC2 instances
VPN connections: Complex setup and ongoing maintenance overhead
SSH tunneling: Still requires a jump server with SSH access

There's a more elegant solution: leveraging your existing ECS containers as secure tunnels to your RDS databases using AWS Systems Manager (SSM) Session Manager.

The Solution

This script creates a secure tunnel from your local machine to an RDS database through a running ECS container, using SSM Session Manager for the connection. No SSH keys, no bastion hosts, no exposed ports—just IAM-based authentication and encrypted connections.

Prerequisites

Before using this script, ensure you have:

AWS CLI installed and configured with appropriate credentials
Session Manager Plugin for AWS CLI
ECS Task Role with permissions to use SSM:

 {
     "Version": "2012-10-17",
     "Statement": [
       {
         "Effect": "Allow",
         "Action": [
           "ssmmessages:CreateControlChannel",
           "ssmmessages:CreateDataChannel",
           "ssmmessages:OpenControlChannel",
           "ssmmessages:OpenDataChannel"
         ],
         "Resource": "*"
       }
     ]

ECS Exec enabled on your service (can be enabled with aws ecs update-service --cluster <cluster> --service <service> --enable-execute-command --force-new-deployment)

How It Works

The script performs the following operations:

1. Task Discovery

TASK_ID=$(aws ecs list-tasks \
  --cluster "$CLUSTER" \
  --service-name "$SERVICE_NAME" \
  --desired-status RUNNING \
  --query 'taskArns[0]')

Finds the first running task in the specified ECS service.

2. Container Runtime ID Retrieval

CONTAINER_RUNTIME_ID=$(aws ecs describe-tasks \
  --cluster "$CLUSTER" \
  --tasks "$TASK_ID" \
  --query 'tasks[0].containers[?name==`'"$SERVICE_NAME"'`].runtimeId | [0]')

3. Port Forwarding Session

aws ssm start-session \
  --target "ecs:${CLUSTER}_${TASK_ID}_${CONTAINER_RUNTIME_ID}" \
  --document-name AWS-StartPortForwardingSessionToRemoteHost \
  --parameters '{"host":["$DB_HOST"],"portNumber":["5432"], "localPortNumber":["$LOCAL_PORT"]}'

Establishes a secure tunnel through the ECS container to the database.

Usage

#!/usr/bin/env bash

set -eu

usage() {
  echo "Usage: $0 <CLUSTER> <SERVICE_NAME> <DB_HOST> <LOCAL_PORT> <REGION>"
  echo "  cluster: ECS cluster name"
  echo "  service: ECS service name"
  echo "  db_host: Database host"
  echo "  local_port: Local port to forward"
  echo "  region:  AWS region (e.g., us-east-2)"
  exit 1
}

if [ $# -ne 5 ]; then
  echo "Error: Expected 5 arguments, got $#" >&2
  usage
fi

CLUSTER="$1"
SERVICE_NAME="$2"
DB_HOST="$3"
LOCAL_PORT="$4"
REGION="$5"

get_first_task_id() {
  local FIRST_TASK_ID
  FIRST_TASK_ID=$(aws ecs list-tasks \
    --cluster "$CLUSTER" \
    --service-name "$SERVICE_NAME" \
    --desired-status RUNNING \
    --output text \
    --query 'taskArns[0]' \
    --region "$REGION" | sed 's/.*\///')

  if [ "$FIRST_TASK_ID" = "None" ] || [ -z "$FIRST_TASK_ID" ]; then
    echo "Error: No running tasks found for service '$SERVICE_NAME' in cluster '$CLUSTER'" >&2
    return 1
  fi

  echo "$FIRST_TASK_ID"
}

TASK_ID=$(get_first_task_id)

CONTAINER_RUNTIME_ID=$(aws ecs describe-tasks \
  --output text \
  --cluster "$CLUSTER" \
  --tasks "$TASK_ID" \
  --region "$REGION" \
  --query 'tasks[0].containers[?name==`'"$SERVICE_NAME"'`].runtimeId | [0]')

TARGET="ecs:${CLUSTER}_${TASK_ID}_${CONTAINER_RUNTIME_ID}"

aws ssm start-session \
  --target "$TARGET" \
  --document-name AWS-StartPortForwardingSessionToRemoteHost \
  --parameters "{\"host\":[\"$DB_HOST\"],\"portNumber\":[\"5432\"], \"localPortNumber\":[\"$LOCAL_PORT\"]}" \
  --region "$REGION"

Save the script as ecs-db-tunnel.sh and make it executable:

chmod +x ecs-db-tunnel.sh

Run the script with the required parameters:

./ecs-db-tunnel.sh <CLUSTER> <SERVICE_NAME> <DB_HOST> <LOCAL_PORT> <REGION>

Example:

./ecs-db-tunnel.sh \
  production-cluster \
  api-service \
  mydb.c9akciq32.us-east-2.rds.amazonaws.com \
  5432 \
  us-east-2

Once connected, you can access the database from your local machine:

psql -h localhost -p 5432 -U dbuser -d mydb

or using other tools such as DBeaver.

Key Benefits

1. No Infrastructure Overhead

No need to maintain bastion hosts or VPN servers. You leverage existing ECS containers.

2. Enhanced Security

No SSH keys to manage or rotate
No exposed ports or public endpoints
IAM-based authentication and authorization
All traffic encrypted through SSM
Audit trail through CloudTrail

3. Zero Configuration

If your ECS service is already running with ECS Exec enabled, you're ready to go.

4. Cost Effective

No additional EC2 instances to run. Session Manager has no additional cost.

5. Temporary Access

Connection exists only while the script is running—perfect for adhoc administrative tasks.

Security Considerations

Network Security

This approach works because:

The ECS container is in the same VPC as the RDS instance
The container's security group allows outbound connections to the RDS security group
The RDS security group permits inbound connections from the ECS container's security group

Access Control

Control who can establish tunnels using IAM policies:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["ecs:ListTasks", "ecs:DescribeTasks", "ssm:StartSession"],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:RequestedRegion": "us-east-2"
        }
      }
    }
  ]
}

Troubleshooting

No running tasks found

Ensure the ECS service has at least one running task:

aws ecs list-tasks --cluster <cluster> --service-name <service> --desired-status RUNNING

Session Manager plugin not found

Install the Session Manager plugin following AWS documentation.

Connection refused

Verify:

RDS security group allows inbound from ECS container security group
Database endpoint is correct
ECS task has network connectivity to RDS

ECS Exec not enabled

Enable it on your service:

aws ecs update-service \
  --cluster <cluster> \
  --service <service> \
  --enable-execute-command \
  --force-new-deployment

Conclusion

Using ECS containers as secure tunnels to RDS databases is an elegant solution that leverages existing infrastructure while maintaining strong security posture. This approach eliminates the need for bastion hosts, reduces attack surface, and provides auditable, temporary access to private databases.

The script demonstrates that sometimes the best security solutions are those that work with your existing architecture rather than adding more complexity on top of it.

How to Deploy Next.js 16 SSG to AWS Amplify: Complete Guide

Michał Miler — Tue, 24 Feb 2026 07:02:03 +0000

Deploying Next.js Static Site Generation (SSG) applications to AWS Amplify can be challenging without the right configuration. While Vercel offers a seamless experience, AWS Amplify may provide a cost-effective alternative with up to 96% savings for smaller sites and 20-30% for medium-traffic applications. In this comprehensive guide, we'll walk through the exact steps to successfully deploy your Next.js 16 SSG site to AWS Amplify.

Why AWS Amplify for SSG?

AWS Amplify is particularly attractive for SSG deployments because:

Lower costs than Vercel, especially for commercial use (see our detailed cost comparison)
No per-seat pricing for team collaboration
Generous free tier for the first 12 months
Standard AWS infrastructure (S3 + CloudFront) with no vendor lock-in

Note on static export limitations: Static Site Generation pre-generates all pages at build time. This means no server-side rendering (SSR), no Incremental Static Regeneration (ISR), and no API routes. If you need these features, you'll need a different deployment strategy.

Prerequisites

Before we begin, ensure you have:

Node.js 22.x or later installed
Next.js 16 project using App Router
An AWS account (free tier eligible for 12 months)
Git repository (GitHub, GitLab, Bitbucket, or AWS CodeCommit)
Basic familiarity with Next.js App Router

Step 1: Configure Next.js 16 for SSG

Update `next.config.ts`

Create or modify your next.config.ts file to enable static export:

import type { NextConfig } from "next";

const nextConfig: NextConfig = {
  output: "export",
  images: {
    unoptimized: true,
  },
};

export default nextConfig;

Key Points:

output: 'export' disables the Next.js runtime and enforces a full static export. This ensures your site can be deployed as plain static HTML/CSS/JS.
images.unoptimized: true disables Next.js Image Optimization API (required for static export)

Handle Image Optimization

Since Next.js Image Optimization doesn't work with static exports, choose one of these options:

Option 1: Unoptimized images (simplest for testing)

import Image from "next/image";

<Image src="/images/photo.jpg" width={500} height={300} alt="Description" />;

Option 2: Standard HTML img tags

<img src="/images/photo.jpg" width={500} height={300} alt="Description" />

Option 3: Third-party CDN (recommended for production)

Use services like Cloudflare Images, Cloudinary, or imgix. See our guide on optimizing S3 images with Cloudflare Images.

const imageLoader = ({
  src,
  width,
  quality,
}: {
  src: string;
  width: number;
  quality?: number;
}) => {
  return `https://your-cdn.com/${src}?w=${width}&q=${quality || 75}`;
};

<Image
  loader={imageLoader}
  src="photo.jpg"
  width={500}
  height={300}
  alt="Description"
/>;

Verify SSG Compatibility

Ensure your project doesn't use these incompatible features:

Cannot use:

Route Handlers (API Routes) - require server runtime
Server-Side Rendering (SSR) - e.g., cache: 'no-store' in fetches
Incremental Static Regeneration (ISR) - revalidate property
Dynamic data without generateStaticParams — including any data fetched from “dynamic APIs” that depend on runtime values (see Next.js guide on dynamic rendering)

Must implement:

All dynamic routes need generateStaticParams to pre-generate pages
Use cache: 'force-cache' for data fetching at build time

Example for dynamic routes:

// app/blog/[slug]/page.tsx
export async function generateStaticParams() {
  return [{ slug: "first-post" }, { slug: "second-post" }];
}

export default async function BlogPost({
  params,
}: {
  params: Promise<{ slug: string }>;
}) {
  const { slug } = await params;
  return <h1>Blog Post: {slug}</h1>;
}

Test Locally

Verify your static export works:

npm run build
npx serve out

Visit http://localhost:3000 to ensure everything renders correctly.

Step 2: Deploy to AWS Amplify

Push to Git

git add .
git commit -m "Configure Next.js for SSG deployment"
git push origin main

Access AWS Amplify Console

Log into AWS Console
Search for "Amplify" in the services search bar
Click "Get Started" under "Amplify Hosting"

Connect Your Repository

Select your Git provider (GitHub, GitLab, Bitbucket, or AWS CodeCommit)
Authorize AWS Amplify to access your repositories
Choose the repository and branch to deploy
Click "Next"

Configure Build Settings

Amplify should auto-detect Next.js, but verify your amplify.yml:

version: 1
frontend:
  phases:
    preBuild:
      commands:
        - npm ci
    build:
      commands:
        - npm run build
  artifacts:
    baseDirectory: out
    files:
      - "**/*"
  cache:
    paths:
      - node_modules/**/*

Critical settings:

baseDirectory: out - Next.js static export directory
npm ci - Consistent, reproducible builds
Cache node_modules for faster builds

Add Environment Variables (Optional)

If your app needs environment variables, add them in "Advanced settings":

NEXT_PUBLIC_API_URL=https://api.example.com
NEXT_PUBLIC_ANALYTICS_ID=G-XXXXXXXXXX

Remember: Only NEXT_PUBLIC_* variables are accessible in the browser. Server-side variables are available during build time only.

Deploy

Review settings
Click "Save and deploy"
Monitor the build progress (typically 3-5 minutes)

Once complete, you'll get a URL like: https://[branch].[app-id].amplifyapp.com

Step 3: Configure Redirects & Routing

Fix 404 Pages

Configure redirects in Amplify console under "Rewrites and redirects":

Rule 1: Custom 404 page

Source: </^[^.]+$|\\.(?!(css|gif|ico|jpg|js|png|txt|svg|woff|woff2|ttf|map|json|webp)$)([^.]+$)/>
Target: /404.html
Status: 404

Rule 2: Client-side routing fallback

Source: /<*>
Target: /index.html
Status: 200 (Rewrite)

Important: Place Rule 2 last - Amplify processes rules in order.

Step 4: Handle Common Considerations

Environment Variables

Client Components (with "use client"):

Can only access NEXT_PUBLIC_* variables
Variables are embedded in the build output

Server Components (default):

Can access all environment variables at build time
Never expose secrets in NEXT_PUBLIC_* variables

Forms and Interactive Features

SSG sites need external services for dynamic functionality:

Forms: Formspree, Web3Forms, AWS Lambda
Search: FlexSearch, Fuse.js, Algolia, Meilisearch
Comments: Giscus, Utterances, Disqus
Authentication: Clerk, Auth0, AWS Cognito

Next.js 16 App Router Features

Works perfectly with SSG:

Metadata API
Route Groups
Loading and Error UI
Server Components (data fetched at build time)

Requires pre-generation:

Dynamic routes (use generateStaticParams)
Parallel and Intercepting Routes (all combinations must be pre-generated)

Troubleshooting Common Issues

Build Fails: "Output directory not found"

Fix: Verify amplify.yml has baseDirectory: out and next.config.ts has output: 'export'

Images Not Loading

Fix: Ensure images.unoptimized: true in config. Images should be in public/ directory and referenced without the public prefix.

Blank Page After Deployment

Fix: Check browser console for errors. Usually means:

Missing NEXT_PUBLIC_ prefix on environment variables
Client-side JavaScript error
Incorrect build output

Dynamic Routes Return 404

Fix:

Implement generateStaticParams for all dynamic routes
Configure redirect rules in Amplify console

Environment Variables Not Working

Fix:

Client-side variables must use NEXT_PUBLIC_ prefix
Set in Amplify Console under "Environment variables"
Rebuild after adding new variables

When NOT to Use This Approach

Consider alternatives if you need:

Incremental Static Regeneration (ISR): Requires Node.js server
Server-Side Rendering (SSR): Real-time, per-request rendering
API Routes: Need serverless functions
On-demand Revalidation: Dynamic content updates without rebuilds
Edge Middleware: Complex routing logic at the edge

For these use cases, consider Vercel, AWS Amplify with SSR configuration, or AWS App Runner/ECS.

Next Steps

After deployment:

Add custom domain: In Amplify console under "Domain management"
Set up monitoring: Use CloudWatch for build and deployment alerts
Configure preview deployments: For pull requests and feature branches
Optimize costs: Review bandwidth usage and optimize images
Test thoroughly: Verify all pages and features work as expected

Conclusion

AWS Amplify provides a cost-effective platform for deploying Next.js 16 SSG applications with proper configuration. The key is understanding SSG limitations, configuring Next.js correctly, and setting up Amplify's build and routing properly.

While the setup requires more configuration than Vercel, the cost savings (up to 96% for small sites) and lack of per-seat pricing make it an excellent choice for businesses looking to optimize their hosting costs without sacrificing performance or reliability.