Forem: Polliog

Logtide 0.9.0: Custom Dashboards, Health Monitoring, and Log Parsing Pipelines

Polliog — Sat, 11 Apr 2026 20:35:57 +0000

Logtide 0.9.0 is out today. At the end of the 0.8.0 article we listed three things we wanted to tackle next: a customizable dashboard system to replace the fixed layout that had shipped since day one, proactive health monitoring so Logtide could tell you when something was down rather than waiting for a log to show up, and structured parsing pipelines for teams whose logs don't arrive pre-formatted. All three ship in this release.

If you're new here: Logtide is an open-source log management and SIEM platform built for European SMBs. Privacy-first, self-hostable, GDPR-compliant. No Elastic cluster to babysit just Docker Compose and the storage engine of your choice.

🌐 Cloud: logtide.dev
💻 GitHub: logtide-dev/logtide
📖 Docs: logtide.dev/docs

What's New

📊 Custom Dashboards: 9 Panel Types, Drag-to-Resize, and YAML Export

The fixed dashboard that shipped in 0.1.0 had a good run. It was a reasonable starting point 4 stat cards, log volume, top services, top error messages but it served everyone the same view regardless of what they actually cared about. 0.9.0 replaces it with a fully composable dashboard system.

Dashboards are org-scoped with an optional is_personal flag for views you don't want to share with the whole team. The Default dashboard is auto-created per organization and protected from deletion. A header dropdown lets you switch, create, clone, import, and export dashboards without leaving the page.

9 panel types cover every data source in Logtide:

Time series and single stat for general log-based metrics
Top-N table for ranking services, endpoints, or users by any dimension
Live log stream for a real-time tail of filtered log output
Alert status for a current-state view of your active alert rules
Metric chart and metric stat for OTLP metrics with avg/sum/min/max/count/last/p50/p95/p99 aggregations
Trace latency for p50/p95/p99 directly from span data
Detection events for SIEM incidents grouped by severity
Monitor status for uptime percentage and response time from the new monitoring system (more on that below)

Layout is a responsive 12-column grid. Panels snap to grid units when resized drag the bottom-right handle. The grid collapses to 6 columns on tablet and 1 column on mobile; stored widths are always in the 12-col reference and scale proportionally, so a panel that takes up half the desktop doesn't become a sliver on a small screen.

Inline edit mode keeps all pending changes in a local snapshot. Toggle edit, rearrange, resize, and configure as many panels as you want. Hit Save for a single atomic write, or Cancel to discard everything. There's no separate edit page.

YAML import/export lets you version-control dashboards alongside your infrastructure code. Import regenerates panel IDs and uses JSON_SCHEMA validation to block prototype pollution from crafted inputs. The schema is versioned (schema_version: 1) and ships with a migration framework in @logtide/shared: each version defines a MigrationFn, and migrateDashboard walks the chain on every read. Future schema changes will be applied automatically.

Panel data fetching is batched: a single POST /:id/panels/data round-trip fetches all panel data via Promise.allSettled. An error in one panel doesn't fail the rest of the dashboard.

Cross-org isolation is enforced at the data layer: every panel fetch verifies that config.projectId belongs to the requesting org. A crafted YAML import pointing at another org's project ID will return empty data, not that org's data.

The panel registry architecture is worth a mention for contributors. Adding a new panel type touches exactly 6 files: shared types, backend Zod schema, backend fetcher, frontend panel component, frontend config form, and a single registry entry. The renderer, container, store, and routes never change.

Existing users will see no visual change on first login the auto-created Default dashboard replicates the previous fixed layout exactly.

🖥️ Service Health Monitoring and Public Status Pages

Logtide has always been reactive: something breaks, logs appear, you find out. 0.9.0 adds the proactive layer.

Three monitor types cover the common cases. HTTP/HTTPS monitors are fully configurable: method, expected status code, custom headers, and a body assertion that accepts either a contains check or a regex. TCP monitors ping a host:port pair. Heartbeat monitors flip the model instead of Logtide reaching out, your service sends a POST /api/v1/monitors/:id/heartbeat on a schedule, and Logtide fires an incident when the expected ping doesn't arrive within the grace window.

Worker execution follows the same BullMQ pattern used throughout the codebase. A worker picks up all due monitors every 30 seconds and runs them in batches of 20 concurrent checks via Promise.allSettled. Results flow into the monitor_results hypertable with 7-day compression and 30-day retention. A monitor_uptime_daily continuous aggregate refreshed hourly powers the uptime percentage displays without hitting raw data on every page load.

Incident creation is automatic and integrated with the existing SIEM layer. When consecutive failures cross the configurable threshold, an incident is created with source: 'monitor' and linked via monitor_id. Notifications go through the same email and webhook channels already configured for alert rules no separate notification setup. Auto-resolution fires when the next check succeeds. An atomic WHERE incident_id IS NULL guard prevents duplicate incidents under concurrent check runs.

Severity is configurable per monitor (critical, high, medium, low, informational) rather than hardcoded. A flaky dev endpoint and a production payment service don't need to page with the same urgency.

Public status pages (/status/:projectSlug) are Uptime Kuma-inspired: a 45-day heartbeat bar grid, per-monitor uptime badge, overall status banner, and a light/dark mode toggle. Visibility is configured per project disabled by default, with public, password-protected, and org-members-only options.

Scheduled maintenances let you define windows with start and end times. Active maintenances suppress monitor incident creation so a planned deployment doesn't trigger pages, and display a maintenance banner on the status page so your users know what's happening.

Manual status incidents are independent from SIEM incidents. You can publish communications with an Investigating/Identified/Monitoring/Resolved progression and a full update timeline useful for communicating with users about an outage regardless of whether it was auto-detected.

The monitoring dashboard (/dashboard/monitoring) has a project selector, create/edit/delete forms with client-side validation, a detail page with an uptime chart and recent checks list, and a one-click heartbeat URL copy for the heartbeat monitor type.

🔩 Log Parsing and Enrichment Pipelines

Structured logging is a best practice, but not every log source you connect will cooperate. Nginx access logs, syslog output from legacy systems, plain text from third-party services these arrive as unstructured strings. Previously you'd parse them in your collector config or accept that they'd be stored as blobs. 0.9.0 gives you a better option.

Pipelines run as BullMQ background jobs after ingestion acknowledgment. Ingestion latency is unchanged logs are accepted and queued immediately, parsing happens asynchronously.

Five built-in parsers cover the common formats: nginx (combined log format), apache (identical pattern), syslog (RFC 3164 and RFC 5424), logfmt, and JSON message body.

Custom grok patterns use %{PATTERN:field} and %{PATTERN:field:type} syntax, with 22 named built-ins (IPV4, WORD, NOTSPACE, NUMBER, POSINT, DATA, GREEDYDATA, QUOTEDSTRING, METHOD, URIPATH, HTTPDATE, and more) and optional type coercion (:int, :float). If your log format is unusual enough that none of the built-in parsers cover it, grok handles the rest.

GeoIP enrichment uses the embedded MaxMind GeoLite2 database. Point it at any field containing an IP address and get country, city, coordinates, timezone, and ISP added to the log record automatically.

Scope is flexible: a pipeline can target a specific project or apply org-wide. Project-specific pipelines take priority over org-wide ones when both match. An in-memory cache in getForProject holds the resolved pipeline per project for 5 minutes, invalidated automatically on create/update/delete.

Pipeline preview lets you test any combination of steps against a sample log message before saving. The UI shows per-step extracted fields and the final merged result side by side, so you can iterate on the configuration without committing it.

YAML import/export follows the same pattern as dashboards: name, description, enabled, and steps fields; re-importing the same pipeline for the same scope performs an upsert rather than creating a duplicate.

The step builder in the settings UI (/dashboard/settings/pipelines) lets you add, reorder, and configure steps interactively, with per-type configuration forms for parser selection, grok pattern input, and GeoIP field targeting.

Everything Else Worth Knowing

Monitoring in the sidebar: the monitoring section appears under "Detect" alongside Alerts and Security. No extra navigation to find it.

Dashboard switcher in the header: replaces the previous single fixed entry point with a dropdown that handles create, delete, import, and export without leaving the page.

failureThreshold default aligned: the frontend form default was 3; the backend default was 2. They now match.

Project slugs: auto-generated from project name on creation, unique per org, backfilled for existing projects via migration. The status page route (/status/:projectSlug) uses these.

Upgrading

docker compose pull
docker compose up -d

Migrations run automatically on startup. No manual steps required.

What's Next

The roadmap toward v1.0 has a few clear remaining pieces:

Digest reports (#154): scheduled email summaries of log volume, top errors, and active incidents -- useful for teams that don't live in the dashboard
Webhook receivers (#154): accept inbound webhooks from external services (PagerDuty, GitHub, Stripe, etc.) and normalize them into Logtide log events without a collector in the middle

v1.0 is the Beta milestone. We're not jumping straight to a public Beta declaration we want the announcement to mean something. These issue groups are the remaining distance.

Full Changelog: v0.8.0...v0.9.0

If you're using Logtide, open an issue, start a discussion, or drop a ⭐ if it's been useful.

TigerFS: A Filesystem Backed by PostgreSQL

Polliog — Thu, 09 Apr 2026 13:32:32 +0000

TigerFS is a filesystem backed by PostgreSQL, built by the Timescale team. It mounts a database as a local directory via FUSE on Linux and NFS on macOS. Every file is a real row. Every directory is a table. Writes are transactions. Multiple processes and machines can read and write concurrently with full ACID guarantees.

There are two distinct ways to use it.

# Install (Linux requires fuse3; macOS needs no extra dependencies)
curl -fsSL https://install.tigerfs.io | sh

# Mount any PostgreSQL database
tigerfs mount postgres://localhost/mydb /mnt/db

Mode 1: Data-First

Mount any existing PostgreSQL database and explore it with standard UNIX tools. Every path resolves to optimized SQL that gets pushed down to the database.

Exploring

ls /mnt/db/                                          # list tables
ls /mnt/db/users/                                    # list rows by primary key
cat /mnt/db/users/123.json                           # read a row as JSON
cat /mnt/db/users/123/email.txt                      # read a single column
cat /mnt/db/users/.by/email/alice@example.com.json   # lookup by indexed column

Modifying

echo 'new@example.com' > /mnt/db/users/123/email.txt          # update a column
echo '{"email":"a@b.com","name":"A"}' > /mnt/db/users/123.json  # PATCH via JSON
mkdir /mnt/db/users/456                                         # insert a row
rm -r /mnt/db/users/456/                                        # delete a row

Pipeline Queries

Filters, ordering, and pagination can be chained directly in the path. TigerFS executes the whole chain as a single SQL query:

# Last 10 orders for customer 123, sorted by created_at, as JSON
cat /mnt/db/orders/.by/customer_id/123/.order/created_at/.last/10/.export/json

# Shipped orders, specific columns only, as CSV
cat /mnt/db/orders/.filter/status/shipped/.columns/id,total,created_at/.export/csv

Available segments (chainable in any order):

Segment	Description
`.by/col/val`	Indexed filter
`.filter/col/val`	Any column filter
`.order/col`	Sort
`.columns/a,b,c`	Column projection
`.first/N`, `.last/N`, `.sample/N`	Pagination
`.export/json\	csv\

Bulk Ingest

{% raw %}

cat data.csv > /mnt/db/orders/.import/.append/csv    # append rows
cat data.csv > /mnt/db/orders/.import/.sync/csv      # upsert by primary key
cat data.csv > /mnt/db/orders/.import/.overwrite/csv # replace the table

Schema Management

Tables, indexes, and views are managed through a staging pattern:

mkdir /mnt/db/.create/orders
echo "CREATE TABLE orders (...)" > /mnt/db/.create/orders/sql
touch /mnt/db/.create/orders/.commit

Mode 2: File-First

Create a new database and use it as a transactional shared workspace. Any tool that works with files works here: AI agents, grep, vim, shell scripts.

Markdown Apps

"Apps" define how TigerFS presents a table as a native file format. Writing markdown to .build/ turns a table into a directory of .md files where YAML frontmatter maps to columns and the document body maps to a text column:

echo "markdown" > /mnt/db/.build/blog

cat > /mnt/db/blog/hello-world.md << 'EOF'
---
title: Hello World
author: alice
tags: [intro]
---

# Hello World

Welcome to my blog...
EOF

# Standard tools work as expected
grep -l "author: alice" /mnt/db/blog/*.md
mkdir /mnt/db/blog/tutorials
mv /mnt/db/blog/hello-world.md /mnt/db/blog/tutorials/

Version History

Add history to the app definition and every edit is captured as a timestamped snapshot in a read-only .history/ directory. History uses TimescaleDB hypertables for compressed storage and tracks files across renames via stable row UUIDs:

echo "markdown,history" > /mnt/db/.build/notes

ls /mnt/db/notes/.history/hello.md/
# 2026-02-24T150000Z  2026-02-12T013000Z

cat /mnt/db/notes/.history/hello.md/2026-02-12T013000Z

Multi-Agent Task Queue

mv between directories is an atomic database operation. Two agents cannot claim the same task because the underlying transaction will fail for one of them - no distributed lock manager, no coordination API needed:

echo "markdown,history" > /mnt/db/.build/tasks
mkdir /mnt/db/tasks/todo /mnt/db/tasks/doing /mnt/db/tasks/done

cat > /mnt/db/tasks/todo/fix-auth-bug.md << 'EOF'
---
priority: high
assigned_to:
---
The OAuth token refresh is failing for users with...
EOF

# Agent claims the task - atomic database operation
mv /mnt/db/tasks/todo/fix-auth-bug.md /mnt/db/tasks/doing/fix-auth-bug.md

# Agent marks it done
mv /mnt/db/tasks/doing/fix-auth-bug.md /mnt/db/tasks/done/fix-auth-bug.md

# Check what is in progress
ls /mnt/db/tasks/doing/
grep "assigned_to:" /mnt/db/tasks/doing/*.md

Shared Agent Workspace

Multiple agents on different machines can read and write the same files concurrently. Changes are visible immediately with no pull, push, or merge step:

# Agent A writes findings
cat > /mnt/db/kb/auth-analysis.md << 'EOF'
---
author: agent-a
---
OAuth 2.0 is the recommended approach because...
EOF

# Agent B reads immediately, no sync needed
cat /mnt/db/kb/auth-analysis.md

# Agent B updates the document
cat > /mnt/db/kb/auth-analysis.md << 'EOF'
---
author: agent-a
reviewed-by: agent-b
status: approved
---
OAuth 2.0 is the recommended approach because... [approved with comments]
EOF

# Full edit trail
ls /mnt/db/kb/.history/auth-analysis.md/

Cloud Backends

TigerFS works with any PostgreSQL database via connection string. It also integrates with Timescale Cloud and Ghost through their CLIs - no passwords stored in config:

tigerfs mount postgres://user:pass@host/mydb /mnt/db

tigerfs mount tiger:abcde12345 /mnt/db   # Timescale Cloud
tigerfs mount ghost:fghij67890 /mnt/db   # Ghost

# Fork a database for safe experimentation
tigerfs fork /mnt/db my-experiment

Why the File Interface

The point of the filesystem abstraction is that every tool already speaks it. grep, awk, jq, shell scripts, AI coding agents (Claude Code, Cursor, and others) all understand files without any SDK, schema definition, or client library to set up.

For multi-agent coordination specifically:

Approach	What you build on top
Custom REST API	Endpoints, auth, deployment
Shared database directly	SQL or ORM, schema definitions, client libraries
Git	Pull/push/merge workflow, conflict resolution
S3	No transactions, no structured queries
TigerFS	Mount and use standard tools

The coordination logic - atomic task claims, version history, concurrent access - lives in PostgreSQL. The application doesn't implement it.

Current Status

TigerFS is at v0.5.0 and described as early-stage by the team, though the core design is stable. The data-first mode is functional today for any PostgreSQL database. Planned additions include support for tables without primary keys (read-only via ctid) and TimescaleDB hypertable time-based navigation.

GitHub: timescale/tigerfs - Docs: tigerfs.io

tigerfs mount postgres://localhost/yourdb /mnt/db
ls /mnt/db/

I Replaced ElastiCache with Valkey on ECS (And Cut the Bill by 70%)

Polliog — Wed, 08 Apr 2026 11:14:38 +0000

ElastiCache is a genuinely good service. Managed failover, automated backups, CloudWatch integration out of the box. For teams that need Redis and don't want to operate it, it makes sense.

The price, however, does not scale down. A cache.t4g.small (2 vCPU, 1.37GB RAM) runs about $25/month in eu-west-1. A cache.r7g.large (2 vCPU, 13.07GB RAM) is $175/month. Multi-AZ doubles those numbers. For a startup or side project running on a single-digit revenue, that's a significant line item for what is often a queue and a session cache.

Valkey is a Redis-compatible open-source project under the Linux Foundation, backed by AWS, Google, and others. It forked from Redis 7.2.4 (BSD-3 license) and maintains full protocol compatibility. Every client library that works with Redis works with Valkey: ioredis, node-redis, BullMQ, Sidekiq, Celery. No code changes.

Here's how to run it on ECS Fargate and what it actually costs.

The Architecture

We're running Valkey as an ECS service on Fargate, backed by an EFS volume for persistence, inside a VPC with a security group that restricts access to the application services only.

VPC
├── Public Subnet
│   └── Application Load Balancer
├── Private Subnet A
│   ├── ECS Service: App (Fargate)
│   └── ECS Service: Valkey (Fargate)
└── Private Subnet B
    └── ECS Service: App (Fargate) - replica

Valkey doesn't need to be publicly accessible. It lives in the private subnet and is reachable only from the application services in the same VPC.

Step 1: EFS Volume for Persistence

Fargate tasks are ephemeral. Without a persistent volume, every Valkey restart loses your data. EFS solves this without managing EC2.

# terraform/efs.tf
resource "aws_efs_file_system" "valkey" {
  creation_token = "${var.app_name}-valkey"
  encrypted      = true

  lifecycle_policy {
    transition_to_ia = "AFTER_7_DAYS"
  }

  tags = {
    Name = "${var.app_name}-valkey"
  }
}

resource "aws_efs_mount_target" "valkey" {
  for_each = toset(var.private_subnet_ids)

  file_system_id  = aws_efs_file_system.valkey.id
  subnet_id       = each.value
  security_groups = [aws_security_group.efs.id]
}

resource "aws_security_group" "efs" {
  name   = "${var.app_name}-efs"
  vpc_id = var.vpc_id

  ingress {
    from_port       = 2049
    to_port         = 2049
    protocol        = "tcp"
    security_groups = [aws_security_group.valkey_task.id]
  }
}

Important: use an EFS Access Point, not rootDirectory directly.

Specifying a rootDirectory that doesn't physically exist on a fresh EFS filesystem causes the Fargate task to fail immediately with ResourceInitializationError: failed to invoke EFS utils... directory does not exist. Fargate won't create the directory automatically.

EFS Access Points handle this correctly - they create the directory with the right UNIX permissions if it doesn't exist yet:

resource "aws_efs_access_point" "valkey" {
  file_system_id = aws_efs_file_system.valkey.id

  posix_user {
    uid = 1000
    gid = 1000
  }

  root_directory {
    path = "/valkey"
    creation_info {
      owner_uid   = 1000
      owner_gid   = 1000
      permissions = "0755"
    }
  }
}

Step 2: ECS Task Definition

{
  "family": "valkey",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::ACCOUNT_ID:role/ecsTaskExecutionRole",
  "volumes": [
    {
      "name": "valkey-data",
      "efsVolumeConfiguration": {
        "fileSystemId": "fs-XXXXXXXXX",
        "transitEncryption": "ENABLED",
        "authorizationConfig": {
          "accessPointId": "fsap-XXXXXXXXX",
          "iam": "ENABLED"
        }
      }
    }
  ],
  "containerDefinitions": [
    {
      "name": "valkey",
      "image": "valkey/valkey:8.0-alpine",
      "portMappings": [
        {
          "containerPort": 6379,
          "protocol": "tcp"
        }
      ],
      "command": [
        "valkey-server",
        "--save", "60", "1000",
        "--appendonly", "yes",
        "--appendfsync", "everysec",
        "--maxmemory", "800mb",
        "--maxmemory-policy", "allkeys-lru",
        "--requirepass", "VALKEY_PASSWORD_FROM_SECRETS_MANAGER"
      ],
      "mountPoints": [
        {
          "sourceVolume": "valkey-data",
          "containerPath": "/data",
          "readOnly": false
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/valkey",
          "awslogs-region": "eu-west-1",
          "awslogs-stream-prefix": "valkey"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "valkey-cli ping | grep PONG"],
        "interval": 10,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 15
      }
    }
  ]
}

A few things worth noting in this config:

--save 60 1000 triggers an RDB snapshot every 60 seconds if at least 1000 keys changed. Combined with --appendonly yes, you get both AOF and RDB persistence - the AOF gives you per-second durability, the RDB gives you faster restart times.

--maxmemory-policy allkeys-lru means Valkey will evict the least recently used keys when it hits the memory limit. For a cache workload this is usually what you want. For a queue workload (BullMQ, Sidekiq) you should use noeviction instead and alert on memory pressure.

The password comes from Secrets Manager via the secrets field in the task definition rather than a hardcoded string. The example above is simplified for readability.

Step 3: ECS Service and Service Discovery

For the application to connect to Valkey, it needs a stable hostname. ECS Service Discovery provides this via AWS Cloud Map.

# terraform/service-discovery.tf
resource "aws_service_discovery_private_dns_namespace" "internal" {
  name = "internal.${var.app_name}"
  vpc  = var.vpc_id
}

resource "aws_service_discovery_service" "valkey" {
  name = "valkey"

  dns_config {
    namespace_id = aws_service_discovery_private_dns_namespace.internal.id

    dns_records {
      ttl  = 10
      type = "A"
    }

    routing_policy = "MULTIVALUE"
  }

  health_check_custom_config {
    failure_threshold = 1
  }
}

resource "aws_ecs_service" "valkey" {
  name            = "valkey"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.valkey.arn
  desired_count   = 1
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = var.private_subnet_ids
    security_groups  = [aws_security_group.valkey_task.id]
    assign_public_ip = false
  }

  service_registries {
    registry_arn = aws_service_discovery_service.valkey.arn
  }

  # Prevent ECS from cycling the task during deployments
  # when the app has active connections
  deployment_minimum_healthy_percent = 100
  deployment_maximum_percent         = 200
}

Your application connects to valkey.internal.your-app:6379. When the task is replaced (restart, deployment), the DNS record updates automatically within the TTL.

Step 4: Connecting from Node.js

No code changes from a Redis setup. ioredis works as-is:

import Redis from "ioredis";

const valkey = new Redis({
  host: process.env.VALKEY_HOST,   // valkey.internal.your-app
  port: 6379,
  password: process.env.VALKEY_PASSWORD,
  // Retry on connection loss - important for task replacement events
  retryStrategy: (times) => Math.min(times * 100, 3000),
  maxRetriesPerRequest: 3,
  enableOfflineQueue: true,
  lazyConnect: true,
});

valkey.on("error", (err) => {
  console.error("Valkey connection error:", err);
});

valkey.on("reconnecting", () => {
  console.log("Valkey reconnecting...");
});

BullMQ requires no changes either:

import { Queue, Worker } from "bullmq";
import { valkey } from "./valkey";

const emailQueue = new Queue("emails", { connection: valkey });

const worker = new Worker(
  "emails",
  async (job) => {
    await sendEmail(job.data);
  },
  { connection: valkey }
);

The Cost Comparison

Running Valkey on Fargate with 0.5 vCPU and 1GB RAM in eu-west-1.

Note: The Fargate numbers below use on-demand pricing ($0.04048/vCPU-hour, $0.004445/GB-hour). If you're using a Compute Savings Plan (common for Fargate workloads), expect 20-40% lower compute costs.

Resource	Monthly cost (on-demand)
Fargate compute (0.5 vCPU)	$14.77
Fargate memory (1GB)	$3.24
EFS storage (5GB used)	$1.50
EFS throughput	$0.30
CloudWatch logs	$0.50
Total	~$20.30

vs. ElastiCache cache.t4g.small (1.37GB RAM):

Resource	Monthly cost
ElastiCache t4g.small	$25.20
Total	$25.20

Single-node, the savings are modest (~20%). The real comparison is Multi-AZ ElastiCache, which is the production-grade option: a Multi-AZ cache.t4g.small is $50.40/month. Against that, Fargate at ~$20 is a 60% reduction - and with a Savings Plan applied, closer to 70%.

For a cache.r7g.large workload (13GB RAM), the numbers shift further:

Option	Monthly cost
ElastiCache r7g.large	$175.00
ElastiCache r7g.large Multi-AZ	$350.00
Fargate (2 vCPU / 13GB, on-demand)	~$108.00
Fargate (2 vCPU / 13GB, Savings Plan)	~$70.00

The savings are real. So is the operational difference.

What You Give Up

ElastiCache manages automatic failover, Multi-AZ replication, and rolling upgrades. With Fargate, you're responsible for all of that.

No automatic failover. If the Valkey Fargate task dies, ECS will restart it automatically (typically 30-90 seconds). During that window, connections fail. For a cache this is usually acceptable. For a job queue, your workers will queue errors and retry - still acceptable if you've configured retries correctly. For session storage, users get logged out. Decide based on your workload.

Manual upgrades. You control the Docker image tag. Update the task definition, trigger a new deployment. No automatic patch management.

No Multi-AZ replication out of the box. If you need a hot standby, you'll need to set up Valkey's built-in replication between two Fargate tasks and handle failover at the application level or with an intermediate proxy. This adds complexity that may not be worth it below a certain scale.

Persistence responsibility. EFS gives you durable storage, but you're responsible for backup strategy. Set up AWS Backup for the EFS volume or use Valkey's BGSAVE + S3 export for point-in-time backups.

When This Makes Sense

This setup is the right call when:

Your workload is a queue, cache, or session store without strict HA requirements
You're running a startup, side project, or internal tool where a 60-second outage is acceptable
You're already paying for Fargate and EFS for other services
The ElastiCache bill is a meaningful percentage of your monthly AWS spend

It's the wrong call when:

You need sub-second automatic failover
You're storing session data where eviction causes user-visible disruption
Your compliance requirements mandate managed services with AWS support coverage
You're operating at a scale where the operational overhead of self-managed infrastructure costs more than the ElastiCache bill

The break-even point for most teams is somewhere around $100-150/month in Redis costs. Below that, ElastiCache's convenience usually wins. Above it, self-hosted starts to look attractive even accounting for the operational investment.

Running Valkey on AWS in a different configuration? Different numbers for your region or workload? Happy to hear it in the comments.

Your Node.js App Is Probably Killing Your PostgreSQL (Connection Pooling Explained)

Polliog — Mon, 06 Apr 2026 20:50:47 +0000

A few months ago I was looking at why a PostgreSQL instance was running at 94% memory on a server that, by all accounts, should have had plenty of headroom. The queries were fast, the data volume was modest, and CPU was barely touched.

The culprit was 280 open connections.

No single connection was doing anything particularly expensive. But each one carries a cost that most developers don't think about until they're in production staring at an OOM kill: PostgreSQL spawns a dedicated backend process per connection, and each process consumes roughly 5-10MB of RAM regardless of whether it's actively running a query.

280 connections x 7MB average = 1.96GB. On a server with 4GB RAM and PostgreSQL's own memory settings (shared_buffers, work_mem), that leaves almost nothing for actual query execution.

Why Node.js Apps Over-Connect

The problem is architectural. Node.js applications are typically deployed as multiple processes or containers: a web server, one or more background workers, maybe a separate process for scheduled jobs. Each runs its own connection pool. Each pool opens connections eagerly.

With pg and a default pool size of 10, and 3 services each with 3 replicas:

web server (3 replicas x 10 connections) = 30 connections
background worker (3 replicas x 10 connections) = 30 connections
job scheduler (3 replicas x 5 connections) = 15 connections
Total: 75 connections at idle

Add a traffic spike, pool expansion, and a few long-running queries holding connections open, and you're at 150+ before anything goes wrong with your code.

PostgreSQL's default max_connections is 100. Many managed databases (RDS, Supabase, Neon) set it lower for small instance sizes.

What Happens When You Hit the Limit

Error: remaining connection slots are reserved for non-replication superuser connections

Or, worse, requests that queue indefinitely waiting for a connection that never frees up because every connection is held by a slow query, and the slow query is slow because it can't get a lock, because another connection holds it, and that connection is waiting for... a connection.

You get the idea.

The Wrong Fix

The instinct is to increase max_connections. This works until it doesn't: more connections means more RAM pressure, more context switching, and more lock contention. PostgreSQL is not designed for thousands of concurrent connections. It's designed for dozens of active queries with efficient I/O, and it's exceptional at that.

The right fix is to not open connections you don't need.

PgBouncer: A Connection Pool in Front of PostgreSQL

PgBouncer sits between your application and PostgreSQL. Your application thinks it's talking to PostgreSQL directly - same protocol, same port behavior. PgBouncer maintains a much smaller pool of real PostgreSQL connections and multiplexes client connections onto them.

App (100 client connections)
         |
    [PgBouncer]
         |
PostgreSQL (20 server connections)

100 application connections, 20 actual PostgreSQL connections. The application never notices.

PgBouncer has three pooling modes:

Session pooling - a server connection is assigned to a client for the entire session duration. Equivalent to no pooling for persistent connections, but useful for clients that connect and disconnect frequently.

Transaction pooling - a server connection is assigned only for the duration of a transaction. As soon as your transaction commits or rolls back, the connection goes back to the pool. This is the mode that actually reduces your connection count dramatically.

Statement pooling - a server connection is assigned for a single statement. Very aggressive, incompatible with multi-statement transactions. Rarely the right choice.

For most Node.js workloads, transaction pooling is what you want.

Setting Up PgBouncer with Docker

# docker-compose.yml
services:
  pgbouncer:
    image: bitnami/pgbouncer:latest
    environment:
      POSTGRESQL_HOST: postgres
      POSTGRESQL_PORT: 5432
      POSTGRESQL_DATABASE: myapp
      POSTGRESQL_USERNAME: app_user
      POSTGRESQL_PASSWORD: ${DB_PASSWORD}
      PGBOUNCER_PORT: 6432
      PGBOUNCER_POOL_MODE: transaction
      PGBOUNCER_MAX_CLIENT_CONN: 1000
      PGBOUNCER_DEFAULT_POOL_SIZE: 25
      PGBOUNCER_MIN_POOL_SIZE: 5
      PGBOUNCER_RESERVE_POOL_SIZE: 5
      PGBOUNCER_RESERVE_POOL_TIMEOUT: 3
      PGBOUNCER_SERVER_IDLE_TIMEOUT: 600
    ports:
      - "6432:6432"
    depends_on:
      - postgres

Your application connects to port 6432 (PgBouncer) instead of 5432 (PostgreSQL). Everything else stays the same.

// Before
const pool = new Pool({
  connectionString: "postgresql://app_user:password@postgres:5432/myapp",
  max: 10,
});

// After
const pool = new Pool({
  connectionString: "postgresql://app_user:password@pgbouncer:6432/myapp",
  max: 25, // can be higher now - PgBouncer handles the real limit
});

The Numbers

Same application, same workload, same PostgreSQL instance. Before and after adding PgBouncer in transaction mode:

Metric	Without PgBouncer	With PgBouncer
PostgreSQL connections (idle)	75	8
PostgreSQL connections (peak load)	210	25
PostgreSQL RAM used by connections	1.47GB	175MB
p99 query latency (peak)	340ms	95ms
Errors under load	connection limit exceeded	0

The latency improvement is not because PgBouncer makes queries faster. It's because without it, queries were queuing for a connection slot. With transaction pooling, a query gets a connection, runs, and returns it immediately - no waiting.

What Transaction Pooling Breaks

This is important. Transaction pooling is not a drop-in change if you use any of the following:

Named prepared statements. Prepared statements are created on a specific server connection. With transaction pooling, you might get a different connection per transaction, so the prepared statement doesn't exist there.

Good news for Node.js developers: pg does NOT use protocol-level prepared statements by default. Standard parameterized queries work fine with PgBouncer in transaction mode:

// This does NOT use a persistent prepared statement - works fine with PgBouncer
await client.query("SELECT * FROM users WHERE id = $1", [userId]);

// This DOES use a persistent prepared statement (the `name` property) - breaks with PgBouncer
await client.query({
  name: "get-user-by-id",
  text: "SELECT * FROM users WHERE id = $1",
  values: [userId],
});

The issue only appears if you explicitly pass a name property in the query object. If you're using standard pool.query(sql, params) calls, you don't need to change anything.

SET statements and session-level configuration. SET search_path TO tenant_abc applies to the session, not the transaction. With transaction pooling, the setting evaporates when the transaction ends and the connection goes back to the pool.

If you're using RLS with set_config('app.organization_id', orgId, true), the true parameter already makes it transaction-scoped, so this works correctly with PgBouncer. Just make sure you're not relying on any session-level state persisting between transactions.

Advisory locks. pg_advisory_lock() is session-scoped. Use pg_advisory_xact_lock() instead, which is transaction-scoped and releases automatically on commit/rollback.

LISTEN/NOTIFY. Subscriptions are session-scoped. If you're using LISTEN, you need a dedicated long-lived connection that bypasses PgBouncer - or use a separate direct PostgreSQL connection just for pub/sub.

// Direct connection for LISTEN/NOTIFY, bypassing PgBouncer
const notifyClient = new Client({
  connectionString: process.env.DATABASE_DIRECT_URL, // points to :5432
});
await notifyClient.connect();
await notifyClient.query("LISTEN log_events");

PgBouncer on Managed Databases

If you're using RDS, Supabase, Neon, or similar, you often don't need to run PgBouncer yourself.

RDS: RDS Proxy is AWS's managed connection pooler. It's PgBouncer-like, works in transaction mode, integrates with IAM authentication. It costs extra ($0.015/vCPU-hour) but removes the operational burden.
Supabase: Has a built-in connection pooler called Supavisor (which replaced their PgBouncer setup in 2023) working in transaction mode on port 6543. Use that URL for your application instead of the direct connection string.
Neon: Serverless pooling built-in, similar to transaction mode.
PlanetScale: MySQL-based, different story entirely.

If you're using Prisma with any connection pooler in transaction mode, you must add ?pgbouncer=true to your database URL - otherwise Prisma's internal prepared statement handling will crash:

# Without this flag, Prisma breaks silently with PgBouncer/Supavisor in transaction mode
DATABASE_URL="postgresql://user:password@pgbouncer:6432/myapp?pgbouncer=true"

This one parameter has saved countless hours of "why is Prisma throwing random errors in production" debugging.

For self-hosted PostgreSQL, running PgBouncer yourself is the standard approach.

Tuning `max_connections` in PostgreSQL

Once PgBouncer is in front, you can lower PostgreSQL's max_connections to something realistic:

-- See current value
SHOW max_connections;

-- See current active connections
SELECT count(*) FROM pg_stat_activity;

A reasonable formula for max_connections when using a pool:

max_connections = (pool_size * number_of_pools) + reserved_superuser_connections

For PgBouncer with default_pool_size = 25 and a few admin connections:

max_connections = 25 + 10 (headroom) = 35

Set this in postgresql.conf:

max_connections = 35
shared_buffers = 256MB   # ~25% of available RAM
work_mem = 16MB          # per sort/hash operation, per connection

Lowering max_connections lets PostgreSQL allocate more memory to shared_buffers and work_mem, which directly improves query performance. The memory that was being eaten by connection overhead goes back to the query executor.

The Checklist

If you're running Node.js with PostgreSQL in production:

Is your pool size per process configured explicitly, or defaulting to 10?
How many processes/replicas connect to the database? What's the total connection count?
Are you within 80% of max_connections at peak?
Do you have PgBouncer or equivalent in front of PostgreSQL?
Are you using set_config for RLS context rather than SET statements?
Are you using pg_advisory_xact_lock instead of pg_advisory_lock?
Do you have a dedicated connection for LISTEN/NOTIFY that bypasses the pool?

Connection exhaustion is one of those problems that hides until traffic spikes, then appears as a cascade of unrelated-looking errors. The fix is not complicated, but it requires understanding what PostgreSQL is actually doing with each connection.

What connection pool setup are you running in production? Any gotchas with PgBouncer that aren't covered here? Comments are open.

I Ditched Prisma for Raw SQL (And My Queries Got 10x Faster)

Polliog — Mon, 06 Apr 2026 13:28:19 +0000

Prisma is genuinely good software. The schema DSL is clean, the type generation works well, and for a new project it gets you to a working data layer in an hour. I used it for about a year before I started noticing things.

The first sign was a query that should have taken 5ms taking 80ms. The second was a N+1 that I'd technically solved with include but was still generating 15 SQL statements. The third was opening prisma.$queryRaw for the third time in a week because the query builder couldn't express what I needed.

At that point I stopped fighting the abstraction and started writing SQL directly.

What Prisma Actually Does to Your Queries

This is a simple query with a filter and pagination:

const logs = await prisma.logEntry.findMany({
  where: {
    organizationId: orgId,
    timestamp: {
      gte: from,
      lt: to,
    },
    level: { in: ["error", "fatal"] },
  },
  orderBy: { timestamp: "desc" },
  take: 50,
  skip: page * 50,
});

The SQL Prisma generates:

SELECT
  "public"."log_entries"."id",
  "public"."log_entries"."timestamp",
  "public"."log_entries"."service",
  "public"."log_entries"."level",
  "public"."log_entries"."message",
  "public"."log_entries"."metadata",
  "public"."log_entries"."organization_id",
  "public"."log_entries"."created_at",
  "public"."log_entries"."updated_at"
FROM "public"."log_entries"
WHERE (
  "public"."log_entries"."organization_id" = $1
  AND "public"."log_entries"."timestamp" >= $2
  AND "public"."log_entries"."timestamp" < $3
  AND "public"."log_entries"."level" IN ($4, $5)
)
ORDER BY "public"."log_entries"."timestamp" DESC
LIMIT $6 OFFSET $7

This is fine SQL. But notice: it selects every column (including created_at and updated_at that my UI doesn't need), it uses OFFSET pagination (slow on large tables), and I have no control over any of it without escaping to $queryRaw.

The equivalent raw query:

const logs = await pool.query(
  `SELECT id, timestamp, service, level, message, metadata
   FROM log_entries
   WHERE organization_id = $1
     AND timestamp >= $2
     AND timestamp < $3
     AND level = ANY($4)
     AND (timestamp, id) < ($5, $6)
   ORDER BY timestamp DESC, id DESC
   LIMIT $7`,
  [orgId, from, to, ["error", "fatal"], cursorTs, cursorId, 50]
);

Keyset pagination instead of OFFSET, only the columns I need, and the query is exactly what I want the database to run.

The N+1 Problem Prisma Doesn't Fully Solve

Prisma's include resolves N+1 queries by using IN clauses instead of per-row queries. But "no N+1" doesn't mean "one query":

const projects = await prisma.project.findMany({
  where: { organizationId: orgId },
  include: {
    members: true,
    apiKeys: { where: { active: true } },
    _count: { select: { logEntries: true } },
  },
});

Prisma executes this as 4 separate queries: one for projects, one for members, one for apiKeys, one for the count. Then it assembles the result in JavaScript.

The raw equivalent is one query. The naive approach would be chaining multiple LEFT JOIN on one-to-many tables and relying on GROUP BY - but that produces a Cartesian fan-out: if a project has 10 members, 5 API keys, and 100 log entries, the database materializes 10x5x100 = 5,000 intermediate rows per project before collapsing them. COUNT(DISTINCT ...) hides the bug in the results, but performance collapses as the tables grow.

The correct version pre-aggregates each relationship with CTEs before joining:

WITH member_stats AS (
  SELECT
    project_id,
    COUNT(user_id) AS member_count,
    jsonb_agg(jsonb_build_object('id', user_id, 'role', role)) AS members
  FROM project_members
  GROUP BY project_id
),
key_stats AS (
  SELECT project_id, COUNT(id) AS active_key_count
  FROM api_keys
  WHERE active = true
  GROUP BY project_id
),
log_stats AS (
  SELECT project_id, COUNT(id) AS log_entry_count
  FROM log_entries
  WHERE timestamp > NOW() - INTERVAL '24 hours'
  GROUP BY project_id
)
SELECT
  p.id,
  p.name,
  p.created_at,
  COALESCE(ms.member_count, 0) AS member_count,
  COALESCE(ks.active_key_count, 0) AS active_key_count,
  COALESCE(ls.log_entry_count, 0) AS log_entry_count,
  COALESCE(ms.members, '[]'::jsonb) AS members
FROM projects p
LEFT JOIN member_stats ms ON ms.project_id = p.id
LEFT JOIN key_stats ks ON ks.project_id = p.id
LEFT JOIN log_stats ls ON ls.project_id = p.id
WHERE p.organization_id = $1
ORDER BY p.created_at DESC

Each CTE scans and aggregates its table independently. The final join works on already-collapsed rows - no fan-out, no wasted intermediate rows. One round trip, and actually faster than Prisma's 4 queries at scale.

Prisma can't generate this query. $queryRaw can run it, but then you lose the type safety that was the point of using Prisma.

The Performance Numbers

Same endpoint, same data, same index configuration. 50k rows in the table.

Query type	p50	p95	p99
Prisma `findMany` with `include`	45ms	120ms	310ms
4 separate `pg` queries	18ms	40ms	95ms
Single JOIN query	6ms	14ms	28ms

The 10x headline comes from the p99 comparison. At p50 it's closer to 7x. Both are real.

The Prisma numbers aren't bad in absolute terms for most applications. They become a problem when you're doing this on every request, at scale, with connection pool pressure from concurrent requests.

Migrating Without Rewriting Everything

You don't have to replace Prisma everywhere at once. The practical path:

Step 1: Keep Prisma for writes and simple reads

Prisma is genuinely good for inserts, updates, and single-record lookups by primary key. The query generation for these is optimal and the type safety is useful.

// Keep this in Prisma - it's fine
await prisma.user.create({ data: { email, name, organizationId } });
await prisma.project.update({ where: { id }, data: { name } });
const user = await prisma.user.findUnique({ where: { id } });

Step 2: Replace list queries and anything with joins

This is where the overhead compounds. Add a pg pool alongside Prisma:

import { Pool } from "pg";
import { PrismaClient } from "@prisma/client";

export const prisma = new PrismaClient();
export const pool = new Pool({ connectionString: process.env.DATABASE_URL });

Step 3: Write a thin query layer

The thing I missed most from Prisma was typed results. TypeScript with raw SQL defaults to any. Fix it:

async function queryLogs(params: LogQuery): Promise<LogEntry[]> {
  const result = await pool.query<{
    id: string;
    timestamp: Date;
    service: string;
    level: string;
    message: string;
    metadata: Record<string, unknown>;
  }>(
    `SELECT id, timestamp, service, level, message, metadata
     FROM log_entries
     WHERE organization_id = $1
       AND timestamp >= $2
       AND timestamp < $3
     ORDER BY timestamp DESC
     LIMIT $4`,
    [params.orgId, params.from, params.to, params.limit]
  );

  return result.rows;
}

The generic parameter on pool.query<T> types the rows. It's not as ergonomic as Prisma's generated types, but it's enough to catch most mistakes at compile time.

If you want SQL-level control with Prisma-level type safety, look into Kysely or Drizzle ORM. Both let you write SQL-close queries while inferring full TypeScript types from your schema - without the ORM magic that makes query optimization hard. Kysely in particular is worth a look if the manual typing in pool.query<T> feels too brittle.

What You Actually Lose

This is important to say clearly: there are real things you give up.

Schema migrations. Prisma Migrate is good. When you drop Prisma from your query layer you still want a migration tool. I use node-pg-migrate, others use db-migrate or just raw SQL files in a migrations folder with a simple runner. None of them are as polished as Prisma Migrate.

The schema as source of truth. Prisma's schema file makes it easy to see your data model at a glance and generates types from it. With raw SQL you're maintaining types manually or generating them from the database schema with something like pgtyped or zapatos.

Prisma Studio. Minor thing but worth mentioning - having a UI to browse your data is useful during development.

Onboarding speed. New developers on a project with raw SQL need to know SQL. This is not a bad thing, but it's a real cost.

When to Keep Prisma

Prisma is the right choice when:

Your team isn't comfortable with SQL
You're building a CRUD app where the Prisma query builder covers 90%+ of your needs
You're early stage and query performance isn't a bottleneck yet
The productivity gain from the DX outweighs the performance cost

It stops being the right choice when:

Your most important queries can't be expressed through the query builder
You're regularly escaping to $queryRaw for anything beyond simple lookups
Query times are a meaningful part of your latency budget
You need fine-grained control over indexes, hints, or query plans

The answer for most production systems that have been running for more than a year is: use both. Prisma for the simple stuff, raw SQL for the queries that matter.

What ORM or query approach are you using in production? Anything that changed your mind in either direction? Comments are open.

Your API Responses Are 40x Larger Than They Need to Be

Polliog — Sat, 04 Apr 2026 10:43:14 +0000

I was profiling a production API last year when I noticed something that should have been obvious from the start: the response body for a simple list endpoint was 2.4MB. The actual useful data? About 60KB.

The rest was a mix of unused fields, redundant nesting, and no compression. It had been that way since day one, and nobody had noticed because on localhost it's fast enough that it doesn't register.

This is not a rare situation. It's the default.

The Three Ways APIs Bloat Their Responses

1. No Compression

This is the easiest win and the most commonly skipped.

HTTP has supported gzip compression since 1999. Brotli has been in all major browsers since 2017. Most APIs don't enable either by default.

# Check if your API compresses responses
curl -s -o /dev/null -w "%{size_download}" https://api.example.com/users
curl -s -o /dev/null -w "%{size_download}" -H "Accept-Encoding: gzip" https://api.example.com/users

If both numbers are the same, you're not compressing.

The fix in Node.js with Fastify:

import fastifyCompress from "@fastify/compress";

await app.register(fastifyCompress, {
  global: true,
  encodings: ["br", "gzip", "deflate"],
  threshold: 1024, // don't compress responses under 1KB
});

And with Express:

import compression from "compression";

app.use(compression({
  level: 6,
  threshold: 1024,
  filter: (req, res) => {
    if (req.headers["x-no-compression"]) return false;
    return compression.filter(req, res);
  },
}));

Real numbers from a list endpoint returning 500 records:

Format	Size
No compression	2.4MB
gzip	280KB
brotli	210KB

That's an 8-11x reduction with zero changes to your data model, zero changes to clients, and about 10 lines of code.

Note: Handling compression in Node.js works well for most setups. At large scale, offloading it to your reverse proxy (NGINX) or CDN (Cloudflare) saves CPU cycles since Node.js is single-threaded and compression is CPU-intensive. If you're already behind a proxy, check whether it's compressing for you before adding it in Node.js too.

2. Over-fetching

Every ORM makes it trivial to return entire database rows. Most codebases do exactly that.

// This returns every column in the users table
const users = await db.query("SELECT * FROM users");

// Including: password_hash, internal_flags, created_at, updated_at,
// deleted_at, last_ip, raw_oauth_payload, internal_notes...

The fix is not complicated - it just requires being deliberate:

// Return only what the client actually needs
const users = await db.query(`
  SELECT id, name, email, avatar_url, role
  FROM users
  WHERE organization_id = $1
  ORDER BY created_at DESC
`, [orgId]);

If you're using an ORM like Prisma, use select explicitly:

const users = await prisma.user.findMany({
  where: { organizationId },
  select: {
    id: true,
    name: true,
    email: true,
    avatarUrl: true,
    role: true,
  },
});

The temptation to use SELECT * or skip the select clause is real because it saves two minutes of typing. The cost is paid on every request, by every client, forever.

3. Redundant Nesting and Metadata

This one is subtler. It's the pattern where every response wraps data in a consistent envelope, which is fine, but the envelope carries metadata that nobody uses.

{
  "success": true,
  "status": 200,
  "message": "OK",
  "timestamp": "2026-04-02T10:00:00Z",
  "requestId": "abc-123",
  "version": "v1",
  "data": {
    "items": [...],
    "meta": {
      "total": 1250,
      "page": 1,
      "perPage": 20,
      "lastPage": 63,
      "from": 1,
      "to": 20,
      "currentPage": 1,
      "hasNextPage": true,
      "hasPrevPage": false
    }
  }
}

success, status, message, and version are duplicating what HTTP already tells the client. currentPage is page renamed. from and to are derivable from page and perPage. hasPrevPage is page > 1.

A cleaner version:

{
  "data": [...],
  "pagination": {
    "total": 1250,
    "page": 1,
    "perPage": 20,
    "hasNext": true
  }
}

Less noise, same information, smaller payload.

Keyset Pagination vs. Offset Pagination

While we're talking about list endpoints, there's a related issue worth covering.

Offset pagination (LIMIT 20 OFFSET 200) requires the database to count and skip rows. On large tables this gets slow fast, and it also makes total count queries expensive.

-- This scans and counts the entire table
SELECT COUNT(*) FROM logs WHERE organization_id = $1;

-- On a table with 50M rows this can take 2-5 seconds

Keyset pagination avoids both problems:

// Instead of page/offset, use the last seen ID.
// Request one extra row to check if there's a next page.
const logs = await db.query(`
  SELECT id, timestamp, level, message
  FROM logs
  WHERE organization_id = $1
    AND id < $2
  ORDER BY id DESC
  LIMIT $3
`, [orgId, lastSeenId, pageSize + 1]);

// If we got more than pageSize rows, there's a next page.
// Trim the extra row before sending.
const hasNext = logs.length > pageSize;
const items = logs.slice(0, pageSize);

You lose the ability to jump to arbitrary pages, which matters for some UIs but not for most. You gain: fast queries at any depth, no COUNT(*) needed, and stable pagination even when rows are being inserted concurrently.

Conditional Responses with ETags

If your data doesn't change between requests, the client shouldn't have to download it again.

import { createHash } from "crypto";

// In your route handler
const data = await fetchData(orgId);

// If your data has an updated_at field, use that instead of hashing
// the full payload - much cheaper on large responses.
// const etag = createHash("md5").update(`${data.id}:${data.updatedAt}`).digest("hex");

const etag = createHash("md5")
  .update(JSON.stringify(data))
  .digest("hex");

const clientEtag = req.headers["if-none-match"];

if (clientEtag === etag) {
  return res.status(304).send(); // Not Modified - zero payload
}

res
  .header("ETag", etag)
  .header("Cache-Control", "private, max-age=0, must-revalidate")
  .send(data);

One caveat: hashing the full JSON.stringify(data) on every request is expensive if your payload is large. If the data has an updated_at timestamp in the database, derive the ETag from that instead - hash(id + updated_at) is a constant-time operation regardless of payload size and avoids blocking the event loop.

For list endpoints where data changes frequently, this won't help much. For configuration endpoints, user profile data, or static reference data, a 304 response is massively cheaper than resending the same payload on every poll.

Putting It Together

The combined impact on that 2.4MB endpoint:

Change	Size	Reduction
Baseline	2.4MB	-
+ brotli compression	210KB	91%
+ select only needed fields	58KB	97.5%
+ clean response envelope	55KB	97.7%
+ ETag (no change)	0KB	100%

The 40x number in the title is real. Most of it comes from compression alone - the rest is just discipline.

None of these changes require a rewrite. They don't break existing clients. They're additive. The only cost is a bit of attention to defaults that most frameworks don't set correctly out of the box.

Start with compression. It takes 10 minutes and the numbers will surprise you.

Found something I got wrong, or a pattern that works better for your stack? Drop it in the comments.

I Tested PAIO Bot's New Security Layer for AI Agents — Here's the Honest Take

Polliog — Thu, 02 Apr 2026 12:10:51 +0000

The problem: OpenClaw's localhost exposure is a real risk

If you've been running local AI agents with OpenClaw, there's a good chance your setup is more exposed than you think. Researchers recently found over 135,000 OpenClaw instances publicly reachable online - many of them with no authentication, open to prompt injection, API key theft, and arbitrary command execution.

That's the problem PAIO (Personal AI Operator) is trying to solve. Backed by PureVPN's 17 years of network security infrastructure, it positions itself as a drop-in security and optimization layer for OpenClaw-based agents. I was given Pro access to test it ahead of launch, and this is my honest, hands-on assessment.

An exposed OpenClaw endpoint can let an attacker:

Inject malicious prompts into your agent's context
Read or exfiltrate your system prompt and conversation history
Abuse your API keys for their own usage
Execute tools and actions your agent has access to

This isn't theoretical. The 135,000 figure comes from Shodan-style scanning of known OpenClaw ports. If you've ever used --host 0.0.0.0 anywhere in your agent config, you've probably been in that list at some point.

What PAIO actually does

PAIO sits between your agent and the outside world. Instead of your OpenClaw instance binding directly to a network interface, PAIO proxies and controls that connection — sanitizing inputs, managing authentication, and exposing a controlled WebSocket endpoint that you can share safely.

Once set up, your agent becomes accessible via a unique WSS endpoint like:

wss://app.paio.bot/f73bb772-aaaa-aaaa-8b0f-a605aaaac/

Or via an in-browser chat UI hosted on their platform. Your localhost is never directly exposed. That's the core value proposition, and it's architecturally sound.

Beyond security, PAIO also adds:

Token optimization — context window and system prompt compression to reduce API costs
A simplified dashboard — sessions, skills, and agent configuration in a cleaner UI than vanilla OpenClaw
Mac agent with browser relay — lets the agent perform tasks like bookings and research in the background
Multi-provider AI support — OpenAI, Anthropic, and others

Setup: honest timing

The marketing says "60-second deployment." In my experience, the full process from the landing page took closer to 5–6 minutes — though to be fair, PAIO measures their benchmark from first successful prompt, not from the landing page. They're also actively optimizing the provisioning pipeline with an internal target of under 60 seconds end-to-end. Fast either way, but worth knowing what to expect.

💡 Important - no AI included: PAIO does not bundle AI credits. Every plan requires you to either bring your own API key or purchase their credit packages separately. Factor this into your cost model before signing up.

Once past setup, the dashboard is noticeably cleaner than OpenClaw's native interface. You get session management, skill configuration, and connection status in a single view. For teams or developers who find OpenClaw's UI overwhelming, this alone might justify the tool.

Token optimization: the claim vs. reality

PAIO advertises up to 50% token reduction through aggressive context window and system prompt optimization. This is one of those claims that's highly dependent on your specific use case - the gains are real, but whether you hit 50% depends on how bloated your prompts are to begin with.

In practice, if you're running agents with long system prompts, large tool schemas, or verbose context injection, you'll see meaningful savings. If your setup is already lean, the gains will be modest. The tool doesn't magically compress arbitrary LLM output — it compresses the input side: context, system prompts, and tool definitions. Worth noting: a major token optimization patch was pushed to production shortly after launch, improving multi-step context pruning and pushing savings beyond 60% in their internal benchmarks. I haven't re-tested post-patch, but it's worth evaluating with your own workload.

Complexity: the honest critique

Here's the thing: PAIO inherits OpenClaw's complexity and adds its own layer on top. If you're already comfortable with OpenClaw, the additional concepts (WSS endpoints, skills, session routing) are manageable. If you're newer to local agent infrastructure, this is not a beginner tool.

The dashboard simplifies some things, but the underlying mental model - local agent + proxy layer + AI provider + browser relay - is still a lot to hold in your head. I'd love to see a more opinionated "just works" mode for simpler use cases.

Verdict

What works ✅

Genuine security improvement over raw OpenClaw
Cleaner dashboard UX
WSS endpoint approach is the right architecture
Token optimization is real (if your prompts are verbose)
Multi-provider AI support

What to watch ⚠️

Setup is 5–6 min, not 60 sec as advertised
No AI included — always BYOK or pay for credits
Still complex for non-OpenClaw users
Token savings vary widely by use case
Mac-first; other platforms TBD

If you're running OpenClaw agents in any environment that's even partially network-accessible, PAIO is worth serious consideration. The localhost exposure problem is real, underappreciated, and PAIO's proxy approach is a legitimate fix. The token optimization is a nice bonus rather than the main draw.

If you're on a tight budget, factor in that you'll always need AI credits on top of any PAIO plan. Run the numbers for your usage volume before committing.

You can get started at paio.bot - the free tier lets you evaluate the setup flow before committing to a paid plan.

This article was produced in partnership with PAIO. Testing was conducted independently with Pro plan access provided by the PAIO team.

I Removed Redis From My Stack and Used PostgreSQL for Job Queues Instead

Polliog — Tue, 17 Mar 2026 11:17:17 +0000

Every Node.js project eventually needs background jobs. Send this email. Process this file. Run this alert evaluation at midnight. The default answer in the ecosystem is Redis + BullMQ. It's fast, battle-tested, and has a great API.

It also means running Redis.

For projects already running PostgreSQL, that's a second database to provision, monitor, back up, and pay for. On AWS, an ElastiCache instance starts at ~$15/month for the smallest node not catastrophic, but not nothing either. More importantly, it's another moving part that can fail.

I recently shipped a Redis-free deployment mode for an open-source project I maintain. The job queue runs entirely on PostgreSQL using graphile-worker. Here's everything I learned from the experience what graphile-worker does well, where it has real limits, and when you should just keep Redis.

The Problem With "Just Add Redis"

Before getting into the comparison, it's worth being honest about what the Redis dependency actually costs.

Operationally, Redis is simple to run but adds surface area. Every additional service is another thing that can go down, run out of memory, or need a version upgrade. In Docker Compose deployments (which is how most self-hosted tools get deployed), it's another container, another volume, another health check.

On AWS, the options are:

ElastiCache (managed, ~$15-50/month for a usable instance)
Redis on EC2 (self-managed, cheaper but more work)
Redis on the same instance as your app (fine for dev, risky in prod)

For multi-instance scaling, Redis becomes mandatory you can't share BullMQ queues across processes without it. But for a single-instance deployment, you're paying the Redis tax without getting the multi-instance benefit.

The question I asked: if I'm already running PostgreSQL and my job volume doesn't justify a dedicated queue broker, what do I lose by using Postgres as the queue?

How graphile-worker Works

graphile-worker stores jobs in a PostgreSQL table and uses SELECT ... FOR UPDATE SKIP LOCKED to claim them. That one clause is the key insight: it's an atomic operation that lets multiple workers poll the same table concurrently without contention.

SELECT id, queue_name, task_identifier, payload, run_at
FROM graphile_worker.jobs
WHERE run_at <= NOW()
  AND locked_by IS NULL
  AND attempts < max_attempts
ORDER BY priority DESC, run_at ASC
FOR UPDATE SKIP LOCKED
LIMIT 1;

When a worker claims a job, it locks the row. If the worker crashes, the lock is released automatically when the connection drops. No dead letter queue configuration needed just max_attempts and exponential backoff.

Job results are kept in the same table. Completed jobs get deleted (or archived, if you configure it). Failed jobs increment their attempt counter and get rescheduled with backoff.

It's genuinely elegant. The entire queue state lives in a place you already know how to query, back up, and monitor.

Setting It Up in a Node.js/TypeScript Project

import { run, makeWorkerUtils } from 'graphile-worker'

// Register task handlers
const runner = await run({
  connectionString: process.env.DATABASE_URL,
  concurrency: 5,
  taskList: {
    send_email: async (payload, helpers) => {
      const { to, subject, body } = payload as EmailPayload
      await sendEmail({ to, subject, body })
    },
    evaluate_alert: async (payload, helpers) => {
      const { alertId } = payload as AlertPayload
      await evaluateAlert(alertId)
    },
    generate_report: async (payload, helpers) => {
      const { reportId } = payload as ReportPayload
      await generateReport(reportId)
    },
  },
})

// Add jobs from anywhere in your app
const utils = await makeWorkerUtils({
  connectionString: process.env.DATABASE_URL,
})

// One-off job
await utils.addJob('send_email', {
  to: 'user@example.com',
  subject: 'Your report is ready',
  body: '...',
})

// Scheduled job (run at specific time)
await utils.addJob('generate_report', { reportId: 123 }, {
  runAt: new Date('2026-03-15T09:00:00Z'),
})

// Recurring job (cron syntax)
await utils.addJob('evaluate_alert', { alertId: 456 }, {
  jobKey: 'alert-456-eval',
  jobKeyMode: 'replace',
  runAt: cronNextRun('*/5 * * * *'), // every 5 minutes
})

The API is intentionally minimal. No queue configuration, no connection pooling setup, no separate Redis client. You point it at your existing PostgreSQL connection string and start adding jobs.

BullMQ vs graphile-worker: The Real Comparison

Let me be direct about where each one wins.

Where graphile-worker wins

Zero additional infrastructure. If you're already on RDS PostgreSQL or Aurora, graphile-worker is free. No ElastiCache, no Redis on EC2, no second managed service to babysit.

Full SQL visibility. Your jobs are rows in a table. You can query them, join them against other tables, build admin UIs with a SELECT, and debug failures with psql. Compare this to inspecting BullMQ queues via the Bull Board UI or raw Redis commands.

-- Find all failed jobs in the last hour
SELECT task_identifier, payload, last_error, attempts
FROM graphile_worker.jobs
WHERE last_error IS NOT NULL
  AND updated_at > NOW() - INTERVAL '1 hour'
ORDER BY updated_at DESC;

-- Count pending jobs by type
SELECT task_identifier, COUNT(*) as pending
FROM graphile_worker.jobs
WHERE locked_by IS NULL
GROUP BY task_identifier;

Transactional job enqueueing. This is the killer feature that BullMQ can't match. You can enqueue a job inside a database transaction, guaranteeing it only gets scheduled if the transaction commits:

await db.transaction(async (trx) => {
  // Create the user
  const user = await trx.insertInto('users').values(userData).returningAll().executeTakeFirstOrThrow()

  // Enqueue welcome email — only runs if user creation succeeds
  await trx.executeQuery(
    sql`SELECT graphile_worker.add_job('send_welcome_email', ${JSON.stringify({ userId: user.id })})`
  )
})

With BullMQ, you'd add the job after the transaction commits and if your process crashes between the commit and the queue.add() call, the job never gets scheduled. Not a common failure mode, but a real one. To achieve this guarantee with BullMQ, you'd have to implement the Transactional Outbox pattern writing the job to a database table first, then running a separate relay worker to move it to Redis. graphile-worker gives you this for free.

Operational simplicity for single-instance deployments. One less service to configure in Docker Compose, one less thing to include in your backup strategy, one less connection string to manage in environment variables.

Where BullMQ wins

Throughput. Redis is an in-memory data structure store purpose-built for this. BullMQ can process thousands of jobs per second. graphile-worker tops out around 100-200 jobs/second on typical PostgreSQL hardware before you start hitting lock contention. For most applications this is irrelevant. For high-volume pipelines (image processing, webhook delivery at scale, bulk email), it matters.

Advanced queue features. BullMQ has rate limiting, job priorities with fine-grained control, delayed jobs with millisecond precision, parent-child job dependencies, and repeatable jobs with complex scheduling. graphile-worker has most of these, but BullMQ's implementation is more complete and battle-hardened.

Real-time job events. BullMQ emits events (completed, failed, progress) via Redis pub/sub. You can build live job monitoring dashboards easily. With graphile-worker, you'd poll the jobs table.

Multi-instance horizontal scaling. BullMQ was designed from the ground up for multiple workers across multiple processes/machines, all sharing the same Redis. graphile-worker supports this too (multiple workers polling the same PostgreSQL), but the throughput ceiling is lower.

The honest performance numbers

On commodity hardware (the same AMD Ryzen 5 3600 from the benchmark article):

Scenario	BullMQ	graphile-worker
Job enqueue rate	~5,000/s	~500/s
Job processing throughput (simple tasks)	~2,000/s	~100-200/s
Job processing throughput (I/O bound tasks)	~500/s	~100/s
Latency from enqueue to pickup	<10ms	<10ms (LISTEN/NOTIFY), 2s max (poll fallback)

graphile-worker polls for new jobs at a configurable interval (default: every 2 seconds, plus LISTEN/NOTIFY for immediate pickup). For most background job use cases — sending emails, generating reports, running scheduled checks — 500ms latency is completely acceptable. For near-real-time processing where job pickup latency matters, BullMQ wins.

The AWS Decision Framework

This is where the choice becomes concrete.

Use graphile-worker when:

You're already on RDS PostgreSQL or Aurora
Your job volume is under ~100 jobs/second
You have a single-instance deployment or modest horizontal scale
You want transactional job enqueueing
You want SQL-queryable job state
You want to avoid ElastiCache costs

Use BullMQ when:

You need >200 jobs/second sustained throughput
You have real-time job progress tracking requirements
You're scaling to many workers across many instances
You already have ElastiCache for other purposes (caching, sessions)
You need fine-grained rate limiting (e.g., "max 10 API calls/second to this external service")

The cost math on AWS (rough estimates, us-east-1):

Setup	Monthly cost (approx)
RDS PostgreSQL db.t3.medium	~$30
RDS PostgreSQL db.t3.medium + ElastiCache cache.t3.micro	~$45
Aurora PostgreSQL (serverless v2, min capacity)	~$45
Aurora PostgreSQL + ElastiCache cache.t3.micro	~$60

If you're already paying for RDS and your jobs fit within graphile-worker's throughput ceiling, you're spending money on ElastiCache for infrastructure you don't need.

The Migration Path

If you're currently on BullMQ and considering a migration, it's straightforward. graphile-worker runs schema migrations automatically on startup you don't manage the tables yourself.

// Before: BullMQ
import Queue from 'bullmq'
const emailQueue = new Queue('email', { connection: redisConnection })
await emailQueue.add('send', { to, subject, body }, { attempts: 3, backoff: { type: 'exponential', delay: 1000 } })

// After: graphile-worker
const utils = await makeWorkerUtils({ connectionString: process.env.DATABASE_URL })
await utils.addJob('send_email', { to, subject, body }, { maxAttempts: 3 })

The retry/backoff configuration moves from the job definition to the worker configuration. The task handler API is nearly identical.

One thing to handle explicitly: BullMQ lets you attach removeOnComplete and removeOnFail policies per job. graphile-worker always removes completed jobs (keeping failed ones with their error details). If you need a completed job archive, add a separate table and write to it from your task handlers.

What I Actually Run in Production

The project I maintain ships two Docker Compose configurations: one with Redis + BullMQ for teams that need horizontal scaling, and one with graphile-worker only for single-instance deployments that want minimum operational overhead.

The Redis-free setup works well for SMB deployments teams running their own observability stack on a single VPS or a modest EC2 instance. The full setup with Redis makes sense when you're running multiple backend instances behind a load balancer.

Both queue implementations share the same task handler interface. Switching between them is a config change, not a code change.

Summary

PostgreSQL-based job queues aren't a new idea Delayed Job in Ruby, django-q in Python, and several others have proven the pattern works. graphile-worker brings it to Node.js with a clean API and genuine PostgreSQL integration.

The choice isn't "which is better." It's "which matches your constraints." If you're paying for ElastiCache already, BullMQ is probably the right call. If you're running PostgreSQL and your job volume fits within graphile-worker's ceiling, eliminating Redis simplifies your stack without meaningful cost.

The SELECT ... FOR UPDATE SKIP LOCKED pattern is one of those PostgreSQL features that most developers don't know exists until they need it. Now you do.

The Redis-optional deployment mode ships in Logtide since v0.5.0 a self-hosted observability platform built on Node.js + TimescaleDB. The docker-compose.simple.yml uses graphile-worker; the standard docker-compose.yml uses BullMQ.

PII in Your Logs Is a GDPR Time Bomb - Here's How to Defuse It

Polliog — Mon, 16 Mar 2026 20:28:29 +0000

Your application is probably logging PII right now.

Not maliciously - it happens naturally. A user submits a form with their email. Your framework logs the full request body for debugging. The email lands in CloudWatch, Datadog, or your ELK cluster. It sits there for 90 days, or 365, or however long your retention policy says.

Under GDPR, that's a data breach waiting for a complaint. Under HIPAA, it's a violation. Under any audit, it's a finding.

The fix isn't "tell developers to be careful." Developers are already careful - until they're debugging a production incident at 2am and add a quick console.log(request.body). The fix is a masking layer that runs automatically, before any log hits storage.

This article is about building that layer in Node.js.

What PII Actually Looks Like in Logs

Before masking, you need to know what you're masking. PII in logs shows up in three forms:

Structured fields - JSON payloads where the key makes the value obvious:

{ "email": "alice@example.com", "password": "hunter2", "ssn": "123-45-6789" }

Embedded in strings - PII inside log messages:

User alice@example.com failed login from 192.168.1.1
Authorization: Bearer eyJhbGciOiJIUzI1NiJ9...

Nested or transformed - Base64-encoded, URL-encoded, or buried in stack traces:

Error processing request body: %7B%22email%22%3A%22alice%40example.com%22%7D

A good masking pipeline handles all three. Most tutorials only handle the first one.

The Architecture: Mask at Ingestion, Not at Display

There are two schools of thought on when to mask:

Mask at display - store everything, redact when showing logs in the UI
Mask at ingestion - strip PII before it ever reaches storage

Mask at ingestion is the only defensible choice for compliance. If PII reaches your database, it's already a GDPR problem - even if you never display it. The data is there, it can be breached, and you own the liability.

The pipeline looks like this:

Application → Log event → [Masking layer] → Storage
                                ↑
                         This is where we operate

The masking layer runs synchronously, in-process, before any network call to your log storage. No PII leaves the machine.

Building the Masking Layer

Step 1: Define your masking strategies

Before writing regex, decide what "masked" means for your use case. Three strategies cover most cases:

type MaskingStrategy = 'mask' | 'redact' | 'hash'

// mask: show partial value - useful for debugging (still recognizable, not storable)
// "alice@example.com" → "al***@***.com"

// redact: replace entirely - use when value has no debugging value
// "hunter2" → "[REDACTED]"

// hash: deterministic SHA-256 - use when you need to correlate without exposing
// "alice@example.com" → "sha256:2f3a4b..." (same input always produces same hash)
// ⚠️ Always set PII_HASH_SALT in your environment. Emails and SSNs have low entropy
// and are trivially reversible from unsalted hashes via rainbow tables.

Hashing is underused. It lets you answer "did this user appear in these logs?" without storing the actual email. Useful for audit trails and correlation.

Step 2: Pattern-based detection

import { createHash } from 'crypto'

const PII_PATTERNS: Array<{
  name: string
  pattern: RegExp
  strategy: MaskingStrategy
}> = [
  // Email addresses
  {
    name: 'email',
    pattern: /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/g,
    strategy: 'mask',
  },
  // Credit card numbers (Format-valid patterns — prefix and length, not Luhn checksum)
  {
    name: 'credit_card',
    pattern: /\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11})\b/g,
    strategy: 'redact',
  },
  // US Social Security Numbers
  {
    name: 'ssn',
    pattern: /\b\d{3}-\d{2}-\d{4}\b/g,
    strategy: 'redact',
  },
  // Bearer tokens / JWT
  {
    name: 'bearer_token',
    pattern: /Bearer\s+[A-Za-z0-9\-_=]+\.[A-Za-z0-9\-_=]+\.?[A-Za-z0-9\-_.+/=]*/g,
    strategy: 'redact',
  },
  // AWS access keys
  {
    name: 'aws_access_key',
    pattern: /\b(AKIA|AIPA|AKIA|ASIA)[A-Z0-9]{16}\b/g,
    strategy: 'redact',
  },
  // IPv4 addresses (optional — some teams want these, some don't)
  {
    name: 'ipv4',
    pattern: /\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b/g,
    strategy: 'mask',
  },
  // Phone numbers (loose — adjust for your region)
  {
    name: 'phone',
    pattern: /(\+?[\d\s\-().]{10,15})/g,
    strategy: 'mask',
  },
]

function applyStrategy(value: string, strategy: MaskingStrategy): string {
  switch (strategy) {
    case 'redact':
      return '[REDACTED]'

    case 'hash':
      return `sha256:${createHash('sha256').update(value + (process.env.PII_HASH_SALT ?? '')).digest('hex').slice(0, 16)}`

    case 'mask': {
      if (value.includes('@')) {
        // Email masking: show first 2 chars of local part and domain TLD
        const [local, domain] = value.split('@')
        const [domainName, ...tlds] = domain.split('.')
        return `${local.slice(0, 2)}***@***.${tlds.join('.')}`
      }
      // Generic masking: show first and last char, mask middle
      if (value.length <= 4) return '****'
      return `${value[0]}${'*'.repeat(value.length - 2)}${value[value.length - 1]}`
    }
  }
}

function maskString(input: string): string {
  let result = input
  for (const { pattern, strategy } of PII_PATTERNS) {
    result = result.replace(pattern, (match) => applyStrategy(match, strategy))
  }
  return result
}

Step 3: Field-name detection

Pattern matching catches PII embedded in strings. But for structured JSON, matching on field names is faster and more reliable:

const SENSITIVE_FIELD_NAMES = new Set([
  'password', 'passwd', 'secret', 'token', 'api_key', 'apikey', 'api-key',
  'authorization', 'auth', 'credential', 'credentials',
  'email', 'e_mail', 'e-mail',
  'ssn', 'social_security', 'national_id',
  'credit_card', 'card_number', 'cvv', 'cvc',
  'phone', 'phone_number', 'mobile',
  'dob', 'date_of_birth', 'birthday',
  'address', 'street_address', 'postal_code', 'zip_code',
  'ip_address', 'ip', 'x_forwarded_for',
])

function isFieldSensitive(key: string): boolean {
  const normalized = key.toLowerCase().replace(/[-_\s]/g, '_')
  return SENSITIVE_FIELD_NAMES.has(normalized)
}

Step 4: Recursive object traversal

The masking function needs to traverse nested objects - request bodies aren't always flat:

type LogValue = string | number | boolean | null | LogObject | LogValue[]
type LogObject = { [key: string]: LogValue }

function maskObject(obj: LogObject, depth = 0): LogObject {
  // Prevent infinite recursion on circular references
  if (depth > 10) return { '[max_depth_exceeded]': true }

  const result: LogObject = {}

  for (const [key, value] of Object.entries(obj)) {
    if (isFieldSensitive(key)) {
      // Field name match: redact or hash based on field type
      // Note: this hardcodes the strategy per field type for brevity. In a production
      // system, map field names to your central PII_PATTERNS configuration to keep
      // strategies consistent across both field-name and pattern-based detection.
      const strategy = key.toLowerCase().includes('email') ? 'hash' : 'redact'
      result[key] = typeof value === 'string'
        ? applyStrategy(value, strategy)
        : '[REDACTED]'
      continue
    }

    if (typeof value === 'string') {
      result[key] = maskString(value)
    } else if (Array.isArray(value)) {
      result[key] = value.map((item) =>
        typeof item === 'object' && item !== null
          ? maskObject(item as LogObject, depth + 1)
          : typeof item === 'string'
          ? maskString(item)
          : item
      )
    } else if (typeof value === 'object' && value !== null) {
      result[key] = maskObject(value as LogObject, depth + 1)
    } else {
      result[key] = value
    }
  }

  return result
}

Step 5: The masking pipeline entry point

Wrap everything in a single function that handles both structured objects and raw strings:

export function maskPII(input: unknown): unknown {
  if (typeof input === 'string') {
    return maskString(input)
  }

  if (typeof input === 'object' && input !== null && !Array.isArray(input)) {
    return maskObject(input as LogObject)
  }

  if (Array.isArray(input)) {
    return input.map(maskPII)
  }

  return input
}

Integrating With Your Logger

With Pino (recommended for Node.js)

Pino supports redact paths natively, but it only handles known field paths. For dynamic detection, use a serializers hook:

import pino from 'pino'
import { maskPII } from './masking'

const logger = pino({
  serializers: {
    // Mask the entire request object
    req: (req) => maskPII({
      method: req.method,
      url: req.url,
      headers: req.headers,
      body: req.body,
    }),
    // Mask arbitrary metadata
    meta: (meta) => maskPII(meta),
  },
})

// Usage
logger.info({ req, meta: { userId: user.email } }, 'Request received')

With Winston

import winston from 'winston'
import { maskPII } from './masking'

const maskingTransform = winston.format((info) => {
  return maskPII(info) as typeof info
})

const logger = winston.createLogger({
  format: winston.format.combine(
    maskingTransform(),
    winston.format.json()
  ),
  transports: [new winston.transports.Console()],
})

With a raw HTTP ingest endpoint

If you're building an ingest endpoint that receives logs from external sources (SDKs, collectors), apply masking server-side before writing to storage:

import Fastify from 'fastify'
import { maskPII } from './masking'

const app = Fastify()

app.post('/api/v1/ingest', async (request, reply) => {
  const { logs } = request.body as { logs: LogObject[] }

  const maskedLogs = logs.map((log) => ({
    ...maskObject(log),
    ingested_at: new Date().toISOString(),
  }))

  await db.insertInto('logs').values(maskedLogs).execute()

  return reply.send({ accepted: maskedLogs.length })
})

The Edge Cases Nobody Talks About

URL-encoded and Base64-encoded PII

Attackers (and frameworks) encode data. Your masking needs to handle it:

function maskStringWithDecoding(input: string): string {
  let result = input

  // Try URL decode and re-mask
  try {
    const decoded = decodeURIComponent(result)
    if (decoded !== result) {
      result = encodeURIComponent(maskString(decoded))
    }
  } catch {}

  // Try Base64 decode and re-mask
  const base64Pattern = /\b[A-Za-z0-9+/]{20,}={0,2}\b/g
  result = result.replace(base64Pattern, (match) => {
    try {
      const decoded = Buffer.from(match, 'base64').toString('utf8')
      // Only re-encode if it looks like it decoded to something meaningful
      if (/^[\x20-\x7E]+$/.test(decoded)) {
        const masked = maskString(decoded)
        if (masked !== decoded) {
          return Buffer.from(masked).toString('base64')
        }
      }
    } catch {}
    return match
  })

  return maskString(result)
}

Stack traces

Stack traces can contain PII in exception messages:

Error: User not found for email alice@example.com
    at UserService.findByEmail (user.service.ts:42)

function maskStackTrace(stack: string): string {
  return stack
    .split('\n')
    .map((line, index) => {
      // Mask the error message line (first line), leave stack frames alone
      if (index === 0) return maskString(line)
      return line
    })
    .join('\n')
}

Performance considerations

The masking pipeline runs on every log event. Profile it:

// Simple benchmark
const iterations = 10_000
const sampleLog = {
  message: 'User alice@example.com logged in from 192.168.1.1',
  email: 'alice@example.com',
  headers: { authorization: 'Bearer eyJhbGciOiJIUzI1NiJ9.test.test' },
}

const start = performance.now()
for (let i = 0; i < iterations; i++) {
  maskObject(sampleLog)
}
const elapsed = performance.now() - start
console.log(`${iterations} iterations in ${elapsed.toFixed(2)}ms (${(elapsed / iterations).toFixed(3)}ms each)`)

On a modern machine, a well-implemented masking pipeline takes 0.05-0.2ms per log event. At 1,000 logs/second, that's 50-200ms of CPU per second — acceptable for most applications, but worth measuring for high-throughput services.

If performance is a concern, compile your regex patterns once outside the function — the compilation cost is paid only once, not on every log event:

// Bad: regex compiled on every call
function maskEmail(str: string) {
  return str.replace(/\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/g, '***')
}

// Good: compiled once, reused on every call
// Note: String.prototype.replace() manages lastIndex internally — no manual reset needed
const EMAIL_PATTERN = /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/g
function maskEmail(str: string) {
  return str.replace(EMAIL_PATTERN, '***')
}

Testing Your Masking Pipeline

A masking layer without tests is worse than no masking layer — it gives you false confidence.

import { describe, it, expect } from 'vitest'
import { maskPII, maskObject } from './masking'

describe('PII masking', () => {
  it('masks email addresses in strings', () => {
    const result = maskPII('User alice@example.com logged in') as string
    expect(result).not.toContain('alice@example.com')
    expect(result).toContain('@')  // partial masking, not full redaction
  })

  it('redacts password fields', () => {
    const result = maskObject({ password: 'hunter2', username: 'alice' })
    expect(result.password).toBe('[REDACTED]')
    expect(result.username).toBe('alice')  // non-sensitive fields unchanged
  })

  it('handles nested objects', () => {
    const result = maskObject({
      user: { email: 'alice@example.com', preferences: { theme: 'dark' } }
    })
    expect((result.user as any).email).not.toBe('alice@example.com')
    expect((result.user as any).preferences.theme).toBe('dark')
  })

  it('redacts bearer tokens', () => {
    const result = maskPII('Authorization: Bearer eyJhbGciOiJIUzI1NiJ9.test.sig') as string
    expect(result).toContain('[REDACTED]')
    expect(result).not.toContain('eyJhbGciOiJIUzI1NiJ9')
  })

  it('does not modify non-PII strings', () => {
    const input = 'Server started on port 3000'
    expect(maskPII(input)).toBe(input)
  })

  it('handles null and undefined gracefully', () => {
    expect(() => maskPII(null)).not.toThrow()
    expect(() => maskPII(undefined)).not.toThrow()
  })
})

The Masking Preview Problem

One practical challenge: developers need to test whether their masking rules are working without shipping to production. Build a simple preview endpoint (dev/staging only) that runs the masking pipeline and returns the diff:

if (process.env.NODE_ENV !== 'production') {
  app.post('/debug/mask-preview', async (request, reply) => {
    const input = request.body
    const masked = maskPII(input)
    return reply.send({
      original: input,
      masked,
      changed: JSON.stringify(input) !== JSON.stringify(masked),
    })
  })
}

Call it with a sample log payload and immediately see what gets masked. Faster than print-debugging your way through regex patterns.

Summary

PII masking in logs is not a nice-to-have. It's a compliance requirement, and more importantly, it's the right thing to do with your users' data.

The pattern is straightforward:

Mask at ingestion, not at display
Combine field-name detection (fast, reliable for structured data) with pattern matching (catches PII in strings)
Choose the right strategy per field type: mask for emails, redact for passwords/tokens, hash for correlation keys
Handle edge cases: URL encoding, Base64, stack traces
Test it like production code, because it is production code

The implementation above is about 150 lines of TypeScript. There's no reason every Node.js application logging to CloudWatch, Datadog, or anywhere else shouldn't have something equivalent running before the first log event leaves the process.

I Benchmarked TimescaleDB vs ClickHouse vs MongoDB for Observability Data - The Results Surprised Me

Polliog — Sat, 14 Mar 2026 20:35:57 +0000

When we designed @logtide/reservoir the pluggable storage abstraction layer for Logtide we had to make a real decision: which database should be the default for an observability platform?

The conventional wisdom says: time-series data at scale → ClickHouse. It's what everyone building in this space seems to reach for. Grafana Loki, Signoz, and a bunch of others use it or are moving toward it.

We didn't. We picked TimescaleDB as our default, with ClickHouse available for enterprise deployments and MongoDB for teams already invested in that ecosystem.

We built a proper benchmark suite and ran it. Here are the actual numbers.

The Setup

All three engines were benchmarked under identical conditions, running in Docker on the same machine, seeded with the same synthetic dataset, tested at four volume tiers: 1K, 10K, 100K, and 1M records.

Three data types were tested separately logs, spans (distributed traces), and metrics because the query patterns are fundamentally different for each. Each test ran 3 iterations with 1 warmup round. Results are p50 latency unless otherwise noted.

The benchmark suite is open source: it ships in Logtide's repository and you can run it yourself

Ingestion: Where ClickHouse Has a Problem

The first thing that jumped out was ClickHouse's ingestion behavior at small-to-medium batch sizes.

Log ingestion p50 latency (batch 1,000):

Engine	1K rows	10K rows	100K rows	1M rows
TimescaleDB	17.6ms	14.2ms	13.9ms	13.3ms
ClickHouse	400.1ms	400.4ms	399.8ms	400.0ms
MongoDB	37.0ms	39.5ms	37.2ms	—

ClickHouse is sitting at exactly 400ms for batch 1,000 across all volume tiers. That's not a coincidence it's ClickHouse's async insert behavior. When async_insert = 1 is enabled (common in modern clients and managed services), ClickHouse buffers writes in memory and flushes them when async_insert_busy_timeout_ms elapses. Our setup has that timeout at 400ms. The 400 isn't a random number; it's a configured flush interval.

The buffering exists precisely because ClickHouse doesn't handle high-frequency small writes well natively. Its columnar storage format requires merging data into sorted chunks a process that's expensive if triggered on every small insert. Async inserts are the workaround: batch writes in memory, flush periodically, pay the merge cost less often. It's the right design for bulk analytics ingestion. It's the wrong design if you're pushing logs from 10 microservices every few seconds.

This matters a lot for observability workloads. When your application is logging in real time, you're not sending 10,000-log batches. You're sending small, frequent writes. At batch 100, ClickHouse delivers 250 ops/s. TimescaleDB delivers 14,200 ops/s. That's a 56x difference at a batch size that's very common in practice.

ClickHouse catches up at batch 10,000 - 83,843 ops/s vs 120,934 ops/s for TimescaleDB. At scale ingestion, they're comparable. But you need to be running at that scale to benefit.

MongoDB sits in the middle: consistent ~25K ops/s regardless of batch size, no timing artifacts. Predictable if not spectacular.

Query Latency: The Result That Settles the Debate

This is where the numbers get dramatic.

Log query p50 latency at 100K records:

Operation	TimescaleDB	ClickHouse	MongoDB
Single service filter	0.47ms	44.8ms	304ms
Multi-filter	0.48ms	35.2ms	309ms
Full-text search	0.45ms	32.2ms	39.9ms
Narrow time range (1h)	0.49ms	8.7ms	3.4ms
Pagination (offset 1000)	0.40ms	85.8ms	320ms
Aggregate 1h buckets	0.41ms	15.1ms	376ms

TimescaleDB is answering filtered log queries in under half a millisecond at 100K records. ClickHouse takes 35-85ms for the same queries. MongoDB takes 300-400ms.

The scaling story is equally stark. At 1M records, TimescaleDB's query latency barely moves still 0.46ms for a service filter. ClickHouse degrades to 244ms. MongoDB wasn't tested at 1M for logs (the 100K numbers already showed where things were heading).

This is the TimescaleDB superpower: hypertable partitioning + continuous aggregates. Most log queries filter by time range and service. TimescaleDB chunks data by time, and those chunks are indexed by service. The queries skip entire partitions instead of scanning. The continuous aggregates make count and aggregate queries nearly free because the work is already done.

The One Place ClickHouse Wins

There's an important exception to the TimescaleDB dominance: count operations at scale.

Count p50 at 1M records:

Operation	TimescaleDB	ClickHouse
Full count	0.38ms	11.25ms
Filtered count	0.43ms	14.42ms

Wait TimescaleDB wins here too? Yes, because of the countEstimate optimization we built: instead of COUNT(*), we use EXPLAIN planner estimates for approximate counts. Zero scan, sub-millisecond.

Where ClickHouse genuinely wins is aggregate throughput at high volume. At 1M records, ClickHouse's aggregate (1m) shows 55,507 ops/s vs TimescaleDB's comparable range. ClickHouse is built for columnar analytical queries over huge datasets if you're running complex analytics across months of data with many group-by combinations, it'll outperform.

For the interactive dashboard queries that dominate observability UIs "show me the last hour filtered by this service" TimescaleDB is not even close to a fair fight.

Spans: The Interesting Reversal

The span (distributed tracing) results tell a different story from logs.

Trace query p50 at 10K records:

Operation	TimescaleDB	ClickHouse	MongoDB
Query all traces	2.5ms	23.6ms	1.6ms
Query error traces	1.6ms	22.6ms	3.3ms
Get trace by ID	0.29ms	4.3ms	0.40ms
Service dependencies	0.42ms	179ms	444ms

MongoDB is faster than TimescaleDB on some trace queries at this scale. The reason: MongoDB's document model fits trace data naturally. A trace is a document with nested spans. The queryTraces (all) query maps directly to a collection scan with a simple index lookup. TimescaleDB has to join spans to reconstruct traces.

Both MongoDB and TimescaleDB stay well ahead of ClickHouse on span queries. ClickHouse at 10K concurrent span queries (50 parallel) takes 1.76 seconds. TimescaleDB handles the same load in 10ms. That's what "not designed for point lookups" looks like in practice.

At 100K spans, the MongoDB advantage on trace queries disappears: querySpans (by service) goes from 82ms to 159ms, while TimescaleDB holds at 0.65ms. The document model helps at smaller scales but doesn't index-skip the way hypertables do.

Concurrency: The Story Nobody Tells

Single-query latency is fine for benchmarks. Production workloads are concurrent.

Concurrent log queries (50 parallel) p50:

Volume	TimescaleDB	ClickHouse	MongoDB
1K	6.8ms	334ms	665ms
10K	6.7ms	401ms	792ms
100K	6.2ms	895ms	2,380ms
1M	6.2ms	6,307ms	—

TimescaleDB's concurrency numbers are remarkably flat. 50 parallel queries at 100K records: 6.2ms. Same 50 parallel queries at 1M records: still 6.2ms.

ClickHouse at 50 parallel queries on 1M records: 6.3 seconds. PostgreSQL's connection-per-query model and MVCC handle concurrent readers without degradation. ClickHouse's columnar engine serializes heavy queries and saturates threads.

This matters if you're running Logtide for a team. Multiple people with dashboards open, alert evaluations running in the background, scheduled reports firing that's concurrent load. TimescaleDB absorbs it. ClickHouse struggles with it.

Metrics: MongoDB's Surprise

Metrics data was the unexpected MongoDB story.

Concurrent metric queries (50 parallel) at 100K:

Engine	p50
TimescaleDB	6.3ms
ClickHouse	284.9ms
MongoDB	53.7ms

MongoDB beats ClickHouse on concurrent metric queries by 5x. The reason: our MongoDB metrics implementation uses the native $percentile aggregation pipeline, which MongoDB handles efficiently in-memory at this scale. ClickHouse's columnar approach adds overhead for the many small aggregations typical of metrics dashboards.

At 1K and 10K records, MongoDB's metric aggregations (avg, sum, min, max, percentiles) are all in the 11-17ms range faster than ClickHouse's 8-21ms range, and only slightly behind TimescaleDB's sub-millisecond performance.

The catch that these latency numbers don't show: MongoDB stores metrics as BSON documents without time-series-specific compression. TimescaleDB uses columnar compression on hypertables, and ClickHouse uses Gorilla encoding (delta-of-delta) for floats and Delta encoding for timestamps algorithms designed specifically for the repetitive patterns in metrics data. In practice, the same year of metrics data will occupy significantly less disk on TimescaleDB or ClickHouse than on MongoDB. If storage cost matters at your scale, that tradeoff should factor into the decision.

MongoDB won 4 out of 52 benchmark categories at 1K records, 2 at 10K. Small wins, but real ones mostly around span lookups by trace ID and narrow time range queries, where its document indexing shines.

The Decision Framework

After seeing these numbers, here's how we think about the choice:

Use TimescaleDB (default) when:

You're running Logtide for a single team or SMB
You're already comfortable with PostgreSQL operationally
You want the lowest query latency across the board
You have mixed concurrent load (dashboards + alerts + searches)
You're on AWS RDS for PostgreSQL with TimescaleDB extension, or Aurora PostgreSQL

Use ClickHouse when:

You're ingesting exclusively in large batches (10K+ per request)
Your primary use case is analytical queries over months of historical data
You have a dedicated ops team managing ClickHouse infrastructure
You're on AWS EC2 with a self-managed ClickHouse cluster

Use MongoDB when:

You're already running MongoDB in your infrastructure (DocumentDB, Atlas, FerretDB, Cosmos DB in Mongo mode)
Your workload is trace-heavy with many individual document lookups
You want to avoid running a separate database just for observability
You're on AWS DocumentDB and don't want another managed service

The @logtide/reservoir abstraction means the application code doesn't care which engine you pick. You swap the config, run the migrations, and the same Logtide instance works on all three.

What These Numbers Don't Tell You

Benchmarks lie in specific ways, and this one has a scale ceiling you should be aware of.

1M records is not a large dataset. A moderately busy production service can generate 1M logs in minutes. At 100M or 1B rows where real enterprise observability workloads live the picture changes. TimescaleDB's B-tree indexes eventually stop fitting in RAM. When that happens, queries start hitting disk and latency climbs non-linearly. ClickHouse's columnar format and extreme compression (often 10:1 or better for log data) means its working set stays in RAM much longer. At billion-row scale, the engines invert: ClickHouse's full-table scans become faster than TimescaleDB's index-misses.

These benchmarks represent SMB-scale workloads teams generating tens of millions of log entries per day, not hundreds of millions per hour. That's exactly Logtide's target. But if you're evaluating engines for a platform that will eventually ingest at Datadog or Cloudflare scale, treat the 1M results as a floor, not a ceiling.

The other caveats: these tests ran on a single machine, fresh database, warm connection pool, no competing load. Production has network latency, shared compute, background vacuum processes (TimescaleDB), and background part merges (ClickHouse). The 400ms ClickHouse ingestion artifact gets worse under real-world conditions with high-frequency small writes from multiple SDK clients simultaneously.

MongoDB's metrics performance advantage at small scale comes with a storage cost that isn't visible in these benchmarks: MongoDB doesn't compress numeric time-series data the way TimescaleDB (using columnar compression) or ClickHouse (using Gorilla/Delta-Delta encoding) do. The same metrics dataset will use significantly more disk and RAM on MongoDB at production scale.

The benchmark suite is in the repo if you want to run it against your own infrastructure with your own dataset shapes.

Why TimescaleDB Won 96% of Tests

The summary from the benchmark runner:

timescale     50 wins ( 96%)
clickhouse     0 wins (  0%)
mongodb        4 wins (  4%)

Zero wins for ClickHouse isn't a bug in the benchmark it's a reflection of the workload. Observability query patterns are point lookups, short time ranges, service filters, and dashboard aggregations. That's TimescaleDB's wheelhouse.

ClickHouse excels at full-table analytics. When you're doing SELECT service, sum(errors) FROM logs WHERE month = 'February' across 500 million rows, ClickHouse will leave TimescaleDB behind. That query pattern doesn't dominate an observability dashboard. It dominates a data warehouse.

We made the right call. But we're glad we have the numbers to prove it now.

@logtide/reservoir is open source TimescaleDB, ClickHouse, and MongoDB adapters ship in Logtide 0.8.0.

If you run it against your own setup and get different results, open an issue. We'd genuinely like to know.

Logtide 0.8.0: Browser Observability, MongoDB Support, and Golden Signals

Polliog — Sat, 14 Mar 2026 20:10:01 +0000

Logtide 0.8.0 is out today. It's a release that started with a single promise from the 0.7.0 article: "full dashboard integration is the first thing on the 0.8.x list." We kept that promise, and then kept going.

This is the release that closes three major open threads at once: browser observability, MongoDB support for @logtide/reservoir, and Golden Signals with real percentile data. Plus a benchmark suite, smart project selectors, and enough performance work to make dashboards feel instant on large deployments.

🌐 Cloud: logtide.dev
💻 GitHub: logtide-dev/logtide (345+ ⭐)
📖 Docs: logtide.dev/docs

What's New

🌐 Browser SDK: Observability for Your Frontend

Backend observability was already solid. Browser instrumentation was the gap. 0.8.0 closes it with @logtide/browser a dedicated browser SDK built from the ground up, available as a drop-in addition to all existing framework packages.

Session tracking assigns a session_id to each browser tab via sessionStorage. That ID flows through the full stack SDK → ingestion → database column → reservoir layer → UI filter so you can slice any view by session and see exactly what a user experienced before an error fired.

Core Web Vitals are collected automatically: LCP, INP, and CLS via the web-vitals library, with a configurable sampling rate so you're not flooding your instance for low-traffic pages.

Breadcrumbs work on two axes:

Click breadcrumbs use event delegation to track click and input interactions. data-testid attributes are captured when present. Input values are never captured.
Network breadcrumbs patch fetch and XMLHttpRequest to record method, URL, status code, and duration. Query params are stripped by default; you can add a deny list for sensitive endpoints.

Offline resilience wraps the transport layer with an OfflineTransport that buffers logs and spans when connectivity drops (bounded queue, no unbounded memory growth), flushes on reconnect, and uses sendBeacon on page unload so nothing is lost when the tab closes.

Source maps ship with a new @logtide/cli package and a logtide sourcemaps upload command. Upload your build artifacts once, and stack frames in error reports automatically show the original file, line, column, and function name. You can toggle between minified and original frames directly in the UI.

Each framework got targeted improvements:

Next.js: RSC error detection tagged with mechanism: 'react.server-component', route params from __NEXT_DATA__ in navigation breadcrumbs
Nuxt: logtidePiniaPlugin for automatic Pinia action breadcrumbs
SvelteKit: route context in handleError, createBoundaryHandler() for <svelte:boundary>
Angular: NgZone context detection tagging errors as angular.zone: 'inside'/'outside'

Projects using the browser SDK automatically get two new dashboard tabs: Performance (Web Vitals over time) and Sessions (session-based filtering and replay context). The Capabilities API (GET /api/v1/projects/:id/capabilities) auto-detects whether a project has Web Vitals or Sessions data and shows those tabs only when relevant.

📈 Metrics Dashboard: The Dashboard We Promised in 0.7.0

We shipped OTLP metrics ingestion in 0.7.0 with the store and API client ready but no visualization layer. 0.8.0 delivers it.

The redesigned metrics page has two tabs: Overview and Explorer.

Overview groups your metrics by service. Each service gets a card with a sparkline (ECharts), plus latest, avg, min, and max values at a glance. The cards cross-link to traces and logs click a data point on a chart and jump straight to the traces in that time window. Service selection and time range are in a persistent header that stays in sync with URL parameters.

Under the hood, Overview is powered by pre-aggregated rollups rather than scanning raw data on every load:

TimescaleDB: metrics_hourly_stats and metrics_daily_stats continuous aggregates with automatic refresh policies
ClickHouse: metrics_hourly_rollup and metrics_daily_rollup materialized views
MongoDB: on-the-fly aggregation pipeline (no separate materialized views needed at this scale)

The query layer uses smart rollup routing: if your request is asking for 1h or 1d intervals with a compatible aggregation function, it hits the pre-aggregated table. Otherwise it falls back to raw data. You get dashboard speed without sacrificing query flexibility.

🍃 MongoDB Storage Adapter: `@logtide/reservoir` Is Now a Tri-Engine System

@logtide/reservoir launched with TimescaleDB and ClickHouse. 0.8.0 adds the third engine: MongoDB.

All 33 StorageEngine interface methods are implemented logs, spans, traces, metrics, and exemplars. The adapter ships with MongoDBQueryTranslator for filter translation, a Docker Compose profile-gated MongoDB 7.0 service for local development, and full admin dashboard integration showing health status for all three engines.

It also auto-detects MongoDB 5.0+ features: $dateTrunc for time bucketing and native time-series collections when available, with fallback for older versions.

The adapter comes with 100 tests: 34 unit tests and 66 integration tests covering the full query surface.

Practical compatibility: if you're running DocumentDB, FerretDB, or Cosmos DB in MongoDB compatibility mode, the adapter works with those too. The storage layer stays fully abstracted swapping engines doesn't touch a line of application code.

// reservoir.config.ts
export default createStorageEngine('mongodb', {
  uri: 'mongodb://localhost:27017/logtide',
  authSource: 'admin',
})

📊 Golden Signals with Percentiles

Rate, errors, duration the four golden signals of observability. Duration without percentiles is noise. 0.8.0 adds P50, P95, and P99 aggregation across all three storage engines.

The new Golden Signals panel has dedicated charts for request rate, error rate, and latency percentiles side by side. The percentile aggregation implementation is engine-native: percentile_cont on TimescaleDB, quantile on ClickHouse, $percentile on MongoDB. No application-level approximation.

You can filter by service name and additional attributes, and all three charts load in parallel.

Everything Else Worth Knowing

Smart project selectors: project dropdowns throughout the app now only show projects that actually have data in the relevant category. If a project has no traces, it won't appear in the traces page selector. A new GET /api/v1/projects/data-availability endpoint powers this, with graceful fallback to all projects if the check fails.

Reservoir benchmark suite: k6-based benchmarking scripts for ingestion and query workloads across all three engines. Seed up to 100k events per run. If you want to make an informed decision between TimescaleDB, ClickHouse, and MongoDB for your specific workload, this gives you a reproducible way to test it.

Custom time range picker: the TimeRangePicker now supports arbitrary custom ranges, synced to URL parameters. Bookmark any time window.

DSN copy on API key creation: when you create a new API key, the dialog now shows the full DSN string (https://KEY@host) ready to copy. One step instead of three.

Performance Work

0.8.0 has more targeted performance work than any previous release. A few highlights:

TimescaleDB skip-scan via Recursive CTEs: distinct queries on high-cardinality fields like service were doing full table scans. Recursive CTEs implement the index skip-scan pattern PostgreSQL lacks natively, dropping execution time from minutes to milliseconds on large tables.

Dashboard intelligent optimization: all three engines now support countEstimate for approximate counts, bypassing heavy COUNT(*) operations on high-volume projects. The dashboard loads instantly regardless of log volume.

MongoDB-specific: insertMany({ordered: false}) for maximum write throughput, compound indexes matching actual query patterns, sparse indexes on nullable fields, atomic trace upsert with a single bulkWrite (one network round trip), and cursor-based keyset pagination with (time, id) tuples for consistent pagination under concurrent writes.

Capabilities detection: reduced the scanning range from 7 days to 24 hours for Web Vitals and Sessions detection, making the initial project dashboard load instant.

Upgrading

No breaking changes.

docker compose pull
docker compose up -d

Redis-free deployment:

docker compose -f docker-compose.simple.yml pull
docker compose -f docker-compose.simple.yml up -d

To use the MongoDB adapter, enable the profile in your Compose setup:

docker compose --profile mongodb up -d

What's Next

0.8.0 closes the observability foundation. What's left before v1.0 (our beta milestone):

Log parsing pipelines (#152): structured extraction for syslog, legacy formats, and custom patterns without writing VRL transforms by hand
Webhook receivers (#154): ingest external events from GitHub, PagerDuty, Stripe, and others without custom code
Proactive health monitoring (#151): status pages built from the data already in Logtide, with uptime history and alerting
Scheduled digest reports (#153): weekly email summaries of error trends, anomalies, and key metrics

The query abstraction layer is also a candidate for extraction as a standalone open-source library if you have thoughts on that, open a discussion.

Full Changelog: v0.7.0...v0.8.0

Star the project, open an issue, or just try it the Docker setup takes about 5 minutes.

PostgreSQL as a Vector Database: When to Use pgvector vs Pinecone vs Weaviate

Polliog — Wed, 04 Mar 2026 11:12:38 +0000

"Should we use PostgreSQL as our vector database?"

I've heard this question a lot in 2026. pgvector is everywhere. Every Postgres instance now has vector search capabilities.

But is it actually better than Pinecone or Weaviate?

I tested all three with 10 million vectors (1536 dimensions, OpenAI embeddings). Here's what I found.

The Vector Database Landscape in 2026

Quick summary:

Pinecone: Fully managed, serverless, 70% market share
Weaviate: Hybrid search (vectors + BM25), open-source
pgvector: PostgreSQL extension, ACID compliance

The big shift in 2026: pgvector is no longer "the slow option."

With pgvectorscale (Timescale's addition), PostgreSQL now delivers 471 QPS at 99% recall on 50M vectors. That's 11.4x better than Qdrant and competitive with Pinecone.

Let's break this down.

What Even Is a Vector Database?

Vectors are just arrays of numbers that represent meaning:

# Text → Vector embedding
"machine learning" → [0.23, -0.41, 0.88, ..., 0.15]  # 1536 dimensions
"deep learning"    → [0.21, -0.39, 0.91, ..., 0.13]  # Similar vector!

Vector databases let you find similar vectors fast:

-- Find documents similar to "machine learning"
SELECT * FROM documents 
ORDER BY embedding <-> '[0.23, -0.41, ...]' 
LIMIT 10;

Use cases:

RAG (Retrieval-Augmented Generation): Give LLMs relevant context
Semantic search: "Find products like this"
Recommendations: "Users similar to you also liked..."
Anomaly detection: "This behavior is unusual"

pgvector: PostgreSQL's Vector Extension

What It Is

pgvector is an extension that adds vector data types and similarity search to PostgreSQL.

CREATE EXTENSION vector;

CREATE TABLE documents (
  id SERIAL PRIMARY KEY,
  content TEXT,
  embedding vector(1536)  -- OpenAI embedding size
);

-- Create HNSW index for fast similarity search
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);

The 2026 Performance Revolution

Before 2025: pgvector was slow. "Use it for <1M vectors, then switch to Pinecone."

March 2026: pgvector + pgvectorscale (from Timescale) changed everything.

Benchmark results (50M vectors, 1536 dims, 99% recall):

pgvectorscale: 471 QPS, p95 latency: 28ms
Pinecone s1: 471 QPS, p95 latency: 784ms (28x slower)
Qdrant: 41 QPS (11.4x slower than pgvector)

Source: Timescale benchmarks, May 2025

Key Features (pgvector 0.8.0, March 2026)

1. HNSW Index (Hierarchical Navigable Small World)

Multi-layer graph for fast approximate search
Sub-millisecond latency at high recall
Configurable m (connections) and ef_construction (quality)

2. Iterative Scan (New in 0.8.0)

Fixes "overfiltering" problem with metadata filters
Returns complete result sets (not partial)
5.7x query performance improvement over 0.7.4

3. ACID Transactions

Full transactional guarantees
Rollback support
Consistency for vectors + relational data

4. SQL Integration

Combine vector search with JOINs, WHERE clauses, CTEs
No context switching between databases

Limitations

1. Single-Node Scaling

Tested reliably up to 10-50M vectors
Beyond that, you need sharding (Citus, manual partitioning)

2. Infrastructure Requirements at Scale

>10M vectors requires optimized hardware:
- Fast NVMe SSDs (index I/O is critical)
- High RAM (32-64GB+ for index caching)
- Parameter tuning (m, ef_search, maintenance_work_mem)
"Zero ops" is misleading at scale you'll spend time on infrastructure

3. No BuiltIn Embedding Models

You bring your own embeddings
Pinecone/Weaviate have hosted inference

4. Performance Degrades with High Write Volume

HNSW rebuild overhead
Less optimized than purpose built vector DBs

Pinecone: The Managed Leader

What It Is

Pinecone is a fully managed, serverless vector database. No infrastructure, no ops.

import pinecone

pinecone.init(api_key="...")
index = pinecone.Index("my-index")

# Upsert vectors
index.upsert(vectors=[
    ("id1", [0.23, -0.41, ...], {"category": "ml"}),
    ("id2", [0.21, -0.39, ...], {"category": "dl"})
])

# Query
results = index.query(
    vector=[0.23, -0.41, ...],
    top_k=10,
    filter={"category": "ml"}
)

Pricing (2026 Serverless)

Cost Type	Price
Storage	$0.33/GB/month
Read Units	$8.25 per 1M reads
Write Units	$2.00 per 1M writes
Minimum	$50/month (Standard), $500/month (Enterprise)

Example (5M vectors, 500K queries/month):

Storage: 10GB × $0.33 = $3.30/month
Reads: 500K × $8.25/M = $4.13/month
Writes: 50K × $2/M = $0.10/month
Total: ~$58/month (including minimum)

Gotcha: Read units are unpredictable. A query with metadata filters can consume 5-10 read units, not 1.

Pros

✅ Zero ops - No servers, no tuning, no maintenance
✅ Auto-scaling - Handle traffic spikes automatically
✅ Compliance - SOC 2, HIPAA, GDPR out-of-the-box
✅ Consistent latency - 20-100ms p95 (production-ready)
✅ Hosted embeddings - Pinecone Inference for models

Cons

❌ Expensive at scale - Above 10M vectors, costs escalate
❌ Vendor lock-in - Proprietary API, migration is painful
❌ Read unit unpredictability - Hard to forecast costs
❌ No ACID transactions - Purpose-built, not general DB

Weaviate: The Hybrid Search Specialist

What It Is

Weaviate combines vector similarity with BM25 keyword search. Best for semantic + keyword hybrid retrieval.

import weaviate

client = weaviate.Client("https://your-cluster.weaviate.network")

# Hybrid search (vector + keyword)
result = client.query.get("Document", ["content"]) \
    .with_hybrid(
        query="machine learning",
        alpha=0.75  # 75% vector, 25% BM25
    ) \
    .with_limit(10) \
    .do()

Pricing (2026 Shared Cloud)

Plan	Cost	Features
Flex	$45/month	Shared, HA, 99.5% uptime
Plus	$280/month (annual)	Shared or Dedicated, 99.9% uptime
Premium	Custom	Dedicated, 99.95% uptime, HIPAA

Pricing dimensions:

Vector dimensions: $0.095 per 1M dimensions/month (Standard tier)
Storage: Variable by region/compression
Backups: Based on volume

Example (10M vectors, 1536 dims):

Dimensions: 10M × 1536 = 15.36B dims
Cost: 15,360M dims × $0.095/M = ~$1,459/month (ballpark)

Pros

✅ Hybrid search - Combine semantic + keyword (unique strength)
✅ Multi-modal - Text, images, audio in one index
✅ Open-source - Selfhost option for full control
✅ GraphQL API - Powerful filtering/aggregation
✅ BM25 built-in - No separate keyword index needed

Cons

❌ Complexity - Steeper learning curve than Pinecone
❌ Higher costs - Vector dimensions pricing scales fast
❌ Self-hosting burden - Need DevOps for production
❌ No ACID - Like Pinecone, not a general database

Head-to-Head Benchmark (10M Vectors)

Test setup:

Dataset: 10M vectors, 1536 dimensions (OpenAI text-embedding-3-small)
Hardware: AWS r6g.xlarge (4 vCPU, 32GB RAM)
Query: Top-10 similarity search, 95% recall target
Filters: 20% of queries with metadata filters (category = 'X')

Query Latency

Database	P50 Latency	P95 Latency	P99 Latency
pgvector 0.8	22ms	78ms	142ms
Pinecone (serverless)	45ms	103ms	187ms
Weaviate (shared)	38ms	95ms	168ms

Winner: pgvector (fastest p50/p95)

Write Throughput (Inserts/sec)

Database	Throughput	Notes
pgvector	8,500/sec	HNSW rebuild overhead
Pinecone	12,000/sec	Write units charged
Weaviate	10,200/sec	Depends on compression

Winner: Pinecone (optimized for writes)

Storage Efficiency

Database	Raw Size	Compressed	Ratio
pgvector	60 GB	60 GB	1x (no compression)
Pinecone	N/A	~55 GB	Proprietary
Weaviate	62 GB	42 GB	1.5x (with PQ)

Winner: Weaviate (best compression)

Recall @ Top-10

Database	Recall	Configuration
pgvector	95.2%	HNSW m=16, ef_search=40
Pinecone	96.1%	Default
Weaviate	95.8%	Default HNSW

Winner: Pinecone (slightly better recall)

Cost Comparison (Real Production Numbers)

Scenario: 10M vectors, 1536 dims, 500K queries/month

pgvector (Self-Hosted)

Infrastructure:

AWS RDS PostgreSQL (r6g.xlarge): $180/month
Storage (60GB): $14/month
Backups: $8/month
Total: $202/month

Plus:

DevOps time: ~4 hours/month (monitoring, updates)
No per-query costs

Pinecone (Serverless)

Pricing:

Storage (55GB): $18/month
Reads (500K): $41/month
Writes (50K): $1/month
Total: $60/month (but scales with usage)

At 5M queries/month: ~$430/month

Weaviate (Shared Cloud, Plus Plan)

Pricing:

Base plan: $280/month (annual)
Vector dimensions (15.36B): ~$1,459/month
Total: ~$1,740/month

(Note: Pricing varies by region/compression; this is approximate)

Cost Winner

Small scale (<1M vectors, <100K queries/month): Pinecone
Medium scale (1-10M vectors, 500K queries/month): pgvector
Large scale (10M+ vectors, high query volume): Pinecone or self-hosted Weaviate

When to Use Each

Choose pgvector if:

✅ You already run PostgreSQL (zero new infrastructure)
✅ You need ACID transactions (vectors + relational data)
✅ You want SQL flexibility (JOINs, complex queries)
✅ Cost matters (75% cheaper than Pinecone self-hosted)
✅ Your scale is <10M vectors (proven sweet spot)
✅ You have DevOps capacity for Postgres management

⚠️ Reality check for >10M vectors:

You'll need optimized hardware (fast NVMe SSDs, 32-64GB RAM)
Expect to tune HNSW parameters (m, ef_search, maintenance_work_mem)
Index builds can take hours on large datasets
Pinecone's "zero ops" sometimes justifies the cost to avoid these infrastructure headaches

Best for:

Startups with existing Postgres infrastructure
Apps needing vectors + relational data together
Cost-sensitive projects (<10M vectors)
RAG with SQL-heavy data pipelines

Choose Pinecone if:

✅ You want zero ops (fully managed, no servers)
✅ You need to ship fast (production in hours, not weeks)
✅ Compliance is non-negotiable (SOC 2, HIPAA built-in)
✅ You don't have DevOps capacity
✅ Consistent latency is critical (p95 <100ms guaranteed)
✅ Your scale is unpredictable (auto-scaling)

Best for:

Enterprises with strict compliance requirements
Teams without infrastructure expertise
MVP/prototype that needs to scale fast
Apps with variable traffic (serverless shines)

Choose Weaviate if:

✅ You need hybrid search (vectors + BM25 keywords)
✅ You're working with multi-modal data (text, images, audio)
✅ You want open-source with self-host option
✅ Advanced filtering is critical (payload-aware HNSW)
✅ You need GraphQL for complex queries
✅ Cost predictability matters (resource-based, not per-query)

Best for:

Enterprise RAG with strict data sovereignty
Multi-modal AI applications
Teams with strong DevOps (self-hosted)
Apps needing semantic + keyword search

The Decision Matrix

Requirement	pgvector	Pinecone	Weaviate
Zero ops	❌	✅	❌ (cloud) / ❌ (self-hosted)
ACID transactions	✅	❌	❌
Hybrid search	❌	❌	✅
Cost (<10M vectors)	✅	⚠️	❌
Compliance (built-in)	❌	✅	⚠️
SQL integration	✅	❌	❌
Multi-modal	❌	❌	✅
Scalability (>50M)	❌	✅	✅

Migration Paths

From Pinecone → pgvector

Why? Cost savings (75%+ reduction)

Steps:

Export vectors from Pinecone (use their API)
Load into PostgreSQL with COPY
Create HNSW index
Dual-write during transition
Cutover reads incrementally

Gotcha: Pinecone's metadata filters → PostgreSQL WHERE clauses

From pgvector → Pinecone

Why? Scale beyond 10M vectors, reduce ops burden

Steps:

Export from PostgreSQL
Upsert to Pinecone (batch API)
Dual-write during transition
Validate recall/latency
Cutover

Gotcha: SQL queries → Pinecone API calls (rewrite logic)

Real-World Use Cases (2026)

RAG for Customer Support (Weaviate)

Company: Neople (game publisher)
Scale: 5M support documents
Why Weaviate: Hybrid search (semantic + keyword) for accurate retrieval

Results:

40% reduction in false positives
Sub-50ms query latency
Native BM25 for exact phrase matching

Internal Document Search (pgvector)

Company: Mid-size SaaS (15M ARR)
Scale: 2M documents
Why pgvector: Already running Postgres, needed ACID

Results:

$0 new infrastructure costs
Combined vector search with user permissions (SQL JOINs)
95% recall, <100ms p95 latency

Product Recommendations (Pinecone)

Company: E-commerce (50M products)
Scale: 50M vectors
Why Pinecone: Auto-scaling for Black Friday traffic

Results:

Zero ops overhead
Handled 100K QPS spike (auto-scaled)
99.9% uptime SLA

2026 Industry Trends

1. pgvector adoption exploded

Every major Postgres hosting platform now supports pgvector
Supabase, Neon, Timescale, AWS RDS, GCP Cloud SQL

2. Hybrid search is table-stakes

Weaviate's BM25 + vector is now expected
Pinecone added sparse vectors (SPLADE) in 2024

3. Serverless won

Pinecone eliminated "pod management" in 2024
Pay-as-you-go is the default

4. Compliance pressure

GDPR, HIPAA, SOC 2 are non-negotiable for enterprise
Pinecone/Weaviate have certifications; pgvector = DIY

Myths Debunked

Myth 1: "pgvector is too slow for production"
Truth: pgvectorscale delivers 471 QPS at 99% recall (competitive with Pinecone)

Myth 2: "Purpose-built vector DBs are always faster"
Truth: At <10M vectors, pgvector matches or beats dedicated DBs

Myth 3: "Pinecone is expensive"
Truth: Serverless pricing is competitive at low scale; costs escalate above 10M vectors

Myth 4: "You need a vector DB for RAG"
Truth: For <1M documents, pgvector is perfect (and you're already running Postgres)

The Bottom Line

pgvector is no longer "the slow option." In 2026, it's a legitimate competitor to Pinecone and Weaviate for most use cases.

But — at scale (>10M vectors), the "zero cost" advantage diminishes when you factor in:

Hardware upgrades (fast SSDs, high RAM)
DevOps time (index tuning, performance monitoring)
Operational complexity (backup strategies, index rebuilds)

Pinecone's "zero ops" can justify the 2-3x cost premium if infrastructure headaches aren't worth your team's time.

Decision framework:

Prototype/MVP → Start with pgvector (if you have Postgres)
Enterprise compliance → Pinecone (SOC 2/HIPAA out-of-box)
Hybrid search → Weaviate (semantic + keyword)
Cost-sensitive (<10M vectors) → pgvector (self-hosted)
Scale >10M vectors → Pinecone (unless you have strong DevOps)
Scale >50M vectors → Pinecone or self-hosted Weaviate (dedicated team)

My take: If you're already running PostgreSQL and staying under 10M vectors, pgvector is a no-brainer. Beyond that, seriously evaluate whether saving money is worth the infrastructure complexity.

What's your vector database setup? Share your experience in the comments.

Resources:

Forem: Polliog

Logtide 0.9.0: Custom Dashboards, Health Monitoring, and Log Parsing Pipelines

What's New

📊 Custom Dashboards: 9 Panel Types, Drag-to-Resize, and YAML Export

🖥️ Service Health Monitoring and Public Status Pages

🔩 Log Parsing and Enrichment Pipelines

Everything Else Worth Knowing

Upgrading

What's Next

TigerFS: A Filesystem Backed by PostgreSQL

Mode 1: Data-First

Exploring

Modifying

Pipeline Queries

Bulk Ingest

Schema Management

Mode 2: File-First

Markdown Apps

Version History

Multi-Agent Task Queue

Shared Agent Workspace

Cloud Backends

Why the File Interface

Current Status

I Replaced ElastiCache with Valkey on ECS (And Cut the Bill by 70%)

The Architecture

Step 1: EFS Volume for Persistence

Step 2: ECS Task Definition

Step 3: ECS Service and Service Discovery

Step 4: Connecting from Node.js

The Cost Comparison

What You Give Up

When This Makes Sense

Your Node.js App Is Probably Killing Your PostgreSQL (Connection Pooling Explained)

Why Node.js Apps Over-Connect

What Happens When You Hit the Limit

The Wrong Fix

PgBouncer: A Connection Pool in Front of PostgreSQL

Setting Up PgBouncer with Docker

The Numbers

What Transaction Pooling Breaks

PgBouncer on Managed Databases

Tuning max_connections in PostgreSQL

The Checklist

I Ditched Prisma for Raw SQL (And My Queries Got 10x Faster)

What Prisma Actually Does to Your Queries

The N+1 Problem Prisma Doesn't Fully Solve

The Performance Numbers

Migrating Without Rewriting Everything

What You Actually Lose

When to Keep Prisma

Your API Responses Are 40x Larger Than They Need to Be

The Three Ways APIs Bloat Their Responses

1. No Compression

2. Over-fetching

3. Redundant Nesting and Metadata

Keyset Pagination vs. Offset Pagination

Conditional Responses with ETags

Putting It Together

I Tested PAIO Bot's New Security Layer for AI Agents — Here's the Honest Take

The problem: OpenClaw's localhost exposure is a real risk

What PAIO actually does

Setup: honest timing

Token optimization: the claim vs. reality

Complexity: the honest critique

Verdict

I Removed Redis From My Stack and Used PostgreSQL for Job Queues Instead

The Problem With "Just Add Redis"

How graphile-worker Works

Setting It Up in a Node.js/TypeScript Project

BullMQ vs graphile-worker: The Real Comparison

Where graphile-worker wins

Where BullMQ wins

The honest performance numbers

The AWS Decision Framework

The Migration Path

What I Actually Run in Production

Summary

PII in Your Logs Is a GDPR Time Bomb - Here's How to Defuse It

What PII Actually Looks Like in Logs

Tuning `max_connections` in PostgreSQL

🍃 MongoDB Storage Adapter: `@logtide/reservoir` Is Now a Tri-Engine System