Forem: Sebastian

Open Source Serverless Product Analytics on AWS

Sebastian — Fri, 02 Jan 2026 10:40:00 +0000

Product analytics shouldn't require managing servers, containers, or complex infrastructure. Yet most self-hosted alternatives to tools like Plausible or Umami assume you'll spin up Docker containers, manage databases, and deal with scaling headaches.

I built this open source solution to change that. It's a fully serverless, self-hostable analytics platform that deploys into your own AWS account with a single CDK command.

No servers. No Docker builds. Minimal, predictable baseline cost.

You get privacy-focused analytics infrastructure that scales from zero to millions of events without operational overhead.

Important note
This repository provides a production-grade analytics ingestion pipeline, not a polished analytics SaaS.

Event collection, buffering, replay, and storage are solid and designed for real workloads.
What is still evolving:

authorization and multi-tenant access control

the query / insights layer (dashboards, funnels, cohorts)

If you’re comfortable building on top of a strong foundation—or want to contribute—this project is for you.

In this post, I’ll walk through how analytics platforms work under the hood, explore two serverless architectures on AWS, and explain the trade-offs behind the approach I chose.

How Analytics Platforms Work

Every analytics system—whether it's Google Analytics, Plausible, or a custom solution—follows the same fundamental pattern:

Browser → Ingestion API → Buffer → Processor → Storage

This separation of concerns is what allows analytics systems to scale reliably without impacting application performance.

Collection (Browser)

A lightweight JavaScript snippet runs on your site and captures events: page views, clicks, web vitals. It batches these events and sends them to your backend using sendBeacon for reliability or fetch for flexibility.

The script should be tiny (ideally under ~1KB gzipped) so it doesn’t affect page performance or Core Web Vitals.

Ingestion API

An HTTP endpoint receives events from the browser. Its responsibilities should be minimal:

validate the payload
enrich it with metadata (e.g. geolocation from request headers)
push the event downstream

The API should return immediately and never block on heavy processing or database writes.

Buffer

The buffer decouples ingestion from processing.

Events are written to a queue or stream so your ingestion API remains fast even during traffic spikes. This layer absorbs bursts, smooths load, and allows downstream consumers to process events at their own pace.

Processor

A worker reads events from the buffer, transforms them into the shape your storage expects, and writes them out.

This is also where batching happens to reduce write amplification and keep storage costs under control.

Storage

This is the query layer. It must handle analytical workloads efficiently:

aggregations over time ranges
grouping by dimensions (referrer, country, device, etc.)

Row-based databases work at small scale, but columnar stores like ClickHouse are dramatically more efficient as data volume grows.

Two Serverless Approaches on AWS

When designing this for AWS, I evaluated two architectures. Both are fully serverless, but they differ in cost characteristics, operational complexity, and replay capabilities.

Approach 1: EventBridge + SQS (Near-Zero Cost at Rest)

Browser → Lambda Function URL → EventBridge → SQS → Processor Lambda → Storage
                                           ↘ S3 (raw archive)

This is the purest pay-per-request model.

EventBridge acts as the routing layer: one rule forwards events to SQS for processing, while another rule triggers a Lambda that archives raw events to S3.

Advantages:

Near-zero cost when there’s no traffic
Simple mental model with declarative routing rules
Easy extensibility—new consumers are just new EventBridge rules

Trade-offs:

No built-in replay mechanism
Reprocessing requires manual replay from S3
Limited control over batching semantics

This approach works well for side projects, low-traffic sites, or scenarios where minimizing idle cost is the top priority.

Approach 2: Kinesis Data Streams + Firehose (What I Built)

Browser → Lambda Function URL → Kinesis Data Stream → Firehose → S3
                                                    ↘ Lambda → Storage

This is the architecture I chose.

Kinesis Data Streams acts as the central event log. Firehose handles archival to S3 automatically, while a Lambda consumer processes events and writes them to the analytics database.

Advantages:

Built-in replay via configurable retention (7 days by default, up to 365)
Strict ordering guarantees within partitions (important for session reconstruction)
Seamless Firehose integration for batching, compression, and delivery to S3
Predictable throughput and backpressure via shards

Trade-offs:

Not zero-cost at rest (one shard is roughly ~$11/month)
Requires basic capacity planning

I chose this approach because replayability and operational simplicity matter more than absolute zero idle cost for a production analytics system. The baseline cost is predictable, and the architecture scales cleanly as traffic grows.

Storage Layer

By default, the project uses AWS Aurora DSQL. I chose it to experiment with a fully serverless SQL database.

It works—but for analytical workloads, ClickHouse is the better choice.

Columnar storage, compression, and built-in aggregation functions make a significant difference for time-series analytics.

The storage layer is abstracted behind an interface, so swapping backends is a configuration change. For real-world traffic, I recommend pointing the system at ClickHouse (ClickHouse Cloud or self-hosted) instead of DSQL.

Getting Started

The entire stack deploys with a single command:

make deploy

This provisions:

the ingestion API (built in Rust)
buffering infrastructure (Kinesis + Firehose)
raw event archival to S3
a query API built with OpenAPI backend with sane defaults

Everything is defined in CDK and can be customized via configuration without touching the core architecture.

The repository is available on GitHub:
👉 https://github.com/boringContributor/aws-serverless-product-analytics

Contributing & Roadmap

This project is intentionally modular and open for contributions.

Areas where help is especially valuable:

authorization & multi-tenant access control
query API design (funnels, breakdowns, cohorts)
ClickHouse schemas and query optimizations
dashboard and visualization experiments

If you’re interested, issues are labeled and the architecture is documented to make onboarding easier.

Wrapping Up

Serverless analytics on AWS is not only possible—it’s practical.

You get the privacy and control benefits of self-hosting without managing servers, containers, or always-on infrastructure. Whether you choose a near-zero-cost EventBridge pipeline or a replay-friendly Kinesis-based architecture depends on your traffic patterns and tolerance for baseline cost.

The code is open source. Deploy it, fork it, or use it as a reference for building your own event ingestion pipelines.

If you have questions or want to adapt this setup to your needs, feel free to set up a quick call for a one-time collaboration:
👉 https://cal.com/someone

Storing Sensitive Information in DynamoDB with KMS

Sebastian — Wed, 24 Dec 2025 13:10:10 +0000

Recently I faced an issue with AWS EventBridge Connections. It's a managed AWS service that handles secrets for you—you configure authentication for an API (either for yourself or your customers, like webhooks), and EventBridge Connections handles the rest when attached to EventBridge API Destination or Step Functions HTTP invoke tasks.

Both services seem great at first glance, but reveal limitations once you move beyond simple use cases. In my case, the lack of customization and control became a blocker. This led me to research alternatives: Where can I store customer-provided secrets or sensitive data securely?

The Obvious Choice: AWS Secrets Manager

For most people, the first solution that comes to mind is AWS Secrets Manager. Secrets Manager is a fully managed service designed specifically for storing and rotating secrets like database credentials, API keys, and OAuth tokens.

What is Secrets Manager?

AWS Secrets Manager helps you protect access to your applications, services, and IT resources without upfront investment and ongoing maintenance costs. It enables you to rotate, manage, and retrieve database credentials, API keys, and other secrets throughout their lifecycle.

Key features include:

Automatic secret rotation
Fine-grained access control via IAM
Audit and compliance through CloudTrail logging
Integration with RDS, DocumentDB, and other AWS services

The Downsides

While Secrets Manager is powerful, it's not always necessary:

Cost: Secrets Manager charges $0.40 per secret per month, plus $0.05 per 10,000 API calls. For applications managing many customer secrets, this adds up quickly.
Overkill for simple use cases: If you don't need automatic rotation or the advanced features, you're paying for functionality you won't use.
Complexity: For straightforward encryption needs, the service adds unnecessary overhead.

This is where AWS Key Management Service (KMS) becomes an attractive alternative.

A Better Fit: AWS KMS

What is KMS?

AWS Key Management Service (KMS) is a managed service that makes it easy to create and control cryptographic keys used to encrypt your data. Unlike Secrets Manager, KMS doesn't store your secrets—it stores encryption keys that you use to encrypt and decrypt data yourself.

Think of it this way:

Secrets Manager: A secure vault that stores your secrets
KMS: A key custodian that holds the keys you use to lock/unlock your own vault

Why KMS for DynamoDB?

DynamoDB Encryption vs Application-Level Encryption

It's important to clarify that DynamoDB already encrypts all data at rest by default using AWS-managed KMS keys. This protects your data against physical disk access and AWS infrastructure-level threats.

However, server-side encryption (SSE) alone is often not sufficient when dealing with customer-provided secrets.

Application-level encryption (encrypting data before storing it in DynamoDB) provides additional guarantees:

Protects against overly permissive IAM policies
Limits exposure in case of accidental data access
Keeps data encrypted in exports, backups, and logs
Enables fine-grained access control at the application boundary

In this guide, we’re focusing on application-level encryption, where sensitive values are encrypted using KMS before being written to DynamoDB, rather than relying solely on DynamoDB’s built-in encryption at rest.

When storing sensitive data in DynamoDB, you have two main approaches:

Store references in DynamoDB: Encrypt secrets, store them in AWS Secret Manager (SSM), then store the reference in DynamoDB
Store encrypted data directly in DynamoDB: Encrypt the sensitive data with KMS and store the encrypted value directly in your DynamoDB table

The second approach is simpler and more cost-effective for many use cases. Let's explore how to implement it.

Setting Up KMS with AWS CDK

Here's how to create a KMS key and configure it for use with DynamoDB:

import * as kms from 'aws-cdk-lib/aws-kms';
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
import * as iam from 'aws-cdk-lib/aws-iam';
import { Stack, StackProps, RemovalPolicy } from 'aws-cdk-lib';
import { Construct } from 'constructs';

export class SecureStorageStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);

    // Create a KMS key for encrypting sensitive data
    const encryptionKey = new kms.Key(this, 'SensitiveDataKey', {
      description: 'Key for encrypting customer secrets in DynamoDB',
      enableKeyRotation: true, // Automatically rotate key every year
      removalPolicy: RemovalPolicy.RETAIN, // Keep key even if stack is deleted
    });

    // Create DynamoDB table
    const secretsTable = new dynamodb.Table(this, 'SecretsTable', {
      partitionKey: { name: 'customerId', type: dynamodb.AttributeType.STRING },
      sortKey: { name: 'secretId', type: dynamodb.AttributeType.STRING },
      billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
    });

    // Grant your Lambda function access to the key
    // (Assuming you have a Lambda function defined)
    encryptionKey.grantEncryptDecrypt(yourLambdaFunction);
    secretsTable.grantReadWriteData(yourLambdaFunction);

    // Add key ARN to Lambda environment variables
    yourLambdaFunction.addEnvironment('KMS_KEY_ID', encryptionKey.keyId);
    yourLambdaFunction.addEnvironment('SECRETS_TABLE_NAME', secretsTable.tableName);
  }
}

Encrypting and Decrypting in TypeScript

Important Limitation: KMS Encrypt Size Limit

AWS KMS Encrypt has a maximum plaintext size of 4 KB. This works well for small secrets such as:

API keys
Webhook secrets
Short OAuth tokens

However, it will not work for larger payloads like:

PEM certificates
Large JSON credentials
Multi-field configuration blobs

Envelope Encryption for Larger Secrets

For secrets larger than 4 KB, you should use envelope encryption:

Use KMS to generate a data encryption key (DEK)
Encrypt the secret locally using a symmetric algorithm (e.g. AES-256-GCM)
Store the encrypted secret and the encrypted data key together in DynamoDB
Decrypt the data key with KMS only when needed

This approach:

Scales to arbitrarily large secrets
Minimizes KMS API calls
Is the recommended best practice by AWS

In this article, we focus on the direct Encrypt / Decrypt approach for simplicity and small secrets. For production systems handling larger payloads, envelope encryption should be used instead.

Here's how to encrypt and decrypt sensitive data using the AWS SDK for JavaScript v3:

import { KMSClient, EncryptCommand, DecryptCommand } from '@aws-sdk/client-kms';
import { DynamoDBClient } from '@aws-sdk/client-dynamodb';
import { DynamoDBDocumentClient, PutCommand, GetCommand } from '@aws-sdk/lib-dynamodb';

const kmsClient = new KMSClient({});
const dynamoClient = DynamoDBDocumentClient.from(new DynamoDBClient({}));

const KMS_KEY_ID = process.env.KMS_KEY_ID!;
const TABLE_NAME = process.env.SECRETS_TABLE_NAME!;

interface CustomerSecret {
  customerId: string;
  secretId: string;
  encryptedValue: string;
  createdAt: string;
}

/**
 * Encrypt a sensitive value using KMS
 */
async function encryptSecret(plaintext: string): Promise<string> {
  const command = new EncryptCommand({
    KeyId: KMS_KEY_ID,
    Plaintext: Buffer.from(plaintext, 'utf-8'),
  });

  const response = await kmsClient.send(command);

  if (!response.CiphertextBlob) {
    throw new Error('Encryption failed: no ciphertext returned');
  }

  // Convert to base64 for storage
  return Buffer.from(response.CiphertextBlob).toString('base64');
}

/**
 * Decrypt a KMS-encrypted value
 */
async function decryptSecret(encryptedValue: string): Promise<string> {
  const command = new DecryptCommand({
    CiphertextBlob: Buffer.from(encryptedValue, 'base64'),
    // Note: KeyId is optional for decrypt - KMS knows which key was used
  });

  const response = await kmsClient.send(command);

  if (!response.Plaintext) {
    throw new Error('Decryption failed: no plaintext returned');
  }

  return Buffer.from(response.Plaintext).toString('utf-8');
}

/**
 * Store an encrypted secret in DynamoDB
 */
async function storeSecret(
  customerId: string,
  secretId: string,
  plainSecret: string
): Promise<void> {
  const encryptedValue = await encryptSecret(plainSecret);

  const item: CustomerSecret = {
    customerId,
    secretId,
    encryptedValue,
    createdAt: new Date().toISOString(),
  };

  await dynamoClient.send(new PutCommand({
    TableName: TABLE_NAME,
    Item: item,
  }));
}

/**
 * Retrieve and decrypt a secret from DynamoDB
 */
async function getSecret(
  customerId: string,
  secretId: string
): Promise<string | null> {
  const response = await dynamoClient.send(new GetCommand({
    TableName: TABLE_NAME,
    Key: { customerId, secretId },
  }));

  if (!response.Item) {
    return null;
  }

  const secret = response.Item as CustomerSecret;
  return await decryptSecret(secret.encryptedValue);
}

// Example usage
async function example() {
  // Store a customer's API key
  await storeSecret('customer-123', 'api-key', 'super-secret-api-key-xyz');

  // Retrieve and decrypt it later
  const apiKey = await getSecret('customer-123', 'api-key');
  console.log('Decrypted API key:', apiKey);
}

SSM Parameter Store vs KMS: The Trade-offs

You might wonder: should I use SSM Parameter Store with KMS encryption, or encrypt data directly with KMS and store it in DynamoDB?

SSM Parameter Store Approach

Pros:

Centralized secret management
Built-in versioning
Free tier: up to 10,000 parameters
Parameter Store integrates with many AWS services

Cons:

Extra API calls (SSM + DynamoDB)
Additional latency
Two services to manage
10,000 parameter limit may be restrictive for large-scale applications

Example:

// Store in SSM, reference in DynamoDB
const paramName = `/customers/${customerId}/secrets/${secretId}`;
await ssm.putParameter({
  Name: paramName,
  Value: plainSecret,
  Type: 'SecureString', // Uses KMS encryption
  KeyId: KMS_KEY_ID,
});

// Store reference in DynamoDB
await dynamodb.putItem({
  TableName: 'Customers',
  Item: {
    customerId: { S: customerId },
    secretRef: { S: paramName }, // Just the reference
  },
});

Direct KMS Encryption in DynamoDB

Pros:

Single service (DynamoDB)
Lower latency (one API call instead of two)
No parameter count limits
Simpler architecture

Cons:

No built-in versioning (you'd implement it yourself)
Less visibility in AWS Console
Manual rotation handling

When to use which:

Use SSM Parameter Store if you need versioning, have < 10,000 secrets, or want integration with other AWS services
Use direct KMS encryption for high-scale applications, lower latency requirements, or when secrets are tightly coupled with your DynamoDB data model

Pricing Comparison

Let's compare costs for storing 50,000 customer secrets:

Secrets Manager

Storage: 50,000 secrets × $0.40/month = $20,000/month
API calls: Assuming 1M calls/month = 100 × $0.05 = $5/month
Total: ~$20,005/month

KMS + DynamoDB

KMS key: 1 key × $1/month = $1/month
KMS API calls: Encrypt + Decrypt requests are billed separately at $0.03 per 10,000 requests (pricing varies slightly by region). Assuming ~1M total requests per month: ~$3/month (Note: KMS also includes a small free tier for requests)
DynamoDB storage: 50,000 items × ~1KB = ~50MB × $0.25/GB = $0.01/month
DynamoDB reads/writes: Varies by usage, let's say $10/month for moderate traffic
Total: ~$14/month

The cost difference is dramatic: $20,005 vs $14 per month for the same number of secrets.

SSM Parameter Store + DynamoDB

SSM: First 10,000 free, then $0.05 per parameter/month
For 50,000 params: 40,000 × $0.05 = $2,000/month
Plus KMS and DynamoDB costs: ~$14/month
Total: ~$2,014/month

Conclusion

While AWS Secrets Manager is excellent for managing application secrets with automatic rotation, it's often overkill—and expensive—for storing customer-provided secrets or sensitive data at scale.

For most use cases involving customer secrets in DynamoDB, encrypting data directly with KMS offers the best balance of:

Security: Strong encryption with managed keys
Performance: Single API call to retrieve data
Cost: Dramatically lower than Secrets Manager
Simplicity: Fewer moving parts

The trade-off is that you’ll need to implement your own versioning logic if required. In practice, most customer-provided secrets cannot and should not be rotated automatically by your system. API keys, webhook secrets, and OAuth credentials are typically owned and rotated by the customer or an external provider, making Secrets Manager’s rotation features largely irrelevant for these use cases. But for many applications, this is a worthwhile exchange for the cost savings and performance benefits.

Key takeaway: Choose the right tool for the job. Secrets Manager shines for application secrets with rotation needs. KMS excels for high-volume, customer-specific data encryption.

Have you dealt with similar challenges managing secrets at scale? I'd love to hear about your approach in the comments below.

Practical usage of asynchronous context tracking in NodeJS and AWS Lambda

Sebastian — Fri, 07 Nov 2025 09:55:20 +0000

Asynchronous context tracking in NodeJS, introduced with version 16, addresses a common challenge in node applications: maintaining context across asynchronous operations. In such environment, where non-blocking I/O operations are the norm, it can be difficult to preserve a "context" or "state" across callbacks, promises, or async/await operations. This is crucial for tasks like tracking user sessions, handling transactions, or implementing logging that depends on knowing the sequence of operations that led to a particular state.

It provides a way for developers to preserve this context without resorting to complex workarounds. It uses the AsyncLocalStorage API, part of the async_hooks module, allowing developers to store and access data that is specific to a particular sequence of asynchronous operations. This makes it possible to easily pass along context through the many layers of asynchronous calls, improving the ability to monitor, debug, and write more maintainable applications. Essentially, it gives developers the power to keep track of the execution flow, even in the inherently asynchronous environment of NodeJS, making it easier to manage application state across asynchronous boundaries.

AsyncLocalStorage within AWS Lambda Environments

In the dynamic world of AWS Lambda, where functions respond to events in isolated invocations, managing context across asynchronous operations can be a bit of a juggling act. AsyncLocalStorage is designed to elegantly handle this exact scenario. It offers a seamless way to maintain context without the cumbersome need to pass state around through function parameters. Each Lambda invocation is treated as a standalone execution, ensuring that context does not leak between invocations, even in warm containers that are reused for efficiency. This setup guarantees that every function run has its clean slate regarding asynchronous context, making code more readable, maintainable, and significantly reducing the likelihood of bugs related to improper context management.

Let's consider the following example, which tracks the user claims by introducing a middy middleware

import { AsyncLocalStorage } from 'node:async_hooks'
import middy from '@middy/core'
import zod from 'zod'

const Claims = zod.object({ ... })
type Claims = zod.infer<typeof Claims>

const claimsStorage = new AsyncLocalStorage<Claims>();

const useClaims = () => {
    const store = claimsStorage.getStore()
    if (!store) {
        throw new Error('invalid claims')
    }
    return store
}

const withUserStoredInContext = (): middy.MiddlewareObj<APIGatewayProxyEventV2WithJWTAuthorizer, APIGatewayProxyStructuredResultV2> => {
  return {
    before: (handler: middy.Request<APIGatewayProxyEventV2WithJWTAuthorizer>) => {
      const claims = Claims.parse(handler.event.requestContext.authorizer.jwt.claims)
      claimsStorage.enterWith(claims)
    },
  }
}

we implement a function useClaims, and can access claims throughout the whole request lifecycle now.

import { useClaims } from '...'
import * as DB from './db'
import { isEmpty } from 'remeda'

const list = async () => {
    const { sub } = useClaims()

    if(!sub) return []

    const allPromotions = await DB.listBySub(sub)

    if(!allPromotions || isEmpty(allPromotions)) return []

    return transformPromotions(allPromotions)
}

Pitfalls

Using AsyncLocalStorage with AWS Lambda offers many benefits for context management across asynchronous operations. However, there are some considerations and potential pitfalls to be aware of:

enterWith is still experimental. That only stable function to populate the store is to use the run method. See this stackoverflow discussion.
Understanding Context Propagation: Developers must have a clear understanding of how context is propagated across asynchronous calls to effectively use AsyncLocalStorage. Misunderstandings can lead to context loss or incorrect assumptions about the availability of context, resulting in bugs that are difficult to diagnose.

💡
I already faced this quite often e.g. calling the getStore and not knowing why it returns undefined even though I thought I populated the store

Potential for Context Leaks: Incorrect usage of AsyncLocalStorage, especially not properly entering and exiting the context, can lead to context information leaking across Lambda invocations in warm containers. Although AWS Lambda provides isolation between invocations, improper management of context can introduce subtle bugs related to context contamination.
Cold Start & Memory Consumption: I did not verify this yet, but there are quite some old discussions about performance degradations when using these hooks. I can imagine the use of async hooks do have some impact on the performance of an application, be it the time to start a container or the memory consumption.

How do you use the AsyncLocalStorage? Let's discuss!

Else, give it a try, cheers!

Scaling Notification Systems: How a Single Timestamp Improved Our DynamoDB Performance

Sebastian — Wed, 14 May 2025 08:02:51 +0000

Introduction

The epilot platform contains a comprehensive notification system. Users receive notifications about ongoing tasks, such as new assignments or overdue tasks. They can also get notified about incoming emails or when someone mentions them in notes, and the list goes on. Users can choose to receive these notifications via email or as in-app notifications. This article focuses on the latter.
Initially, in-app notifications were stored in Aurora (AWS's solution for SQL-based databases). This setup soon became a major pain point, prompting us to migrate to DynamoDB. The simplicity of the notification data structure and the amount of read and write operations we expected made DynamoDB the perfect choice to scale.
However, if you don't think carefully about how you design access patterns in DynamoDB, more problems arise than you'd expect.
Let's dive into why a bad implementation of a markAllAsRead feature us some headache and how we reduced the complexity from O(n) to O(1) by using a timestamp-based approach for unread notifications.

The Problem

The initial design was straightforward. Every user gets a new notification item in the DynamoDB table. The partition key (pk) was a combination of user_id and organization_id, while the sort key (sk) contains the notification_id. The access patterns were quite straightforward: fetch all notifications for a given user, mark a notification as read, and mark all notifications as read for the lazy ones. The latter is the origin of this article.
An attribute read_state indicates if a notification was already read by a user. Marking a single notification was quite straightforward. It was as simple as:

async function markAsRead(params: { ... }) {
  await ddb.update({
    TableName: config.NOTIFICATIONS_TABLE,
    Key: toUserNotificationSK(params),
    UpdateExpression: 'SET read_state = :read_state',
    ExpressionAttributeValues: {
      ':read_state': 1, // binary 1 is true
    },
  });
}

Once a notification is read, the item is updated and read_state is set to 1. A Global Secondary Index (GSI) called byReadState then allows us to read all unread notifications for a given tenant (org + user). This created two operations that performed poorly:

A bad implementation of the markAllAsRead feature. It first queried all unread notifications and then performed a batch operation to update all notifications to read. As shown in the graph below, DynamoDB began to throttle under load, when people with lots of unread notifications used the marked all as read feature.
To indicate that a user has unread messages, a getTotalUnreadCount endpoint is exposed. This allows us to render a notification bell in the UI to show the unread count.

The naive implementation to batch update all unread notifications worked surprisingly well in the beginning. However, as the volume of notifications increased, we started experiencing more and more throttling events in DynamoDB. What started as occasional hiccups became a serious bottleneck in our notification service's performance.

The issue was multi-faceted. First, DynamoDB has limits on batch operations, requiring us to split large batches into multiple smaller operations. This not only added complexity to our code but also increased the probability of partial failures.
Second, each notification update consumed Write Capacity Units (WCUs) from our table's provisioned capacity. For users with hundreds or thousands of unread notifications, a single "Mark All as Read" action would consume a significant portion of our available WCUs, causing other notification operations to be throttled.
Importantly, these issues didn't affect the entire epilot platform, but were isolated to the notification service itself. Users would see timeouts or delayed responses specifically when interacting with notifications, while the rest of the platform continued to function normally.
However, this created a frustrating user experience, especially for power users who relied heavily on notifications to manage their workflows.
The problem was particularly severe for organizations with large teams, where notification counts could grow rapidly, and the "Mark All as Read" feature was used frequently to manage notification overload.

The Solution: Last Read Timestamp

After evaluating several options, we settled on a timestamp-based approach that would fundamentally change how we track read states while maintaining backward compatibility with our existing system.
Instead of updating each notification individually when a user clicks "Mark All as Read," we simply record the timestamp of when this action occurred. Any notification created before this timestamp is considered "read," while notifications arriving after it are "unread." This solution transforms what was an O(n) operation into an O(1) operation, regardless of how many notifications a user has.

The New Table Structure

We created a new DynamoDB table called notifications-read-state with the following structure:

{
  pk: `ORG#${orgId}#USER#${userId}`,  // Partition key
  sk: `READMARK#${timestamp}`,         // Sort key
  read_at: ISO8601Timestamp,           // When the user marked all as read
  created_at: ISO8601Timestamp         // When this record was created
}

The primary key design allows us to:

Efficiently lookup the most recent "mark all as read" timestamp for any user
Support multiple organizations per user
Maintain a history of read events if needed for analytics

The read_at attribute stores an ISO-formatted timestamp that serves as our "high water mark" for read notifications. This single attribute is the cornerstone of our solution.

Before:

async function markAllAsRead(userId, orgId) {
  // Step 1: Query for all unread notifications
  const unreadNotifications = await getUnreadNotifications(userId, orgId);

  // Step 2: Prepare batch updates (25 items per batch due to DynamoDB limits)
  const batches = createBatches(unreadNotifications, 25);

  // Step 3: Execute all batch updates
  for (const batch of batches) {
    await ddb.batchWrite({
      RequestItems: {
        [NOTIFICATIONS_TABLE]: batch.map(item => ({
          PutRequest: {
            Item: { ...item, read_state: 1 }
          }
        }))
      }
    });
  }
}

Complexity: O(n) - As the number of notifications increases, both processing time and database load increase linearly.

After:

async function markAllAsRead(userId, orgId) {
  const now = new Date().toISOString();

  // Single write operation
  await ddb.put({
    TableName: NOTIFICATIONS_READ_STATE_TABLE,
    Item: {
      pk: `ORG#${orgId}#USER#${userId}`,
      sk: `READMARK#${now}`,
      read_at: now,
      created_at: now
    }
  });
}

Complexity: O(1) - Constant time operation regardless of notification count.

This simple change drastically improved our system's performance. The "Mark All as Read" operation now completes in milliseconds instead of potentially seconds, uses a predictable amount of database capacity, and never times out, even for users with thousands of unread notifications.
What makes this approach particularly powerful is that we don't need to modify any existing notifications. Instead, we're recording a state transition that implicitly affects all notifications for a user at once.

Given the following pseudo-code, the byReadState index can be removed completely. All you need is to fetch the latest getLastReadTimestamp for a given user and calculate whether the notification was already seen or not.

const lastReadAt = await getLastReadTimestamp({ userId: params.userId, orgId: params.orgId });

const { Items, LastEvaluatedKey } = await ddb.query({
  ...
  TableName: config.NOTIFICATIONS_TABLE,
  IndexName: 'byTimestamp',
  ExclusiveStartKey: params.cursor ? decodeLastEvaluatedKey(params.cursor) : undefined,
});
...

const notificationsWithReadStatus = Items.map((item) => {
  // A notification is considered read if:
  // - It was created before the last "mark all as read" time OR
  // - It has been individually marked as read (read_state = 1)
  const isRead = item.timestamp <= lastReadAt || item.read_state === 1;

  return {
    ...item,
    read_state: isRead,
  };
 });

Lessons Learned

Our journey from the first version to the optimized solution taught us a lot about designing for scale—especially with DynamoDB.

At first, updating each notification one by one seemed fine. It was simple, worked great in dev, and handled early traffic just fine. But as usage grew, that approach quickly hit its limits. It was a good reminder: what works now might not work when your data grows 10x.

The breakthrough came when we stopped trying to optimize the old way and instead rethought the problem. Rather than updating every record, we started recording state changes with timestamps. That shift made things both simpler and faster—and it's a pattern that applies well beyond notifications.

Most importantly, we learned to play to DynamoDB’s strengths: fast, predictable access with simple operations. Once we aligned our design with that, everything clicked.

Conclusion

It’s easy to overthink scalability from the start, but the truth is: you won’t know your real problems until users are actually using the system. Our experience reminded us that it's totally fine to start simple and ship. You learn way more from real-world usage than from guessing at edge cases.

Scalability issues aren’t failures—they’re signs of growth. When we hit our limits, it forced us to rethink things. And funny enough, the fix—a single timestamp—ended up being both simple and powerful. It made the system faster, more reliable, and easier to reason about.

So if you’re torn between shipping something basic now or building for every possible future, go with the simple version. Ship it, learn from it, and improve as you go.

Do you want to work on features like this? Check out our career page or reach out to at X or LinkedIn

Building a Scalable Audit Log System with AWS and ClickHouse

Sebastian — Tue, 26 Nov 2024 09:00:05 +0000

Audit logs might seem like a backend feature that only a few people care about, but they play a crucial role in keeping things running smoothly and securely in any SaaS or tech company. Let me take you through our journey of building a robust and scalable audit log system. Along the way, I’ll share why we needed it, what exactly audit logs are, and how we combined tools like AWS, ClickHouse, and OpenAPI to craft a solution that works like a charm.

The Case of the Disappearing Configuration

At epilot, we’ve encountered a frustratingly familiar scenario. A customer reaches out, upset that one of their workflow configurations has mysteriously vanished. Their immediate question? “Who deleted it?”—and the assumption is that someone on our team is responsible.

Now here’s the tricky part: how do we, as engineers, figure out who did what and when?

One obvious approach is to dive into the application logs. But here’s the catch: most of the production logs aren’t enabled by default. Even when they are, they’re often sampled, capturing only about 10% of the actual traffic. Additionally, those logs often seem to lack the required information. This means we’re left piecing together incomplete data, like trying to solve a puzzle with half the pieces missing.

What Are Audit Logs Anyway?

Audit logs provide clear visibility into system changes, aiding teams in investigations, diagnosing incidents, and tracing unauthorized actions. They empower admins by reducing support reliance and ensuring clarity on actions like role or workflow updates. For enterprise customers, audit logs are a critical, expected feature that supports compliance with standards like ISO 27001. Additionally, they lay the groundwork for enhanced threat detection capabilities in the future. In simple terms they try to help to answer the following questions:

WHO is doing something. Typically a user or a system (api call)

WHAT is that user/system doing?

WHERE is that occurring from? (e.g. an IP address)

WHEN did it occur?

WHY? (optional) Why did the user log in? → we don’t know, Why is its IP blocked? → User logged in 5 times with the wrong password

Key Considerations for a Successful Audit Log System

Before diving into the technical details, it’s crucial to define what makes an audit log system effective. While the exact requirements depend on your company’s domain, there are some universal points worth considering:

Compliance: Ensure the system adheres to regulations like GDPR. For example, customers may request the deletion of personal data, so you’ll need a straightforward way to erase all logs tied to a specific customer.

Sustainability: Audit logs grow rapidly, especially in high-traffic systems. Storing them indefinitely may not be feasible. Decide on strategies for archiving or purging logs over time.

Permissions: Define who is allowed to access audit logs to maintain security and privacy.

Format: Standardize the structure of your logs to ensure they’re easy to interpret and query.

Data Selection: Carefully determine what actions and events are worth logging to answer critical questions effectively, without unnecessary noise.

Making It Happen: How We Built Our Audit Logs

At epilot, our APIs are built around serverless components provided by AWS. From the outset, we recognized that AWS API Gateway events provided a rich source of information for building audit logs. These events capture critical details such as user identities, actions performed (through the request payload), IP addresses, headers, and more.

Given our microservices architecture, where services are organized by domain and accessed through an API Gateway (see our system architecture), we needed a solution that seamlessly integrated with this structure.

High-Level Overview
Our approach to audit logging can be summarized as:

Capturing events asynchronously.
Validating and transforming raw events into a standard format.
Persisting the data in a read-only, scalable, and query-friendly storage system.

This design adheres to several key technical principles:

Asynchronous Event Capture
We use Amazon SQS to decouple event capture from the main HTTP request flow. For example, when a user creates a new workflow configuration, the relevant API Gateway event is pushed to an SQS queue by middleware wrapping the API. This ensures that audit logging does not introduce latency or affect the performance of the core application logic.

From Raw to Standardized Events
Our focus is on capturing system modifications, specifically HTTP methods like POST, PUT, PATCH, and DELETE. These provide meaningful insights into changes occurring within the system. GET requests, on the other hand, generate excessive noise and are generally excluded—though we offer an opt-in mechanism for services where logging GET requests adds value.

A Lambda function processes raw API Gateway events from the SQS queue, transforming them into a structured and validated format. This includes filtering relevant data, enhancing it using metadata like OpenAPI specifications, and ensuring consistency across all logged events.

Data Persistence
For storing audit logs, we chose ClickHouse, a highly scalable, SQL-based database that aligns with our requirements:

Read-only access: Supports immutability to preserve data integrity. Scalability: Proven in our data lake setup to handle large volumes of data efficiently.
Querying: SQL capabilities allow for precise filtering and analysis, which is more complex with alternatives like DynamoDB. By leveraging ClickHouse, we ensure a robust and scalable foundation for our audit logs, simplifying future integrations and analysis.

Integration
To make audit logging effortless for our microservices, we focused on seamless integration. At epilot, we rely heavily on middy, a middleware engine used across all our services. Building on this, we introduced a new middleware: withAuditLog.

import { withAuditLog } from '@epilot/audit-log'
import middy from '@middy/core'
import type { Handler } from 'aws-lambda'


export const withMiddlewares = (handler: lambda.Handler) => {
  return middy(handler)
    .use(enableCorrelationIds())
    .use(...)
    .use(
      withAuditLog({
        ignorePaths: ['/v1/webhooks/configs/{configId}/trigger']
      })
    )
}

This middleware integrates directly into existing services and simplifies the audit logging process by:

Capturing API Gateway Events: It hooks into the request lifecycle to extract the API Gateway event details.
Omitting GET Requests by Default: To reduce noise, it filters out GET requests, with an option to opt them in for specific services where needed.
Forwarding to SQS: Its primary role is to forward the event to an SQS queue for asynchronous processing.

With this middleware, adding audit logging to any microservice is as simple as including withAuditLogs in the service's middleware stack and giving the SQS:SendMessage permission. This ensures consistency, reduces implementation effort, and keeps the integration process dead simple.

Technical Considerations
This article focuses on our high-level approach to building audit logs, as there are numerous ways to tackle the problem, each with its trade-offs. During our research, we explored alternatives like EventBridge for emitting events at the end of each request or Kinesis for streaming data. Ultimately, we chose a solution that met our key requirements: decoupling log emission from the main flow while offering flexibility in managing throughput and batching.

Here’s why we chose SQS:

Decoupling from the Main Flow
SQS allows us to process audit logs asynchronously, ensuring that the main HTTP request flow remains unaffected. This means audit log processing won’t slow down user-facing operations.

Flexibility with Throughput and Batching
With SQS, we can fine-tune parameters like long-polling and batch windows to optimize throughput without compromising efficiency. This ensures scalable and reliable processing regardless of traffic spikes.

Scalability for POST/PUT/PATCH/DELETE Events
Since we exclude GET requests by default, the system can handle fewer, more meaningful events. Capturing GET requests would require supporting a higher volume of events, potentially leading to Lambda concurrency issues, as multiple Lambda environments subscribing to the same queue could interfere with other services also using Lambda.

Exposing Audit Logs to Users

To make audit logs accessible and actionable, we introduced a new SST-based microservice that acts as a bridge to query data from ClickHouse. This microservice provides a simple and intuitive interface for users to explore their audit logs.

Key Features:

Search and Filtering: A user-friendly search bar allows users to combine filters effortlessly, enabling them to pinpoint specific events or patterns within the logs.
Activity Messages: Each audit log entry includes an activity message, a concise summary of what occurred. This message is dynamically constructed on the API side, tailored to the specific service name, making it customizable and relevant.

By customizing the activity messages for each service, users can quickly understand what happened in their systems without wading through raw data. This tailored approach ensures that the audit logs deliver immediate value and clarity to the end users.

Summary

In this article, we detailed the design and implementation of our audit log system at epilot, highlighting the key decisions and considerations that shaped its architecture. Our approach leverages AWS serverless components to seamlessly integrate audit logging into our microservices, ensuring scalability, efficiency, and ease of use.

Capturing Events: Using a custom middleware, withAuditLogs, we extract API Gateway events asynchronously and forward them to an SQS queue, ensuring the logging process does not block the main application flow.

Processing and Storing Logs: A Lambda function transforms raw events into a standardized format, focusing on meaningful system modifications (POST, PUT, PATCH, DELETE) and stores them in a scalable, SQL-based ClickHouse database.

User Accessibility: A new SST-based microservice provides a simple interface for querying and filtering logs. Tailored activity messages enhance usability, helping users quickly understand what occurred.

Technical Considerations: SQS was chosen for its ability to decouple the logging process, optimize throughput, and handle scalability challenges. While other solutions like EventBridge or Kinesis were viable, SQS met our specific requirements effectively.

This high-level overview provides a flexible, scalable, and user-friendly solution for audit logging while ensuring system integrity and maintaining performance.

Do you want to work on features like this? Check out our career page or reach out to my Twitter

How Epilot Builds a Powerful Webhook Feature with AWS

Sebastian — Wed, 17 Jan 2024 13:31:58 +0000

Introduction

At epilot, we're committed to simplify the sale and technical implementation of renewable energy solutions through a digital foundation, supporting energy suppliers, grid operators, and solution providers in the energy transition.

One of our features is the integration of webhooks for data synchronization with third-party systems. This allows for timely updates and efficient data exchange, a crucial factor in enhancing our customer service in the energy sector.

In this blog post, we'll dive into how we have harnessed the power of AWS to build a robust webhook feature, enhancing our service capabilities and offering our clients an even more powerful and reliable platform.

The Evolution of Our Webhook System

The initial version of our webhook feature was developed around the time AWS launched a new product known as API Destinations. The concept is simple but powerful: by creating an API Destination as a target for an EventBridge rule, it seamlessly forwards the request to any configured third-party service. One significant advantage of this approach is the use of EventBridge connections to secure webhook requests. This is a common challenge in many platforms, where securing requests is either unsupported or only possible through a signing secret. With EventBridge connections, securing a request becomes versatile and robust, offering options like basic authentication (username/password), API keys (e.g., Authorization: ), or OAuth – a feature frequently demanded by larger enterprise customers. This method eliminates the need for us to manually store client credentials, as the API Destination efficiently handles the signing and forwarding of the request.

The following showcases a sketch of necessary components for our initial webhook architecture

The user is able to create a webhook configuration through our UI. A lambda function creates an API Destination and an EventBridge connection. It then attaches the connection to the API Destination. Then an EventBridge rule is created with API Destination as its target. Whenever this rule is matched, the target is invoked. API Destination forwards failed requests to a Dead Letter Queue (DLQ). A lambda function picks up messages from the queue and stores these events in a table to display failed events to the user.

Caveats

As our platform scaled with increased traffic and users, it unveiled unforeseen issues. The architecture we initially implemented, revealed deficiencies in areas we hadn't anticipated:

Frequent Timeouts: Our customers often synchronise data generated by epilot with their systems, some of which may be slower and unable to handle requests asynchronously. A notable limitation of API Destination is its strict 5-second timeout on requests. This constraint is frequently encountered when syncing data with third-party systems, as their response times can easily exceed this duration.
Payload Size: EventBridge has a hard event size limit of 256kb. While this is a substantial data allowance, we occasionally reach this limit due to extensive data usage. In serverless environments, a typical solution to circumvent such limitations is the Claim-Check-Pattern. However, this approach is not supported by API Destination.
No Analytics: Monitoring within EventBridge remains a complex issue, particularly in determining the success of requests and reflecting this in the user interface. While Dead Letter Queue (DLQ) setups enable to capture failed events, the challenge lies in effectively tracking and displaying successful events.
Was the request successful?: In our platform, webhooks can be triggered by automations. An automation is a set off by predefined actions, such as triggering a webhook. We often received feedback from customers who found it confusing when webhook actions appeared to be successful but ultimately failed. Given the 'fire-and-forget' nature of webhooks, a challenge arises: How can we promptly display a failure when a request doesn't go through successfully?
No static IP support: Larger enterprise customers often require the support of static IPs for using webhook features, which poses a challenge as API Destinations currently do not offer this capability.

How AWS Step Functions Fulfilled Our Requirements

The lack of above mentioned features showcased the requirement of a new webhook architecture.
The AWS Step Function team recently published a new HTTP task, which is very similar to API Destination. One can reuse the EventBridge connection to authorize the request and the HTTP task forwards the request to a 3rd party system. It has no CDK support yet and has to be stable for some months in order for us to adopt it. This announcement, however, brought us to the idea of using a Step Function to implement our webhook architecture. With Step Functions we can:

remove any timeout issues
call them synchronously (30s API GW timeout) and asynchronously (no timeout)
create a lambda task that forwards the request manually:
- allows to use the Claim-Check-Pattern and send larger payloads
- can run within a VPC i.e. having a static IP is easy to add
- complete control how we fetch and store the http response
store all http responses in a new event table
easily extend the Step Function with new features when necessary

Playing around with the awesome Step Function builder gets us the following output:

The goal is so use as few lambda functions as possible, to mitigate cascading cold starts. The Step Function architecture itself is straight forward and consists the following tasks:

GetItemTask Fetch the webhook configuration to know where to send the event to and how
PutItemTask Persist an event to DynamoDB with some initial data and an 'in_progress' state
LambdaInvokeTask Call the 3rd party with the input of the state machine. When the input contains a s3_key, hydrate the payload first.
LambdaInvokeTask Set the event to 'failed' or 'succeeded' based on the HTTP response.
LambdaInvokeTask (exceptions): Catch unknown exceptions, which raises alerts and sets the event to 'failed' as well.

This results in the following (high level) architecture sketch:

We're updating our system to use AWS Step Functions instead of API Destinations with EventBridge. This change is pretty straightforward, so we don't need any complex migration scripts. We can still use the EventBridge connections we already have, but we'll need to attach them manually to our Lambda tasks for now. We're hoping to automate this attachment by using the new above mentioned HTTP task soon.

For event publishing, we're using a new API endpoint
/webhook/{config_id}/trigger?sync=true|false. The endpoint checks if the data is bigger than 256kb and, if so, stores it on S3. After that, it triggers the Step Function either in the background or synchronously. This setup is great because it means consumers don't have to worry about permissions; they just need to set up our webhook client. Of course, the consumer can still use the old method of just sending an EventBridge event to trigger the webhook like before.

Conclusion

API Destination proved to be an excellent service for creating a basic webhook feature, but its limitations led us to transition to AWS Step Functions. This shift has enabled us to offer our customers enhanced capabilities, including static IP support, improved analytics, handling of larger payloads, and the elimination of timeout issues. With Step Functions, we now have the flexibility to scale and evolve our architecture to meet our growing needs and those of our customers.

Do you want to work on features like this? Check out our career page or reach out to my Twitter