Forem: Nabin Debnath

The "Stateful Island" Paradox: Architecting Astro for Enterprise Scale

Nabin Debnath — Tue, 10 Feb 2026 13:00:50 +0000

TL;DR

Astro is fantastic for content sites, but it gets tricky when you try to build complex apps. The specific pain point is state management because Astro's "Islands" run in isolation, they can't easily talk to each other. This article details a pattern using Nano Stores and Edge Middleware to make disjointed islands share state without turning your app back into a bloated SPA.

The Reality Check

We’ve all seen the Astro adoption cycle. You pitch it to the team because the performance metrics are undeniable. You build the marketing pages, the blog, and the "About Us" section. The Lighthouse scores hit 100, the JS bundle is nonexistent, and the stack feels perfect.

Then you hit the wall.

Usually, it happens when a Product Manager asks for something seemingly simple: "Can we keep the shopping cart count updated in the header when the user adds an item in the sidebar?"

In a standard React app (Next.js or CRA), this is trivial. You wrap the app in a Context Provider and move on. But in Astro, that header and that sidebar are effectively strangers. They don't share a Virtual DOM. They don't share a parent component. They are two separate mini-apps floating in a sea of static HTML.

The immediate reflex is to wrap the entire <body> in a giant React Provider, but that defeats the entire purpose of using Astro. You’ve just accidentally rebuilt a worse Single Page Application.

We need a way to keep the performance of isolated islands while getting the data consistency of a monolith.

The Pattern: Subterranean State

To solve this, we have to stop thinking about state as something that flows down from a parent component. In Astro, there is no persistent parent.

Instead, think of state as "subterranean." The UI islands float on the surface, disconnected from each other. The data lives "underground," in a framework-agnostic layer that tunnels information up to whatever component needs it.

We need a stack that meets three criteria:

Framework Agnostic: It has to work even if the Header is React and the Cart is Svelte (a common migration scenario).
Hydration Independent: It needs to exist before the components even wake up.
Server Safe: It must accept initial state from the Edge to prevent layout shift.

The solution that works best in production right now is Nano Stores for the client, bridged with Astro Middleware for the server.

The Flow Visualization

Implementation Details

Let's look at the actual code. We will build a shared cart state that works across frameworks.

The Agnostic Domain Logic

We define the store in pure TypeScript. No React, no Svelte, just logic. This makes it incredibly easy to unit test because you don't need to mock a DOM.

// src/stores/cartStore.ts
import { map, computed } from 'nanostores';

export type CartItem = { id: string; price: number; title: string };

export type CartState = {
  items: CartItem[];
  isDrawerOpen: boolean;
};

// 1. Define the Atom
export const $cart = map<CartState>({
  items: [],
  isDrawerOpen: false
});

// 2. Computed State (Performance Optimization)
// Only subscribers to $totalPrice will re-render when items change.
export const $totalPrice = computed($cart, cart => 
  cart.items.reduce((acc, item) => acc + item.price, 0)
);

// 3. Actions (The API)
// This is where your business logic lives.
export function addToCart(item: CartItem) {
  const current = $cart.get();

  // Example business logic
  if (current.items.length >= 10) {
    return console.warn("Cart limit reached"); 
  }

  $cart.setKey('items', [...current.items, item]);
  $cart.setKey('isDrawerOpen', true);
}

The React Consumer (Headless)

The React component is now just a dumb view. It doesn't manage state; it just reflects it.

// src/components/Header.tsx
import { useStore } from '@nanostores/react';
import { $cart, $totalPrice } from '../stores/cartStore';

export const Header = () => {
  // This component will re-render AUTOMATICALLY when $cart changes
  const cart = useStore($cart);
  const total = useStore($totalPrice);

  return (
    <nav>
      <h1>Enterprise Store</h1>
      <div className="cart-summary">
        {cart.items.length} items (${total})
      </div>
    </nav>
  );
};

The Svelte Producer

Here is the cool part. The Svelte component imports the exact same file. No "props drilling" through three layers of layout components.

<script>
  import { addToCart } from '../stores/cartStore';
  export let product;
</script>

<div class="card">
  <h3>{product.title}</h3>
  <button on:click={() => addToCart(product)}>
    Add to Cart
  </button>
</div>

Solving the "Flash of Zero State"

If you stop here, you have a race condition.

When a user refreshes the page, the store initializes as empty. Then, maybe 500ms later, your client-side JS kicks in, reads from localStorage, and the cart count jumps from 0 to 5.

That layout shift is a user experience killer. In a real app, the initial state usually comes from the server (a session cookie, a user database).

We can use Astro Middleware to fetch this data on the server and hand it off to the store before the browser even paints.

The Middleware Injection

We intercept the request at the edge to fetch the user's session.

// src/middleware.ts
import { defineMiddleware } from 'astro/middleware';

export const onRequest = defineMiddleware(async (context, next) => {
  // 1. Identify User (Simulated)
  const sessionToken = context.cookies.get('auth_token');

  // 2. Fetch State (Simulated DB call)
  // In reality, this would be await db.getCart(sessionToken)
  const userCart = { 
    items: [{ id: '1', title: 'Saved Item', price: 50 }], 
    isDrawerOpen: false 
  };

  // 3. Attach to locals so the Layout can see it
  context.locals.initialState = userCart;

  return next();
});

The HTML Handoff

In your main layout, we bridge the server-client gap. We write the state directly into a global variable so the store can pick it up synchronously.

---
// src/layouts/Layout.astro
const { initialState } = Astro.locals;
---
<head>
  <script define:vars={{ initialState }}>
    window.SERVER_STATE = initialState;
  </script>

  <script>
    import { $cart } from '../stores/cartStore';

    if (window.SERVER_STATE) {
      $cart.set(window.SERVER_STATE);
    }
  </script>
</head>

Why This Approach Scales

This isn't just a hack to make things work; it's a better architectural pattern for large teams.

Decoupling: Team A can work on the Search Bar (React) and Team B can work on the Checkout Sidebar (Svelte). As long as they agree on the cartStore.ts interface, they never step on each other's toes.

Performance: You maintain the "Island" benefits. The header hydrates immediately (client:load), but the heavy cart sidebar can wait until the user clicks a button (client:idle or client:only).

Portability: If you decide to ditch React for SolidJS next year, your business logic (the store) stays exactly the same.

Final Thoughts

The "Stateful Island" paradox is only a problem if you try to force Astro to behave like Next.js. Once you decouple your state from your UI framework and let it live in the "subterranean" layer, Astro becomes a serious contender for complex, enterprise-grade applications.

Stop fighting the isolation. Embrace it, and tunnel your data underneath.

Composite SLOs for Serverless Event-Driven Systems

Nabin Debnath — Mon, 05 Jan 2026 13:46:01 +0000

Measuring What Users Experience Across API Gateway -> Lambda -> DynamoDB -> EventBridge

TL;DR

Serverless systems rarely fail at a single component. Failures occur at the junctures between managed services. Yet most SLO implementations still measure API Gateway, Lambda and DynamoDB in isolation.
This article shows how to define and operate composite, end-to-end SLOs for a real serverless chain. You'll see how to derive availability and latency SLIs across multiple AWS services, calculate error budgets correctly, wire burn-rate alerts, and ship a working dashboard using CloudWatch metric math and infrastructure as code.

Introduction: Why "Everything Is Green" Is Still Not Good Enough

If you have operated a serverless system for long time, you'll eventually experience this situation:

API Gateway shows 99.9% availability
Lambda error rate looks fine
DynamoDB has no throttles
EventBridge metrics are quiet

Despite this, users are retrying operations, workflows remain unfinished, and crucial events are failing to reach downstream systems.

Nothing is individually broken. The system is. The problem is not observability coverage. It's how reliability is modeled.

Serverless architectures push complexity into managed services. That's a good trade until reliability is measured per service instead of per request. At that point, SLOs stop representing user experience and start representing dashboards.

The Gap we are trying to close

There are a lot of content online on SLO but what's missing is specificity.
What I mostly find:

Generic SLO explainers based on microservices
Burn-rate math explained with Prometheus examples
AWS blog posts measuring one service at a time
"Composite SLO" is mentioned as a concept

What’s consistently absent:

A step-by-step SLO model for serverless request chains
Clear guidance on what counts as failure when managed services retry, buffer, or partially succeed
Concrete examples using CloudWatch metric math
A way to combine sync and async paths into a single availability signal

This article will close the gap.

The System We're Measuring

We'll use a very common production pattern:

From a user’s perspective, the request is successful only if:

API Gateway accepts and processes the request
Lambda executes successfully
The DynamoDB write succeeds
The event is published to EventBridge

This is an uncompromising definition that dictates all subsequent actions. It means that any result short of complete success is considered a partial failure, even if the HTTP response code is 200.

Why Per-Service SLOs Break Down in Serverless

Per-service SLOs assume clean failure boundaries. Serverless doesn’t have those.

Consider this real scenario:

API Gateway returns 200
Lambda executes successfully
DynamoDB write succeeds
EventBridge PutEvents partially fails

Your API metrics look perfect. Your Lambda metrics look perfect. Your DynamoDB metrics look perfect.
Your business workflow is broken.
This is why composite SLOs are not "advanced", they're table stakes for event-driven systems.

Defining the Composite SLO

Availability Objective

99.5% of requests must complete end-to-end successfully over a rolling 30-day window

A request is counted as successful only if all four steps succeed.

Important Scope Clarification: Note that for the asynchronous step (EventBridge), "success" means the event was successfully published to the bus. This SLO measures the promise of work, not the eventual consumption by downstream subscribers.
If you have critical downstream consumers, they need their own separate SLOs. Trying to jam async consumption into a synchronous API availability metric will only create noise.

Latency Objective

95% of successful requests must complete within 800 ms

Latency here reflects user-visible delay, not async processing time. EventBridge publishing is included in availability, not latency.
This distinction matters more than most teams realize.

Choosing SLIs That Actually Map to Reality

We will compose existing AWS metrics.
Availability Signals

API Gateway
- Count
- 5XXError
Lambda
- Invocations
- Errors
DynamoDB
- UserErrors
EventBridge
- PutEventsFailedEntries

These metrics already exist. The work is in combining them correctly.

Composite Availability: Turning Fragments Into a Single Signal

The core question is simple: Out of all incoming requests, how many completed the full chain?
We model this explicitly using CloudWatch metric math.
Composite Availability Expression

CompositeAvailabilityMetric:
  Type: AWS::CloudWatch::Alarm
  Properties:
    Metrics:
      - Id: totalRequests
        MetricStat:
          Metric:
            Namespace: AWS/ApiGateway
            MetricName: Count
            Dimensions:
              - Name: ApiName
                Value: OrdersAPI
          Period: 60
          Stat: Sum

      - Id: apiFailures
        MetricStat:
          Metric:
            Namespace: AWS/ApiGateway
            MetricName: 5XXError
            Dimensions:
              - Name: ApiName
                Value: OrdersAPI
          Period: 60
          Stat: Sum

      - Id: lambdaFailures
        MetricStat:
          Metric:
            Namespace: AWS/Lambda
            MetricName: Errors
            Dimensions:
              - Name: FunctionName
                Value: OrdersHandler
          Period: 60
          Stat: Sum

      - Id: eventFailures
        MetricStat:
          Metric:
            Namespace: AWS/Events
            MetricName: FailedInvocations
          Period: 60
          Stat: Sum

      - Id: availability
        Expression: "1 - ((FILL(apiFailures,0) + FILL(lambdaFailures,0) + FILL(eventFailures,0)) / MAX([totalRequests], 1))"
        Label: CompositeAvailability
        ReturnData: true

This produces a single availability SLI that reflects user reality, not service health.

A Note on "Strict" Math & Retries
You might notice this formula is ruthless: 1 - (failures / total). In a serverless world, services like Lambda often retry automatically on failure.

If a Lambda fails twice and succeeds on the third try, this metric counts it as a failure. This is intentional. Hidden retries burn your error budget and increase latency. By penalizing retries in your availability score, you force the team to fix the underlying flakiness rather than letting the retry policy hide it.

Composite Latency: Measuring the Critical Path

Latency is additive across synchronous hops.

We use percentile metrics to avoid averages masking tail behavior.

CompositeLatencyMetric:
  Metrics:
    - Id: apiLatency
      MetricStat:
        Metric:
          Namespace: AWS/ApiGateway
          MetricName: Latency
          Dimensions:
            - Name: ApiName
              Value: OrdersAPI
        Period: 60
        Stat: p95

    - Id: lambdaDuration
      MetricStat:
        Metric:
          Namespace: AWS/Lambda
          MetricName: Duration
          Dimensions:
            - Name: FunctionName
              Value: OrdersHandler
        Period: 60
        Stat: p95

    - Id: dynamoLatency
      MetricStat:
        Metric:
          Namespace: AWS/DynamoDB
          MetricName: SuccessfulRequestLatency
          Dimensions:
            - Name: TableName
              Value: Orders
        Period: 60
        Stat: p95

    - Id: totalLatency
      Expression: "apiLatency + lambdaDuration + dynamoLatency"
      Label: EndToEndLatency
      ReturnData: true

This prevents the common error of prematurely claiming "P95 latency is acceptable" even while users are still experiencing delays.

Error Budgets and Burn Rate (Where SLOs Become Useful)

For a 99.5% SLO over 30 days:

Total error budget: 0.5%
Budget in minutes: ~216 minutes/month

We use multi-window burn-rate alerts to avoid noise.

Fast Burn (Page): If we burn the monthly budget in under 2 days, something is seriously wrong.
Slow Burn (Ticket): If we’re slowly bleeding reliability, the system needs attention, but not at 2am.

These alerts are driven by the composite availability metric, not individual services. That alignment is the entire point.

Dashboard Design: Fewer Charts, Better Decisions

To keep the investigation time short and focused, we require these typical widgets:

Composite availability (rolling 30 days)
Error budget remaining
End-to-end latency p95
API 5xx
Lambda errors
EventBridge failed entries

Best Practices and Anti-Patterns

Best Practices

Define failure conservatively
Use percentiles, not averages
Treat async failures as first-class reliability issues
Alert on burn rate, not raw errors

Anti-Patterns

Ignoring retries in SLI math
Counting HTTP 200 as success unconditionally
Measuring latency per service in isolation
Treating EventBridge as "eventually reliable"

Conclusion

Serverless systems fail in ways that traditional SLO models don’t capture. Composite SLOs fix that by forcing reliability to align with user experience instead of service boundaries.

If you run event-driven systems and still rely on per-service health, you're measuring the wrong thing.

I Replaced a Docker-based Microservice with WebAssembly and It's 100x+ Faster

Nabin Debnath — Sat, 13 Dec 2025 15:22:44 +0000

TL;DR
We've all heard the quote from Docker's founder, Solomon Hykes, back in 2019: "If WASM+WASI existed in 2008, we wouldn't have needed to create Docker."

For years, this was just a prophecy. But in 2025, the tech has finally caught up.

I decided to find out if he was right. I took a simple, everyday Node.js Docker-based microservice, rewrote it in Rust-based WebAssembly (Wasm), and benchmarked them head-to-head.

The results weren't just better; they were shocking. We're talking 99% smaller artifacts, incremental build times cut by 10x, and cold-start times that are over 100x faster.

Here's the full story, with all the code and benchmarks.

Part 1: The "Before" - Our Bloated Docker Service

To make this a fair comparison, I picked a perfect candidate for a microservice: a JWT (JSON Web Token) Validator.

It's a common, real-world task. An API gateway or backend service receives a request, takes the Authorization: Bearer header, and needs to ask a different service, "Is this token valid?"

It's a simple, stateless function to put it in its own container.

The Node.js / Express Code

It is a Node.js code and an Express server with one endpoint: /validate. It uses the jsonwebtoken library to verify the token against a secret.

// validator-node/index.js
import express from 'express';
import jwt from 'jsonwebtoken';

const app = express();
app.use(express.json());

// The one secret key our service knows
const JWT_SECRET = process.env.JWT_SECRET || 'a-very-strong-secret-key';

app.post('/validate', (req, res) => {
  const { token } = req.body;

  if (!token) {
    return res.status(400).send({ valid: false, error: 'No token provided' });
  }

  try {
    // The core logic!
    jwt.verify(token, JWT_SECRET);
    // If it doesn't throw, it's valid
    res.status(200).send({ valid: true });
  } catch (err) {
    // If it throws, it's invalid
    res.status(401).send({ valid: false, error: err.message });
  }
});

const port = process.env.PORT || 3000;
app.listen(port, () => {
  console.log(`Node.js validator listening on port ${port}`);
});

The Dockerfile

We use a multi-stage build with an Alpine base image to keep it small.

# Dockerfile
# --- Build Stage ---
FROM node:18-alpine AS build
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci

COPY . .

# --- Production Stage ---
FROM node:18-alpine
WORKDIR /app
COPY --from=build /app/node_modules ./node_modules
COPY --from=build /app/index.js ./index.js

# We don't need the full package.json, just the code and dependencies
ENV NODE_ENV=production
CMD ["node", "index.js"]

The Problem

Let's check few things after docker does its work - the cost of the simple service

The Build Time: On my machine, building this from a cold cache takes ~81 seconds. Even with Docker layer caching, re-building after a small code change takes about 45 seconds due to context switching and layer hashing.
The Artifact Size: After building, the final image is 188MB. That's 188MB to ship a 30-line script.
The Cold Start: When deployed to a serverless platform (like Cloud Run or scaled-to-zero K8s), the cold start is painful. The container has to be pulled, and the Node.js runtime has to boot. I was seeing cold starts between 800ms and 1.5 seconds. That's a user-facing delay.

Part 2: The "After" - Rebuilding with WebAssembly

Wasm modules are small, compile-to-binary, and run in a secure, sandboxed runtime that starts in microseconds. Unlike Docker, which packages a whole OS, Wasm just packages your code.

I chose to rewrite it in Rust because of its first-class Wasm support and performance. I used the Spin framework, which makes building Wasm-based HTTP services incredibly simple.

The Rust / Spin Code

First, let's install the Spin CLI and scaffold a new project.

$ spin new
# I selected: http-rust (HTTP trigger with Rust)
Project name: validator-wasm
...

This generates a src/lib.rs file. I opted to use the jwt-simple crate instead of the standard jsonwebtoken because jwt-simple is a pure-Rust implementation. This avoids C-binding issues and compiles down to an incredibly small Wasm binary.

// validator-wasm/src/lib.rs
use anyhow::{Result, Context};
use spin_sdk::{
    http::{Request, Response, Router, Params},
    http_component,
};
use serde::{Deserialize, Serialize};
use jwt_simple::prelude::*;

// 1. Define our request and response structs
#[derive(Deserialize)]
struct TokenRequest {
    token: String,
}

#[derive(Serialize)]
struct TokenResponse {
    valid: bool,
    #[serde(skip_serializing_if = "Option::is_none")]
    error: Option<String>,
}

// Get the JWT secret from environment or use a default
fn get_secret() -> HS256Key {
    let secret = std::env::var("JWT_SECRET").unwrap_or_else(|_| "a-very-strong-secret-key".to_string());
    HS256Key::from_bytes(secret.as_bytes())
}

/// The Spin HTTP component
#[http_component]
fn handle_validator(req: Request) -> Result<Response> {
    let mut router = Router::new();
    router.post("/validate", validate_token);
    Ok(router.handle(req))
}

// 2. JWT validation using jwt-simple
fn validate_token(req: Request, _params: Params) -> Result<Response> {
    // Read the request body
    let body = req.body();
    if body.is_empty() {
        return Ok(json_response(400, false, "Empty request body"));
    }

    let token_req: TokenRequest = serde_json::from_slice(body)
        .context("Failed to parse request body")?;

    let key = get_secret();

    // The `verify_token` function does the validation
    match key.verify_token::<serde_json::Value>(&token_req.token, None) {
        Ok(_) => Ok(json_response(200, true, "")),
        Err(e) => Ok(json_response(401, false, &e.to_string())),
    }
}

// Helper to build a JSON response
fn json_response(status: u16, valid: bool, error_msg: &str) -> Response {
    let error = if error_msg.is_empty() { 
        None 
    } else { 
        Some(error_msg.to_string()) 
    };

    Response::builder()
        .status(status)
        .header("Content-Type", "application/json")
        .body(serde_json::to_string(&TokenResponse { valid, error }).unwrap())
        .build()
}

It is evidently more code than the Node.js code. But it's also type-safe, compiled, and we will see this unbelievably fast.

The "Build"

There's no Dockerfile. Instead, I configured the spin.toml manifest to use the modern wasm32-wasip1 target.

#:schema https://schemas.spinframework.dev/spin/manifest-v2/latest.json

spin_manifest_version = 2
[application]
name = "validator-wasm"
version = "0.1.0"

[[trigger.http]]
route = "/..."
component = "validator-wasm"

[component.validator-wasm]
source = "target/wasm32-wasip1/release/validator_wasm.wasm"  # The build output
allowed_http_hosts = []
[component.validator-wasm.build]
command = "cargo build --target wasm32-wasip1 --release"
watch = ["src/**/*.rs", "Cargo.toml"]

Build this entire project:

$ spin build

This one command compiles the Rust code to a Wasm module.

Part 3: The Showdown - Docker vs. Wasm Benchmarks

I've successfully run and measured both the Docker container and the Spin Wasm application. Docker runs a full operating system in a virtualized container, while Wasm runs a tiny, sandboxed module directly on the host.
This architectural difference leads to some staggering benchmark results.

Metric	Docker (Node.js)	WebAssembly (Rust/Spin)	The Winner
Artifact Size	188 MB	0.5 MB	Wasm (99.7% smaller)
Build Time (Incremental)	~45 sec (Docker layer caching)	4.2 seconds	Wasm (10x faster)
Cold Start Time	~1.2 seconds (1200ms)	~10ms	Wasm (120x faster)
Memory Usage	~85 MB (idle)	~4 MB (idle)	Wasm (95% less)

Artifact Size: The Wasm module is 0.5 MB (548KB to be exact). Not 188MB. I can send this file in a Slack message. It's 99.7% smaller.
Build Time (Incremental): This is the developer "inner loop" metric. Rust's incremental builds are blazing fast. Once dependencies are compiled, changing your code and running spin build takes ~4 seconds. Comparing this to waiting ~45 seconds for Docker context switching and layer sha-hashing feels like a superpower.
Cold Start: This is the headline. The Wasm runtime starts in the low-millisecond range. I benchmarked it using spin up and got startup times consistently around 10ms. Compared to the 1200ms of the container, it's not even a contest.

This is the "100x faster" promise. It's not that the code executes 100x faster (though the Rust version is quicker); it's that the service can go from zero-to-ready 100 times faster.

Part 4: The Verdict - Is Docker Dead?

No. Of course not - Wasm is not a Docker killer. It's a Docker alternative for a specific job.

You should still use Docker/Containers for:

Large, complex, stateful applications (like a database).
Monolithic apps you're lifting-and-shifting.
Services that truly need a full Linux environment.

But WebAssembly is the new king for:

Serverless Functions (FaaS)
Microservices (or "nano-services")
Edge Computing (where low startup time is critical)
Plugin Systems (like for a SaaS)

My takeaway- That quote from Solomon Hykes wasn't just a spicy take. He was right.

The next time you're about to docker init a new, simple serverless function, you just ask yourself if your use case is a right candidate for this. It may or may not be.

Try it yourself. You might be shocked, too.

Zero-Code Observability: Using eBPF to Auto-Instrument Services with OpenTelemetry

Nabin Debnath — Fri, 07 Nov 2025 14:19:13 +0000

Instrumenting services for observability often means sprinkling tracing code across hundreds of files which is painful to maintain and easy to forget.
eBPF + OpenTelemetry (OTel): a powerful combination that hooks into your running processes and emits traces, metrics, and logs without touching application code.

In this post, you’ll learn how to:

Use an eBPF agent to automatically instrument apps
Export telemetry data through OpenTelemetry Collector
Visualize it with Grafana
Control overhead and
Roll it out safely in production

Why observability shouldn’t require rewriting code

Modern apps are stitched together from dozens of microservices. We push features daily, yet visibility into performance often lags.

You’ve probably heard: “We’ll add tracing later.” …and then it never happens.

Manual instrumentation with OpenTelemetry SDKs gives fine-grained control, but it comes with:

Code changes across many repos,
Version mismatches between SDKs,
Extra CI/CD validation.

Wouldn’t it be nice if the system could observe itself, automatically?

That’s what eBPF (extended Berkeley Packet Filter) delivers. It hooks into the Linux kernel, captures runtime events (like syscalls, network, and process activity), and forwards them all with low overhead. Combine that with OpenTelemetry, and you get a zero-code observability pipeline.

eBPF + OpenTelemetry in plain English

eBPF: Think of eBPF as a programmable microscope for the Linux kernel. It lets you attach tiny programs to events such as network packets or function calls and safely collect data in real-time.

OpenTelemetry: OpenTelemetry (OTel) is a vendor-neutral standard for generating and exporting traces, metrics, and logs. It’s supported by almost every major observability backend (Grafana, Datadog, AWS X-Ray, etc.).

An eBPF agent can auto-discover and instrument running services (HTTP, gRPC, database calls, etc.) and emit OTel-formatted data to your collector.

No SDKs. No code injection. Everything happens in runtime.

Setting up your environment

For today's demo, we’ll use a simple Node.js app and an eBPF agent (Grafana Beyla ) to demonstrate. You can adapt this for Java, Python, Go, etc.

Step 1: Create a minimal service

mkdir ebpf-otel-demo && cd $_
npm init -y
npm install express

index.js

const express = require("express");
const app = express();

app.get("/orders/:id", async (req, res) => {
  await new Promise(r => setTimeout(r, Math.random() * 200));
  res.json({ orderId: req.params.id, status: "OK" });
});

app.listen(3000, () => console.log("Service running on port 3000"));

Dockerfile

FROM node:18
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
CMD ["node", "index.js"]

Build and run

docker build -t ebpf-otel-demo .
docker run -p 3000:3000 ebpf-otel-demo

Your API is now live at http://localhost:3000/orders/123.

Step 2: Install an eBPF agent
Install Beyla on the host or as a sidecar container. (Requires Linux kernel ≥ 5.8.)

sudo apt-get install linux-headers-$(uname -r)
curl -sSfL https://github.com/grafana/beyla/releases/latest/download/beyla-linux-amd64.tar.gz | tar xz
sudo mv beyla /usr/local/bin/

Step 3: Configure the agent
Create beyla-config.yml:

listen:
  interfaces: [eth0]
otlp:
  endpoint: "localhost:4317"
service:
  name: "orders-service"
instrumentation:
  language: "nodejs"

Run it:

sudo beyla run --config beyla-config.yml

The agent now attaches to your running container, intercepts HTTP calls, and sends spans to your OTel Collector.

Connect OpenTelemetry Collector

The collector acts as a bridge between producers (Beyla) and your observability backend (Grafana, Tempo, or Jaeger).

Create otel-collector-config.yml:

receivers:
  otlp:
    protocols:
      grpc:
      http:

exporters:
  logging:
  otlp:
    endpoint: "tempo:4317"
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [logging, otlp]

Run the collector (in Docker for simplicity):

docker run --rm -p 4317:4317 -v $(pwd)/otel-collector-config.yml:/etc/otel/config.yml \
  otel/opentelemetry-collector:latest --config /etc/otel/config.yml

Visualize traces in Grafana

If you’re using Grafana Tempo + Loki + Grafana OSS:

docker run -d --name=grafana -p 3001:3000 grafana/grafana

Add Tempo as a data source and point it to your collector’s OTLP endpoint. Within seconds, you’ll see spans like:

{
  "traceId": "5dfb0e7c16b6f9c1",
  "spanId": "8aeb32afaa3e41d9",
  "name": "GET /orders/:id",
  "attributes": {
    "http.method": "GET",
    "http.status_code": 200,
    "service.name": "orders-service"
  },
  "duration_ms": 52.8
}

Behind the scenes: what eBPF is doing

eBPF attaches probes (kprobes/uprobes) to kernel and user-space events:

Socket reads/writes -> network latency
HTTP libraries -> method, route, status
Syscalls -> file I/O, DNS, etc.

The agent aggregates these into OTel spans, adds tags (service, method, latency), and exports them asynchronously which usually consuming < 1–2% CPU.

Here’s a simplified view:

Controlling overhead and noise

Auto-instrumentation is powerful, but it can produce a lot of data. Here’s how to keep it efficient:

Sampling
In beyla-config.yml:

sampling:
  probability: 0.2   # capture 20% of requests

Filtering
Capture only interesting routes:

filters:
  include_paths: ["/orders/*"]

Resource limits
Run the agent with limited CPU/memory:

sudo systemd-run --property=CPUQuota=20% beyla run ...

Security considerations

eBPF programs run with kernel privileges.
Always use signed binaries or build from source.
Test in staging first. Avoid root unless required.

Production Rollout Checklist

Test in staging with representative traffic.
Enable sampling (< or = 20 %) before full rollout.
Run the agent in restricted mode (non-root if possible).
Compare baseline latency before/after attach.
Use dashboards to monitor agent CPU/memory usage.

Why this approach matters

You can onboard dozens of services instantly, a huge win for teams with legacy stacks or microservice sprawl.

What is next?

Combine with Service Mesh: Use eBPF telemetry to enrich service-mesh metrics (Istio, Linkerd).
Join Logs + Traces: Since OTel supports logs too, you can correlate application logs with eBPF spans via trace IDs.
Build Compliance Dashboards: In regulated industries (finance, healthcare), eBPF traces create immutable audit trails of service interactions without leaking business data.

Common problems you may face

Kernel version too old: upgrade or use COS/Ubuntu 22+.
Container visibility: run agent on host or enable --privileged if sidecar fails to attach.
Over-collection: fine-tune filters.
Trace backend mismatch: ensure OTel Collector exporter matches your backend format (Tempo, Jaeger, Zipkin)

Wrapping up

You’ve now built an observability stack that requires zero code changes yet delivers full visibility.

Key takeaways:
✅ eBPF captures runtime events safely and efficiently.
✅ OpenTelemetry unifies data into a portable format.
✅ Together they let developers focus on features.

Start small, pick one service, attach an agent, visualize traces and scale gradually.
Once you see that first automatic trace appear in Grafana, you’ll realize: observability doesn’t need to slow you down.

Further reading
Grafana Beyla Docs
OpenTelemetry Collector
eBPF.io Guide
CNCF Observability Landscape

The "Shift-Left" Imperative: Implementing Data Contracts in CI/CD Pipeline

Nabin Debnath — Fri, 17 Oct 2025 12:56:49 +0000

Having spent years in the trenches of software development, I've observed countless systems crumble under the weight of one silent killer: data quality drift. Microservices promise independence, but they are glued together by the data they exchange. When a producer service quietly changes an API response or a database column, downstream consumers break, leading to expensive root-cause-analysis.

The solution isn't better error handling; it's prevention.

It's time for Data Engineering and DevOps to fully embrace the Shift-Left philosophy. We must move the validation of our most critical asset-data from runtime monitoring to compile-time automation. This is the Shift-Left Imperative for data, and the mechanism to achieve it is the Data Contract implemented directly within the CI/CD pipeline.

What Exactly is a Data Contract?

A Data Contract is a formal, explicit agreement between a data producer (the service or application that creates the data) and all its consumers (the services, analytical systems, or data warehouses).

A Data Contract is a versioned schema that specifies the following:

Structure: The field names, data types (e.g., string, integer, timestamp) and required/optional status.
Semantics (Quality): Expectations for the data's content (e.g., user_id must be a positive integer; email must be a valid format).
SLAs: Commitments on availability, latency, and retention.

Similar to the API specification (like OpenAPI/Swagger) but for data payloads, whether they are flowing through a REST endpoint, an event stream (Kafka/Pulsar) or a database table.

Shift-Left Data Contracts

In traditional data pipelines or microservice architectures, data validation often happens late:

Runtime: An error log gets generated when a consumer service crashes because the upstream service suddenly sent something else instead of what is expected.
Post-Mortem: A downstream data analyst reports a broken dashboard because a column name was changed in the source database.

This phenomenon is called data drift, and it’s inherently a DevOps problem. It represents an opportunity in the software release process to account for dependencies.

The Shift-Left approach mandates: Data Contracts are defined, versioned, and validated before any code that interacts with that data is deployed to a production-like environment. By moving the contract validation into CI/CD, we turn a potential runtime incident into a fast, fixable build failure.

Implementing Data Contracts in the CI/CD Pipeline

The true power of Data Contracts is fully utilized when their validation is fully automated. Let's go through the multi-stage CI/CD flow to enforce the contract across an organization.

Stage 1: Contract Definition and Storage
The contract should live in a source control repository, often alongside the producer's code, to enforce versioning and a peer-review (Pull Request) process.

We can use JSON Schema or Avro Schema as the contract format for maximum tooling compatibility.

Example Contract

{
  "$id": "order_placed_v1",
  "type": "object",
  "properties": {
    "order_id": {
      "type": "string",
      "format": "uuid"
    },
    "customer_id": {
      "type": "integer",
      "minimum": 1
    },
    "timestamp": {
      "type": "string",
      "format": "date-time"
    }
  },
  "required": ["order_id", "customer_id", "timestamp"]
}

Stage 2: CI Validation (The Linter for Data)
When a developer proposes a change to the producer service or the contract itself, the CI pipeline must immediately enforce two critical checks:

Structural Validation (The Contract Check)
Use a tool like ajv (for JSON Schema) or a custom Avro parser to ensure the contract file itself is well-formed.
Backward and Forward Compatibility Check (The Dependency Check)
This is the most crucial step. If the developer is updating the contract (e.g., v1 to v2), we must ensure the new version is backward compatible with all existing consumers. This check is often performed against a Schema Registry API.

If the change involves removing a required field or changing a field's data type (e.g., integer to string), the CI pipeline fails. The developer is forced to either re-evaluate the change or propose a major version bump, which signals a breaking change to all consumers.

Here is a pseudo-code snippet illustrating this check in the CI script:

# CI Script (e.g., in a Jenkins/GitLab/GitHub Action pipeline)

# 1. Fetch the last published contract version (vN-1)
OLD_SCHEMA=$(curl -s "schema-registry.corp/api/v1/schemas/${TOPIC}/latest")

# 2. Register the new contract version (vN) in a test mode
RESPONSE=$(curl -X POST -H "Content-Type: application/json" \
  "schema-registry.corp/api/v1/schemas/${TOPIC}/versions" \
  --data @new_contract_file.json)

# 3. Check the compatibility flag returned by the Schema Registry
COMPATIBILITY_STATUS=$(echo $RESPONSE | jq -r '.is_compatible')

if [ "$COMPATIBILITY_STATUS" != "COMPATIBLE" ]; then
  echo "Data Contract failed compatibility check!"
  echo "Breaking changes detected: New schema vN is not backward compatible with vN-1."
  exit 1
else
  echo "Contract is compatible. Proceeding with registration and code generation."
fi

Stage 3: Artifact Generation and Distribution
Once the contract passes validation, the CI/CD pipeline executes tasks that make the contract immediately useful to consumers:

Code Generation: Automatically generate domain-specific objects (Pojos, Structs, Classes) in the language of the producer/consumer (e.g., Python, Java, Go). This is known as Schema-First Development. The service code now uses the generated objects, ensuring the code always conforms to the contract.
Schema Registry Publish: The final, approved contract is published to a centralized Schema Registry (like Confluent Schema Registry or an AWS Glue Data Catalog). This registry acts as the single source of truth for all consumers.

Stage 4: Consumer Service Integration
When a consumer service deploys, its CI/CD pipeline does two things:

Dependency Check: It pulls the latest approved version of the contract from the Schema Registry.
Runtime Embedding: It embeds the contract directly into its production code. At runtime, the consumer can use this contract to perform fast, local validation checks on incoming data, providing immediate and informative error feedback instead of silent failures.

Tools of the Trade
You can leverage established tools to do all these jobs instead of building from scratch.

Conclusion: Building Robust Data Architectures
The "Shift-Left" imperative in data is about recognizing that data quality is not a downstream concern, it is an architectural concern.

By implementing Data Contracts and automating their validation within the CI/CD pipeline, fundamentally changing the team's development mindset. We are moving from a reactive model (fixing broken data) to a proactive, contract-driven model. This dramatically reduces integration risks, accelerates feature development, and allows the data architecture to scale and evolve gracefully.

Eyes on change: Building a Custom Watcher With Async Notifications

Nabin Debnath — Thu, 02 Oct 2025 12:16:25 +0000

Watching data for changes is a core task in modern applications. In a collaborative application same data gets modified by different stakeholders or users. In that scenario, it is very important to notify the owner of the record or the user who cares about that change. Watcher is The feature which enables user to keep an eye on the change programatically. In this article, we'll peel off that complex functionality and build a simple, custom watcher from the ground up.

Architecture Decisions

We will keep the implementation simple but extensible:
UI: React → lightweight, quick to build forms/buttons.
Backend: Express → simple routing for watch + update endpoints.
Database: SQLite → file-based, no setup, perfect for local dev.
Notifications: Console logs first → keep it easy to demo.
Async Layer: BullMQ + Redis → realistic queue-based processing without too much setup.

This stack lets us run locally in mins but can be upgraded later to Postgres, RabbitMQ, or Kafka if needed.

Architecture Diagram

React UI → User clicks Watch.
Express API → Handles request and updates DB.
SQLite DB → Stores records and watcher subscriptions.
BullMQ Queue → Stores notification jobs asynchronously.
Worker → Pulls jobs and executes notifications.
Notification Channel → Console logs, email, Slack, etc.

Database setup

We will create the simple table and insert a demo record.

CREATE TABLE records (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  name TEXT,
  value TEXT
);

CREATE TABLE watchers (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  user TEXT,
  record_id INTEGER,
  FOREIGN KEY(record_id) REFERENCES records(id)
);

INSERT INTO records (name, value) VALUES (?, ?)", [
        "Demo Record",
        "Initial Value,
      ]

Backend implementation

server.js - Express server is running on port 4000. It adds a row in the watchers table when the watch button is clicked. Next time when the records table is updated, it checks for any watcher and puts a notification in the job queue.

app.post("/api/update/:id", (req, res) => {
  const recordId = req.params.id;
  const newValue = req.body.value;

  db.updateRecord(recordId, newValue, (err) => {
    if (err) return res.status(500).json({ error: err.message });

    db.getWatchers(recordId, (err, watchers) => {
      if (err) return res.status(500).json({ error: err.message });

      watchers.forEach((w) => {
        enqueueNotification(w.user, recordId, newValue);
      });

      res.json({ message: "Record updated and notifications queued." });
    });
  });
});

queue.js - This adds the notification in the bullMQ queue for processing by the worker

const { Queue } = require("bullmq");

const notificationQueue = new Queue("notifications", {
  connection: { host: "127.0.0.1", port: 6379 },
});

function enqueueNotification(user, recordId, newValue) {
  notificationQueue.add("notify", { user, recordId, newValue });
  console.log(`Job queued for ${user} on record ${recordId}`);
}

Frontend implementation

App.js - Simple app for the watcher demo

import React, { useEffect, useState } from "react";
import { getRecord, watchRecord, updateRecord } from "./api";

function App() {
  const [record, setRecord] = useState(null);
  const [newValue, setNewValue] = useState("");

  useEffect(() => {
    async function load() {
      const data = await getRecord(1);
      setRecord(data);
    }
    load();
  }, []);

  const handleWatch = async () => {
    const res = await watchRecord(1);
    alert(res.message);
  };

  const handleUpdate = async () => {
    if (!newValue) return alert("Enter a new value first!");
    const res = await updateRecord(1, newValue);
    alert(res.message);

    const data = await getRecord(1);
    setRecord(data);
    setNewValue("");
  };

  if (!record) return <div>Loading...</div>;

  return (
    <div style={{ padding: "20px" }}>
      <h2>Watcher Demo</h2>
      <p><strong>Record:</strong> {record.name}</p>
      <p><strong>Value:</strong> {record.value}</p>

      <div style={{ marginTop: "20px" }}>
        <button onClick={handleWatch} style={{ marginRight: "10px" }}>Watch</button>
        <input value={newValue} onChange={e => setNewValue(e.target.value)} placeholder="New Value"/>
        <button onClick={handleUpdate} style={{ marginLeft: "10px" }}>Update</button>
      </div>
    </div>
  );
}

export default App;

api.js -

const API_BASE = "http://localhost:4000/api";

export async function getRecord(id) {
  const res = await fetch(`${API_BASE}/record/${id}`);
  return res.json();
}

export async function watchRecord(id, user = "demo-user") {
  const res = await fetch(`${API_BASE}/watch/${id}`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ user }),
  });
  return res.json();
}

export async function updateRecord(id, newValue) {
  const res = await fetch(`${API_BASE}/update/${id}`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ value: newValue }),
  });
  return res.json();
}

Worker implementation

worker.js - This pulls the notifications from the queue to send to the right channel (console, email, slack etc)

const { Worker } = require("bullmq");

const worker = new Worker(
  "notifications",
  async (job) => {
    const { user, recordId, newValue } = job.data;
    console.log(`Notifying ${user}: Record ${recordId} changed to "${newValue}"`);
  },
  {
    connection: { host: "127.0.0.1", port: 6379 },
  }
);

worker.on("completed", (job) => console.log(`Job ${job.id} completed`));
worker.on("failed", (job, err) => console.error(`Job ${job.id} failed: ${err.message}`));

Code build, deploy and run server

Start Redis: docker run -d -p 6379:6379 redis
Start backend: npm install && npm start
Start worker: npm install && npm start
Start frontend: npm install && npm start

Test the application

Clicking Watch
Updating the value to see logs in the worker console

Frontend

Backend job queued

Worker processed jobs

In this article we built a custom watcher system that lets users “watch” a record and get notified when it changes - all in a way that scales. This feature is one of the most essential feature from user standpoint to keep an eye on the record. We not only built a working demo but also learned a design pattern that is used in real-world systems like GitHub, Jira, Confluence etc.