<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Nabin Debnath</title>
    <description>The latest articles on Forem by Nabin Debnath (@nabindebnath).</description>
    <link>https://forem.com/nabindebnath</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3538672%2F1e51e7ad-4c1f-4d31-9f68-eaae38f6a125.jpg</url>
      <title>Forem: Nabin Debnath</title>
      <link>https://forem.com/nabindebnath</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/nabindebnath"/>
    <language>en</language>
    <item>
      <title>The "Stateful Island" Paradox: Architecting Astro for Enterprise Scale</title>
      <dc:creator>Nabin Debnath</dc:creator>
      <pubDate>Tue, 10 Feb 2026 13:00:50 +0000</pubDate>
      <link>https://forem.com/nabindebnath/the-stateful-island-paradox-architecting-astro-for-enterprise-scale-2m49</link>
      <guid>https://forem.com/nabindebnath/the-stateful-island-paradox-architecting-astro-for-enterprise-scale-2m49</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Astro is fantastic for content sites, but it gets tricky when you try to build complex apps. The specific pain point is state management because Astro's "Islands" run in isolation, they can't easily talk to each other. This article details a pattern using Nano Stores and Edge Middleware to make disjointed islands share state without turning your app back into a bloated SPA.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Reality Check
&lt;/h2&gt;

&lt;p&gt;We’ve all seen the Astro adoption cycle. You pitch it to the team because the performance metrics are undeniable. You build the marketing pages, the blog, and the "About Us" section. The Lighthouse scores hit 100, the JS bundle is nonexistent, and the stack feels perfect.&lt;/p&gt;

&lt;p&gt;Then you hit the wall.&lt;/p&gt;

&lt;p&gt;Usually, it happens when a Product Manager asks for something seemingly simple: "Can we keep the shopping cart count updated in the header when the user adds an item in the sidebar?"&lt;/p&gt;

&lt;p&gt;In a standard React app (Next.js or CRA), this is trivial. You wrap the app in a Context Provider and move on. But in Astro, that header and that sidebar are effectively strangers. They don't share a Virtual DOM. They don't share a parent component. They are two separate mini-apps floating in a sea of static HTML.&lt;/p&gt;

&lt;p&gt;The immediate reflex is to wrap the entire &lt;code&gt;&amp;lt;body&amp;gt;&lt;/code&gt; in a giant React Provider, but that defeats the entire purpose of using Astro. You’ve just accidentally rebuilt a worse Single Page Application.&lt;/p&gt;

&lt;p&gt;We need a way to keep the performance of isolated islands while getting the data consistency of a monolith.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pattern: Subterranean State
&lt;/h2&gt;

&lt;p&gt;To solve this, we have to stop thinking about state as something that flows down from a parent component. In Astro, there is no persistent parent.&lt;/p&gt;

&lt;p&gt;Instead, think of state as "subterranean." The UI islands float on the surface, disconnected from each other. The data lives "underground," in a framework-agnostic layer that tunnels information up to whatever component needs it.&lt;/p&gt;

&lt;p&gt;We need a stack that meets three criteria:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Framework Agnostic:&lt;/strong&gt; It has to work even if the Header is React and the Cart is Svelte (a common migration scenario).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hydration Independent:&lt;/strong&gt; It needs to exist before the components even wake up.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Server Safe:&lt;/strong&gt; It must accept initial state from the Edge to prevent layout shift.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The solution that works best in production right now is &lt;strong&gt;Nano Stores&lt;/strong&gt; for the client, bridged with &lt;strong&gt;Astro Middleware&lt;/strong&gt; for the server.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Flow Visualization
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsyvrys4atmwq3irppgmn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsyvrys4atmwq3irppgmn.png" alt="The Subterranean Layer" width="800" height="802"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Implementation Details
&lt;/h2&gt;

&lt;p&gt;Let's look at the actual code. We will build a shared cart state that works across frameworks.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Agnostic Domain Logic
&lt;/h3&gt;

&lt;p&gt;We define the store in pure TypeScript. No React, no Svelte, just logic. This makes it incredibly easy to unit test because you don't need to mock a DOM.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// src/stores/cartStore.ts
import { map, computed } from 'nanostores';

export type CartItem = { id: string; price: number; title: string };

export type CartState = {
  items: CartItem[];
  isDrawerOpen: boolean;
};

// 1. Define the Atom
export const $cart = map&amp;lt;CartState&amp;gt;({
  items: [],
  isDrawerOpen: false
});

// 2. Computed State (Performance Optimization)
// Only subscribers to $totalPrice will re-render when items change.
export const $totalPrice = computed($cart, cart =&amp;gt; 
  cart.items.reduce((acc, item) =&amp;gt; acc + item.price, 0)
);

// 3. Actions (The API)
// This is where your business logic lives.
export function addToCart(item: CartItem) {
  const current = $cart.get();

  // Example business logic
  if (current.items.length &amp;gt;= 10) {
    return console.warn("Cart limit reached"); 
  }

  $cart.setKey('items', [...current.items, item]);
  $cart.setKey('isDrawerOpen', true);
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The React Consumer (Headless)
&lt;/h3&gt;

&lt;p&gt;The React component is now just a dumb view. It doesn't manage state; it just reflects it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// src/components/Header.tsx
import { useStore } from '@nanostores/react';
import { $cart, $totalPrice } from '../stores/cartStore';

export const Header = () =&amp;gt; {
  // This component will re-render AUTOMATICALLY when $cart changes
  const cart = useStore($cart);
  const total = useStore($totalPrice);

  return (
    &amp;lt;nav&amp;gt;
      &amp;lt;h1&amp;gt;Enterprise Store&amp;lt;/h1&amp;gt;
      &amp;lt;div className="cart-summary"&amp;gt;
        {cart.items.length} items (${total})
      &amp;lt;/div&amp;gt;
    &amp;lt;/nav&amp;gt;
  );
};
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Svelte Producer
&lt;/h3&gt;

&lt;p&gt;Here is the cool part. The Svelte component imports the exact same file. No "props drilling" through three layers of layout components.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;script&amp;gt;
  import { addToCart } from '../stores/cartStore';
  export let product;
&amp;lt;/script&amp;gt;

&amp;lt;div class="card"&amp;gt;
  &amp;lt;h3&amp;gt;{product.title}&amp;lt;/h3&amp;gt;
  &amp;lt;button on:click={() =&amp;gt; addToCart(product)}&amp;gt;
    Add to Cart
  &amp;lt;/button&amp;gt;
&amp;lt;/div&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Solving the "Flash of Zero State"
&lt;/h2&gt;

&lt;p&gt;If you stop here, you have a race condition.&lt;/p&gt;

&lt;p&gt;When a user refreshes the page, the store initializes as &lt;code&gt;empty&lt;/code&gt;. Then, maybe 500ms later, your client-side JS kicks in, reads from &lt;code&gt;localStorage&lt;/code&gt;, and the cart count jumps from 0 to 5.&lt;/p&gt;

&lt;p&gt;That layout shift is a user experience killer. In a real app, the initial state usually comes from the server (a session cookie, a user database).&lt;/p&gt;

&lt;p&gt;We can use Astro Middleware to fetch this data on the server and hand it off to the store before the browser even paints.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Middleware Injection
&lt;/h3&gt;

&lt;p&gt;We intercept the request at the edge to fetch the user's session.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// src/middleware.ts
import { defineMiddleware } from 'astro/middleware';

export const onRequest = defineMiddleware(async (context, next) =&amp;gt; {
  // 1. Identify User (Simulated)
  const sessionToken = context.cookies.get('auth_token');

  // 2. Fetch State (Simulated DB call)
  // In reality, this would be await db.getCart(sessionToken)
  const userCart = { 
    items: [{ id: '1', title: 'Saved Item', price: 50 }], 
    isDrawerOpen: false 
  };

  // 3. Attach to locals so the Layout can see it
  context.locals.initialState = userCart;

  return next();
});
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The HTML Handoff
&lt;/h3&gt;

&lt;p&gt;In your main layout, we bridge the server-client gap. We write the state directly into a global variable so the store can pick it up synchronously.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;---
// src/layouts/Layout.astro
const { initialState } = Astro.locals;
---
&amp;lt;head&amp;gt;
  &amp;lt;script define:vars={{ initialState }}&amp;gt;
    window.SERVER_STATE = initialState;
  &amp;lt;/script&amp;gt;

  &amp;lt;script&amp;gt;
    import { $cart } from '../stores/cartStore';

    if (window.SERVER_STATE) {
      $cart.set(window.SERVER_STATE);
    }
  &amp;lt;/script&amp;gt;
&amp;lt;/head&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Why This Approach Scales
&lt;/h2&gt;

&lt;p&gt;This isn't just a hack to make things work; it's a better architectural pattern for large teams.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decoupling:&lt;/strong&gt; Team A can work on the Search Bar (React) and Team B can work on the Checkout Sidebar (Svelte). As long as they agree on the &lt;code&gt;cartStore.ts&lt;/code&gt; interface, they never step on each other's toes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance:&lt;/strong&gt; You maintain the "Island" benefits. The header hydrates immediately (&lt;code&gt;client:load&lt;/code&gt;), but the heavy cart sidebar can wait until the user clicks a button (&lt;code&gt;client:idle&lt;/code&gt; or &lt;code&gt;client:only&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Portability:&lt;/strong&gt; If you decide to ditch React for SolidJS next year, your business logic (the store) stays exactly the same.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;The "Stateful Island" paradox is only a problem if you try to force Astro to behave like Next.js. Once you decouple your state from your UI framework and let it live in the "subterranean" layer, Astro becomes a serious contender for complex, enterprise-grade applications.&lt;/p&gt;

&lt;p&gt;Stop fighting the isolation. Embrace it, and tunnel your data underneath.&lt;/p&gt;

</description>
      <category>astro</category>
      <category>webdev</category>
      <category>javascript</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Composite SLOs for Serverless Event-Driven Systems</title>
      <dc:creator>Nabin Debnath</dc:creator>
      <pubDate>Mon, 05 Jan 2026 13:46:01 +0000</pubDate>
      <link>https://forem.com/nabindebnath/composite-slos-for-serverless-event-driven-systems-40do</link>
      <guid>https://forem.com/nabindebnath/composite-slos-for-serverless-event-driven-systems-40do</guid>
      <description>&lt;p&gt;&lt;strong&gt;Measuring What Users Experience Across API Gateway -&amp;gt; Lambda -&amp;gt; DynamoDB -&amp;gt; EventBridge&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  TL;DR
&lt;/h3&gt;

&lt;p&gt;Serverless systems rarely fail at a single component. Failures occur at the junctures between managed services. Yet most SLO implementations still measure API Gateway, Lambda and DynamoDB in isolation.&lt;br&gt;
This article shows how to define and operate composite, end-to-end SLOs for a real serverless chain. You'll see how to derive availability and latency SLIs across multiple AWS services, calculate error budgets correctly, wire burn-rate alerts, and ship a working dashboard using CloudWatch metric math and infrastructure as code.&lt;/p&gt;


&lt;h2&gt;
  
  
  Introduction: Why "Everything Is Green" Is Still Not Good Enough
&lt;/h2&gt;

&lt;p&gt;If you have operated a serverless system for long time, you'll eventually experience this situation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API Gateway shows 99.9% availability&lt;/li&gt;
&lt;li&gt;Lambda error rate looks fine&lt;/li&gt;
&lt;li&gt;DynamoDB has no throttles&lt;/li&gt;
&lt;li&gt;EventBridge metrics are quiet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Despite this, users are retrying operations, workflows remain unfinished, and crucial events are failing to reach downstream systems.&lt;/p&gt;

&lt;p&gt;Nothing is individually broken. The system is. The problem is not observability coverage. It's how reliability is modeled.&lt;/p&gt;

&lt;p&gt;Serverless architectures push complexity into managed services. That's a good trade until reliability is measured per service instead of per request. At that point, SLOs stop representing user experience and start representing dashboards.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Gap we are trying to close
&lt;/h2&gt;

&lt;p&gt;There are a lot of content online on SLO but what's missing is specificity.&lt;br&gt;
What I mostly find:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generic SLO explainers based on microservices&lt;/li&gt;
&lt;li&gt;Burn-rate math explained with Prometheus examples&lt;/li&gt;
&lt;li&gt;AWS blog posts measuring one service at a time&lt;/li&gt;
&lt;li&gt;"Composite SLO" is mentioned as a concept&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What’s consistently absent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A step-by-step SLO model for serverless request chains&lt;/li&gt;
&lt;li&gt;Clear guidance on what counts as failure when managed services retry, buffer, or partially succeed&lt;/li&gt;
&lt;li&gt;Concrete examples using CloudWatch metric math&lt;/li&gt;
&lt;li&gt;A way to combine sync and async paths into a single availability signal&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This article will close the gap.&lt;/p&gt;


&lt;h2&gt;
  
  
  The System We're Measuring
&lt;/h2&gt;

&lt;p&gt;We'll use a very common production pattern:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8bg4o091c0smd2uhs4t6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8bg4o091c0smd2uhs4t6.png" alt="Common Serverless Architecture Flow" width="800" height="259"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From a user’s perspective, the request is successful only if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API Gateway accepts and processes the request&lt;/li&gt;
&lt;li&gt;Lambda executes successfully&lt;/li&gt;
&lt;li&gt;The DynamoDB write succeeds&lt;/li&gt;
&lt;li&gt;The event is published to EventBridge&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is an uncompromising definition that dictates all subsequent actions. It means that any result short of complete success is considered a partial failure, even if the HTTP response code is 200.&lt;/p&gt;


&lt;h2&gt;
  
  
  Why Per-Service SLOs Break Down in Serverless
&lt;/h2&gt;

&lt;p&gt;Per-service SLOs assume clean failure boundaries. Serverless doesn’t have those.&lt;/p&gt;

&lt;p&gt;Consider this real scenario:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API Gateway returns 200&lt;/li&gt;
&lt;li&gt;Lambda executes successfully&lt;/li&gt;
&lt;li&gt;DynamoDB write succeeds&lt;/li&gt;
&lt;li&gt;EventBridge PutEvents partially fails&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your API metrics look perfect. Your Lambda metrics look perfect. Your DynamoDB metrics look perfect.&lt;br&gt;
Your business workflow is broken.&lt;br&gt;
This is why composite SLOs are not "advanced", they're table stakes for event-driven systems.&lt;/p&gt;


&lt;h2&gt;
  
  
  Defining the Composite SLO
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Availability Objective
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;99.5% of requests must complete end-to-end successfully over a rolling 30-day window
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;A request is counted as successful only if all four steps succeed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important Scope Clarification:&lt;/strong&gt; Note that for the asynchronous step (EventBridge), "success" means the event was successfully published to the bus. This SLO measures the promise of work, not the eventual consumption by downstream subscribers.&lt;br&gt;
If you have critical downstream consumers, they need their own separate SLOs. Trying to jam async consumption into a synchronous API availability metric will only create noise.&lt;/p&gt;
&lt;h3&gt;
  
  
  Latency Objective
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;95% of successful requests must complete within 800 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Latency here reflects user-visible delay, not async processing time. EventBridge publishing is included in availability, not latency.&lt;br&gt;
This distinction matters more than most teams realize.&lt;/p&gt;


&lt;h2&gt;
  
  
  Choosing SLIs That Actually Map to Reality
&lt;/h2&gt;

&lt;p&gt;We will compose existing AWS metrics.&lt;br&gt;
&lt;strong&gt;Availability Signals&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API Gateway

&lt;ul&gt;
&lt;li&gt;Count&lt;/li&gt;
&lt;li&gt;5XXError&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Lambda

&lt;ul&gt;
&lt;li&gt;Invocations&lt;/li&gt;
&lt;li&gt;Errors&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;DynamoDB

&lt;ul&gt;
&lt;li&gt;UserErrors&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;EventBridge

&lt;ul&gt;
&lt;li&gt;PutEventsFailedEntries&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These metrics already exist. The work is in combining them correctly.&lt;/p&gt;


&lt;h2&gt;
  
  
  Composite Availability: Turning Fragments Into a Single Signal
&lt;/h2&gt;

&lt;p&gt;The core question is simple: Out of all incoming requests, how many completed the full chain?&lt;br&gt;
We model this explicitly using CloudWatch metric math.&lt;br&gt;
&lt;strong&gt;Composite Availability Expression&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CompositeAvailabilityMetric:
  Type: AWS::CloudWatch::Alarm
  Properties:
    Metrics:
      - Id: totalRequests
        MetricStat:
          Metric:
            Namespace: AWS/ApiGateway
            MetricName: Count
            Dimensions:
              - Name: ApiName
                Value: OrdersAPI
          Period: 60
          Stat: Sum

      - Id: apiFailures
        MetricStat:
          Metric:
            Namespace: AWS/ApiGateway
            MetricName: 5XXError
            Dimensions:
              - Name: ApiName
                Value: OrdersAPI
          Period: 60
          Stat: Sum

      - Id: lambdaFailures
        MetricStat:
          Metric:
            Namespace: AWS/Lambda
            MetricName: Errors
            Dimensions:
              - Name: FunctionName
                Value: OrdersHandler
          Period: 60
          Stat: Sum

      - Id: eventFailures
        MetricStat:
          Metric:
            Namespace: AWS/Events
            MetricName: FailedInvocations
          Period: 60
          Stat: Sum

      - Id: availability
        Expression: "1 - ((FILL(apiFailures,0) + FILL(lambdaFailures,0) + FILL(eventFailures,0)) / MAX([totalRequests], 1))"
        Label: CompositeAvailability
        ReturnData: true
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This produces a &lt;strong&gt;single availability SLI&lt;/strong&gt; that reflects user reality, not service health.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A Note on "Strict" Math &amp;amp; Retries&lt;/strong&gt;&lt;br&gt;
You might notice this formula is ruthless: &lt;code&gt;1 - (failures / total)&lt;/code&gt;. In a serverless world, services like Lambda often retry automatically on failure.&lt;/p&gt;

&lt;p&gt;If a Lambda fails twice and succeeds on the third try, this metric counts it as a failure. This is intentional. Hidden retries burn your error budget and increase latency. By penalizing retries in your availability score, you force the team to fix the underlying flakiness rather than letting the retry policy hide it.&lt;/p&gt;


&lt;h2&gt;
  
  
  Composite Latency: Measuring the Critical Path
&lt;/h2&gt;

&lt;p&gt;Latency is additive across synchronous hops.&lt;/p&gt;

&lt;p&gt;We use percentile metrics to avoid averages masking tail behavior.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CompositeLatencyMetric:
  Metrics:
    - Id: apiLatency
      MetricStat:
        Metric:
          Namespace: AWS/ApiGateway
          MetricName: Latency
          Dimensions:
            - Name: ApiName
              Value: OrdersAPI
        Period: 60
        Stat: p95

    - Id: lambdaDuration
      MetricStat:
        Metric:
          Namespace: AWS/Lambda
          MetricName: Duration
          Dimensions:
            - Name: FunctionName
              Value: OrdersHandler
        Period: 60
        Stat: p95

    - Id: dynamoLatency
      MetricStat:
        Metric:
          Namespace: AWS/DynamoDB
          MetricName: SuccessfulRequestLatency
          Dimensions:
            - Name: TableName
              Value: Orders
        Period: 60
        Stat: p95

    - Id: totalLatency
      Expression: "apiLatency + lambdaDuration + dynamoLatency"
      Label: EndToEndLatency
      ReturnData: true

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This prevents the common error of prematurely claiming "P95 latency is acceptable" even while users are still experiencing delays.&lt;/p&gt;




&lt;h2&gt;
  
  
  Error Budgets and Burn Rate (Where SLOs Become Useful)
&lt;/h2&gt;

&lt;p&gt;For a 99.5% SLO over 30 days:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Total error budget: 0.5%&lt;/li&gt;
&lt;li&gt;Budget in minutes: ~216 minutes/month&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We use multi-window burn-rate alerts to avoid noise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fast Burn (Page)&lt;/strong&gt;: If we burn the monthly budget in under 2 days, something is seriously wrong.&lt;br&gt;
&lt;strong&gt;Slow Burn (Ticket)&lt;/strong&gt;: If we’re slowly bleeding reliability, the system needs attention, but not at 2am.&lt;/p&gt;

&lt;p&gt;These alerts are driven by the composite availability metric, not individual services. That alignment is the entire point.&lt;/p&gt;




&lt;h2&gt;
  
  
  Dashboard Design: Fewer Charts, Better Decisions
&lt;/h2&gt;

&lt;p&gt;To keep the investigation time short and focused, we require these typical widgets: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Composite availability (rolling 30 days)&lt;/li&gt;
&lt;li&gt;Error budget remaining&lt;/li&gt;
&lt;li&gt;End-to-end latency p95&lt;/li&gt;
&lt;li&gt;API 5xx&lt;/li&gt;
&lt;li&gt;Lambda errors&lt;/li&gt;
&lt;li&gt;EventBridge failed entries&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Best Practices and Anti-Patterns
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Best Practices&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Define failure conservatively&lt;/li&gt;
&lt;li&gt;Use percentiles, not averages&lt;/li&gt;
&lt;li&gt;Treat async failures as first-class reliability issues&lt;/li&gt;
&lt;li&gt;Alert on burn rate, not raw errors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Anti-Patterns&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ignoring retries in SLI math&lt;/li&gt;
&lt;li&gt;Counting HTTP 200 as success unconditionally&lt;/li&gt;
&lt;li&gt;Measuring latency per service in isolation&lt;/li&gt;
&lt;li&gt;Treating EventBridge as "eventually reliable"&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Serverless systems fail in ways that traditional SLO models don’t capture. Composite SLOs fix that by forcing reliability to align with user experience instead of service boundaries.&lt;/p&gt;

&lt;p&gt;If you run event-driven systems and still rely on per-service health, you're measuring the wrong thing.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>serverless</category>
      <category>observability</category>
      <category>aws</category>
    </item>
    <item>
      <title>I Replaced a Docker-based Microservice with WebAssembly and It's 100x+ Faster</title>
      <dc:creator>Nabin Debnath</dc:creator>
      <pubDate>Sat, 13 Dec 2025 15:22:44 +0000</pubDate>
      <link>https://forem.com/nabindebnath/i-replaced-a-docker-based-microservice-with-webassembly-and-its-100x-faster-4f6d</link>
      <guid>https://forem.com/nabindebnath/i-replaced-a-docker-based-microservice-with-webassembly-and-its-100x-faster-4f6d</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;br&gt;
We've all heard the quote from Docker's founder, Solomon Hykes, back in 2019: "&lt;em&gt;If WASM+WASI existed in 2008, we wouldn't have needed to create Docker.&lt;/em&gt;"&lt;/p&gt;

&lt;p&gt;For years, this was just a prophecy. But in 2025, the tech has finally caught up.&lt;/p&gt;

&lt;p&gt;I decided to find out if he was right. I took a simple, everyday Node.js Docker-based microservice, rewrote it in Rust-based WebAssembly (Wasm), and benchmarked them head-to-head.&lt;/p&gt;

&lt;p&gt;The results weren't just better; they were shocking. We're talking 99% smaller artifacts, incremental build times cut by 10x, and cold-start times that are over 100x faster.&lt;/p&gt;

&lt;p&gt;Here's the full story, with all the code and benchmarks.&lt;/p&gt;
&lt;h2&gt;
  
  
  Part 1: The "Before" - Our Bloated Docker Service
&lt;/h2&gt;

&lt;p&gt;To make this a fair comparison, I picked a perfect candidate for a microservice: a JWT (JSON Web Token) Validator.&lt;/p&gt;

&lt;p&gt;It's a common, real-world task. An API gateway or backend service receives a request, takes the Authorization: Bearer  header, and needs to ask a different service, "Is this token valid?"&lt;/p&gt;

&lt;p&gt;It's a simple, stateless function to put it in its own container.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Node.js / Express Code
&lt;/h3&gt;

&lt;p&gt;It is a Node.js code and an Express server with one endpoint: &lt;code&gt;/validate&lt;/code&gt;. It uses the &lt;code&gt;jsonwebtoken&lt;/code&gt; library to verify the token against a secret.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// validator-node/index.js
import express from 'express';
import jwt from 'jsonwebtoken';

const app = express();
app.use(express.json());

// The one secret key our service knows
const JWT_SECRET = process.env.JWT_SECRET || 'a-very-strong-secret-key';

app.post('/validate', (req, res) =&amp;gt; {
  const { token } = req.body;

  if (!token) {
    return res.status(400).send({ valid: false, error: 'No token provided' });
  }

  try {
    // The core logic!
    jwt.verify(token, JWT_SECRET);
    // If it doesn't throw, it's valid
    res.status(200).send({ valid: true });
  } catch (err) {
    // If it throws, it's invalid
    res.status(401).send({ valid: false, error: err.message });
  }
});

const port = process.env.PORT || 3000;
app.listen(port, () =&amp;gt; {
  console.log(`Node.js validator listening on port ${port}`);
});
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Dockerfile
&lt;/h3&gt;

&lt;p&gt;We use a multi-stage build with an Alpine base image to keep it small.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Dockerfile
# --- Build Stage ---
FROM node:18-alpine AS build
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci

COPY . .

# --- Production Stage ---
FROM node:18-alpine
WORKDIR /app
COPY --from=build /app/node_modules ./node_modules
COPY --from=build /app/index.js ./index.js

# We don't need the full package.json, just the code and dependencies
ENV NODE_ENV=production
CMD ["node", "index.js"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Problem
&lt;/h3&gt;

&lt;p&gt;Let's check few things after docker does its work - the cost of the simple service&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Build Time:&lt;/strong&gt; On my machine, building this from a cold cache takes ~81 seconds. Even with Docker layer caching, re-building after a small code change takes about 45 seconds due to context switching and layer hashing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Artifact Size:&lt;/strong&gt; After building, the final image is 188MB. That's 188MB to ship a 30-line script.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Cold Start:&lt;/strong&gt; When deployed to a serverless platform (like Cloud Run or scaled-to-zero K8s), the cold start is painful. The container has to be pulled, and the Node.js runtime has to boot. I was seeing cold starts between 800ms and 1.5 seconds. That's a user-facing delay.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Part 2: The "After" - Rebuilding with WebAssembly
&lt;/h2&gt;

&lt;p&gt;Wasm modules are small, compile-to-binary, and run in a secure, sandboxed runtime that starts in microseconds. Unlike Docker, which packages a whole OS, Wasm just packages your code.&lt;/p&gt;

&lt;p&gt;I chose to rewrite it in Rust because of its first-class Wasm support and performance. I used the &lt;a href="https://github.com/spinframework/spin" rel="noopener noreferrer"&gt;Spin framework&lt;/a&gt;, which makes building Wasm-based HTTP services incredibly simple.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Rust / Spin Code
&lt;/h3&gt;

&lt;p&gt;First, let's install the Spin CLI and scaffold a new project.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ spin new
# I selected: http-rust (HTTP trigger with Rust)
Project name: validator-wasm
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This generates a &lt;code&gt;src/lib.rs&lt;/code&gt; file. I opted to use the &lt;code&gt;jwt-simple&lt;/code&gt; crate instead of the standard &lt;code&gt;jsonwebtoken&lt;/code&gt; because &lt;code&gt;jwt-simple&lt;/code&gt; is a pure-Rust implementation. This avoids C-binding issues and compiles down to an incredibly small Wasm binary.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// validator-wasm/src/lib.rs
use anyhow::{Result, Context};
use spin_sdk::{
    http::{Request, Response, Router, Params},
    http_component,
};
use serde::{Deserialize, Serialize};
use jwt_simple::prelude::*;

// 1. Define our request and response structs
#[derive(Deserialize)]
struct TokenRequest {
    token: String,
}

#[derive(Serialize)]
struct TokenResponse {
    valid: bool,
    #[serde(skip_serializing_if = "Option::is_none")]
    error: Option&amp;lt;String&amp;gt;,
}

// Get the JWT secret from environment or use a default
fn get_secret() -&amp;gt; HS256Key {
    let secret = std::env::var("JWT_SECRET").unwrap_or_else(|_| "a-very-strong-secret-key".to_string());
    HS256Key::from_bytes(secret.as_bytes())
}

/// The Spin HTTP component
#[http_component]
fn handle_validator(req: Request) -&amp;gt; Result&amp;lt;Response&amp;gt; {
    let mut router = Router::new();
    router.post("/validate", validate_token);
    Ok(router.handle(req))
}

// 2. JWT validation using jwt-simple
fn validate_token(req: Request, _params: Params) -&amp;gt; Result&amp;lt;Response&amp;gt; {
    // Read the request body
    let body = req.body();
    if body.is_empty() {
        return Ok(json_response(400, false, "Empty request body"));
    }

    let token_req: TokenRequest = serde_json::from_slice(body)
        .context("Failed to parse request body")?;

    let key = get_secret();

    // The `verify_token` function does the validation
    match key.verify_token::&amp;lt;serde_json::Value&amp;gt;(&amp;amp;token_req.token, None) {
        Ok(_) =&amp;gt; Ok(json_response(200, true, "")),
        Err(e) =&amp;gt; Ok(json_response(401, false, &amp;amp;e.to_string())),
    }
}

// Helper to build a JSON response
fn json_response(status: u16, valid: bool, error_msg: &amp;amp;str) -&amp;gt; Response {
    let error = if error_msg.is_empty() { 
        None 
    } else { 
        Some(error_msg.to_string()) 
    };

    Response::builder()
        .status(status)
        .header("Content-Type", "application/json")
        .body(serde_json::to_string(&amp;amp;TokenResponse { valid, error }).unwrap())
        .build()
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It is evidently more code than the Node.js code. But it's also type-safe, compiled, and we will see this unbelievably fast.&lt;/p&gt;

&lt;h3&gt;
  
  
  The "Build"
&lt;/h3&gt;

&lt;p&gt;There's no &lt;code&gt;Dockerfile&lt;/code&gt;. Instead, I configured the &lt;code&gt;spin.toml&lt;/code&gt; manifest to use the modern &lt;code&gt;wasm32-wasip1&lt;/code&gt; target.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#:schema https://schemas.spinframework.dev/spin/manifest-v2/latest.json

spin_manifest_version = 2
[application]
name = "validator-wasm"
version = "0.1.0"

[[trigger.http]]
route = "/..."
component = "validator-wasm"

[component.validator-wasm]
source = "target/wasm32-wasip1/release/validator_wasm.wasm"  # The build output
allowed_http_hosts = []
[component.validator-wasm.build]
command = "cargo build --target wasm32-wasip1 --release"
watch = ["src/**/*.rs", "Cargo.toml"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Build this entire project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ spin build
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This one command compiles the Rust code to a Wasm module.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 3: The Showdown - Docker vs. Wasm Benchmarks
&lt;/h2&gt;

&lt;p&gt;I've successfully run and measured both the Docker container and the Spin Wasm application. Docker runs a full operating system in a virtualized container, while Wasm runs a tiny, sandboxed module directly on the host.&lt;br&gt;
This architectural difference leads to some staggering benchmark results.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Docker (Node.js)&lt;/th&gt;
&lt;th&gt;WebAssembly (Rust/Spin)&lt;/th&gt;
&lt;th&gt;The Winner&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Artifact Size&lt;/td&gt;
&lt;td&gt;188 MB&lt;/td&gt;
&lt;td&gt;0.5 MB&lt;/td&gt;
&lt;td&gt;Wasm (99.7% smaller)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Build Time (Incremental)&lt;/td&gt;
&lt;td&gt;~45 sec (Docker layer caching)&lt;/td&gt;
&lt;td&gt;4.2 seconds&lt;/td&gt;
&lt;td&gt;Wasm (10x faster)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cold Start Time&lt;/td&gt;
&lt;td&gt;~1.2 seconds (1200ms)&lt;/td&gt;
&lt;td&gt;~10ms&lt;/td&gt;
&lt;td&gt;Wasm (120x faster)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory Usage&lt;/td&gt;
&lt;td&gt;~85 MB (idle)&lt;/td&gt;
&lt;td&gt;~4 MB (idle)&lt;/td&gt;
&lt;td&gt;Wasm (95% less)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Artifact Size:&lt;/strong&gt; The Wasm module is 0.5 MB (548KB to be exact). Not 188MB. I can send this file in a Slack message. It's 99.7% smaller.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build Time (Incremental):&lt;/strong&gt; This is the developer "inner loop" metric. Rust's incremental builds are blazing fast. Once dependencies are compiled, changing your code and running spin build takes ~4 seconds. Comparing this to waiting ~45 seconds for Docker context switching and layer sha-hashing feels like a superpower.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cold Start:&lt;/strong&gt; This is the headline. The Wasm runtime starts in the low-millisecond range. I benchmarked it using spin up and got startup times consistently around 10ms. Compared to the 1200ms of the container, it's not even a 
contest.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the "100x faster" promise. It's not that the code executes 100x faster (though the Rust version is quicker); it's that the service can go from zero-to-ready 100 times faster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 4: The Verdict - Is Docker Dead?
&lt;/h2&gt;

&lt;p&gt;No. Of course not - Wasm is not a Docker killer. It's a Docker alternative for a specific job.&lt;/p&gt;

&lt;p&gt;You should still use Docker/Containers for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Large, complex, stateful applications (like a database).&lt;/li&gt;
&lt;li&gt;Monolithic apps you're lifting-and-shifting.&lt;/li&gt;
&lt;li&gt;Services that truly need a full Linux environment.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But WebAssembly is the new king for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Serverless Functions (FaaS)&lt;/li&gt;
&lt;li&gt;Microservices (or "nano-services")&lt;/li&gt;
&lt;li&gt;Edge Computing (where low startup time is critical)&lt;/li&gt;
&lt;li&gt;Plugin Systems (like for a SaaS)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My takeaway- That quote from Solomon Hykes wasn't just a spicy take. He was right.&lt;/p&gt;

&lt;p&gt;The next time you're about to docker init a new, simple serverless function, you just ask yourself if your use case is a right candidate for this. It may or may not be.&lt;/p&gt;

&lt;p&gt;Try it yourself. You might be shocked, too.&lt;/p&gt;

</description>
      <category>webassembly</category>
      <category>docker</category>
      <category>rust</category>
      <category>serverless</category>
    </item>
    <item>
      <title>Zero-Code Observability: Using eBPF to Auto-Instrument Services with OpenTelemetry</title>
      <dc:creator>Nabin Debnath</dc:creator>
      <pubDate>Fri, 07 Nov 2025 14:19:13 +0000</pubDate>
      <link>https://forem.com/nabindebnath/zero-code-observability-using-ebpf-to-auto-instrument-services-with-opentelemetry-oki</link>
      <guid>https://forem.com/nabindebnath/zero-code-observability-using-ebpf-to-auto-instrument-services-with-opentelemetry-oki</guid>
      <description>&lt;p&gt;Instrumenting services for observability often means sprinkling tracing code across hundreds of files which is painful to maintain and easy to forget.&lt;br&gt;
&lt;strong&gt;eBPF + OpenTelemetry (OTel)&lt;/strong&gt;: a powerful combination that hooks into your running processes and emits traces, metrics, and logs without touching application code.&lt;/p&gt;

&lt;p&gt;In this post, you’ll learn how to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use an eBPF agent to automatically instrument apps&lt;/li&gt;
&lt;li&gt;Export telemetry data through OpenTelemetry Collector&lt;/li&gt;
&lt;li&gt;Visualize it with Grafana&lt;/li&gt;
&lt;li&gt;Control overhead and&lt;/li&gt;
&lt;li&gt;Roll it out safely in production&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Why observability shouldn’t require rewriting code
&lt;/h2&gt;

&lt;p&gt;Modern apps are stitched together from dozens of microservices. We push features daily, yet visibility into performance often lags.&lt;/p&gt;

&lt;p&gt;You’ve probably heard: “&lt;em&gt;We’ll add tracing later.&lt;/em&gt;” …and then it never happens.&lt;/p&gt;

&lt;p&gt;Manual instrumentation with OpenTelemetry SDKs gives fine-grained control, but it comes with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code changes across many repos,&lt;/li&gt;
&lt;li&gt;Version mismatches between SDKs,&lt;/li&gt;
&lt;li&gt;Extra CI/CD validation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Wouldn’t it be nice if the system could observe itself, automatically?&lt;/p&gt;

&lt;p&gt;That’s what eBPF (extended Berkeley Packet Filter) delivers. It hooks into the Linux kernel, captures runtime events (like syscalls, network, and process activity), and forwards them all with low overhead. Combine that with OpenTelemetry, and &lt;em&gt;you get a zero-code observability pipeline&lt;/em&gt;.&lt;/p&gt;


&lt;h2&gt;
  
  
  eBPF + OpenTelemetry in plain English
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;eBPF&lt;/strong&gt;: Think of eBPF as a programmable microscope for the Linux kernel. It lets you attach tiny programs to events such as network packets or function calls and safely collect data in real-time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenTelemetry&lt;/strong&gt;: OpenTelemetry (OTel) is a vendor-neutral standard for generating and exporting traces, metrics, and logs. It’s supported by almost every major observability backend (Grafana, Datadog, AWS X-Ray, etc.).&lt;/p&gt;

&lt;p&gt;An eBPF agent can auto-discover and instrument running services (HTTP, gRPC, database calls, etc.) and emit OTel-formatted data to your collector.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4kxqqxi8i9y18bbc3qtk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4kxqqxi8i9y18bbc3qtk.png" alt="eBPF and OpenTelemetry integration" width="800" height="280"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;No SDKs. No code injection. Everything happens in runtime.&lt;/p&gt;


&lt;h2&gt;
  
  
  Setting up your environment
&lt;/h2&gt;

&lt;p&gt;For today's demo, we’ll use a simple Node.js app and an eBPF agent (Grafana Beyla ) to demonstrate. You can adapt this for Java, Python, Go, etc.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Create a minimal service&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mkdir ebpf-otel-demo &amp;amp;&amp;amp; cd $_
npm init -y
npm install express
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;index.js&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const express = require("express");
const app = express();

app.get("/orders/:id", async (req, res) =&amp;gt; {
  await new Promise(r =&amp;gt; setTimeout(r, Math.random() * 200));
  res.json({ orderId: req.params.id, status: "OK" });
});

app.listen(3000, () =&amp;gt; console.log("Service running on port 3000"));
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Dockerfile&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FROM node:18
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
CMD ["node", "index.js"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Build and run&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker build -t ebpf-otel-demo .
docker run -p 3000:3000 ebpf-otel-demo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your API is now live at &lt;code&gt;http://localhost:3000/orders/123&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Install an eBPF agent&lt;/strong&gt;&lt;br&gt;
Install Beyla on the host or as a sidecar container. (Requires Linux kernel ≥ 5.8.)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo apt-get install linux-headers-$(uname -r)
curl -sSfL https://github.com/grafana/beyla/releases/latest/download/beyla-linux-amd64.tar.gz | tar xz
sudo mv beyla /usr/local/bin/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3: Configure the agent&lt;/strong&gt;&lt;br&gt;
Create beyla-config.yml:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;listen:
  interfaces: [eth0]
otlp:
  endpoint: "localhost:4317"
service:
  name: "orders-service"
instrumentation:
  language: "nodejs"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo beyla run --config beyla-config.yml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent now attaches to your running container, intercepts HTTP calls, and sends spans to your OTel Collector.&lt;/p&gt;




&lt;h2&gt;
  
  
  Connect OpenTelemetry Collector
&lt;/h2&gt;

&lt;p&gt;The collector acts as a bridge between producers (Beyla) and your observability backend (Grafana, Tempo, or Jaeger).&lt;/p&gt;

&lt;p&gt;Create otel-collector-config.yml:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;receivers:
  otlp:
    protocols:
      grpc:
      http:

exporters:
  logging:
  otlp:
    endpoint: "tempo:4317"
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [logging, otlp]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the collector (in Docker for simplicity):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker run --rm -p 4317:4317 -v $(pwd)/otel-collector-config.yml:/etc/otel/config.yml \
  otel/opentelemetry-collector:latest --config /etc/otel/config.yml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Visualize traces in Grafana
&lt;/h2&gt;

&lt;p&gt;If you’re using Grafana Tempo + Loki + Grafana OSS:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker run -d --name=grafana -p 3001:3000 grafana/grafana
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add Tempo as a data source and point it to your collector’s OTLP endpoint. Within seconds, you’ll see spans like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "traceId": "5dfb0e7c16b6f9c1",
  "spanId": "8aeb32afaa3e41d9",
  "name": "GET /orders/:id",
  "attributes": {
    "http.method": "GET",
    "http.status_code": 200,
    "service.name": "orders-service"
  },
  "duration_ms": 52.8
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Behind the scenes: what eBPF is doing
&lt;/h2&gt;

&lt;p&gt;eBPF attaches probes (kprobes/uprobes) to kernel and user-space events:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Socket reads/writes -&amp;gt; network latency&lt;/li&gt;
&lt;li&gt;HTTP libraries -&amp;gt; method, route, status&lt;/li&gt;
&lt;li&gt;Syscalls -&amp;gt; file I/O, DNS, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent aggregates these into OTel spans, adds tags (service, method, latency), and exports them asynchronously which usually consuming &amp;lt; 1–2% CPU.&lt;/p&gt;

&lt;p&gt;Here’s a simplified view:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frd280sta0fuvv9kwifi0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frd280sta0fuvv9kwifi0.png" alt="Simplified view of eBPF behind the scene" width="800" height="474"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Controlling overhead and noise
&lt;/h2&gt;

&lt;p&gt;Auto-instrumentation is powerful, but it can produce a lot of data. Here’s how to keep it efficient:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sampling&lt;/strong&gt;&lt;br&gt;
In beyla-config.yml:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sampling:
  probability: 0.2   # capture 20% of requests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Filtering&lt;/strong&gt;&lt;br&gt;
Capture only interesting routes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;filters:
  include_paths: ["/orders/*"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Resource limits&lt;/strong&gt;&lt;br&gt;
Run the agent with limited CPU/memory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo systemd-run --property=CPUQuota=20% beyla run ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Security considerations&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;eBPF programs run with kernel privileges.&lt;/li&gt;
&lt;li&gt;Always use signed binaries or build from source.&lt;/li&gt;
&lt;li&gt;Test in staging first. Avoid root unless required.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Production Rollout Checklist&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Test in staging with representative traffic.&lt;/li&gt;
&lt;li&gt;Enable sampling (&amp;lt; or = 20 %) before full rollout.&lt;/li&gt;
&lt;li&gt;Run the agent in restricted mode (non-root if possible).&lt;/li&gt;
&lt;li&gt;Compare baseline latency before/after attach.&lt;/li&gt;
&lt;li&gt;Use dashboards to monitor agent CPU/memory usage.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Why this approach matters
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwcn2sdl1kjpyqilos8rn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwcn2sdl1kjpyqilos8rn.png" alt="real world value" width="800" height="220"&gt;&lt;/a&gt;&lt;br&gt;
You can onboard dozens of services instantly, a huge win for teams with legacy stacks or microservice sprawl.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is next?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Combine with Service Mesh&lt;/strong&gt;: Use eBPF telemetry to enrich service-mesh metrics (Istio, Linkerd).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Join Logs + Traces&lt;/strong&gt;: Since OTel supports logs too, you can correlate application logs with eBPF spans via trace IDs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build Compliance Dashboards&lt;/strong&gt;: In regulated industries (finance, 
healthcare), eBPF traces create immutable audit trails of service interactions without leaking business data.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Common problems you may face
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Kernel version too old: upgrade or use COS/Ubuntu 22+.&lt;/li&gt;
&lt;li&gt;Container visibility: run agent on host or enable --privileged if sidecar fails to attach.&lt;/li&gt;
&lt;li&gt;Over-collection: fine-tune filters.&lt;/li&gt;
&lt;li&gt;Trace backend mismatch: ensure OTel Collector exporter matches your backend format (Tempo, Jaeger, Zipkin)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;You’ve now built an observability stack that requires zero code changes yet delivers full visibility.&lt;/p&gt;

&lt;p&gt;Key takeaways:&lt;br&gt;
✅ eBPF captures runtime events safely and efficiently.&lt;br&gt;
✅ OpenTelemetry unifies data into a portable format.&lt;br&gt;
✅ Together they let developers focus on features.&lt;/p&gt;

&lt;p&gt;Start small, pick one service, attach an agent, visualize traces and scale gradually.&lt;br&gt;
Once you see that first automatic trace appear in Grafana, you’ll realize: &lt;em&gt;observability doesn’t need to slow you down&lt;/em&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Further reading&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://grafana.com/docs/beyla/latest/" rel="noopener noreferrer"&gt;Grafana Beyla Docs&lt;/a&gt;&lt;br&gt;
&lt;a href="https://opentelemetry.io/docs/collector/" rel="noopener noreferrer"&gt;OpenTelemetry Collector&lt;/a&gt;&lt;br&gt;
&lt;a href="https://ebpf.io/what-is-ebpf/" rel="noopener noreferrer"&gt;eBPF.io Guide&lt;/a&gt;&lt;br&gt;
&lt;a href="https://landscape.cncf.io/" rel="noopener noreferrer"&gt;CNCF Observability Landscape&lt;/a&gt;&lt;/p&gt;

</description>
      <category>observability</category>
      <category>ebpf</category>
      <category>opentelemetry</category>
      <category>devops</category>
    </item>
    <item>
      <title>The "Shift-Left" Imperative: Implementing Data Contracts in CI/CD Pipeline</title>
      <dc:creator>Nabin Debnath</dc:creator>
      <pubDate>Fri, 17 Oct 2025 12:56:49 +0000</pubDate>
      <link>https://forem.com/nabindebnath/the-shift-left-imperative-implementing-data-contracts-in-cicd-pipeline-40cl</link>
      <guid>https://forem.com/nabindebnath/the-shift-left-imperative-implementing-data-contracts-in-cicd-pipeline-40cl</guid>
      <description>&lt;p&gt;Having spent years in the trenches of software development, I've observed countless systems crumble under the weight of one silent killer: &lt;em&gt;data quality drift&lt;/em&gt;. Microservices promise independence, but they are glued together by the data they exchange. When a producer service quietly changes an API response or a database column, downstream consumers break, leading to expensive root-cause-analysis.&lt;/p&gt;

&lt;p&gt;The solution isn't better error handling; it's prevention.&lt;/p&gt;

&lt;p&gt;It's time for Data Engineering and DevOps to fully embrace the &lt;strong&gt;Shift-Left&lt;/strong&gt; philosophy. We must move the validation of our most critical asset-data from runtime monitoring to compile-time automation. This is the Shift-Left Imperative for data, and the mechanism to achieve it is the Data Contract implemented directly within the CI/CD pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Exactly is a Data Contract?
&lt;/h2&gt;

&lt;p&gt;A Data Contract is a formal, explicit agreement between a data producer (the service or application that creates the data) and all its consumers (the services, analytical systems, or data warehouses).&lt;/p&gt;

&lt;p&gt;A Data Contract is a versioned schema that specifies the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Structure&lt;/strong&gt;: The field names, data types (e.g., string, integer, timestamp) and required/optional status.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantics (Quality)&lt;/strong&gt;: Expectations for the data's content (e.g., user_id must be a positive integer; email must be a valid format).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SLAs&lt;/strong&gt;: Commitments on availability, latency, and retention.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Similar to the API specification (like OpenAPI/Swagger) but for data payloads, whether they are flowing through a REST endpoint, an event stream (Kafka/Pulsar) or a database table.&lt;/p&gt;

&lt;h2&gt;
  
  
  Shift-Left Data Contracts
&lt;/h2&gt;

&lt;p&gt;In traditional data pipelines or microservice architectures, data validation often happens late:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Runtime&lt;/strong&gt;: An error log gets generated when a consumer service crashes because the upstream service suddenly sent something else  instead of what is expected.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Post-Mortem&lt;/strong&gt;: A downstream data analyst reports a broken dashboard because a column name was changed in the source database.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This phenomenon is called &lt;em&gt;data drift&lt;/em&gt;, and it’s inherently a DevOps problem. It represents an opportunity in the software release process to account for dependencies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Shift-Left approach mandates&lt;/strong&gt;: Data Contracts are defined, versioned, and validated before any code that interacts with that data is deployed to a production-like environment. By moving the contract validation into CI/CD, we turn a potential runtime incident into a fast, fixable build failure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing Data Contracts in the CI/CD Pipeline
&lt;/h2&gt;

&lt;p&gt;The true power of Data Contracts is fully utilized when their validation is fully automated. Let's go through the multi-stage CI/CD flow to enforce the contract across an organization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 1: Contract Definition and Storage&lt;/strong&gt;&lt;br&gt;
The contract should live in a source control repository, often alongside the producer's code, to enforce versioning and a peer-review (Pull Request) process.&lt;/p&gt;

&lt;p&gt;We can use JSON Schema or Avro Schema as the contract format for maximum tooling compatibility.&lt;/p&gt;

&lt;p&gt;Example Contract&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "$id": "order_placed_v1",
  "type": "object",
  "properties": {
    "order_id": {
      "type": "string",
      "format": "uuid"
    },
    "customer_id": {
      "type": "integer",
      "minimum": 1
    },
    "timestamp": {
      "type": "string",
      "format": "date-time"
    }
  },
  "required": ["order_id", "customer_id", "timestamp"]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stage 2: CI Validation (The Linter for Data)&lt;/strong&gt;&lt;br&gt;
When a developer proposes a change to the producer service or the contract itself, the CI pipeline must immediately enforce two critical checks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Structural Validation (The Contract Check)&lt;/strong&gt;&lt;br&gt;
Use a tool like ajv (for JSON Schema) or a custom Avro parser to ensure the contract file itself is well-formed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Backward and Forward Compatibility Check (The Dependency Check)&lt;/strong&gt;&lt;br&gt;
This is the most crucial step. If the developer is updating the contract (e.g., v1 to v2), we must ensure the new version is backward compatible with all existing consumers. This check is often performed against a Schema Registry API.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the change involves removing a required field or changing a field's data type (e.g., integer to string), the CI pipeline fails. The developer is forced to either re-evaluate the change or propose a major version bump, which signals a breaking change to all consumers.&lt;/p&gt;

&lt;p&gt;Here is a pseudo-code snippet illustrating this check in the CI script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# CI Script (e.g., in a Jenkins/GitLab/GitHub Action pipeline)

# 1. Fetch the last published contract version (vN-1)
OLD_SCHEMA=$(curl -s "schema-registry.corp/api/v1/schemas/${TOPIC}/latest")

# 2. Register the new contract version (vN) in a test mode
RESPONSE=$(curl -X POST -H "Content-Type: application/json" \
  "schema-registry.corp/api/v1/schemas/${TOPIC}/versions" \
  --data @new_contract_file.json)

# 3. Check the compatibility flag returned by the Schema Registry
COMPATIBILITY_STATUS=$(echo $RESPONSE | jq -r '.is_compatible')

if [ "$COMPATIBILITY_STATUS" != "COMPATIBLE" ]; then
  echo "Data Contract failed compatibility check!"
  echo "Breaking changes detected: New schema vN is not backward compatible with vN-1."
  exit 1
else
  echo "Contract is compatible. Proceeding with registration and code generation."
fi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stage 3: Artifact Generation and Distribution&lt;/strong&gt;&lt;br&gt;
Once the contract passes validation, the CI/CD pipeline executes tasks that make the contract immediately useful to consumers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Code Generation&lt;/strong&gt;: Automatically generate domain-specific objects (Pojos, Structs, Classes) in the language of the producer/consumer (e.g., Python, Java, Go). This is known as Schema-First Development. The service code now uses the generated objects, ensuring the code always conforms to the contract.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema Registry Publish&lt;/strong&gt;: The final, approved contract is published to a centralized Schema Registry (like Confluent Schema Registry or an AWS Glue Data Catalog). This registry acts as the single source of truth for all consumers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Stage 4: Consumer Service Integration&lt;/strong&gt;&lt;br&gt;
When a consumer service deploys, its CI/CD pipeline does two things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dependency Check&lt;/strong&gt;: It pulls the latest approved version of the contract from the Schema Registry.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runtime Embedding&lt;/strong&gt;: It embeds the contract directly into its production code. At runtime, the consumer can use this contract to perform fast, local validation checks on incoming data, providing immediate and informative error feedback instead of silent failures.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tools of the Trade&lt;/strong&gt;&lt;br&gt;
You can leverage established tools to do all these jobs instead of building from scratch.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4yoql5f53hrpi340qqhb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4yoql5f53hrpi340qqhb.png" alt="Different tools based on use case" width="800" height="196"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion: Building Robust Data Architectures&lt;/strong&gt;&lt;br&gt;
The "Shift-Left" imperative in data is about recognizing that data quality is not a downstream concern, it is an architectural concern.&lt;/p&gt;

&lt;p&gt;By implementing Data Contracts and automating their validation within the CI/CD pipeline, fundamentally changing the team's development mindset. We are moving from a reactive model (fixing broken data) to a proactive, contract-driven model. This dramatically reduces integration risks, accelerates feature development, and allows the data architecture to scale and evolve gracefully.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>cicd</category>
      <category>devops</category>
      <category>shiftleft</category>
    </item>
    <item>
      <title>Eyes on change: Building a Custom Watcher With Async Notifications</title>
      <dc:creator>Nabin Debnath</dc:creator>
      <pubDate>Thu, 02 Oct 2025 12:16:25 +0000</pubDate>
      <link>https://forem.com/nabindebnath/eyes-on-change-building-a-custom-watcher-with-async-notifications-3nph</link>
      <guid>https://forem.com/nabindebnath/eyes-on-change-building-a-custom-watcher-with-async-notifications-3nph</guid>
      <description>&lt;p&gt;Watching data for changes is a core task in modern applications. In a collaborative application same data gets modified by different stakeholders or users. In that scenario, it is very important to notify the owner of the record or the user who cares about that change. &lt;em&gt;Watcher&lt;/em&gt; is The feature which enables user to keep an eye on the change programatically. In this article, we'll peel off that complex functionality and build a simple, custom watcher from the ground up.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture Decisions
&lt;/h3&gt;

&lt;p&gt;We will keep the implementation simple but extensible:&lt;br&gt;
&lt;strong&gt;UI:&lt;/strong&gt; React → lightweight, quick to build forms/buttons.&lt;br&gt;
&lt;strong&gt;Backend:&lt;/strong&gt; Express → simple routing for watch + update endpoints.&lt;br&gt;
&lt;strong&gt;Database:&lt;/strong&gt; SQLite → file-based, no setup, perfect for local dev.&lt;br&gt;
&lt;strong&gt;Notifications:&lt;/strong&gt; Console logs first → keep it easy to demo.&lt;br&gt;
&lt;strong&gt;Async Layer:&lt;/strong&gt; BullMQ + Redis → realistic queue-based processing without too much setup.&lt;/p&gt;

&lt;p&gt;This stack lets us run locally in mins but can be upgraded later to Postgres, RabbitMQ, or Kafka if needed.&lt;/p&gt;
&lt;h3&gt;
  
  
  Architecture Diagram
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4lhynzgyu7ennb6gqlf0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4lhynzgyu7ennb6gqlf0.png" alt="High level architecture diagram" width="800" height="313"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;React UI → User clicks Watch.&lt;/li&gt;
&lt;li&gt;Express API → Handles request and updates DB.&lt;/li&gt;
&lt;li&gt;SQLite DB → Stores records and watcher subscriptions.&lt;/li&gt;
&lt;li&gt;BullMQ Queue → Stores notification jobs asynchronously.&lt;/li&gt;
&lt;li&gt;Worker → Pulls jobs and executes notifications.&lt;/li&gt;
&lt;li&gt;Notification Channel → Console logs, email, Slack, etc.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Database setup
&lt;/h3&gt;

&lt;p&gt;We will create the simple table and insert a demo record.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE records (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  name TEXT,
  value TEXT
);

CREATE TABLE watchers (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  user TEXT,
  record_id INTEGER,
  FOREIGN KEY(record_id) REFERENCES records(id)
);

INSERT INTO records (name, value) VALUES (?, ?)", [
        "Demo Record",
        "Initial Value,
      ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Backend implementation
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;server.js&lt;/strong&gt; - Express server is running on port 4000. It adds a row  in the &lt;em&gt;watchers&lt;/em&gt; table when the watch button is clicked. Next time when the &lt;em&gt;records&lt;/em&gt; table is updated, it checks for any watcher and puts a notification in the job queue.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;app.post("/api/update/:id", (req, res) =&amp;gt; {
  const recordId = req.params.id;
  const newValue = req.body.value;

  db.updateRecord(recordId, newValue, (err) =&amp;gt; {
    if (err) return res.status(500).json({ error: err.message });

    db.getWatchers(recordId, (err, watchers) =&amp;gt; {
      if (err) return res.status(500).json({ error: err.message });

      watchers.forEach((w) =&amp;gt; {
        enqueueNotification(w.user, recordId, newValue);
      });

      res.json({ message: "Record updated and notifications queued." });
    });
  });
});
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;queue.js&lt;/strong&gt; - This adds the notification in the bullMQ queue for processing by the worker&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const { Queue } = require("bullmq");

const notificationQueue = new Queue("notifications", {
  connection: { host: "127.0.0.1", port: 6379 },
});

function enqueueNotification(user, recordId, newValue) {
  notificationQueue.add("notify", { user, recordId, newValue });
  console.log(`Job queued for ${user} on record ${recordId}`);
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Frontend implementation
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;App.js&lt;/strong&gt; - Simple app for the watcher demo&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import React, { useEffect, useState } from "react";
import { getRecord, watchRecord, updateRecord } from "./api";

function App() {
  const [record, setRecord] = useState(null);
  const [newValue, setNewValue] = useState("");

  useEffect(() =&amp;gt; {
    async function load() {
      const data = await getRecord(1);
      setRecord(data);
    }
    load();
  }, []);

  const handleWatch = async () =&amp;gt; {
    const res = await watchRecord(1);
    alert(res.message);
  };

  const handleUpdate = async () =&amp;gt; {
    if (!newValue) return alert("Enter a new value first!");
    const res = await updateRecord(1, newValue);
    alert(res.message);

    const data = await getRecord(1);
    setRecord(data);
    setNewValue("");
  };

  if (!record) return &amp;lt;div&amp;gt;Loading...&amp;lt;/div&amp;gt;;

  return (
    &amp;lt;div style={{ padding: "20px" }}&amp;gt;
      &amp;lt;h2&amp;gt;Watcher Demo&amp;lt;/h2&amp;gt;
      &amp;lt;p&amp;gt;&amp;lt;strong&amp;gt;Record:&amp;lt;/strong&amp;gt; {record.name}&amp;lt;/p&amp;gt;
      &amp;lt;p&amp;gt;&amp;lt;strong&amp;gt;Value:&amp;lt;/strong&amp;gt; {record.value}&amp;lt;/p&amp;gt;

      &amp;lt;div style={{ marginTop: "20px" }}&amp;gt;
        &amp;lt;button onClick={handleWatch} style={{ marginRight: "10px" }}&amp;gt;Watch&amp;lt;/button&amp;gt;
        &amp;lt;input value={newValue} onChange={e =&amp;gt; setNewValue(e.target.value)} placeholder="New Value"/&amp;gt;
        &amp;lt;button onClick={handleUpdate} style={{ marginLeft: "10px" }}&amp;gt;Update&amp;lt;/button&amp;gt;
      &amp;lt;/div&amp;gt;
    &amp;lt;/div&amp;gt;
  );
}

export default App;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;api.js&lt;/strong&gt; -&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const API_BASE = "http://localhost:4000/api";

export async function getRecord(id) {
  const res = await fetch(`${API_BASE}/record/${id}`);
  return res.json();
}

export async function watchRecord(id, user = "demo-user") {
  const res = await fetch(`${API_BASE}/watch/${id}`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ user }),
  });
  return res.json();
}

export async function updateRecord(id, newValue) {
  const res = await fetch(`${API_BASE}/update/${id}`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ value: newValue }),
  });
  return res.json();
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Worker implementation
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;worker.js&lt;/strong&gt; - This pulls the notifications from the queue to send to the right channel (console, email, slack etc)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const { Worker } = require("bullmq");

const worker = new Worker(
  "notifications",
  async (job) =&amp;gt; {
    const { user, recordId, newValue } = job.data;
    console.log(`Notifying ${user}: Record ${recordId} changed to "${newValue}"`);
  },
  {
    connection: { host: "127.0.0.1", port: 6379 },
  }
);

worker.on("completed", (job) =&amp;gt; console.log(`Job ${job.id} completed`));
worker.on("failed", (job, err) =&amp;gt; console.error(`Job ${job.id} failed: ${err.message}`));
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Code build, deploy and run server
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Start Redis:&lt;/strong&gt; docker run -d -p 6379:6379 redis&lt;br&gt;
&lt;strong&gt;Start backend:&lt;/strong&gt; npm install &amp;amp;&amp;amp; npm start&lt;br&gt;
&lt;strong&gt;Start worker:&lt;/strong&gt; npm install &amp;amp;&amp;amp; npm start&lt;br&gt;
&lt;strong&gt;Start frontend:&lt;/strong&gt; npm install &amp;amp;&amp;amp; npm start&lt;/p&gt;

&lt;h3&gt;
  
  
  Test the application
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Clicking Watch&lt;/li&gt;
&lt;li&gt;Updating the value to see logs in the worker console&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Frontend&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8p6it1u9riwh3rjfqa3n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8p6it1u9riwh3rjfqa3n.png" alt="User clicks watch button to watch the record change" width="800" height="270"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F79j8f25z3xpz6y0qt76x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F79j8f25z3xpz6y0qt76x.png" alt="Record is updated with new value as 1000" width="800" height="253"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Backend job queued&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgegf3an8uqu6jpd6o31o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgegf3an8uqu6jpd6o31o.png" alt="Record update notification is pushed into queue" width="638" height="168"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Worker processed jobs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F31ctuucwsf7jicfwre3f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F31ctuucwsf7jicfwre3f.png" alt="Worker pulled the events from queue and processed" width="716" height="194"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this article we built a custom watcher system that lets users “watch” a record and get notified when it changes - all in a way that scales. This feature is one of the most essential feature from user standpoint to keep an eye on the record. We not only built a working demo but also learned a design pattern that is used in real-world systems like GitHub, Jira, Confluence etc.&lt;/p&gt;

</description>
      <category>tutorial</category>
      <category>watcher</category>
      <category>javascript</category>
      <category>softwaredevelopment</category>
    </item>
  </channel>
</rss>
