Forem

Synchronization in Node.js: Why Single-Threaded Does Not Mean Safe From Concurrency Problems

CodeWithIshwar — Mon, 18 May 2026 16:41:53 +0000

One of the most common misconceptions about Node.js is:

“Node.js is single-threaded, so synchronization problems cannot happen.”

This is only partially true.

Yes, JavaScript execution in Node.js runs on a single thread using the event loop. But modern backend applications deal with asynchronous operations, external services, databases, queues, and distributed systems — all of which introduce concurrency challenges.

Understanding synchronization in Node.js is essential if you want to build scalable and reliable backend systems.

What Is Synchronization?

Synchronization is the process of controlling access to shared resources when multiple operations happen at the same time.

The goal is to prevent:

Race conditions
Data inconsistency
Duplicate processing
Lost updates
Unexpected behavior

Why Synchronization Still Matters in Node.js

Node.js applications commonly handle:

Multiple API requests simultaneously
Concurrent database updates
Shared cache access
Async file operations
Background job processing

Even though JavaScript itself runs on one thread, async operations can overlap in execution timing.

This creates situations where multiple operations interact with the same resource concurrently.

Example: Race Condition

Imagine a wallet service:

Initial balance = ₹1000

Two requests arrive:

Request A deducts ₹300
Request B deducts ₹500

If both requests:

Read the same balance
Update independently
Save the result

…the final balance may become incorrect.

This is called a race condition.

Common Synchronization Techniques in Node.js

1. Database Transactions

Transactions ensure operations execute safely as a single unit.

Useful for:

Payment systems
Banking workflows
Order processing

2. Atomic Operations

Databases provide atomic update mechanisms.

Examples:

MongoDB $inc
PostgreSQL row locking
Optimistic locking

These reduce concurrency conflicts.

3. Redis Distributed Locks

In distributed systems, Redis locks help ensure only one worker processes a task at a time.

Commonly used in:

Payment handling
Cron jobs
Distributed workers

4. Mutexes

Mutexes restrict access to critical sections of code.

Only one async operation can enter at a time.

5. Message Queues

Queues serialize workloads and reduce concurrency problems.

Popular tools:

BullMQ
RabbitMQ
Kafka

Important Takeaway

Single-threaded does NOT mean concurrency-safe.

As applications scale, synchronization becomes critical for:

High-traffic APIs
Financial systems
Real-time platforms
Distributed architectures

The real complexity in backend engineering often comes from handling concurrency correctly.

Conclusion

Node.js provides excellent performance through asynchronous and non-blocking architecture.

But scalable backend systems still require proper synchronization strategies to maintain correctness and reliability.

Understanding concurrency is what transforms developers into backend engineers capable of designing production-grade systems.

NodeJS #JavaScript #BackendDevelopment #SystemDesign #Concurrency #SoftwareEngineering #Programming #DistributedSystems #WebDevelopment #Tech #codewithishwar

AI Coding Tools Need Better Boundaries, Not Better Prompts

ClickIT - DevOps and Software Development — Mon, 18 May 2026 16:41:48 +0000

One thing becoming increasingly obvious with AI-assisted development:

LLMs are great at generating code.
They’re not great at making architectural decisions.

A lot of teams are discovering the same pattern:

rapid prototyping feels amazing,
shipping gets faster,
but long-term maintainability starts degrading quietly in the background.

The problem usually isn’t the generated code itself.

It’s the lack of:

clear contracts,
deterministic workflows,
validation layers,
and shared engineering conventions before generation even starts.

Without those boundaries, AI tends to optimize for local correctness instead of system consistency.

That’s why workflows like Spec-Driven Development (SDD) are becoming more relevant as teams integrate AI deeper into production environments.

Instead of relying on increasingly complex prompts, SDD focuses on:

defining contracts first,
validating specs before implementation,
constraining generation scope,
and treating LLMs more like implementation engines than autonomous architects.

In practice, this tends to produce:

more predictable outputs,
cleaner collaboration between engineers,
and codebases that are actually maintainable months later.

We’ve been exploring this topic internally and recently put together a breakdown of how Spec-Driven Development can help create more reliable AI-assisted workflows in real-world engineering environments.

If the topic sounds interesting, here’s the discussion:

Stop "Vibe Coding" and Start Spec-Driven Development | Part 1

Curious how other teams here are approaching this shift:

Are you introducing stricter boundaries around AI-generated code?
Have specs become more important in your workflow?
Or are you still experimenting with prompting strategies first?

Feels like the industry is slowly moving from: “AI can generate code”

to: “How do we engineer systems around probabilistic generators?”

And that’s a much more interesting problem...

11 Agentic Testing Tools to Know in 2026

Alvin Lee — Mon, 18 May 2026 16:40:46 +0000

Agentic testing tools help teams plan, generate, adapt, and run tests with far less manual effort. They’re quickly becoming part of how modern QA scales without slowing delivery.

One thing to get right from the start is scope. Not all agentic testing tools operate at the same level of scope or strategic impact. They vary significantly in what they do and where they fit. Some are point solutions that help you author or run tests faster. Others sit inside broader AI-driven quality platforms that prioritize risk, optimize test portfolios, and enforce quality gates across the pipeline.

This post covers 11 agentic testing tools to know about in 2026. They’re grouped so you can compare them based on scope, strengths, and fit for your organization.

What is an agentic testing tool?

An agentic testing tool is software that uses AI agents to autonomously plan, generate, maintain, and execute tests. It often makes decisions based on context, such as requirements, code changes, risk signals, or past results.

It goes beyond AI-assisted automation by adding initiative and workflow-level decision-making. Instead of only suggesting what to do next, it takes action within defined boundaries.

Here are 11 agentic testing tools grouped by scope. Each includes a summary and key strengths and considerations. Let’s go!

Enterprise AI-driven quality platforms

These platforms extend beyond test creation to orchestrate automation, intelligence, and governance at scale. They are suited for organizations that require stability, risk prioritization, and release confidence across complex environments.

1. Tricentis Tosca

Tricentis Tosca is designed for enterprise test automation where stability, scale, and governance matter. In an agentic context, the shift is moving from “write and maintain scripts” to “orchestrate outcomes,” especially across complex apps and high-change environments.

Tricentis enables AI-driven testing and agentic quality engineering across your delivery pipeline. It also positions MCP as a way to bridge AI and testing tools through a universal integration approach, which matters if you’re thinking about agentic workflows that span multiple systems.

Strengths

Suitable for large regression suites and complex end-to-end workflows.
AI-assisted resilience helps reduce long-term maintenance costs.

Considerations

The highest value shows up when teams commit to governance and standardization (not “ad hoc scripts”).
Adoption typically requires alignment across QA, engineering, and release stakeholders.

2. SmartBear

SmartBear is best viewed as a broad testing portfolio vendor that has been positioning around AI across testing workflows.

Strengths

Covers multiple testing disciplines.
Suitable for consolidated vendor strategies.

Considerations

AI depth varies across products.
Portfolio integration matters.

3. UiPath Test Suite

UiPath Test Suite extends testing into broader automation ecosystems. In an agentic context, it is relevant for teams that want testing integrated into AI-driven business process automation and orchestration environments.

Strengths

Aligns testing with broader automation initiatives.
Fits organizations standardizing around enterprise automation platforms.

Considerations

Strongest value when already invested in the UiPath ecosystem.
Organizations must evaluate how deeply autonomous testing workflows integrate with CI/CD.

AI-native testing platforms

AI-native testing platforms are built with AI at the core of test creation and execution workflows. They aim to reduce friction from requirements to automation and help teams maintain speed and stability as systems evolve.

4. ACCELQ

ACCELQ positions itself around AI-powered automation and end-to-end testing acceleration. For agentic buyers, the key question is whether the platform reduces friction from requirements to automation to execution and whether it can keep pace as systems change.

Strengths

Faster ramp-up for automation.
Structured automation workflows.

Considerations

Like any platform, success depends on fit with your stack and operating model.
Ensure governance and explainability are strong enough for enterprise release standards.

5. mabl

mabl is an AI-native testing vendor geared toward continuous testing and reducing maintenance overhead. For agentic tool evaluation, focus on whether AI helps you run reliably at speed, not just generate tests during setup.

Strengths

CI/CD integration.
Automation resilience focus.

Considerations

Primarily web-centric workflows.
Enterprise governance depth varies.

6. Functionize

Functionize is commonly positioned as AI-forward test automation focused on reducing manual work across authoring, execution, and maintenance. In a practical agentic sense, tools like this aim to do more of the work for you, especially around test upkeep as systems evolve.

Strengths

Lifecycle focus: value isn’t only authoring, but also keeping tests healthy over time.
AI-forward orientation fits teams pushing toward higher autonomy.

Considerations

Scope depends on team maturity.
Organizations may need to evaluate governance needs more deeply.

Point-solution agentic tools

Point-solution agentic tools focus on solving a specific testing bottleneck rather than managing the full quality lifecycle. They are often used to accelerate test authoring, execution, or UI interaction without requiring a broader platform shift.

7. testRigor

testRigor is typically associated with natural-language-driven test creation and reducing scripting complexity. For agentic buyers, it often lands in the “make authoring easier” category.

Strengths

Lower barrier to authoring.
Rapid initial automation.

Considerations

Primarily focused on UI regression.
Potential trade-off between depth and creation speed.

8. QA Wolf

QA Wolf is often positioned around fast test creation and managed execution models for teams that want results without building everything in-house. In an agentic tooling conversation, this fits as a way to compress time-to-value, especially when internal bandwidth is limited.

Strengths

Fast time to coverage.
Managed execution support.

Considerations

The operational model differs from in-house-only tools.
Evaluate long-term scaling fit.

9. Virtuoso QA

Virtuoso is frequently grouped with AI-led UI testing approaches that aim to reduce manual scripting and increase resilience. Its relevance depends on whether it meaningfully adapts and maintains tests as the app changes, not just how quickly it creates them.

Strengths

Faster UI automation creation.
Reduced scripting complexity.

Considerations

Validate the reality of flake handling and maintenance in your environment (dynamic UIs expose gaps quickly).
Ensure pipeline integration and evidence output meet enterprise needs.

10. AskUI

AskUI approaches automation through UI perception and interaction. That can matter when you test across varied front ends, remote desktops, or environments where DOM-level automation is not always feasible.

Strengths

Useful for UI-driven automation challenges.
Works across heterogeneous UI surfaces.

Considerations

Typically narrower in scope than end-to-end platforms.
Validate stability and evidence outputs for long-running regression usage.

11. CoTester by TestGrid

CoTester lands in the agentic assistant space for testing workflows. Tools in this category typically let you offload specific tasks, helping your team by generating tests, suggesting validations, or scaling coverage with less effort.

Strengths

Assistant-style support for testing tasks.
Accelerates defined QA activities.

Considerations

Not a full end-to-end platform.
Best as a complementary capability.

How agentic technology applies to modern testing

Agentic testing brings the agent loop into quality workflows. It decides what to test, executes the work, evaluates results, and adjusts based on context.

Here’s what that looks like in real delivery pipelines:

Planning: Interpreting requirements, code changes, and risk signals to select the right tests.
Execution: Running tests and collecting evidence.
Adaptation: Repairing brittle selectors and managing flakiness as systems change.
Governance: Enforcing quality gates based on measurable signals such as coverage and change impact.

Agentic testing is not AI that writes tests. It is AI that runs a quality workflow.

How to choose the right agentic testing tool

Buying decisions usually fail for one of two reasons: teams choose a point tool when they actually need a platform, or they buy a platform when they need quick, targeted relief. Use this checklist to avoid both mistakes.

1. Start with scope: assistant, point solution, or platform?

Ask one blunt question: Do you need help authoring tests, or do you need help governing release confidence?

2. Demand measurable outcomes, not demos

Demos can look impressive, but real value shows up in production metrics. Look for clear improvements in regression time, maintenance effort, flake rate, defect escapes, and coverage visibility. If success cannot be measured, ROI will be hard to prove.

3. Validate governance: explainability, auditability, control

Agentic systems take action, so your team must understand why. You should be able to explain test selection, recent changes, and the evidence behind a release decision, especially in regulated and enterprise environments.

If you want agentic testing that scales beyond a single team or application, you need more than a test generator. You need an AI-driven approach that connects automation, intelligence, and governance.

FAQ: Agentic testing tools in 2026

What makes a testing tool truly agentic?

A testing tool is truly agentic if it can independently plan and execute testing actions based on context, such as code changes, requirements, or risk signals. It does not just suggest next steps. It selects tests after a pull request, generates tests from requirements, repairs broken locators, and enforces quality gates with minimal human input.

Are agentic testing tools the same as AI test automation?

No. AI test automation typically assists with parts of automation, such as smarter locators or faster script creation. Agentic testing tools go further by automating decision-making across workflows. They can decide which tests to run for a build, identify untested code changes, and prioritize high-risk areas without manual triage.

What results should I expect from agentic testing?

Most teams see measurable improvements in regression cycle time and maintenance effort when agentic workflows are implemented correctly. A realistic benchmark is reducing regression runtime by 30–70% through change-based test selection and cutting maintenance effort by 30–50% through self-healing automation and flake reduction.

I Replaced a Polling Loop With Three React Hooks and a Firestore Rule

R.N.Krishnan — Mon, 18 May 2026 16:40:45 +0000

The first version of the VORTEX dashboard polled an API endpoint every five seconds. It worked. It also meant the UI was always up to four seconds behind reality, and every agent write to Firestore required a separate read path just to surface it on screen. I replaced the whole thing with three custom hooks and onSnapshot listeners. The dashboard has been real-time since, with no polling, no message queue, and no separate read model.

The Data Model First

Before writing a single hook, I mapped out exactly which Firestore collections existed and who owned them:

Collection	Writers	Readers
`leads`	Agent 1, Agent 7	Dashboard, Agent 4
`activity_feed`	All agents (append only)	Dashboard
`product_intelligence`	Agent 6	Dashboard
`agent_logs`	All agents	Dashboard (Debate Log)

This table is the reason the security rules look the way they do:

// firestore.rules
match /activity_feed/{eventId} {
  allow read:   if true;
  allow create: if true;    // All agents append
  // No update, no delete — the feed is append-only by design
}

match /leads/{leadId} {
  allow read:  if request.auth != null;
  allow write: if request.auth != null;
}

The activity_feed collection is append-only deliberately. No agent ever updates or deletes a feed entry. This means the feed is a reliable audit trail of what happened, in order — you can replay it from any point without worrying about entries being mutated after the fact.

The Three Hooks

The entire dashboard data layer is three hooks: useLeads, useActivityFeed, and useProductIntel. Each one owns one collection and one onSnapshot listener.

// hooks/index.jsx — useLeads
export function useLeads() {
  const [leads, setLeads] = useState(INITIAL_LEADS);

  useEffect(() => {
    const q = query(
      collection(db, 'leads'),
      orderBy('intent_score', 'desc'),
      limit(50)
    );
    return onSnapshot(q, (snapshot) => {
      const fbLeads = snapshot.docs.map(doc => ({
        id: doc.id,
        ...doc.data(),
      }));
      if (fbLeads.length > 0) setLeads(fbLeads);
    });
  }, []);

  return leads;
}

Three things worth noting here:

The if (fbLeads.length > 0) guard. Without this, an empty Firestore collection on first load would wipe the seed data. The hook falls back to INITIAL_LEADS — a hardcoded set of mock leads — until real data arrives. This means the dashboard is never blank, even before Firebase is configured.

onSnapshot returns its own unsubscribe function. Returning it directly from useEffect means React calls it on unmount, cleaning up the listener automatically. No manual cleanup needed.

orderBy('intent_score', 'desc') means the Kanban always shows highest-intent leads first within each column, without any client-side sorting logic.

The activity feed hook is similar but has a time-based limit instead:

// hooks/index.jsx — useActivityFeed
export function useActivityFeed() {
  const [events, setEvents] = useState(INITIAL_EVENTS);

  useEffect(() => {
    const q = query(
      collection(db, 'activity_feed'),
      orderBy('timestamp', 'desc'),
      limit(20)
    );
    return onSnapshot(q, (snapshot) => {
      const fbEvents = snapshot.docs.map(doc => ({
        id: doc.id,
        ...doc.data(),
      }));
      if (fbEvents.length > 0) setEvents(fbEvents);
    });
  }, []);

  return events;
}

Twenty events, most recent first. Every agent write to activity_feed triggers this listener and the feed item appears in the UI within milliseconds.

The Metrics Hook Problem

useMetrics caused the most grief. The original version returned an array:

// The version that broke everything downstream
export function useMetrics(leads) {
  return useMemo(() => [
    { label: "Total Leads", value: leads.length, trend: "+12%" },
    { label: "Hot Leads Today", value: hotLeads, trend: "+5" },
    // ...
  ], [leads]);
}

Six components consumed this hook. Three of them destructured it as an array. Three treated it as a named object — metrics.totalLeads, metrics.hotLeads, metrics.conversionRate. The array-consuming components worked. The object-consuming components silently got undefined for every value and displayed nothing.

The fix was making useMetrics return a proper object:

export function useMetrics(leads) {
  return useMemo(() => {
    const totalLeads = leads.length;
    const hotLeads = leads.filter(l => l.status === 'HOT_LEAD').length;
    const conversionRate = 12.4;
    const highestScoreLead = leads.reduce(
      (max, l) => (!max || l.intent_score > max.intent_score ? l : max),
      null
    );
    return {
      totalLeads,
      hotLeads,
      emailsSent: 312,
      callsPlaced: 89,
      demosBooked: 12,
      conversionRate,
      highestScore: highestScoreLead?.intent_score || 0,
      highestScoreLead,
    };
  }, [leads]);
}

The lesson: if a hook returns structured data that multiple components consume, make it a named object from day one. Arrays are fine for lists. They're not fine for typed data shapes where consumers care about specific fields.

The useCountUp Hook

The sidebar metrics animate from zero to their real value on load. That required a useCountUp hook — something the codebase was importing but that didn't exist yet.

export function useCountUp(target, duration = 1000) {
  const [value, setValue] = useState(0);
  const rafRef = useRef(null);

  useEffect(() => {
    const start = performance.now();
    const to = Number(target) || 0;

    const tick = (now) => {
      const elapsed = now - start;
      const progress = Math.min(elapsed / duration, 1);
      const eased = 1 - Math.pow(1 - progress, 3); // ease-out cubic
      setValue(Math.round(to * eased));
      if (progress < 1) rafRef.current = requestAnimationFrame(tick);
    };

    rafRef.current = requestAnimationFrame(tick);
    return () => cancelAnimationFrame(rafRef.current);
  }, [target, duration]);

  return value;
}

Ease-out cubic means the number counts up fast at first and slows as it approaches the target. requestAnimationFrame keeps it tied to the display refresh rate rather than a fixed interval. The rafRef holds the animation frame ID so the cleanup function can cancel it properly on unmount — without this, switching tabs mid-animation would leave a hanging rAF loop.

The StrictMode Bug

React's StrictMode runs effects twice in development — mount, unmount, remount. This exposed a bug in the DebateTerminal component that replays the Hindsight agent log line by line:

// The broken version
useEffect(() => {
  let idx = 0;
  const show = () => {
    if (idx >= allLines.length) return;
    setDisplayed(prev => [...prev, allLines[idx]]);
    idx++;
    setTimeout(show, 60); // recursive — never cleaned up
  };
  const t = setTimeout(show, 300);
  return () => clearTimeout(t); // only cancels the first timeout
}, []);

The cleanup only cancelled the initial 300ms delay. Once show started calling itself recursively, those timeouts had no handle. In StrictMode, the simulated unmount left the first chain running, then the remount started a second chain. Two parallel loops, both writing to displayed state with independent idx counters, producing duplicate and out-of-order lines.

The fix was storing every timeout ID in a ref:

const timeoutRef = useRef(null);

useEffect(() => {
  let idx = 0;
  const show = () => {
    if (idx >= allLines.length) { setPlaying(false); return; }
    const line = allLines[idx];
    setDisplayed(prev => [...prev, line]);
    idx++;
    timeoutRef.current = setTimeout(show, line.isHeader ? 400 : 60);
  };
  timeoutRef.current = setTimeout(show, 300);
  return () => clearTimeout(timeoutRef.current); // cancels the whole chain
}, []);

Now the cleanup always cancels whichever timeout is currently pending. The chain breaks cleanly on unmount.

Seed Data as the Demo Mode

The hooks layer has a deliberate fallback: if Firestore returns empty or throws, the UI renders from INITIAL_LEADS and INITIAL_EVENTS. This means the entire dashboard works without a Firebase project configured — useful for demos, useful for development, useful when the backend is down.

const [leads, setLeads] = useState(INITIAL_LEADS); // fallback always set first

return onSnapshot(q, (snapshot) => {
  const fbLeads = snapshot.docs.map(doc => ({ id: doc.id, ...doc.data() }));
  if (fbLeads.length > 0) setLeads(fbLeads); // only overwrite if real data exists
});

The VITE_USE_DEMO_DATA environment flag extends this further — when set, the Firebase initialization is skipped entirely and the hooks return seed data without attempting any Firestore connection.

Takeaways

onSnapshot is simpler than it looks. It returns its own cleanup function, it handles reconnection automatically, and it pushes updates to all listeners simultaneously. For a dashboard that needs to reflect agent writes in real time, it's the right tool and it requires less infrastructure than a polling setup.

Return named objects from data hooks, not arrays. The useMetrics bug would have been caught immediately with TypeScript. Without it, the silent undefined failures are hard to trace because the component renders without errors — it just shows nothing.

StrictMode is a useful stress test. The DebateTerminal bug only appeared in development because of StrictMode's double-invoke behavior. That's the point — it surfaces cleanup bugs before they reach production.

Seed data is infrastructure. Having realistic fallback data in the hooks layer means the dashboard is always demonstrable, always developable, and always recoverable. It's not a hack — it's a design decision.

Closing

The dashboard started as a polling loop hitting a REST endpoint. It's now three hooks, each owning one Firestore collection, each cleaning up after itself on unmount. The real-time behavior came for free once the data model was right. The hard part wasn't the Firestore integration — it was making the hooks clean enough that six different components could consume them without knowing anything about the underlying data source.

5 Free Image Compression Tools Compared: Privacy, Speed, and Quality (2026)

yangjiaqiang12 — Mon, 18 May 2026 16:37:02 +0000

The Test

I tested 5 popular free image compression tools on the same 2MB photo to compare privacy, speed, and output quality. Here are the results.

Results

Tool	Privacy	Batch	WebP	Output Size	Time
Squash	? Local	? Yes	?	384KB	1.2s
Squoosh	? Local	? No	?	367KB	1.8s
TinyPNG	? Upload	? Yes	?	402KB	4.3s
Compressor.io	? Upload	? Yes	?	411KB	6.1s
Optimizilla	? Upload	? Yes	?	426KB	5.8s

Key Findings

Privacy-first tools are faster. Squash and Squoosh process images locally using the Canvas API. No network round-trip means 3-5x faster compression.

Batch mode matters. Squoosh produces the smallest files but processes images one at a time. If you have 20 product photos, that is 20 manual clicks. Squash combines batch processing with local privacy -- a combination no other free tool offers.

WebP is the format to beat. Tools supporting WebP output achieved 20-30% smaller files than JPEG-only tools at equivalent quality. WebP browser support is now at 97% globally.

Upload-based tools are slower. TinyPNG and Compressor.io add 3-6 seconds of network latency per image. For batch work, this adds up quickly.

The Privacy Factor

Uploading images to a third-party server is not just a privacy concern -- it is a compliance issue. If you handle client work, medical images, financial documents, or unreleased products, server-based tools are a liability.

Browser-based tools solve this completely. The image never leaves your device. There is no server to hack, no database to leak, no privacy policy to trust.

Bottom Line

Best overall: Squash -- free, private, batch mode, multi-format
Best quality: Squoosh -- slightly better compression but no batch mode
Best if you do not care about privacy: TinyPNG -- established, reliable, but uploads your files

?? Try Squash: yangjiaqiang12.github.io/squash-image-compressor

?? Source: github.com/yangjiaqiang12/squash-image-compressor

? Support: ko-fi.com/squashtools

Custom behavior without custom code

Ian Johnson — Mon, 18 May 2026 16:36:06 +0000

Every successful SaaS product eventually meets the same question: a customer asks for something specific to them, you build it, and now you have a feature in your codebase that's only meant to run for one tenant. A year later, you have a dozen of these. The codebase has if-statements checking tenant IDs, the test suite mocks out customer-specific paths, and the senior engineer who knows which branch belongs to which customer is the only person who can refactor anything.

There's a better shape, and it doesn't require giving up the per-customer customization. It does require separating, cleanly and firmly, the code that defines what behaviors are possible from the data that selects and parameterizes them. This article is about how to do that, where to store the data, and the security cliff you'll fall off if you let the data become code.

What not to do

A handful of approaches show up over and over, and each has a fatal flaw:

Separate deployed instances per customer. This solves customization by forking the operational surface. Now you have N versions of the database, N sets of background jobs, N deploy pipelines, N versions of every bug fix to roll out. It works for two or three customers and collapses by ten.
Conditional code in the backend — if tenant_id == "acme": .... Cheap on day one, untenable by month six. Every developer has to know the customer landscape to make changes safely. Every refactor is risky in proportion to how many tenants have branches. Customer-specific logic spreads across the codebase by capillary action.
Code injected at build time. A configuration that produces a different binary per tenant. Has the same operational cost as separate instances, plus the added joy of debugging behavior that depends on what compile-time flag was set. Don't.

The pattern that scales is to keep one codebase, one running cluster, one deploy pipeline — and to let per-tenant behavior live in data that the code consults. Basically, I am describing multi-tenancy.

Code defines the possibilities; data selects among them

Identify the points in your system where behavior can vary per tenant. These are extension points: the discount engine, the approval workflow, the export format, the notification rules. At each one, your code defines a small set of behaviors it knows how to perform. Per-tenant data picks which behaviors to use and supplies the parameters.

Concretely: a class hierarchy. A common shape is a CustomRule base class with a contract — say, applies?(context) and apply(context) — and a set of concrete implementations:

class CustomRule:
    def applies(self, context) -> bool: ...
    def apply(self, context) -> None: ...

class PercentageDiscountRule(CustomRule):
    def __init__(self, percent, min_order):
        self.percent = percent
        self.min_order = min_order

    def applies(self, context):
        return context.order_total >= self.min_order

    def apply(self, context):
        context.discount += context.order_total * (self.percent / 100)

class FirstPurchaseDiscountRule(CustomRule):
    def __init__(self, amount):
        self.amount = amount

    def applies(self, context):
        return context.customer.order_count == 0

    def apply(self, context):
        context.discount += self.amount

A tenant's configuration is then a small declarative description — which rules they have, with what parameters:

{
  "discount_rules": [
    {"type": "percentage", "percent": 10, "min_order": 100},
    {"type": "first_purchase", "amount": 5}
  ]
}

At runtime, you load the tenant's config, hydrate it into instances of the right rule classes, and run them. The code knows how to perform every behavior; the data says which behaviors to apply, in what order, with what parameters. To add a new kind of rule, you add a new class. To add a new tenant configuration, you change data — no deploy, no migration, no engineering.

Notice that the apply methods mutate the incoming value. If you prefer to not do so, just return that result and apply it when called. A reasonable name for this operation is result. This is really up to your preference in terms of using mutable vs immutable data. In the context of a web app, you usually do want mutability (for example, encoding and decoding a value from the database to a particular meaning for a tenant). If there is more complexity, you can put it behind a port to unit test it separately.

The shape generalizes: any extension point in your system can have its own base class, its own family of implementations, and its own data schema describing how it's configured per tenant.

Where the data lives

The configuration has to be persisted somewhere. The options aren't equivalent:

In-memory cache. Tempting because it's fast, but caches get invalidated, evicted, and reset on deploy. If the cache is the source of truth, you've lost the data the moment something restarts. Caches belong in front of the source of truth, not in place of it.
Files on disk. Workable for very small, very stable configurations, but file I/O is slow at scale, file deployment is operational overhead, and "edit a file and redeploy" doesn't fit the case where customer success needs to toggle something for a tenant at 4pm on a Friday.
Static configuration baked into the app. Fine for values that genuinely never change between deploys. But if the values are tenant-specific, you're back to the "code per customer" problem.
A database. If you're already running one — and you almost certainly are — this is the clear winner. Reads are fast (especially with a thin cache in front), updates are transactional, the data sits next to the tenant records it's associated with, and you get backups, replication, and access control for free.

Use the database you already have. Don't introduce a new piece of infrastructure for this.

A note on schema

Whichever shape you pick, the configuration has to be retrievable by tenant. That means a tenant_id foreign key, typically a dedicated tenant_configurations table with tenant_id referencing tenants, indexed for fast lookup. The runtime question is always the same: "given the tenant for this request, what's their configuration?" Get that relationship in place first; everything else flows from being able to find the right rules for the right tenant.

If you're using a relational database, the principled approach beyond that is to model the configuration with normalized tables — a tenant_discount_rules table with tenant_id, typed columns for rule type, percent, min_order, and so on, or a polymorphic schema with a separate table per rule type. This is fine, and you may end up there. But I'd push back on starting there.

For an initial proof of concept, a single table is enough:

CREATE TABLE tenant_configurations (
  tenant_id   BIGINT PRIMARY KEY REFERENCES tenants(id),
  config      JSONB  NOT NULL DEFAULT '{}'::jsonb,
  updated_at  TIMESTAMP NOT NULL DEFAULT NOW()
);

One row per tenant, the primary key handles the lookup index, no migrations needed when you add a new kind of rule. You fetch the row by tenant_id, parse the config JSON, hydrate it into your rule classes, run them. When the configuration stabilizes, when querying into the configuration becomes important, or when validation needs to live at the database level, that's the moment to normalize. Until then, JSON in a column is the shortest path from idea to working code, and you can refactor toward structure once you know what the structure should be.

The security cliff

There is one thing you must not do, no matter how convenient it looks: do not store executable code in the configuration, and do not let configuration values be interpreted and run.

That means no eval, no exec, no embedded JavaScript or Python or Ruby expressions, no SQL fragments concatenated into queries, no template engines that allow arbitrary function calls. It is tempting (really tempting) to support a configuration that looks like:

{
  "discount_amount": "order.total * 0.1 if customer.tier == 'gold' else 0"
}

…and eval that string at runtime. Do not. The moment you do, anyone who can write to that configuration row can execute arbitrary code on your servers, with the privileges of your application. That's not a feature; that's a remote code execution vulnerability you built on purpose. It doesn't matter that the configuration is "only" editable by admins, or "only" through your UI — the surface area expands the moment another bug exposes that table, the moment a credential leaks, the moment an internal account is phished. The configuration becomes the attacker's payload delivery mechanism, and you handed them the loaded gun.

The correct discipline is strict: configuration is data. It selects between behaviors the code already knows how to perform and supplies typed parameters to them. It never describes a new behavior. If a customer needs a behavior the code doesn't have, the answer is to add a new rule class, not to let them write logic into a JSON blob.

This is also what makes the system safe to expose to customer-success people, support engineers, and eventually self-service customers. The blast radius of a misconfigured rule is "the rule doesn't apply" or "the rule applies wrong". Never "the server runs whatever I told it to."

The shape, summarized

Identify per-tenant extension points and write a small base class for each.
Implement the concrete behaviors as subclasses of that base.
Store tenant configurations as data; start with a JSON column on the tenant record, normalize later if it earns it.
Hydrate the data into classes at runtime; let the classes do the work.
Never, ever let the data become code.

The principle underneath all of this is that code is the menu (the list of things your system is capable of doing) and data is the order. Customers can pick from the menu, in any combination, with any parameters. They cannot rewrite the menu. The chef writes the menu. That's how you keep the kitchen safe.

The Compiler: Heart and Tools of All Software

Gideon Towolawi — Mon, 18 May 2026 16:36:00 +0000

The Compiler: Heart and Tools of All Software

Every program you have ever run — your operating system, your browser, the app that woke you up this morning, the firmware in your coffee machine — was once just text. Human-readable text. Ideas typed by someone who understood a problem well enough to describe its solution.

But computers do not read ideas. They read instructions. Binary. Electrical signals that mean nothing without precise interpretation.

The bridge between human intention and machine execution is the compiler. It is the most consequential piece of software ever invented. Without it, computer science as we know it does not exist.

What Computer Science Would Be Without Compilers

Imagine a world where every programmer writes raw machine code. Not assembly — actual binary. Opcodes and operands encoded by hand. Every program is a miracle of patience, and every bug is a nightmare of hexadecimal archaeology.

In this world:

Software development is artisanal, not industrial. A single application takes years.
Portability is a myth. Every CPU architecture requires rewriting everything from scratch.
Abstraction dies. There are no functions, no types, no modules — just raw memory and jumps.
Security is impossible. Human minds cannot track the state of thousands of registers and memory locations simultaneously.

Computer science without compilers is not computer science. It is digital craftsmanship at the limit of human endurance. The compiler is what lets us think in concepts instead of circuits.

The Compiler as a Pipeline of Principles

A compiler is not a single program. It is a pipeline of transformations, each stage reducing complexity and increasing structure. The quality of a compiler depends entirely on the principles baked into each stage.

Most people know the classical stages:

Lexer — characters → tokens
Parser — tokens → syntax tree
Semantic Analysis — syntax tree → validated intermediate representation
Optimization — IR → faster IR
Code Generation — IR → machine code

But this description misses the point. The stages are not just mechanical steps. They are guardians of meaning.

Stage 1: The Lexer — Dumb by Design

The lexer is where principles begin. Its job is simple: convert a stream of characters into a stream of tokens. int, x, =, 42, ;.

A bad lexer tries to be smart. It merges = = into ==. It strips whitespace because "it doesn't matter." It reconstructs strings and throws away the original quotes.

A principled lexer stays dumb. It emits raw tokens with precise spatial information — where each token starts, where it ends, what line, what column. It does not interpret. It does not merge. It does not discard.

Why? Because semantics belong to the parser. The lexer cannot know whether :: is a scope resolution operator or two separate colons in a ternary expression. It cannot know whether whitespace inside a string literal is significant or decorative. By staying dumb, the lexer preserves all information for downstream stages to make informed decisions.

The token structure I use reflects this:

struct Token {
  TokenType type;      // what kind of token
  std::string lexeme;  // the raw text
  size_t line;         // visual line for errors
  size_t column;       // visual column for errors
  size_t span_to;      // exclusive byte offset in source
};

span_to is the critical field. It lets the parser reconstruct multi-token operators. It lets the formatter preserve original spacing. It lets the LSP highlight exact ranges. The lexer does not use this information — it merely records it, faithfully and without interpretation.

This is the first principle: reduce at the right stage, never earlier.

Why Principles Matter More Than Performance

It is tempting to optimize the lexer. Merge tokens early. Strip separators. Compress the token stream. These optimizations feel productive.

They are traps.

Every piece of information discarded in the lexer is a piece of information that cannot be recovered in the parser, the semantic analyzer, or the code generator. A stripped space cannot be restored for formatting. A merged == cannot be split back if the parser needs to report "unexpected token = after =". An interpreted string literal loses the original escape sequences.

The cost of a "smart" lexer is permanent information loss. The cost of a dumb lexer is a slightly larger token stream — trivial to optimize later, impossible to reconstruct if deleted early.

This principle extends through every compiler stage:

Parser: Validate syntax strictly, but do not constant-fold yet
Semantic Graph: Resolve types and ownership, but do not lower to machine concepts yet
IR: Represent semantics faithfully, optimize only when correctness is provable
Backend: Generate code for the target, but never modify semantic truth

Each stage has one job. Each stage does that job completely. No stage does another stage's work prematurely.

Building Correct by Construction

The compiler is not just a tool. It is a proof system. It proves that your program means what you think it means, that it will not leak memory, that it will not access invalid lifetimes, that it will execute deterministically across architectures.

This is not about being clever. It is about being correct by construction.

What Comes Next

Over the next weeks, I will document each stage of compiler construction in detail:

Why the lexer stays dumb and what that enables
How the semantic graph builds structure from raw tokens
What compile-time invariants mean for systems programming
How to translate semantics into machine resources without losing correctness

If you are building compilers, thinking about language design, or simply curious about how software becomes real, subscribe to the newsletter. I share what I learn, what I get wrong, and how to avoid the traps I fall into.

The compiler is the heart of software. Understanding it is understanding how we turn thought into action.

Building a systems language that writes like C++ and proves safety like Rust, without the mental overhead. Join the newsletter for weekly deep-dives on compiler architecture, language design, and systems programming.

How I Built FreeLabTools Using Only Claude and Gemini (And Why It Changes Everything)

Free Lab Tools — Mon, 18 May 2026 16:35:26 +0000

As a solo developer, the biggest bottleneck isn't usually the ideas—it's the time required to execution. Recently, I wanted to launch a suite of free web tools for developers and creators, but doing it all from scratch would have taken weeks.

Instead, I decided to run an experiment: building the entire platform using a "tag-team" of Large Language Models (Claude and Gemini).

The result? FreeLabTools.com is now live, fully functional, and was built in a fraction of the time. Here is exactly how I did it.

The Strategy: Playing to Each AI's Strengths

I quickly realized that treating AI models as generalists is a mistake. To build FreeLabTools efficiently, I assigned specific roles to each LLM based on their core strengths:

1. Claude: The Architect & Lead Coder

I used Claude (specifically Claude 3.5 Sonnet) as my primary software engineer.

What it did: Generated the clean, modular JavaScript logic for the tools, handled complex algorithms, and structured the UI using modern CSS/Tailwind.
Why it shined: Claude’s ability to maintain context over long conversations and write production-ready code with minimal bugs is unmatched. It understood the "edge cases" of client-side web tools perfectly.

2. Gemini: The Researcher, Optimizers & Copywriter

While Claude was busy coding, I used Gemini to handle the broader scope of the project.

What it did: Optimized the code for speed, generated SEO-friendly meta descriptions, structured the JSON-LD schema for Google, and helped brainstorm user-friendly UI copy.
Why it shined: Gemini’s integration with up-to-date web standards and its fast processing made it the perfect tool for refining, auditing, and preparing the site for launch.

The Workflow: How They Worked Together

The synergy was surprisingly smooth. I would ask Claude to generate a specific tool (for example, a robust code formatter or a secure password generator). Once the tool was functional, I would feed that code into Gemini with the prompt: “Review this code for performance bottlenecks and suggest SEO metadata for the tool page.”

Gemini would often spot tiny optimizations or suggest better accessibility (ARIA) attributes, which I would then feed back to Claude to implement. It felt like managing a highly cooperative two-person dev team.

Key Takeaways for Solo Devs

If you are planning to build your own SaaS or utility site like FreeLabTools.com, here is my advice:

Be specific with prompts: Don't just say "build a tool." Define the inputs, expected outputs, and constraints.
Double-check the math/logic: Even though both AIs are incredibly smart, human oversight is still required to test the final output.
Automate the boring stuff: Let AI handle the boilerplate code so you can focus on user experience and deployment.

What's Next?

Building this project proved to me that the barrier to entry for launching web platforms has completely collapsed.

I'd love for you to check out the final result at FreeLabTools.com and let me know what you think. If you have any questions about the specific prompts I used to pair Claude and Gemini, drop a comment below!

Have you tried building a full project using multiple AI models? What was your experience?

LLM Evaluation in CI: Stop Manual Testing Before It Costs You

Charlie Hadley — Mon, 18 May 2026 16:35:21 +0000

LLM Evaluation in CI: Stop Manual Testing Before It Costs You

You ship a prompt change to production. Two hours later, a customer complains your LLM is returning hallucinated data. You rollback. You lost an hour of revenue and some user trust.

This happens because you tested the happy path, not the edge cases. LLM systems are probabilistic — the same input doesn't always produce the same output quality.

The enterprise solution is Braintrust ($249/mo), LangSmith ($99/mo), or Arize. If you're indie, bootstrapped, or pre-PMF, those budgets simply don't exist.

The Core Idea: Eval-as-Code

Instead of vibes-based testing, you define quality as a rubric with concrete attributes:

Correctness (0–10): Is the answer factually right?
Conciseness (0–10): Does it avoid unnecessary padding?
Hallucination risk (0–10): Does it cite things it can't know?
Tone (0–10): Does it match expected register?
Usefulness (0–10): Would a real user find this helpful?

A cheap judge model (GPT-4o-mini at ~$0.0001/call) scores each output against your rubric. You run 50 test cases per eval. Total cost: about £0.20 per full run.

Building This in GitHub Actions

Here's the minimal structure:

name: LLM Eval
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run evals
        run: python run_evals.py
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - name: Check threshold
        run: python check_threshold.py --min-score 7.5

The run_evals.py script:

Loads your golden dataset (JSON file of input/expected-output pairs)
Runs your LLM system on each input
Sends (input, expected, actual) to GPT-4o-mini with your rubric
Aggregates scores by attribute
Writes results to eval_results.json

If aggregate score drops below your threshold, check_threshold.py exits with code 1 — the PR fails.

A Real Example From Production

I changed a classification system prompt to improve response formatting. The change looked solid in manual testing on 5 examples. But I accidentally dropped a critical piece of context the model needed for correct classification.

Without evals: ships to users. Angry support tickets. Rollback. Lost trust.

With evals: CI caught it in 4 minutes. PR fails. I fix the prompt. Evals pass. Ship confidently.

Golden Datasets: The Hard Part

The hardest part is building your test cases. The key insight: start with failures, not successes.

Every time your LLM system makes a mistake:

Save the input
Write down what the correct output should have been
Add it to your golden dataset

After 2–3 weeks of normal usage, you'll have 30–50 meaningful test cases that represent real failure modes — far more valuable than synthetic test cases you invented upfront.

Multi-Model Comparison

Before committing to an expensive model, run your eval suite across providers:

models = ["gpt-4o-mini", "gpt-4o", "claude-3-5-haiku", "gemini-flash-1.5"]
results = {}
for model in models:
    results[model] = run_eval_suite(model, golden_dataset)

# Sort by (score / cost_per_1k_tokens) to find optimal tradeoff

This stops you from paying for GPT-4o when Claude Haiku scores 92% as well at 20% of the cost.

Cost Optimization

Batch your calls: OpenAI batch API gives 50% discount on async evals
Cache responses: Hash (model + prompt + input) → cache hit avoids re-scoring
Coarse-to-fine: Use a 2-stage system — cheap model filters obvious passes, expensive model only sees borderline cases
Weekly CI only: Run full suite on PRs to main, not every commit

A well-optimized setup runs 100 eval cases for under £0.10.

What I've Packaged Up

I've turned this into a complete ready-to-use system in The Indie Hacker's LLM Eval Playbook:

6 golden dataset templates for common LLM tasks (classification, summarization, retrieval, generation, code review, reasoning)
Complete rubric scoring system in Python (copy-paste ready)
Multi-model comparison script with cost-efficiency ranking
GitHub Actions workflow — drop it in your repo and it works
Cost optimization guide with benchmarks

£29 one-time. One avoided production incident pays for it 10× over.

If you have questions about implementing eval-as-code for your specific use case, drop them in the comments — happy to help.

Jenkins as a Code, or how I stopped clicking around in the UI

Khachatur Ashotyan — Mon, 18 May 2026 16:35:13 +0000

I've been running Jenkins in one form or another for years now. Different companies, different sizes of teams, but somehow the same story keeps repeating itself, and at some point I just couldn't take it anymore. So I decided to write down what I went through, what I learned, and where this journey took me. This is Part 1 of what I'm calling My CI/CD Odyssey — a series where I want to share the ideas, the mistakes, and the things that actually worked.

Future chapters will go deeper into the painful stuff — building macOS workers without losing your mind, using spot instances as GitHub Actions runners to cut costs, and a few other rabbit holes I went into. But before we get there, let's start at the beginning, because the beginning is where most of the pain lives.

The "before" picture, and why it hurts

If you've worked with Jenkins for any reasonable amount of time, you probably know this scene: someone opens the Jenkins UI, clicks "New Item", picks a freestyle or pipeline job, fills in twenty-something fields, scrolls past a wall of plugin options, and clicks Save. Then a month later somebody has to figure out why a job behaves differently in dev than in prod, and the answer is "because Arthur clicked a different checkbox in February and nobody remembers".

That was basically my world for a long time. We had multi-tier environments — dev, stage, sometimes more — and on top of that, sometimes more than one Jenkins instance per tier. Each one was configured by hand. Plugins installed by hand. Pipelines copy-pasted from one Jenkins to another and edited by hand. Credentials added by hand. Workers attached by hand. Then one day you wake up and realize:

Nobody remembers what plugins are installed where.
The "stage" Jenkins doesn't match production anymore, and you only notice when a pipeline breaks in prod.
A plugin update on Friday afternoon kills a build, and rolling it back means a human clicking buttons under stress.
A new team member joins and you spend three days explaining tribal knowledge that should really live in a repo.

That last point is what really got me. Tribal knowledge is fine when there are two of you. It stops being fine very quickly.

The idea: treat Jenkins like any other piece of code

So I started doing some research, and the direction was pretty obvious in hindsight: if Jenkins is a piece of infrastructure, and we treat infrastructure as code everywhere else (Terraform for cloud, Helm for Kubernetes, Ansible for hosts), then Jenkins itself shouldn't be the special snowflake we manage by hand. The whole controller, all the jobs, all the credentials wiring, the workers — everything should come out of a git repo. End to end.

The goal I wrote down for myself was something like this:

I want a Jenkins instance where I can throw away the whole VM, the whole cluster, the whole config, run a pipeline, and ten minutes later have an identical Jenkins back. And I want dev to be code-to-code identical to prod, so when I test a plugin upgrade or a pipeline change in dev, I actually know it will behave the same in prod.

If you've ever burned yourself on a "but it worked in stage" deploy, you know exactly why that sentence matters.

The building blocks

Once I started designing this, the picture broke down into a few moving pieces. None of these are revolutionary on their own — what matters is how they fit together.

1. JCasC — Jenkins Configuration as Code

This is the foundation. JCasC is a Jenkins plugin that lets you define the entire controller config in YAML. System settings, security realm, authorization strategy, clouds, credentials wiring, tools, global libraries — all of it. The controller reads the YAML on boot and configures itself.

The moment I plugged JCasC in and could rebuild a controller from a YAML file, I knew I wasn't going back. No more "what's installed where". Whatever is in the YAML is the truth. If it's not in the YAML, it doesn't exist.

A minimal taste of what that looks like:

jenkins:
  systemMessage: "Managed by JCasC — do not edit in the UI"
  numExecutors: 0
  mode: EXCLUSIVE
  securityRealm:
    github:
      clientID: ${GITHUB_CLIENT_ID}
      clientSecret: ${GITHUB_CLIENT_SECRET}
  clouds:
    - kubernetes:
        name: "eks"
        namespace: "jenkins"
        jenkinsUrl: "http://jenkins.jenkins.svc.cluster.local:8080"
unclassified:
  globalLibraries:
    libraries:
      - name: "ci-libs"
        defaultVersion: "main"
        retriever:
          modernSCM:
            scm:
              git:
                remote: "https://github.com/<org>/ci-libs.git"

Fifteen lines, and the whole controller knows who it is.

2. Job DSL — jobs from a git repo

JCasC handles the controller, but it doesn't really handle jobs. For that I leaned on the Job DSL plugin. Jobs are defined in Groovy files in a git repo, and a small "seeder" job in Jenkins polls the repo, picks up all the DSL files, and recreates jobs from them. If a job is removed from git, it disappears from Jenkins. If a parameter changes in git, it changes in Jenkins on the next seed run.

This means the Jenkins UI becomes basically read-only from a configuration point of view. Nobody edits a job in the UI anymore — if you do, the seeder will overwrite you on the next run. That's a feature, not a bug.

Look here for declarative API

3. Helm + Kubernetes for the controller

I run the Jenkins controller in Kubernetes. Helm chart for the deploy, persistent volume for the home dir, a sidecar that injects JCasC config from a ConfigMap. Upgrading Jenkins is just bumping a chart version. Rolling back is rolling back a chart version. Plugin lists are values in a Helm values.yaml file, version-pinned, and reviewed in a pull request like any other change.

This is honestly the part that made plugin upgrades stop being scary. They go through a PR. They get tested in dev first. They get the same review as application code.

Side note: if you'd rather not deal with Helm at all, the community also maintains a Jenkins Kubernetes Operator that takes a CRD-first approach. I went with Helm for the simpler upgrade story, but the operator is a perfectly reasonable alternative if you're already heavy into the operator pattern.

4. Packer for worker images

The next big piece is the workers — the actual machines that run your builds. Here I went all-in on Packer. Every worker image is baked from a Packer template that lives in git: base OS, language runtimes, SDKs, build tools, everything pre-installed. The image gets a version. The version gets pinned in the worker config.

This was the moment that builds started to feel reproducible. Before Packer, every worker was a slightly different snowflake, hand-installed and slowly drifting. After Packer, every worker that boots from image v1.2.3 is byte-for-byte the same as every other worker booted from image v1.2.3. If a dependency upgrade breaks something, you know exactly which image introduced it, and you can pin back to the previous one in a one-line PR.

5. Ephemeral workers — born, used, destroyed

This is the part that connects everything, and honestly the part I'm proudest of. Workers in this setup are ephemeral. Not "long-lived agents we reboot once a week" — actually ephemeral. A pipeline asks Jenkins for a worker, dedicated job spins one up from a known Packer image, the worker runs the build, the worker dies. Always. Every build gets a virgin environment.

The "something" depends on the platform, but the pattern is identical across all of them:

Linux builds — the Jenkins Kubernetes plugin schedules a pod in the EKS cluster from a container image we baked. Build finishes, pod is deleted. Lifecycle is seconds to minutes.
AWS EC2 / Azure VMs (Linux and Windows) — Dedicated job run terraform to provision and de-provision instances from packer templates.
macOS VMs — same idea, but the underlying virtualization is its own world. We spin up a fresh macOS VM from a Packer-baked image on each build (via Tart on Apple Silicon hosts, or vSphere for older fleets, or Orchard for pooled remote Macs), the build runs, and the VM is torn down at the end. macOS is messier and deserves its own post — that's Part 2 — but the contract is the same: born for one build, destroyed after.

The point is: every build starts from byte-identical state. Not "mostly the same". Not "the same modulo ~/.cache". Identical. If v1.2.3 of an image is what's running, then every build on that image starts from the exact same filesystem snapshot the Packer pipeline produced. There's no human in between leaving footprints.

That kills a whole category of bugs. No more "leftover state on the agent". No more "this worker has a weird ~/.cache somebody never cleaned up". No more "the disk filled up because of build artifacts from three weeks ago". No more "this only fails on Friday because the agent's been up since Monday and something is leaking". The worker simply doesn't live long enough to accumulate any of that.

It also makes "build is non-reproducible" investigations a lot shorter. If two builds against the same commit produce different artifacts, the cause is almost never the worker — because the worker is brand new in both cases. That narrows the search dramatically.

And it turns out to be a beautiful security property too: secrets that get pulled onto a worker disappear with it. There's no long-lived agent holding old tokens. If a credential leaks into a build environment, its blast radius is measured in minutes, not weeks.

6. Terraform / Terragrunt for everything else

All the things that aren't Jenkins itself — VPCs, IAM, secret stores, the EKS cluster, image galleries — live in Terraform, organized with Terragrunt so the same modules get reused across dev and prod with different inputs. Same code, different variables. That's how I get dev to be code-to-code identical to prod.

If you ever want to test how production will behave, just run the same Terraform with ENV=stage instead of ENV=prod. Same modules, same versions, just a different namespace. No surprises.

How it all clicks together

The flow ends up looking like this:

Somebody opens a pull request — could be a new job, a plugin bump, a JCasC tweak, a new Packer image.
CI runs validation: YAML lint, Groovy compile checks, Terraform plan, Packer build for changed images.
PR gets reviewed and merged.
On merge, GitHub Actions applies infra changes via Terraform, and the Jenkins seeder picks up new DSL files on its next poll.
Next build that needs a worker pulls the new image. No human in the loop.

That's the loop. That's the whole point. The Jenkins UI becomes a window into what the repo says should be running, not the source of truth.

What this fixed for me

Here's what I noticed had actually changed:

No more "works on stage, breaks on prod". Because the two are literally the same code with different inputs. If it works on stage, it works on prod, modulo data differences.
Plugin upgrades stopped being scary. They go through a PR. They get tried on dev. They roll back with git revert.
Onboarding got faster. New engineers read the repo. They don't have to be told secrets or shown a Jenkins UI tour.
Disaster recovery got real. I can lose the controller VM, the EKS cluster, even the entire account, and as long as I have the repo I can rebuild.
Audit trail came for free. Every change to any pipeline is a git commit, with an author, a timestamp, and a PR description. No more "who changed this and when".

What I'm still figuring out

I don't want to make this sound like a finished story, because it's not. A few things still keep me up at night:

macOS workers are their own special kind of hell. You can't just spin up a Mac VM in AWS the same way you spin up Linux. There's a whole ecosystem of hypervisors, licensing rules, and hardware constraints to deal with. This deserves its own post — and it's getting one. Part 2 will be all about macOS workers: Tart, virtualization on Apple Silicon, the trade-offs between self-hosted and cloud-mac providers, and how to make signing and notarization not feel like a horror movie.
GitHub actions Cost at scale. There is easy way to run spot instances as GitHub Actions runners to offload certain workloads cheaply, save money, and that's its own rabbit hole — different trade-offs, different failure modes, different cost curves. Part 3 will cover spot-based GitHub Actions runners end to end.

Closing thought

If there's one thing I'd say to anyone reading this who's still managing Jenkins by clicking buttons, it's this: you're not lazy for doing it, you're just paying the cost in places that don't show up on a dashboard. The cost shows up when someone leaves the team, when a plugin update breaks a build at 2am, when a customer-facing deploy fails because stage lied to you. Jenkins as a Code doesn't make those costs disappear, but it makes them visible and reviewable. And that, honestly, has been worth all the work.

Appendix — tools and plugins I leaned on

For anyone who wants to skip straight to the implementations, here's the short list of what's actually wired up in this setup:

Jenkins plugins

Configuration as Code (JCasC) — the controller config in YAML.
Job DSL — jobs defined in Groovy in a git repo.
Kubernetes plugin — ephemeral pod agents in EKS.
Pipeline: Shared Groovy Libraries — the global libraries that hold reusable pipeline code.

Deployment

Jenkins official Helm chart — what I use to deploy the controller.
Jenkins Kubernetes Operator — the CRD-based alternative, if you prefer operators over Helm.

Image building

HashiCorp Packer — bakes all the worker images (Linux, Windows, macOS).

Infrastructure

Terraform — everything outside Jenkins (VPCs, IAM, secrets, EKS, image galleries).
Terragrunt — keeps the same modules DRY across dev / stage / prod.
Kubernetes / Amazon EKS — where the Jenkins controller lives.
Helm — package manager for the Kubernetes side.
GitHub Actions — applies Terraform on merge.

Coming up in later parts

Tart — macOS VMs on Apple Silicon (Part 2).
Orchard — Tart cluster orchestration for macOS fleets (Part 2).

This is Part 1 of My CI/CD Odyssey. If you want to be pinged when Part 2 drops, follow me here on dev.to. And if you're doing JaaC differently — I'd love to hear about it in the comments.

I Built a Debugger for LLM Agents — Here's Why "Observability" Wasn't Enough

Raju Shanigarapu — Mon, 18 May 2026 16:32:33 +0000

Every time I changed a prompt, I was running a hypothesis test.

But I had no debugger. No way to pause execution. No structural comparison between "before" and "after." Just two terminal windows and a vague feeling that maybe it was better now.

I built agent-lens to fix this.

The Problem with "Observability"

Langfuse, LangSmith, Phoenix — these are great tools. They show you what happened. Traces, spans, token counts.

But none of them answer the question I actually had: did this change make it better?

That requires something different:

A way to compare two runs structurally
A record of why you made the change (the hypothesis)
A verdict — not just "here are the numbers," but "this was an improvement"

What agent-lens Does Differently

1. Pause a live agent mid-run

import agent_lens
from openai import OpenAI

agent_lens.install()          # auto-patches OpenAI + Anthropic
agent_lens.dashboard.start()  # localhost:7878

client = OpenAI()

@agent_lens.trace
def my_agent(query: str) -> str:
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": query}]
    ).choices[0].message.content

Open the dashboard, click Pause. The agent blocks at the next LLM call.

2. State a hypothesis before you change anything

POST /runs/{run_id}/fork
{
  "span_id": "abc123",
  "edited_messages": [{"role": "system", "content": "Be concise."}],
  "notes": "Hypothesis: shorter system prompt reduces hallucination",
  "expected_output": "concise"
}

The note travels with the run forever. Future you can read your reasoning.

3. GET /diff — one call, one verdict

GET /runs/{run_a}/diff/{run_b}

{
  "metrics_delta": {
    "latency_ms":   {"a": 1847, "b": 820,  "pct_change": -55.6},
    "total_tokens": {"a": 453,  "b": 87,   "pct_change": -80.8},
    "cost_usd":     {"a": 0.0045, "b": 0.00087, "pct_change": -80.7}
  },
  "assertion_result": {
    "expected_output": "concise",
    "passed_in_a": false,
    "passed_in_b": true,
    "verdict": "improved"
  }
}

Hypothesis confirmed. With numbers.

The Full Flow

[Agent running] → Pause → agent blocks at next LLM call
                              ↓
                    [Edit messages in dashboard]
                              ↓
                    Fork → new run diverges
                              ↓
                    Resume → original continues
                              ↓
              [Two runs. GET /diff. Get verdict.]

No restarts. No re-running preceding steps.

Zero Infrastructure

Everything runs locally. SQLite at ~/.agent-lens/runs.db. No Docker. No cloud. No API keys needed to start exploring:

pip install agentlens-tracer
python examples/07_demo_mock.py  # runs a full demo with no API key

Works with LangChain and LlamaIndex Too

from agent_lens.integrations.langchain import AgentLensCallbackHandler
from agent_lens.integrations.llamaindex import AgentLensLlamaIndexHandler

Pass as a callback — every LLM call is traced automatically.

Why This Matters

You're not debugging a function. You're debugging a probabilistic system. Every prompt change is a hypothesis test.

Today you run that test by eyeballing outputs. agent-lens makes it structural, repeatable, and recorded.

Vibes-based prompt engineering is debugging without a debugger.
agent-lens is the debugger.

GitHub: https://github.com/RAJUSHANIGARAPU/agent-lens
Install: pip install agentlens-tracer

Would love to hear how you're currently debugging LLM agents — drop a comment below.

I Built a Free Image Compressor That Never Uploads Your Files

yangjiaqiang12 — Mon, 18 May 2026 16:29:39 +0000

The Problem

Every time you use an online image compressor, your files get uploaded to someone else's computer. TinyPNG does it. Compressor.io does it. Even most so-called "free" tools collect your data on their servers.

This has always felt wrong to me. Images can contain sensitive stuff -- screenshots of conversations, private photos, business documents, unreleased designs. Why should making them smaller require handing them to a stranger?

The Solution: Squash

Squash is a free, open-source image compressor that runs entirely in your browser.

🚫 No uploads -- everything stays on your device
⚡ Instant compression -- no waiting for network round-trips
🎨 Multi-format -- JPEG, PNG, WebP support
📦 Batch processing -- compress multiple images at once
🎚️ Quality slider -- full control from 1% to 100%
📐 Resize -- set max dimensions while compressing
🌓 Dark mode -- works great at night
💰 Completely free -- no limits, no watermarks

How It Works

Squash uses the browser Canvas API to decode images, apply compression settings, and re-encode them at your chosen quality level. All the heavy lifting happens on your device hardware -- not on some server farm.

The processing pipeline:

Load -- Image is decoded from file into raw pixel data
Resize -- If a max width is set, image is scaled down
Encode -- Pixel data is re-encoded at the chosen quality and format
Download -- The compressed result is ready to save

No server. No upload. No privacy concern.

Why I Built This

I wanted a tool that respects privacy. Not as a marketing slogan -- as a technical guarantee. Your images literally cannot be uploaded because there is no server to upload them to. The entire application is static HTML, CSS, and JavaScript served from GitHub Pages.

The source code is open (MIT license). You can inspect every line. You can host it yourself. You can verify that nothing leaves your browser.

Comparison with Other Tools

Tool	Privacy	Batch Mode	WebP Support	Price
Squash	✅ Local only	✅ Yes	✅ Yes	Free
Squoosh	✅ Local only	❌ No	✅ Yes	Free
TinyPNG	❌ Uploads	✅ Yes	✅ Yes	Free (20/day)
Compressor.io	❌ Uploads	✅ Yes	❌ No	Free (10MB cap)

The key difference: Squash and Squoosh process locally. But Squoosh has no batch mode. Squash is the only tool that combines local processing + batch mode + multi-format + unlimited use.

Tech Stack

Vanilla HTML/CSS/JavaScript -- No frameworks, no dependencies
Canvas API -- Browser-native image processing
GitHub Pages -- Free, fast static hosting
Zero backend -- No server, no database, no API calls

Try It Yourself

👉 Launch Squash

📂 Source Code on GitHub

If you find it useful, consider buying me a coffee ☕ -- it helps keep the project alive and improving.

Built with vanilla HTML/CSS/JS. No frameworks, no dependencies, no build step. MIT licensed.