Forem: Nidhish Akolkar

State Drift in Parallel Execution Systems - What It Is and How to Fix It By Nidhish Akolkar

Nidhish Akolkar — Sun, 24 May 2026 21:00:48 +0000

There is a category of bug that is worse than a crash.
A crash is loud. It stops execution. It produces an error. It tells you something went wrong and roughly where. You can find it, fix it, and move on.
State drift is silent. Execution continues. Logs look clean. Individual nodes succeed. Payloads validate. And yet the system becomes progressively, subtly, semantically wrong producing outputs that are structurally correct but meaningfully broken.
This is what happens when parallel execution systems lose control of shared state. And once you've experienced it at scale, it changes how you think about building distributed systems entirely.

What State Drift Actually Is
Most developers encounter state management as a local problem. A single thread, a single process, a single source of truth. State is mutable, but mutations are sequential. There's always one definitive version of reality.
Parallelism breaks this assumption completely.
When multiple execution branches run simultaneously each reading, transforming, and writing to shared context "state" stops being a single truth and starts behaving like fragmented timelines. Different branches hold different versions of the same object. Each believes it is operating on the latest state. None of them are wrong, exactly. But none of them are fully right either.
The result is state drift: the gradual divergence of a system's internal model of the world from any consistent, coherent reality.
At small scale, drift is manageable. Individual inconsistencies surface as minor anomalies. At large scale across hundreds of parallel branches, across long-running workflows, across complex agent systems drift compounds exponentially. What starts as a subtle inconsistency in one branch propagates into downstream decisions, corrupts aggregations, and eventually produces a system that is technically executing correctly while being semantically broken.

How It Manifests
The most insidious property of state drift is that it looks like correct behavior.
When I was building a 600+ node AI orchestration infrastructure, state drift didn't announce itself with errors. It announced itself with strangeness:
Duplicated reasoning chains agents re-solving subtasks that had already been completed, because their version of context didn't include the prior resolution.
Stale summaries overriding richer context a compression branch finishing late and writing an older, lower-fidelity summary over a richer one that a parallel branch had already produced.
Downstream references to entities that no longer existed branches making decisions based on context graphs that had been restructured by a concurrent branch they weren't aware of.
Merge nodes silently accepting semantically outdated state structurally valid payloads that passed all schema checks but carried context from an earlier execution window, quietly overwriting correct state with stale state.
The execution logs were technically correct. Every node succeeded. Every payload validated. The system was not crashing.
It was drifting.
And the real problem underneath all of these symptoms was the same: uncontrolled mutability across parallel execution branches.
Parallelism itself wasn't the issue. The problem was allowing multiple concurrent branches to mutate shared state without discipline.

The Naive Fixes and Why They Fall Short
The first instinct when hitting state drift is to add synchronization:

Delayed merges
Wait nodes
Branch dependency chains
Execution barriers
Forced sequencing for critical paths

These help. They reduce collision frequency. They prevent the most obvious race conditions. But they don't eliminate drift because branches still carry stale local assumptions about state that was valid when they started but has since been modified by concurrent execution.
You can synchronize when branches merge without solving the problem of what state each branch was operating on during execution. Synchronization controls timing. It doesn't control consistency.
The real solution requires something more fundamental: distributed-state discipline.

Five Strategies That Actually Work

Versioned context objects The first meaningful fix was making state versioning explicit rather than implicit. Every major state object started carrying lineage metadata: a version ID, a parent execution reference, a mutation depth counter, and a timestamp. Branches stopped assuming they owned the state. Instead they operated on explicit, identified versions of state. The immediate effect was revealing. A shocking amount of hidden drift became visible the moment every state object carried a version. Branches that were confidently overwriting "the" context were suddenly obviously overwriting version 3 with results derived from version 1. Versioning doesn't prevent drift on its own. But it makes drift visible and visible drift is debuggable drift.
Immutable intermediate snapshots This was the most structurally significant change. Instead of allowing shared mutable memory across long-running parallel branches, critical execution stages began producing immutable snapshots. Downstream systems could read these snapshots and derive new state from them, but they could not rewrite them. The principle: append and derive is safer than mutate in place. This eliminated three categories of problems immediately:

Accidental overwrites, where a branch writing valid new state silently destroyed valid existing state
Recursive corruption, where a mutated object was read by another branch, mutated again, and written back each mutation compounding the previous one
Merge ambiguity, where it was genuinely unclear which of two competing state versions was more recent or more correct

Immutability doesn't remove the need for reconciliation. It makes reconciliation tractable.

Explicit reconciliation layers Once multiple branches produce competing enrichments of the same state object, you need a principled way to merge them. "Last write wins" is not a reconciliation strategy it's an abdication of one. Explicit reconciliation passes do the following:

Compare branch outputs field by field
Resolve conflicts using explicit priority rules or confidence scores
Merge structured fields selectively rather than wholesale
Discard low-confidence mutations rather than allowing them to corrupt high-confidence state

This is expensive in performance terms. It adds latency and complexity to the execution graph. But the alternative allowing branches to clobber each other's outputs silently produces a system that is fast and wrong. The performance cost of reconciliation is the cost of correctness.

State validity checks before promotion A significant amount of instability came from a specific failure mode: incomplete partial writes masquerading as valid context. A branch would start writing a state object, fail partway through, and leave behind a partial structure that passed schema validation but was semantically incomplete. Downstream branches would read this partial state, derive conclusions from it, and propagate the corruption forward. The fix was a promotion gate: before any state object was allowed to enter shared context, it passed through structural validation that checked:

Schema integrity
Dependency reference validity every referenced entity actually exists in the current context graph
Orphaned link detection references to entities that have been removed
Partial write detection objects that are structurally present but semantically incomplete

Partial context objects were quarantined rather than merged. A branch failing to produce valid state was treated as a branch failure, not as a valid state update.

Temporal isolation of execution phases The final and perhaps most important strategy was recognizing that certain types of operations simply should not coexist in the same active mutation window. Reasoning, enrichment, memory compression, and orchestration decisions each operate on state in fundamentally different ways. Allowing them to run concurrently against shared mutable state is asking for inconsistency. The solution was separating these into isolated execution phases with strict promotion order:

Reasoning phases complete and produce versioned outputs
Enrichment phases read reasoning outputs and produce enriched snapshots
Memory compression runs on finalized enriched state, not on live state
Orchestration decisions are made on fully reconciled, promoted state

Cross-phase contamination dropped dramatically. The system became predictable in a way it hadn't been before not because the individual components changed, but because the temporal boundaries between them became explicit and enforced.

The Hardest Part: Observability
Everything above describes how to reduce drift. But before you can fix drift, you have to see it.
This is the genuinely hard part.
At scale, debugging state drift feels like debugging distributed cognition. You are not tracing a single failure. You are tracing tiny inconsistencies propagating through asynchronous reasoning layers over time. The hardest bugs are the ones where every individual node succeeded, every payload validated, and yet the overall system became semantically wrong.
Standard logging is not enough. You need to be able to answer questions like:

Which version of this state object did this branch operate on?
When did this field in this object last change, and which branch changed it?
What was the full lineage of this output every transformation it went through to get here?

Without this level of observability, you are debugging by intuition. With it, drift becomes traceable and traceable problems are solvable problems.

The Mental Model Shift
Building through these problems changed how I think about parallel systems fundamentally.
The intuition most developers start with is: parallelism is about speed. You run things concurrently to finish faster.
That's true as far as it goes. But at the level of complexity where state drift becomes a real problem, parallelism is also about epistemology about what each branch of the system knows, when it knows it, and how confident it can be that its knowledge is current.
State drift is what happens when different parts of a system have inconsistent answers to those questions.
The fixes are not primarily about synchronization primitives or locking mechanisms. They are about designing a system where every component has a clear, explicit, trustworthy answer to: "what version of reality am I operating on, and how do I know?"
Get that right, and parallel systems become predictable. Get it wrong, and you will spend a very long time debugging systems that are technically correct and semantically broken.

Myself Nidhish Akolkar I am a Computer Engineering student and AI Systems Engineer based in Pune, India. I build autonomous multi-agent AI infrastructure and run a funded institutional AI & ML laboratory.
GitHub: github.com/nidhishakolkar01-lgtm
LinkedIn: linkedin.com/in/nidhish-a-akolkar-30a33238b

Why Most AI Demos Are Useless - And What Production AI Actually Looks Like By Nidhish Akolkar

Nidhish Akolkar — Sun, 24 May 2026 18:45:06 +0000

There is a specific kind of AI demo that has become almost a cliche.
A founder or researcher stands on a stage or records a screen capture and shows an AI doing something impressive. It answers a complex question perfectly. It writes flawless code on the first try. It completes a multi-step task without a single failure. The audience applauds. Twitter shares it ten thousand times.
And then developers try to build something real with the same technology and discover that nothing works the way the demo suggested it would.
This is not a coincidence. It is a pattern. And understanding why it happens is one of the most important things you can know before building serious AI systems.

What a Demo Optimizes For
A demo has one job: to impress.
Everything about how a demo is constructed flows from that single objective. The inputs are carefully chosen to produce impressive outputs. Edge cases are avoided. The happy path is rehearsed until it's reliable. Failure modes are never shown. Context is controlled completely.
This is not dishonest it's just how demonstrations work. Nobody demos the part where the system breaks.
But the gap between "this works in a controlled demo" and "this works reliably in production" is enormous. In AI systems, that gap is larger than in almost any other category of software.

The Four Things Demos Never Show

What happens when the input is messy Demo inputs are clean. Real-world inputs are not. In a demo, the user types a clear, well-formed request. In production, users type incomplete sentences, make spelling mistakes, ask ambiguous questions, provide contradictory information, and expect the system to figure out what they meant. A model that performs beautifully on clean inputs can degrade dramatically on messy ones. Demos never show this because the demo author controls the inputs.
What happens at the edges Every AI system has failure modes. The question is not whether they exist it's whether you've found them before your users do. Demos are constructed to stay well inside the system's reliable operating range. Production systems encounter the edges constantly. Users do unexpected things. Data arrives in unexpected formats. External APIs return unexpected responses. The combination of factors the developer never considered occurs regularly at scale.
What happens over time A demo runs once, cleanly, and ends. Production systems run continuously. They accumulate state. They encounter the same edge cases repeatedly. They interact with systems that change. Models that worked six months ago behave differently after updates. Data distributions shift. Prompts that were reliable stop being reliable. Long-term stability is invisible in a demo. It is one of the hardest problems in production AI.
What the infrastructure actually looks like The most misleading thing about most AI demos is what they leave out entirely: everything except the model. A demo shows you the input and the output. It hides the orchestration layer that routes requests, the validation system that catches bad outputs, the retry logic that handles failures, the monitoring that alerts when something goes wrong, the fallback systems that maintain service when the primary model fails, the rate limiting that prevents abuse, the logging that makes debugging possible. None of that is in the demo. All of it is in the production system. And building all of it is where most of the engineering work actually lives.

The Demo-to-Production Gap in Practice
I have built AI systems that went through this exact transition from something that worked impressively in controlled conditions to something that had to work reliably in the real world.
The experience taught me something that no demo ever communicated: the model is the easy part.
Connecting to an LLM API, giving it a prompt, and getting a response this takes an hour. Any developer can do it. The result often looks impressive immediately. You can demo it the same day you built it.
What takes months is everything around the model:
The orchestration layer that coordinates multiple models or agents working together. The state management that maintains context across steps without drift or corruption. The error handling that makes the system resilient when individual components fail. The observability that tells you what happened when something goes wrong. The evaluation pipeline that tells you whether the system is actually performing well, not just whether it looks like it is.
When I built a 600+ node AI orchestration infrastructure in n8n, the model calls themselves were maybe 10% of the total complexity. The other 90% was the infrastructure around them routing logic, parallel execution, aggregation layers, state synchronization, failure recovery. None of that would have shown up in a demo. All of it was what made the system actually work.

What Production AI Actually Looks Like
Production AI systems share characteristics that demos almost never have:
They are conservative by default. A production system is designed to fail safely to do less rather than do something wrong. It has hard constraints on what it will and won't attempt. It escalates to humans when confidence is low. It never takes irreversible actions without confirmation.
They have explicit failure modes. Not "the system doesn't fail" rather, "we know exactly how this system fails and we've designed around those failure modes." Every failure case has a handler. Every timeout has a fallback. Every unexpected input has a graceful degradation path.
They are observable. You can look at any point in the system's execution history and understand exactly what happened. Every model call is logged. Every decision point is traced. Every failure is captured with enough context to reproduce and debug it.
They separate concerns cleanly. The model does what models are good at reasoning, language, pattern recognition. The infrastructure does what infrastructure is good at reliability, consistency, state management, error handling. Mixing these concerns produces systems that are hard to debug and harder to improve.
They are evaluated continuously. Not "we tested it before launch" rather, "we measure how well it performs in production on an ongoing basis." Output quality is monitored. Regression is detected. The system gets better over time rather than drifting worse.

Why This Matters for Anyone Building AI Systems
If you are building AI systems or evaluating whether to the gap between demo and production is the most important thing to understand.
It means the right question to ask about any AI capability is not "can it do this?" but "can it do this reliably, at scale, on messy real-world inputs, over time, with acceptable failure modes?"
Those are completely different questions. The first one gets answered in an afternoon. The second one takes months of engineering to answer honestly.
It also means that the hardest skill in AI engineering right now is not knowing how to use the models it's knowing how to build the infrastructure around them. How to design systems that are reliable when the model is not. How to build observability into the execution layer. How to handle failure gracefully. How to maintain state consistency across complex workflows.
These are systems engineering skills. They are not what AI demos show. They are what production AI actually requires.

The One Takeaway
The next time you watch an impressive AI demo, ask one question:
"What would this look like after six months in production with real users?"
If you can answer that question confidently, you understand production AI.
If the demo gives you no way to even start answering it which most demos don't you are watching a proof of concept, not a product.
The distance between those two things is where the real engineering lives.

What is Agentic AI - Explained Without the Hype By Nidhish Akolkar

Nidhish Akolkar — Sun, 24 May 2026 18:34:06 +0000

Every few months, a new term takes over the AI industry.
Right now, that term is agentic AI.
You see it in startup pitches, research papers, product announcements, and LinkedIn posts. Everyone is building "agentic systems." Every major AI lab has an "agentic" roadmap. The word is everywhere.
And almost nobody explains what it actually means.
This article does. No hype. No buzzwords. Just a clear, honest explanation of what agentic AI is, how it differs from what came before, and why it actually matters.

Start Here: What Regular AI Does
To understand agentic AI, you first need to understand what non-agentic AI looks like.
Most AI systems including the chatbots and tools most people use daily follow a simple pattern:
Input → Model → Output
You give the model a prompt. The model processes it. The model returns a response. Done.
This is called a request-response model. It's linear, stateless, and passive. The system has no memory between interactions. It takes no actions in the world. It cannot decide to do anything on its own. It just answers.
This is genuinely useful. But it has hard limits.
A request-response AI cannot execute a multi-step task without you manually managing every step. It cannot decide mid-task to use a different approach. It cannot monitor something over time and respond to changes. It cannot take action in external systems without you explicitly orchestrating every call.
It is, fundamentally, a very sophisticated text transformation function.

What Makes AI "Agentic"
Agentic AI is defined by three properties that request-response systems don't have:

Autonomy An agentic system can decide what to do next without being told explicitly. Given a goal, it plans its own path to that goal breaking it into steps, choosing which tools to use, adapting when something doesn't work.
Tool use An agentic system can take actions in the world not just generate text. It can search the web, write and execute code, call APIs, read and write files, send messages, interact with databases. It has hands, not just a voice.
Persistence An agentic system maintains state across steps. It remembers what it did, what worked, what failed, and what it still needs to do. It can run for minutes, hours, or continuously not just for the duration of a single prompt. Put these three together and you get something qualitatively different from a chatbot. You get a system that can pursue a goal over time, using whatever tools it needs, making its own decisions along the way.

A Concrete Example
Here's the clearest way to understand the difference.
Non-agentic approach research task:
You: "Research the top 5 competitors to my product and summarize their pricing."
The model returns a text response based on its training data. It might be outdated. It can't browse the web. It can't format the output into a spreadsheet. It gives you text and stops.
Agentic approach same task:
You give the same instruction to an agentic system. It:

Decides it needs current information searches the web for each competitor
Visits each competitor's pricing page
Extracts the relevant pricing data
Identifies that pricing structures differ decides to normalize them for comparison
Formats the output as a structured table
Notices one competitor's page is paywalled decides to search for third-party reviews instead
Returns a complete, current, formatted comparison

You gave it a goal. It figured out the steps. It used tools. It adapted when something didn't work. You didn't manage any of it.
That's agentic AI.

The Four Components of an Agentic System
Most agentic systems regardless of how they're built share four core components:
The planner
Takes a high-level goal and breaks it into a sequence of steps. This is where the reasoning happens. Modern LLMs are surprisingly good at this given a goal and context, they can generate reasonable execution plans.
The executor
Carries out individual steps. This is where tool calls happen web searches, code execution, API calls, file operations. The executor is what gives the agent hands.
The memory
Maintains context across steps. Without memory, an agent forgets what it did two steps ago and loses coherence on long tasks. Memory can be short-term (within a single run) or long-term (persisted across sessions).
The evaluator
Checks whether each step succeeded and whether the overall task is progressing. When something fails, the evaluator decides whether to retry, try a different approach, or escalate. This is what makes agents resilient rather than brittle.

Why "Agentic" Is Not the Same as "Autonomous"
These two words get used interchangeably. They shouldn't be.
Autonomous means operating without human control. A fully autonomous system makes all decisions independently with no human oversight.
Agentic means goal-directed with tool use and planning. An agentic system can be highly autonomous or it can include human-in-the-loop checkpoints where a person reviews decisions before they're executed.
Most production agentic systems today sit somewhere in the middle. They handle routine steps autonomously but escalate uncertain decisions to a human. This is intentional it's how you build systems that are both powerful and safe.
The level of autonomy is a design choice. The agentic architecture is what enables that choice to exist.

The Real Engineering Challenge
Here's what most explanations miss: the hard part of agentic AI is not the AI.
Any competent developer can connect an LLM to a few tools and call it an agent. That takes an afternoon.
The hard part is everything around the model:
Reliability Agents fail in ways that chatbots don't. A chatbot gives a bad answer. An agent executes a bad action which might have real consequences. Building agents that fail gracefully, retry intelligently, and never take irreversible actions without confirmation is a serious engineering challenge.
State management Maintaining consistent context across dozens of steps, across parallel execution branches, across sessions that span hours this is where most agent implementations break down in practice.
Observability When an agent fails on step 14 of a 20-step task, you need to know exactly what happened at each prior step. Logging, tracing, and debugging agentic systems requires completely different tooling from standard software.
Trust boundaries An agent with access to your email, calendar, files, and APIs is powerful. It's also a significant security surface. Defining what an agent can and cannot do and enforcing those boundaries reliably is non-trivial.
These are not AI problems. They are systems engineering problems. And they are why building production-grade agentic systems is hard even for experienced engineers.

Where Agentic AI Actually Works Today
Agentic AI is not hype but it's also not magic. It works well in specific contexts:
Well-defined, multi-step tasks Research, data collection, report generation, code review. Tasks where the steps are knowable even if the execution requires judgment.
Tool-heavy workflows Tasks that require interacting with multiple systems, APIs, or data sources. This is where the "hands" of an agent matter most.
High-volume repetitive processes Tasks that would take a human hours to do manually but follow a consistent enough pattern that an agent can handle them reliably.
Monitoring and response Systems that need to watch something over time and take action when conditions change. This is the least explored and most promising category.
It works poorly in contexts that require deep common sense, nuanced human judgment, or accountability for high-stakes irreversible decisions. Knowing the difference is what separates useful deployments from expensive failures.

The One-Sentence Definition
If someone asks you what agentic AI is, here it is:
An agentic AI system is one that pursues goals autonomously, using tools to take actions in the world, across multiple steps, without requiring a human to manage each step explicitly.
Everything else the architectures, the frameworks, the research papers is elaboration on that core idea.

Why It Actually Matters
The shift from request-response AI to agentic AI is not incremental. It's categorical.
Request-response AI augments human work it makes you faster at things you're already doing.
Agentic AI delegates human work it takes over entire workflows, not just individual steps.
That distinction changes what's possible. Not in a sci-fi way. In a practical, near-term, already-happening way. The systems being built today not five years from now are beginning to handle tasks that previously required sustained human attention, judgment, and action.
Understanding what agentic AI actually is, beneath the hype, is the first step to building it well.

How I Built a Proactive Heartbeat Loop in TypeScript By Nidhish Akolkar

Nidhish Akolkar — Sun, 24 May 2026 18:17:55 +0000

Most developers treat an LLM like a vending machine.
You put in a prompt. You get out a response. The machine goes back to sleep.
That works fine for chatbots. It breaks down completely the moment you want an AI that doesn't wait to be asked one that monitors your environment continuously, evaluates your state in the background, and decides on its own when to step in.
Building that kind of system requires something most AI tutorials never cover: a persistent background execution loop that runs independently of user input, collects telemetry silently, and triggers the model only when the data says it's time.
This is how I built that loop for JARVIS and what I learned doing it.

Why a Heartbeat Loop at All
The fundamental problem with passive AI is the direction of control.
In a standard implementation the flow is always:
User acts → System executes → Model responds → System sleeps
The AI is completely blind between interactions. It has no awareness of how long you've been working, whether your attention has shifted, or whether anything in your environment has changed. It only knows what you explicitly tell it.
A heartbeat loop inverts this. Instead of waiting for user input, a background process fires on a set interval independently of anything happening in the UI collects environmental data, evaluates it against thresholds, and decides whether intervention is warranted.
The flow becomes:
Heartbeat fires → Telemetry collected → Threshold evaluated → Intervention triggered (if needed)
The user never initiates. The system does.

The Architecture
Before writing any code, the architecture decision that mattered most was this: the heartbeat loop must never directly touch UI state.
This sounds obvious. It isn't. The temptation when building this is to have the background loop directly update state when it detects something trigger a voice response, animate an element, change a display. If you do that, you introduce shared mutable state between a background thread and a UI thread, and you will hit race conditions. The UI framework assumes state only changes in response to user actions. When a background process violates that assumption, you get freezes, infinite re-render loops, and failures that are almost impossible to trace because they're emergent they don't live in any single component.
The solution is strict producer/consumer isolation:

The heartbeat loop is a pure producer it reads environment data and dispatches events. It never writes to UI state directly.
The UI layer is a pure consumer it listens for events and manages its own state independently.

No shared mutable state between them. This single architectural decision eliminated an entire category of bugs.

The Telemetry Gateway Pattern
The other critical design decision was around API efficiency.
A naive implementation would query the LLM on every heartbeat cycle. That's wasteful, slow, and expensive. More importantly, it's unnecessary most heartbeat cycles don't need model involvement at all.
Instead, JARVIS uses what I call a Deterministic Gateway Pattern:
Step 1 — Continuous local ingestion
Every cycle, the heartbeat collects micro-metrics and writes them to an isolated local cache. No API calls. No model involvement. Just lightweight data collection:

typescriptinterface SystemTelemetry {
activeSessionMinutes: number;
lastIdleGapMinutes: number;
interfaceFocusState: "active" | "idle" | "away";
sessionStartTimestamp: number;
}

const collectTelemetry = async (): Promise => {
const now = Date.now();
const sessionMinutes = (now - sessionStartTimestamp) / 60000;
const idleGap = getLastIdleGapMinutes();
const focusState = getInterfaceFocusState();

return {
activeSessionMinutes: sessionMinutes,
lastIdleGapMinutes: idleGap,
interfaceFocusState: focusState,
sessionStartTimestamp
};
};

Step 2 — Local threshold evaluation
The collected metrics are evaluated entirely locally no model call, no network request. Just lightweight conditional logic:

typescriptconst FATIGUE_THRESHOLD_MINUTES = 120;
const BREAK_MINIMUM_MINUTES = 10;
const STRATEGIC_THRESHOLD_MINUTES = 60;

type AgentMode = "PASSIVE" | "STRATEGIC" | "PROTECTIVE";

const evaluateTelemetry = (telemetry: SystemTelemetry): AgentMode => {
const { activeSessionMinutes, lastIdleGapMinutes } = telemetry;

if (
activeSessionMinutes > FATIGUE_THRESHOLD_MINUTES &&
lastIdleGapMinutes < BREAK_MINIMUM_MINUTES
) {
return "PROTECTIVE";
}

if (activeSessionMinutes > STRATEGIC_THRESHOLD_MINUTES) {
return "STRATEGIC";
}

return "PASSIVE";
};

Step 3 — Autonomous inversion of control
Only when a threshold is crossed does the system involve the model. At that point, the orchestrator packages the full telemetry history, attaches a behavioral directive appropriate to the intervention type, and fires a targeted payload:

typescriptconst triggerIntervention = async (
mode: AgentMode,
telemetry: SystemTelemetry
) => {
if (mode === "PASSIVE") return;

const directive = mode === "PROTECTIVE"
? PROTECTIVE_DIRECTIVE
: STRATEGIC_DIRECTIVE;

const payload = {
telemetry,
directive,
sessionContext: await loadSessionContext()
};

interventionEventEmitter.emit("intervention", {
type: mode,
payload,
timestamp: Date.now()
});
};

The model is never called speculatively. It's called when the data says it's necessary.

The Full Heartbeat Loop
Putting it all together:
typescriptconst SESSION_START = Date.now();
const HEARTBEAT_INTERVAL_MS = 30000; // 30 seconds

// Start the loop
setInterval(runHeartbeat, HEARTBEAT_INTERVAL_MS);

Two things worth noting:
The try/catch is non-negotiable. A heartbeat loop runs indefinitely. One unhandled exception without a catch will crash the entire background process silently no error, no warning, just a loop that stopped firing. Always wrap the cycle body.
Fail silently toward the user. Heartbeat failures should be logged internally but never surfaced as UI errors. The user's workspace should be completely unaffected if the background loop has a bad cycle.

The UI Consumer Side
On the interface side, the component listens for intervention events and handles them entirely within its own state:

typescriptuseEffect(() => {
const handleIntervention = (event: InterventionEvent) => {
setInterventionQueue(prev => [...prev, event]);
};

interventionEventEmitter.on("intervention", handleIntervention);

return () => {
interventionEventEmitter.off("intervention", handleIntervention);
};
}, []);

The UI never asks the heartbeat what's happening. It just listens. The heartbeat never asks the UI what to display. It just dispatches.
Clean boundary. No shared state. No race conditions.

What This Architecture Gives You
Once the heartbeat loop is in place, the system fundamentally changes in character.
The AI is no longer a tool you invoke. It becomes an environment that runs continuously, maintains awareness of your state, and exercises its own judgment about when to act. The user's relationship with the system shifts it's no longer a conversation, it's more like a presence.
That shift requires one more thing beyond the technical implementation: the intervention logic has to be conservative. An AI that interrupts constantly is worse than one that never interrupts. The thresholds, the modes, the decision logic these need to be calibrated carefully. The heartbeat loop is infrastructure. What you do with it is design.

Key Takeaways-

A heartbeat loop requires strict producer/consumer isolation between the background thread and the UI layer. Any shared mutable state will cause race conditions.
Use a deterministic gateway pattern evaluate thresholds locally first, invoke the model only when necessary. Never query speculatively.
Always wrap the cycle body in try/catch. Silent failures in a background loop are the hardest bugs to diagnose.
The interval matters. 30 seconds is long enough to avoid performance impact, short enough to feel responsive. Tune it for your specific use case.
The loop is infrastructure. The intervention logic is where the real design work lives.

I Built a 600+ Node AI Orchestration Infrastructure in n8n. Here's What Actually Happened.

Nidhish Akolkar — Sun, 24 May 2026 12:52:28 +0000

[ The full infrastructure at peak scale. This is what 600+ nodes of parallel agent execution looks like.]

Most people use n8n for simple automations.
Webhook → Gmail → Slack.
That's fine. That's what it was built for.
I used it to build a distributed AI orchestration system with 600+ interconnected nodes, parallel agent execution pipelines, dynamic routing layers, modular tool registries, and centralized aggregation systems.
It took several months of continuous iteration to get there. It was never designed to be this big. It grew one workflow became ten, ten became interconnected execution chains, and eventually the whole thing became something that looked less like automation and more like a distributed cognitive infrastructure.
This is the honest story of how that happened, what broke along the way, and what building it actually taught me.

Why n8n and Why It Became Surprisingly Powerful
The original problem was simple: I wanted to build AI systems that could think modularly, route tasks dynamically, use tools autonomously, and execute multi-step reasoning without me hardcoding every decision path.
Most frameworks at the time either hid too much, required too much setup, or made it impossible to see what was actually happening during execution.
n8n accidentally solved one problem I didn't expect: it made complex execution visible.
You could literally see the execution graph. Every branch, every connection, every data flow laid out in front of you. That changed how I thought about building systems entirely. Instead of reasoning about code, I started reasoning about execution architecture. That shift is what eventually led to 600+ nodes.

The Architecture Five Layers
The infrastructure wasn't built randomly. It evolved into five distinct layers, each with a specific responsibility.
Layer 1: Trigger Layer
Everything entered through a centralized HTTP trigger system. The entire infrastructure was API first external applications, frontend systems, and agents themselves could invoke execution dynamically. Every incoming request passed through normalization before touching any execution logic. Nothing entered the pipeline raw.
Layer 2: Preprocessing Layer
Before any AI execution began, requests were enriched, classified, and transformed. This layer handled intent classification, capability mapping, context injection, and validation. At scale this became the most quietly important part of the system without it, downstream agent execution became inconsistent in ways that were almost impossible to trace back to their source.
Layer 3: Routing Layer
This was the most complex section. The routing layer functioned like an operating system scheduler for AI execution deciding which agents should run, in what order, under what conditions, with what tools, and at what priority. It contained conditional execution systems, intent routers, capability dispatchers, fallback mechanisms, and retry logic. This is where the workflow stopped being linear and became a tree.
Layer 4: Parallel Agent Execution
This is where the infrastructure became genuinely powerful. Instead of a single chain, the system ran specialized agents simultaneously across parallel branches research agents, summarization agents, validation agents, transformation layers, classification pipelines. Multiple branches executed at the same time, and their outputs were later merged downstream. New capabilities could be added as new branches without touching anything else. One branch failing didn't collapse the whole system.
Layer 5: Aggregation Layer
Every parallel branch eventually converged into a centralized aggregation system. This layer collected outputs, resolved conflicts, synthesized context, ranked results, and formatted responses. It was the hardest layer to get right and the one that nearly broke everything.

The Night Everything Broke
I want to be specific about this because vague references to "debugging challenges" don't capture what working at this scale actually feels like.
The worst problem I hit was context consistency during aggregation.
Here's what was happening: parallel branches were executing simultaneously, but they didn't finish at the same time. One branch would complete immediately. Another would hit a retry loop after a transient failure. A third would take longer because of a heavy processing step. When these branches finally converged at the aggregation layer, they were merging states from different points in time.
The result: outputs became inconsistent. Memory objects got corrupted. Valid data from one branch would collide with stale or partial data from another. And the worst part the individual branches looked completely fine when you inspected them in isolation. The problem only appeared at convergence.
I spent an entire night manually tracing execution paths across dozens of interconnected branches trying to understand why perfectly valid outputs were collapsing only during aggregation. It wasn't a bug in any single node. It was an emergent property of timing differences across the entire graph.
The fix was enforcing strict execution barriers aggregation layers would only merge outputs after all upstream branches either completed or explicitly failed, never during an intermediate state. Simple in hindsight. Invisible until you've been burned by it.

What This Scale Actually Taught Me
Building to 600+ nodes teaches you things that tutorials and courses simply cannot.
Debugging changes completely. At small scale, bugs are local. At this scale, failures are emergent they appear in places completely disconnected from where they originate. You need execution observability built into the architecture from the start, not added later.
State is the hardest problem. Parallel systems share state across branches that execute at different speeds. State drift where different parts of your system have inconsistent views of the same data becomes a constant engineering concern.
Visual architecture has real limits. n8n's visual nature was the biggest advantage early on and the biggest challenge late on. At 600+ nodes, navigation became difficult. Dependency tracing became painful. What made experimentation fast made maintenance hard.
Modularity is not optional at scale. The tool registry system abstracting tools into reusable callable modules rather than hardcoding execution logic was the single architectural decision that made the difference between a system that could evolve and one that would have collapsed under its own complexity.

The Bigger Lesson
The most important thing this project taught me has nothing to do with n8n specifically.
It's that intelligence alone is not the hard part of AI systems.
Calling an LLM API is easy. Getting one model to respond to one prompt is a solved problem. The real engineering challenge begins when you try to coordinate multiple capabilities, manage execution state across parallel processes, route tasks dynamically based on context, recover from partial failures, and aggregate outputs from systems that don't finish at the same time.
That's orchestration. And orchestration is where most AI systems fail quietly not because the model was wrong, but because the infrastructure around it wasn't built to handle real-world execution complexity.
This project was my education in that problem.
Six hundred nodes. Several months. One very long night staring at an aggregation layer that shouldn't have been failing.
Worth every minute of it.

4 Years of Coding Before College — What I Actually Learned

Nidhish Akolkar — Sun, 24 May 2026 09:23:12 +0000

I didn't start coding because someone told me to.
There was no career counselor, no parent pushing me toward it, no school assignment that sparked something. I started because I was obsessed with one idea: creating things instead of just consuming them.
I'd spend hours watching tech videos, exploring random software, breaking things, fixing them, breaking them again. The whole ecosystem AI, futuristic systems, communities, events, branding all of it fascinated me. But it wasn't about becoming a developer. It was pure curiosity. It was the feeling that somewhere inside all of this, I could build things that only existed in my head.
The moment that clicked, I just kept going deeper.
That was four years ago. I was still in school.

The Trap I Fell Into Early: Perfecting the Vision Instead of Shipping It
I think big. Always have.
When I get an idea, I don't just see the first version I see the whole thing. The scale, the community around it, the systems underneath it, the experience someone would have using it three years from now.
For a long time, I thought that was a strength. And in some ways it is. But I learned the hard way that thinking too far ahead without executing in the present is just a comfortable way to avoid the discomfort of starting.
I'd plan projects in extreme detail. Build mental structures around them. Imagine the future scale of what they could become. And then delay actually building them because they didn't feel ready enough yet.
Nothing is ever ready enough. That's not a cliche it's something you only really understand after watching a perfectly planned project die in a notebook while a half-baked idea someone else shipped becomes real.
The other thing I learned the hard way: motivation isn't a system. Some days you feel unstoppable. Some days you don't want to touch anything. I used to wait for the motivated days. Now I build anyway. The consistency of showing up even on the flat days is what separates things that get finished from things that stay ideas.

The Moment It Stopped Feeling Like Practice
There's a specific feeling I remember.
It was when people started looking at me for direction instead of me looking at others.
Not because I announced anything. Not because I put a title on it. Just because things I had imagined clubs, events, workshops, systems, communities had become real enough that other people were experiencing them. Things that were random thoughts in my head were now things that existed in the world, that people showed up for, that had a life outside of me.
That was the moment it started feeling real.
Not the first time code ran. Not the first time a project worked. It was the first time I looked at something and thought I made that. It didn't exist before me. Now it does.
That feeling is still what drives everything.

The Part Nobody Talks About: It Can Feel Lonely
Most people in my college knew me as the tech and community guy. Teachers noticed the clubs and workshops and events. Students would come to me for direction on technical things.
But the actual journey the real learning happened almost entirely outside of classrooms.
YouTube videos at midnight. Random communities online. Experimenting with things nobody around me was experimenting with. Failing at ideas that nobody around me even understood well enough to tell me were bad ideas.
When you're trying to build things that are bigger than your current environment, there's a part of the journey that feels internally lonely. Not in a sad way. Just in a I'm figuring this out without a map kind of way.
I don't think that's a bad thing. I think it's just what it actually feels like to be early on something. The people around you are living in the present version of the world. You're spending your energy trying to build the next one.

What I Actually Know Now That I Didn't Know Then
Execution matters more than potential.
That's it. That's the whole lesson.
You can have the best ideas, the clearest vision, genuine talent, and real ambition and none of it matters if things never get finished. The world doesn't reward people for what they were capable of. It rewards people for what they actually shipped.
I used to think bigger was always better. That the scale of your vision was the measure of your seriousness. Now I think the real skill — the one that actually compounds over time is learning how to sustainably turn ideas into reality. How to build systems around your work instead of relying on energy and inspiration. How to finish things. How to be patient with the process without losing the ambition that started it.
My biggest growth over these four years isn't that I learned how to think bigger.
It's that I'm learning how to build better.

Four years in. First year of college. A funded AI lab, a multi-agent system in development, and a patent in progress.
And genuinely, the most useful thing I can tell anyone starting out is this: start before you're ready, ship before it's perfect, and stay consistent when the motivation runs out.
Everything else is details.

Nidhish Akolkar is a Computer Engineering student and AI Systems Engineer based in Pune, India. He builds autonomous AI systems and runs a funded institutional AI & ML laboratory.
GitHub: github.com/nidhishakolkar01-lgtm
LinkedIn: linkedin.com/in/nidhish-a-akolkar-30a33238b