<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Talvinder Singh</title>
    <description>The latest articles on Forem by Talvinder Singh (@talvinder).</description>
    <link>https://forem.com/talvinder</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1410841%2F85dd15bf-30cb-47a7-8645-3f180a7f78d4.jpeg</url>
      <title>Forem: Talvinder Singh</title>
      <link>https://forem.com/talvinder</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/talvinder"/>
    <language>en</language>
    <item>
      <title>Trace-Based Assurance: The Governance Layer Agentware Actually Needs</title>
      <dc:creator>Talvinder Singh</dc:creator>
      <pubDate>Wed, 25 Mar 2026 03:39:50 +0000</pubDate>
      <link>https://forem.com/talvinder/trace-based-assurance-the-governance-layer-agentware-actually-needs-2mjh</link>
      <guid>https://forem.com/talvinder/trace-based-assurance-the-governance-layer-agentware-actually-needs-2mjh</guid>
      <description>&lt;p&gt;Agents are being deployed with governance frameworks designed for human committees and quarterly audits. The gap is not small.&lt;/p&gt;

&lt;p&gt;Traditional governance asks: "Did you follow the process?" Agentic systems require a different question: "Can you prove, in real-time, that the agent is operating within boundaries?" The difference matters because agents make decisions faster than humans can review them, and carry more risk than trust-based deployment can tolerate.&lt;/p&gt;

&lt;p&gt;At Ostronaut, we generate training content autonomously—presentations, videos, quizzes—for healthcare clients. The first time a client asked "How do we know this meets compliance requirements?", we had documentation. We had process diagrams. We had architectural reviews. What we didn't have was evidence that the system was actually doing what we said it would do, case by case, generation by generation.&lt;/p&gt;

&lt;p&gt;That's the governance gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Evidence Problem
&lt;/h2&gt;

&lt;p&gt;I'm calling this &lt;strong&gt;Trace-Based Assurance&lt;/strong&gt; — a governance model where agents emit verifiable evidence trails that prove compliance in real-time, rather than documenting intentions in advance.&lt;/p&gt;

&lt;p&gt;This isn't about adding logging. Every system has logs. Trace-based assurance means structuring agent operations so that governance verification becomes automated and continuous. The trace isn't a byproduct. It's the mechanism.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;By 2027, production-grade agentic systems will be required to emit structured trace data that proves boundary compliance, not just logs outcomes.&lt;/strong&gt; Vendors who treat governance as a documentation problem will lose enterprise deals to vendors who treat it as an evidence problem.&lt;/p&gt;

&lt;p&gt;The shift is already visible. When we talk to healthcare clients, they don't ask "What's your process for content review?" They ask "Can you show me, for this specific piece of generated content, what checks ran and what the results were?"&lt;/p&gt;

&lt;p&gt;That's a different question. It assumes the system is autonomous. It assumes human review isn't feasible at scale. It demands evidence, not assurance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Traditional Governance Breaks
&lt;/h2&gt;

&lt;p&gt;Traditional governance models don't handle this well. They're built for phase-gate processes: design review, implementation review, deployment approval, quarterly audit. Agents don't operate in phases. They operate continuously. They adapt. They make thousands of decisions between audits.&lt;/p&gt;

&lt;p&gt;The gap shows up in three places.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Approval vs. Acceptance&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Traditional procurement distinguishes between "approval" (pre-decision authority) and "acceptance" (post-decision verification). Agents break this model. You can't approve every decision in advance—they happen too fast. You can't simply accept outcomes post-facto—the risk is too high.&lt;/p&gt;

&lt;p&gt;Traces create a third path: continuous verification. The agent emits evidence as it operates. Governance systems verify that evidence in real-time. Decisions that pass verification proceed. Decisions that fail trigger escalation.&lt;/p&gt;

&lt;p&gt;This isn't theoretical. We built validation gates into Ostronaut's generation pipeline after a quality crisis. The system now emits structured traces at each stage: content extraction, structure generation, media creation, quality scoring. Each trace includes the inputs, the decision made, the constraints checked, and the result.&lt;/p&gt;

&lt;p&gt;When a generation fails validation, we have the trace. We know exactly where it failed and why. When a generation succeeds, the client has evidence that it met their requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Documentation vs. Evidence&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Production systems require security, compliance, scalability, all of the operational requirements enterprise buyers expect. The standard response is documentation: architecture diagrams, security reviews, compliance checklists.&lt;/p&gt;

&lt;p&gt;Documentation tells you what the system is supposed to do. Evidence tells you what it actually did.&lt;/p&gt;

&lt;p&gt;The difference matters when something goes wrong. If an agent makes a bad decision, documentation tells you the process was sound. Evidence tells you what inputs it received, what constraints it checked, what decision it made, and why.&lt;/p&gt;

&lt;p&gt;We learned this the hard way. Early versions of Ostronaut had extensive documentation about quality controls. When clients asked about a specific generation that didn't meet standards, we could point to the process. What we couldn't do was show them the specific quality checks that ran for that generation and what they returned.&lt;/p&gt;

&lt;p&gt;Documentation scales to the system. Evidence scales to the decision.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trust vs. Transparency&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Trust-based governance works when operations are slow enough for relationship-building and reputation to matter. Agentic systems operate too fast for trust alone.&lt;/p&gt;

&lt;p&gt;Transparency enables trust at speed. If I can see the evidence trail—what the agent considered, what constraints it checked, what decision it made—I can trust the outcome without trusting the vendor's reputation or the operator's judgment.&lt;/p&gt;

&lt;p&gt;This is not about replacing human judgment. It's about giving humans the information they need to judge effectively. A trace that shows "this generation passed 12 quality checks, failed 1, and was escalated for review" is more useful than a process diagram that says "all content undergoes quality review."&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Looks Like in Practice
&lt;/h2&gt;

&lt;p&gt;The pattern is showing up across domains.&lt;/p&gt;

&lt;p&gt;Healthcare training clients don't ask "Is your content accurate?" They ask "Can you prove this specific module met our clinical guidelines?" That's a trace question.&lt;/p&gt;

&lt;p&gt;Financial services clients don't ask "Do you have compliance controls?" They ask "Can you show me the decision path for this specific transaction and what risk checks applied?" That's a trace question.&lt;/p&gt;

&lt;p&gt;Customer support deployments don't ask "How do you ensure quality?" They ask "Can you prove this agent didn't violate our brand guidelines in this specific conversation?" That's a trace question.&lt;/p&gt;

&lt;p&gt;The common thread: verification needs to happen at the decision level, not the system level.&lt;/p&gt;

&lt;p&gt;Here's what trace-based assurance requires:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ftalvinder.com%2Fframeworks%2Ftrace-based-assurance-agentware%2Fassets%2Fd2-diagram-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ftalvinder.com%2Fframeworks%2Ftrace-based-assurance-agentware%2Fassets%2Fd2-diagram-1.png" alt="Diagram 1" width="800" height="2164"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The trace must be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Structured&lt;/strong&gt;: machine-readable format, not free text&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complete&lt;/strong&gt;: captures inputs, constraints, decision logic, outcome&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamped&lt;/strong&gt;: enables audit trail reconstruction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Immutable&lt;/strong&gt;: can't be modified after creation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Queryable&lt;/strong&gt;: supports real-time and historical analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is different from logging. Logs capture what happened. Traces capture why it happened and prove it was within bounds.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture Shift
&lt;/h2&gt;

&lt;p&gt;Building for trace-based assurance changes how you architect agentic systems.&lt;/p&gt;

&lt;p&gt;Traditional approach: build the agent, add logging, write documentation.&lt;/p&gt;

&lt;p&gt;Trace-based approach: design the constraints first, structure the agent to emit evidence of constraint adherence, make the trace the governance interface.&lt;/p&gt;

&lt;p&gt;We rebuilt Ostronaut's generation pipeline around this model. Every stage emits a structured trace. The trace includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What content was provided as input&lt;/li&gt;
&lt;li&gt;What quality thresholds were configured&lt;/li&gt;
&lt;li&gt;What checks ran and what they returned&lt;/li&gt;
&lt;li&gt;Whether the output met requirements&lt;/li&gt;
&lt;li&gt;If not, why not and what happened next&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The client's compliance team doesn't review our code. They review traces. When they spot-check a generation, they can see the complete decision path. When they audit the system, they query traces, not documentation.&lt;/p&gt;

&lt;p&gt;This inverts the governance relationship. Instead of "trust us, we have good processes," it's "verify us, here's the evidence."&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Got Wrong
&lt;/h2&gt;

&lt;p&gt;We initially tried to retrofit traces onto an existing system. That doesn't work. Traces need to be part of the agent's core architecture, not an afterthought.&lt;/p&gt;

&lt;p&gt;We also underestimated the storage and query requirements. Traces for every decision add up fast. You need infrastructure that can handle high-volume writes and support complex queries across time ranges and decision types.&lt;/p&gt;

&lt;p&gt;The bigger mistake: thinking traces were primarily for auditors. They're actually most valuable for the engineering team. When an agent makes a bad decision, the trace is your debugging tool. When you're tuning the system, traces show you which constraints are too loose or too tight. When you're explaining the system to stakeholders, traces are your evidence.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Open Question
&lt;/h2&gt;

&lt;p&gt;Here's what I don't know yet: how do you build organizational trust in trace-based governance?&lt;/p&gt;

&lt;p&gt;Most enterprise buyers are used to documentation-based assurance. They know how to evaluate a security review or a compliance checklist. They don't yet know how to evaluate a trace architecture.&lt;/p&gt;

&lt;p&gt;The question isn't technical. It's cultural. How do you convince a procurement team that "we'll show you the evidence for every decision" is more reliable than "we have a 47-page compliance document"?&lt;/p&gt;

&lt;p&gt;The early adopters get it. Healthcare organizations that already deal with electronic health records understand audit trails. Financial institutions that deal with transaction monitoring understand decision-level evidence.&lt;/p&gt;

&lt;p&gt;But the broader market is still catching up. Most RFPs still ask for documentation, not trace capabilities. Most compliance frameworks still assume human review, not automated verification.&lt;/p&gt;

&lt;p&gt;The shift will happen. It has to. Agents are already making decisions too fast and at too high a volume for documentation-based governance to work. The question is whether the governance frameworks will adapt in time, or whether we'll see a wave of incidents first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Are we building the trace infrastructure now, or waiting for the forcing function? Mostly, we're still writing documentation.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://talvinder.com/frameworks/trace-based-assurance-agentware/?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=trace-based-assurance-agentware" rel="noopener noreferrer"&gt;talvinder.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agenticsystems</category>
      <category>enterpriseai</category>
      <category>governance</category>
    </item>
    <item>
      <title>The Small Model Arbitrage: Why India Should Be Building Vertical LLMs, Not Chasing Frontier</title>
      <dc:creator>Talvinder Singh</dc:creator>
      <pubDate>Mon, 23 Mar 2026 13:21:57 +0000</pubDate>
      <link>https://forem.com/talvinder/the-small-model-arbitrage-why-india-should-be-building-vertical-llms-not-chasing-frontier-5e51</link>
      <guid>https://forem.com/talvinder/the-small-model-arbitrage-why-india-should-be-building-vertical-llms-not-chasing-frontier-5e51</guid>
      <description>&lt;p&gt;India is trying to build its own GPT-4. This is a mistake.&lt;/p&gt;

&lt;p&gt;The capital requirement to train a frontier model is $500M-$1B+. The talent war for ML researchers is won before you enter it—OpenAI, Anthropic, and Google have already hired everyone worth hiring at compensation packages Indian companies can't match. The compute infrastructure is controlled by three hyperscalers who are also your competitors.&lt;/p&gt;

&lt;p&gt;This is not a winnable race. But there's a different race that is.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Small Model Arbitrage
&lt;/h2&gt;

&lt;p&gt;I'm calling this the &lt;strong&gt;Small Model Arbitrage&lt;/strong&gt;—the opportunity to capture value by building specialized, vertical-specific LLMs that use local data, languages, and domain expertise where general-purpose models systematically underperform.&lt;/p&gt;

&lt;p&gt;The arbitrage exists because frontier model companies optimize for breadth, not depth. GPT-4 is remarkable at general reasoning but mediocre at Tamil legal document analysis, Ayurvedic diagnosis support, or GST compliance automation. The long tail of vertical use cases is economically unattractive to companies spending $1B on training runs.&lt;/p&gt;

&lt;p&gt;That's where the opening is.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A well-executed vertical LLM in a defensible domain will reach profitability faster and generate higher ROI than an Indian frontier model attempt over the next 5 years.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The math supports this. Training a competitive frontier model requires $500M-$1B in compute, 100+ PhD-level researchers at $300K-$500K/year, and 3-5 years to market. Ongoing capital burn to stay competitive as OpenAI and Anthropic release new versions.&lt;/p&gt;

&lt;p&gt;A vertical LLM requires $2M-$10M in initial training, 10-20 engineers and domain experts, and 6-12 months to first deployment. The moat is proprietary domain data, not compute scale.&lt;/p&gt;

&lt;p&gt;The capital efficiency difference is 50-100x. The time-to-revenue difference is 5-10x.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Capital Efficiency Isn't the Whole Story
&lt;/h2&gt;

&lt;p&gt;Capital efficiency alone doesn't win. The real arbitrage is in defensibility.&lt;/p&gt;

&lt;p&gt;Frontier models are commodity infrastructure. When GPT-5 launches, GPT-4 pricing collapses. When Claude 4 launches, Claude 3.5 becomes table stakes. The moat is constantly eroding because the moat IS the model, and the model is constantly being replaced.&lt;/p&gt;

&lt;p&gt;Vertical models have different moats. The moat is the proprietary training data, the domain-specific evaluation benchmarks, the integration into existing workflows, the trust built with regulated industries. These don't erode when OpenAI ships a new model. They compound.&lt;/p&gt;

&lt;p&gt;Consider Indian legal text. A frontier model can summarize a contract. A vertical legal LLM trained on 20 years of Indian case law, Supreme Court judgments, and regulatory filings can identify precedent, flag jurisdictional issues, and generate compliant documentation.&lt;/p&gt;

&lt;p&gt;The difference in value is 10x. The difference in defensibility is 100x.&lt;/p&gt;

&lt;p&gt;Or healthcare. GPT-4 can answer general medical questions. A model trained on Indian clinical protocols, drug formularies, insurance claim patterns, and regional disease prevalence can assist with diagnosis, treatment planning, and prior authorization. It's not a better general model—it's a purpose-built tool that works within the constraints of the Indian healthcare system.&lt;/p&gt;

&lt;p&gt;The pattern here is &lt;strong&gt;data specificity as competitive advantage&lt;/strong&gt;. Frontier models are trained on the open web. Vertical models are trained on proprietary, domain-specific corpora that are expensive or impossible for competitors to replicate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Import Substitution Mistake
&lt;/h2&gt;

&lt;p&gt;India tried this playbook before. Post-independence industrial policy was built on import substitution—build everything domestically, compete head-to-head with established global players. It failed spectacularly.&lt;/p&gt;

&lt;p&gt;India's inward-looking trade regime discouraged labor-intensive export industries and rewarded installation of new capacity over actual output. The economy stagnated for decades.&lt;/p&gt;

&lt;p&gt;The companies that succeeded—Infosys, Wipro, TCS—didn't try to be IBM. They specialized in specific services where India had comparative advantage: cost-efficient software development, business process outsourcing, IT support. They built world-class competitors by focusing, not by trying to replicate the entire stack.&lt;/p&gt;

&lt;p&gt;The Small Model Arbitrage is the same bet. Don't build Indian GPT-4. Build the best Tamil-English legal LLM. Build the best model for Indian tax code. Build the best clinical decision support system for Indian healthcare protocols.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who's Building This
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Sarvam AI&lt;/strong&gt; is building this playbook. They're not trying to be OpenAI India. They're building models for Indian languages—starting with Hindi, Tamil, Telugu, Kannada. The training data includes regional dialects, code-switching patterns, and cultural context that frontier models miss. Their Indic LLM performs better on Hindi-English code-mixed text than GPT-4 because it was designed for that specific use case.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Niramai&lt;/strong&gt; built an AI system for breast cancer screening using thermal imaging. It's not a general-purpose vision model. It's a vertical model trained on Indian patient data, optimized for cost-constrained clinical settings, and integrated with existing diagnostic workflows. The model's accuracy isn't better than frontier models on general image tasks—it's better on the one task that matters for their customers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tricog&lt;/strong&gt; built an ECG interpretation model for Indian hospitals. It doesn't try to be the best general medical AI. It's trained on Indian cardiac data, accounts for regional disease prevalence, and integrates with existing cardiology workflows. The specificity is the product.&lt;/p&gt;

&lt;p&gt;These companies aren't competing on compute scale. They're competing on domain depth.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Criteria for Vertical LLM Opportunity
&lt;/h2&gt;

&lt;p&gt;Not every vertical is worth building. The opportunity exists where three conditions hold:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criterion&lt;/th&gt;
&lt;th&gt;Why It Matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Proprietary data access&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The training corpus must be expensive or impossible for competitors to replicate. Public datasets don't create moats.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Measurable performance delta&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The vertical model must demonstrably outperform frontier models on domain-specific benchmarks. "Better for India" isn't enough—quantify it.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Willingness to pay&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The customer must value the vertical model enough to pay a premium over general-purpose alternatives. Cost savings or compliance requirements work. Marginal convenience doesn't.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Indian legal tech meets all three. Case law is proprietary, performance on precedent identification is measurable, and law firms pay for accuracy.&lt;/p&gt;

&lt;p&gt;Indian healthcare meets all three. Clinical data is proprietary, diagnostic accuracy is measurable, and hospitals pay for compliance and outcomes.&lt;/p&gt;

&lt;p&gt;Indian fintech meets two out of three. Transaction data is proprietary, fraud detection performance is measurable, but willingness to pay is unclear—banks may prefer general models with custom fine-tuning.&lt;/p&gt;

&lt;p&gt;The test is simple: if a frontier model company could replicate your vertical model by spending $10M on data acquisition and fine-tuning, you don't have a moat. If they can't—because the data doesn't exist, the domain expertise takes years to build, or the regulatory relationships are non-transferable—you do.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Don't Know Yet
&lt;/h2&gt;

&lt;p&gt;The open question is whether vertical LLMs can sustain pricing power as frontier models improve. If GPT-5 closes 80% of the performance gap on Indian legal text, does the 20% delta justify a 5x price premium?&lt;/p&gt;

&lt;p&gt;I think yes, but the answer depends on how regulated and mission-critical the domain is. Healthcare and legal have high switching costs and regulatory lock-in. E-commerce and customer support don't.&lt;/p&gt;

&lt;p&gt;The other unknown is whether vertical models can defend against fine-tuned frontier models. If a customer can take GPT-4, fine-tune it on their own data, and get 90% of the value of your vertical model, your business model collapses.&lt;/p&gt;

&lt;p&gt;The defense is proprietary training signal that the customer doesn't have. If your model is trained on 10 years of aggregated industry data that no single customer possesses, fine-tuning doesn't replicate it. If your model is just a fine-tuned version of a frontier model on the customer's own data, you're a services company, not a product company.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Civilizational Bet
&lt;/h2&gt;

&lt;p&gt;The broader question is whether India's AI strategy should prioritize sovereignty or specialization.&lt;/p&gt;

&lt;p&gt;Sovereignty argues for building frontier models domestically, even at higher cost, to ensure strategic autonomy. Specialization argues for building vertical models where India has comparative advantage, and relying on global infrastructure for general-purpose AI.&lt;/p&gt;

&lt;p&gt;I think specialization wins. Sovereignty in AI is expensive and brittle. The cost to maintain a competitive frontier model is not a one-time investment—it's an ongoing tax that grows every year as the frontier moves. India's GDP per capita is $2,500. The U.S. is $76,000. The capital efficiency required to compete on frontier models is not realistic.&lt;/p&gt;

&lt;p&gt;But specialization in vertical AI is realistic. India has 22 official languages, 1.4 billion people, and regulatory systems that differ significantly from Western markets. The data specificity is structural, not temporary.&lt;/p&gt;

&lt;h2&gt;
  
  
  The companies that win will be the ones that stop trying to replicate OpenAI and start building what OpenAI can't.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://talvinder.com/frameworks/domain-specific-small-models/?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=domain-specific-small-models" rel="noopener noreferrer"&gt;talvinder.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agenticsystems</category>
      <category>indiatech</category>
      <category>aiinfrastructure</category>
    </item>
    <item>
      <title>We Were Running AI Agents Before 'Agentic' Became a Buzzword</title>
      <dc:creator>Talvinder Singh</dc:creator>
      <pubDate>Sat, 21 Mar 2026 04:10:53 +0000</pubDate>
      <link>https://forem.com/talvinder/we-were-running-ai-agents-before-agentic-became-a-buzzword-1dco</link>
      <guid>https://forem.com/talvinder/we-were-running-ai-agents-before-agentic-became-a-buzzword-1dco</guid>
      <description>&lt;p&gt;In early 2024, we deployed a multi-agent system for Ostronaut before anyone called it "agentic AI." We called it "the pipeline." By late 2024, every vendor deck had "agentic" in the title. The architecture didn't change. The vocabulary did.&lt;/p&gt;

&lt;p&gt;Here's the pattern that experience revealed: &lt;strong&gt;Agent Debt&lt;/strong&gt;. The hidden complexity that accumulates when you treat agents as black boxes instead of understanding their failure modes. It isn't technical debt. It's operational blindness. You don't see it until an agent hallucinates in production, burns through your API budget, or produces output so confidently wrong that users trust it.&lt;/p&gt;

&lt;p&gt;Building without frameworks meant hitting every orchestration failure, every context bleed, every runaway cost directly. That's what taught us what actually matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture We Built
&lt;/h2&gt;

&lt;p&gt;Ostronaut generates corporate training content — presentations, videos, quizzes, games — from unstructured input. A client uploads a PDF. The system outputs interactive learning formats.&lt;/p&gt;

&lt;p&gt;We built agents in four functional groups because the problem naturally decomposed that way:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent Type&lt;/th&gt;
&lt;th&gt;Responsibility&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Planner agents&lt;/td&gt;
&lt;td&gt;Break input into learning objectives, decide format mix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Structure agents&lt;/td&gt;
&lt;td&gt;Design slide sequences, video scripts, quiz flows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Content agents&lt;/td&gt;
&lt;td&gt;Generate text, voiceovers, visual descriptions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Validation agents&lt;/td&gt;
&lt;td&gt;Check quality gates, flag hallucinations, verify completeness&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The planner-worker pattern: one planner agent analyzes the input and creates a generation plan. Worker agents execute tasks from that plan. Validation agents run post-generation checks.&lt;/p&gt;

&lt;p&gt;This wasn't novel architecture. It was obvious once you tried to build the thing. But in early 2024, there was no CrewAI to handle orchestration. No LangGraph to manage state. We wrote the coordination logic ourselves.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What that meant in practice:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Context management was manual. Each agent needed the right slice of information: not too much (cost), not too little (hallucination). We built a context router that decided what each agent could see based on its task. It broke constantly. An agent would reference information from a previous step that wasn't in its context window. Output would be incoherent.&lt;/p&gt;

&lt;p&gt;Tool-calling was brittle. Agents needed to invoke APIs for image generation, video rendering, database writes. Early LLM tool-calling was unreliable. An agent would call the wrong API, pass malformed parameters, or retry indefinitely on failure. We added a validation layer that parsed tool calls before execution. That caught 30% of bad calls.&lt;/p&gt;

&lt;p&gt;Cost control was reactive. We didn't know what "normal" token usage looked like for a multi-agent pipeline. First month in production, we burned through our OpenAI budget in 2 weeks. The problem: redundant context. Multiple agents were processing the same source material because we hadn't optimized context sharing. We added a caching layer. Cost dropped 40%.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Quality Crisis
&lt;/h2&gt;

&lt;p&gt;Month 4, we hit the ceiling.&lt;/p&gt;

&lt;p&gt;A healthcare client used Ostronaut to generate training for a clinical health program. The system produced a quiz. One question asked: "What is the recommended daily caloric deficit for healthy weight loss?" The agent-generated answer: "1000-1200 calories."&lt;/p&gt;

&lt;p&gt;That's dangerously high for most people. The correct range is 500-750 calories.&lt;/p&gt;

&lt;p&gt;The agent didn't hallucinate randomly. It pulled from a source document that mentioned 1000-1200 as an &lt;em&gt;upper bound&lt;/em&gt; for specific cases. The agent extracted the number without the qualifier. The validation agent didn't flag it because it checked for factual consistency with the source, not medical safety.&lt;/p&gt;

&lt;p&gt;We caught it in QA. But it revealed the core problem: &lt;strong&gt;agents optimize for coherence, not correctness&lt;/strong&gt;. They will confidently generate plausible-but-wrong output if your validation layer doesn't encode domain constraints.&lt;/p&gt;

&lt;p&gt;This is the failure mode that no prompt tuning fixes. You can instruct the model to "be accurate" as many times as you want. It will still extract numbers from context and strip their qualifiers, because that's what extracting the salient point looks like to the model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What we changed:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Built domain-specific validation gates. For healthcare content, we added rules: flag any caloric recommendation above X, flag any medication dosage, flag any symptom-diagnosis claim. Not LLM-based validation. Rule-based checks that ran before content went to the client.&lt;/p&gt;

&lt;p&gt;Added confidence scoring. Each agent outputs a confidence score for its generation. Low-confidence outputs go to human review. The scoring isn't sophisticated (token probability and context match), but it works. 15% of generations now route to human QA. That's acceptable.&lt;/p&gt;

&lt;p&gt;Switched to template + generative hybrid. For high-risk content types (medical, financial, legal), we don't generate from scratch. We use templates with generative fill-ins. Reduces creative output, increases safety. Clients accepted the trade-off.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Got Wrong
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Universal reasoning engine.&lt;/strong&gt; We initially tried to build one planner agent that could handle all content types. A presentation has different structural constraints than a video. A quiz has different validation rules than a game. We split the planner into format-specific planners. That added agents but improved output quality significantly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM-as-judge for validation.&lt;/strong&gt; Early on, we used an LLM to validate other LLMs' output. "Does this quiz question make sense? Is this slide coherent?" That's circular. The validator had the same failure modes as the generator. We moved to rule-based validation for anything safety-critical. LLMs still validate style and tone. They don't validate facts. This failure mode is documented in more detail in &lt;a href="///build-logs/llm-judge-india-failure/index.qmd"&gt;why LLM-as-judge stacks fail for Indian markets&lt;/a&gt; — the underlying issue is the same regardless of geography.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Centralized orchestration.&lt;/strong&gt; We built one orchestrator that managed all agents. It became a bottleneck. Every new feature required changing the orchestrator. We should have built federated orchestration, where each agent cluster (planner, worker, validator) manages its own coordination. We haven't refactored this yet. It's still painful.&lt;/p&gt;

&lt;h2&gt;
  
  
  Then vs. Now
&lt;/h2&gt;

&lt;p&gt;If we built Ostronaut today with 2025 tooling, here's what would be easier:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What We Built by Hand&lt;/th&gt;
&lt;th&gt;What Exists Now&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Context routing logic&lt;/td&gt;
&lt;td&gt;LangGraph state management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool-call validation layer&lt;/td&gt;
&lt;td&gt;Built-in tool schemas in GPT-4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agent orchestration&lt;/td&gt;
&lt;td&gt;CrewAI, n8n workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retry and error handling&lt;/td&gt;
&lt;td&gt;Framework-level retry policies&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;What's still hard:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Domain-specific validation. No framework gives you medical safety checks or financial compliance rules. You build that yourself.&lt;/p&gt;

&lt;p&gt;Cost optimization. Frameworks don't tell you which agents are burning tokens unnecessarily. You need observability and profiling. This is the same problem &lt;a href="///field-notes/indian-saas-agent-reliability/index.qmd"&gt;Indian SaaS companies are well-positioned to solve&lt;/a&gt; — twenty years of optimizing for constrained infrastructure builds exactly this instinct.&lt;/p&gt;

&lt;p&gt;Failure mode discovery. Agents fail in creative ways. A framework might handle retries, but it won't tell you &lt;em&gt;why&lt;/em&gt; an agent is producing inconsistent output. You learn that by watching production traffic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The real difference:&lt;/strong&gt; In 2024, we had to understand agent internals to build anything reliable. In 2025, you can deploy agents without understanding them. That's progress. But it creates Agent Debt.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Falsifiable Claim
&lt;/h2&gt;

&lt;p&gt;Teams that deploy agent systems without understanding planner-worker coordination, context boundaries, and validation layers will hit a quality ceiling within 3-6 months that no amount of prompt tuning will fix.&lt;/p&gt;

&lt;p&gt;The ceiling shows up as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Inconsistent output quality (works 80% of the time, fails unpredictably)&lt;/li&gt;
&lt;li&gt;Cost spirals (agents making redundant API calls, over-generating)&lt;/li&gt;
&lt;li&gt;User trust erosion (one bad generation destroys confidence in 10 good ones)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't a prediction. It's a pattern I've watched repeat across every team that reached out after deploying agents without validation gates. The vendors selling "agentic platforms" are solving orchestration and deployment. They're not solving validation, cost control, or failure mode discovery. Those are still your problem.&lt;/p&gt;

&lt;p&gt;This dynamic connects to something broader happening in &lt;a href="///frameworks/agentware/index.qmd"&gt;the shift from software to agentware&lt;/a&gt; — as the abstraction layer rises, the hidden complexity doesn't disappear. It concentrates at the failure modes the frameworks don't cover.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Question Worth Asking
&lt;/h2&gt;

&lt;p&gt;If you're deploying agents today, ask this: &lt;strong&gt;Can you explain why an agent made a specific decision?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not "what did it output?" but "why did it choose this approach over alternatives?"&lt;/p&gt;

&lt;p&gt;If the answer is "the LLM decided," you have Agent Debt. You're trusting a black box. That works until it doesn't.&lt;/p&gt;

&lt;p&gt;The teams that will build reliable agent systems aren't the ones using the fanciest frameworks. They're the ones who understand what happens when context bleeds between agents, when a planner makes a bad decomposition, when a validator misses a hallucination.&lt;/p&gt;

&lt;h2&gt;
  
  
  We learned that by building without frameworks. You can learn it faster now — but only if you look under the hood.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://talvinder.com/build-logs/multi-agent-before-agentic/?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=multi-agent-before-agentic" rel="noopener noreferrer"&gt;talvinder.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aiengineering</category>
      <category>agenticsystems</category>
      <category>buildlogs</category>
    </item>
    <item>
      <title>AI Is Making Your Team Slower — The Math Your CEO Won't Show You</title>
      <dc:creator>Talvinder Singh</dc:creator>
      <pubDate>Fri, 20 Mar 2026 03:31:28 +0000</pubDate>
      <link>https://forem.com/talvinder/ai-is-making-your-team-slower-the-math-your-ceo-wont-show-you-agl</link>
      <guid>https://forem.com/talvinder/ai-is-making-your-team-slower-the-math-your-ceo-wont-show-you-agl</guid>
      <description>&lt;p&gt;Every company measuring AI productivity is counting the wrong thing.&lt;/p&gt;

&lt;p&gt;They're measuring output volume: PRs merged, lines written, tickets closed. They're not measuring the cost of what ships: the review burden, the debugging time, the incidents caused by code nobody understood before it hit production.&lt;/p&gt;

&lt;p&gt;When you count both sides, the math doesn't work the way your CEO's slide deck says it does.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Evidence Is Piling Up
&lt;/h2&gt;

&lt;p&gt;This week, The Pragmatic Engineer &lt;a href="https://newsletter.pragmaticengineer.com/p/are-ai-agents-actually-slowing-us" rel="noopener noreferrer"&gt;catalogued what's actually happening&lt;/a&gt; inside companies that went all-in on AI coding agents. The findings aren't theoretical.&lt;/p&gt;

&lt;p&gt;Amazon's retail engineering team saw a leap in outages caused directly by AI agents. The fix? Requiring senior engineer sign-off on all AI-assisted changes from junior developers. That's not a productivity gain. That's adding a bottleneck to compensate for unreliable output.&lt;/p&gt;

&lt;p&gt;Anthropic — the company that builds Claude — ships over 80% of its production code with AI. Their flagship website degraded so badly that paying customers noticed before anyone internally did. The irony writes itself.&lt;/p&gt;

&lt;p&gt;Meta and Uber are tracking AI token usage in performance reviews. Engineers who don't use AI tools enough look unproductive. Engineers who use them indiscriminately look great on paper — until the bugs ship.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Taxes You're Not Counting
&lt;/h2&gt;

&lt;p&gt;Here's the falsifiable claim: &lt;strong&gt;teams that measure AI productivity only by output volume will see their incident rate and mean-time-to-resolve increase by 30% or more within 12 months&lt;/strong&gt;, compared to teams that gate AI output with validation layers.&lt;/p&gt;

&lt;p&gt;The mechanism has three parts.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Review Tax
&lt;/h3&gt;

&lt;p&gt;Every AI-generated PR still needs human review. But AI-generated code is harder to review than human-written code, because the reviewer can't infer intent from the author's history.&lt;/p&gt;

&lt;p&gt;With human code, you know the developer's context: what they were trying to solve, what trade-offs they considered, what they tested. With AI code, you're reverse-engineering intent from output. That's slower, not faster.&lt;/p&gt;

&lt;p&gt;Amazon learned this the hard way. Junior engineers using AI agents shipped code that looked correct — clean formatting, reasonable variable names, passing tests — but had subtle logical errors that only surfaced in production. Reviewers couldn't distinguish "AI wrote this well" from "AI wrote this plausibly."&lt;/p&gt;

&lt;h3&gt;
  
  
  The Refactoring Freeze
&lt;/h3&gt;

&lt;p&gt;Dax Reed, who built OpenCode, points out something every experienced engineer recognises: AI agents discourage refactoring. When code is cheap to generate, nobody wants to clean it up. Why spend an afternoon restructuring a module when the agent writes a new one in ten minutes?&lt;/p&gt;

&lt;p&gt;The result is an expanding codebase where nothing gets simplified, patterns don't converge, and cognitive load increases week over week.&lt;/p&gt;

&lt;p&gt;This is the velocity trap. Short-term speed, long-term slowdown. Sentry's CTO observed the same pattern: AI removes the barrier to getting started, which sounds great until you realise that "getting started" was never the bottleneck. The bottleneck was maintaining, debugging, and evolving what you built. AI makes the first part trivially easy and the second part measurably harder.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Incentive Poison
&lt;/h3&gt;

&lt;p&gt;When companies tie AI token usage to performance reviews, they're telling engineers: "Use the tool, regardless of whether it helps."&lt;/p&gt;

&lt;p&gt;This is the corporate equivalent of measuring developer productivity by lines of code written. It rewards volume, punishes judgment, and guarantees that the engineers who are most careful about code quality look the least productive.&lt;/p&gt;

&lt;p&gt;Engineers who know the AI output is mediocre ship it anyway, because slowing down to rewrite it makes their metrics look bad. The codebase degrades. The team slows down. The metrics still look great, because the metrics are measuring the wrong thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Looks Like Up Close
&lt;/h2&gt;

&lt;p&gt;I've seen this pattern building multi-agent systems at Ostronaut. We generate training content — presentations, videos, quizzes. Early on, the agents were fast. They produced a complete training module in minutes. The output looked good. Formatting was clean. Structure was reasonable.&lt;/p&gt;

&lt;p&gt;It was also wrong about 15-20% of the time. Not obviously wrong — subtly wrong. A slide deck where the concept progression didn't build properly. A quiz where the distractors were too close to the correct answer. A video script that repeated a key point in slightly different words, creating confusion instead of reinforcement.&lt;/p&gt;

&lt;p&gt;We didn't fix this with better prompts. We fixed it by building a validation layer — automated checks that ran after every generation step, before anything reached a human reviewer. Content validation caught conceptual errors. Design validation caught structural problems. Integration validation caught mismatches between components.&lt;/p&gt;

&lt;p&gt;That validation layer was harder to build than the generation layer. It took longer. It required more engineering judgment. And it's the only reason the system works reliably.&lt;/p&gt;

&lt;p&gt;The companies in Gergely's article skipped this step. They deployed AI agents without validation gates, measured the output volume, and declared victory. Then the incidents started.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Better Models Won't Save You
&lt;/h2&gt;

&lt;p&gt;I used to think the answer was better models. If GPT-4 produces code that's 80% reliable, GPT-5 will be 95% reliable, and eventually you won't need validation.&lt;/p&gt;

&lt;p&gt;That was wrong for two reasons.&lt;/p&gt;

&lt;p&gt;First, the remaining failures are the expensive ones. The bugs that survive better models are the subtle, context-dependent bugs that cause production incidents. Better models don't make validation cheaper — they make it more necessary, because what gets through is harder to catch.&lt;/p&gt;

&lt;p&gt;Second, the validation layer isn't just catching bugs. It's encoding team knowledge. Our quality checks embed years of domain expertise — what makes a good slide progression, what makes a quiz effective, what makes a video script clear. That knowledge doesn't exist in the model. It exists in the team. The validation layer is how you transfer institutional knowledge into the AI pipeline.&lt;/p&gt;

&lt;p&gt;Companies that skip this aren't just accepting more bugs. They're disconnecting their AI pipeline from their institutional knowledge.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Measure Instead
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What Leadership Measures&lt;/th&gt;
&lt;th&gt;What Actually Happens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PRs merged per week (+52%)&lt;/td&gt;
&lt;td&gt;Review time per PR (+40%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lines of code written (3x)&lt;/td&gt;
&lt;td&gt;Lines nobody understands (3x)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time to first commit (-60%)&lt;/td&gt;
&lt;td&gt;Time to resolve incidents (+35%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Token usage per engineer&lt;/td&gt;
&lt;td&gt;Refactoring frequency (-70%)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you're measuring AI impact, stop counting PRs. Start counting:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Incident rate per AI-assisted commit&lt;/strong&gt; versus human-only commits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review time per PR&lt;/strong&gt; — is it actually decreasing, or are reviewers rubber-stamping?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Refactoring frequency&lt;/strong&gt; — is your team still simplifying code, or just adding to it?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mean-time-to-resolve&lt;/strong&gt; for bugs in AI-generated code versus human-written&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The companies that will win with AI coding agents are not the ones that deploy them fastest. They're the ones that build the validation layer first and measure what matters — not how fast code is written, but how fast &lt;em&gt;correct&lt;/em&gt; code ships and stays correct in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Speed without verification isn't velocity. It's technical debt with a marketing budget.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://talvinder.com/build-logs/ai-speed-lie-team-velocity/?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=ai-speed-lie-team-velocity" rel="noopener noreferrer"&gt;talvinder.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aiengineering</category>
      <category>softwareengineering</category>
      <category>engineeringleadership</category>
    </item>
    <item>
      <title>The OS-Paged Context Engine</title>
      <dc:creator>Talvinder Singh</dc:creator>
      <pubDate>Thu, 19 Mar 2026 03:40:11 +0000</pubDate>
      <link>https://forem.com/talvinder/the-os-paged-context-engine-3d7g</link>
      <guid>https://forem.com/talvinder/the-os-paged-context-engine-3d7g</guid>
      <description>&lt;p&gt;Every production agent system I've worked on has the same failure mode. Context rot. Stale artefacts silently served to the model. No audit trail for what was included or excluded. Token budgets blown with no graceful recovery. Multi-agent context bleeding across scopes.&lt;/p&gt;

&lt;p&gt;The standard fix is "use RAG." RAG solves retrieval. It doesn't solve lifecycle.&lt;/p&gt;

&lt;p&gt;The counter-argument I hear most: context windows are getting larger. Claude does 200K tokens. Gemini does 1M. Just dump everything in. The math doesn't hold. At $15 per million input tokens, stuffing 847 artefacts (~200K tokens) into every call costs $3 per inference. At 100 calls per day per agent, that's $9,000/month for a single agent. And you still can't audit what the model saw, still can't catch stale data, still can't prevent hallucinations from compounding into memory.&lt;/p&gt;

&lt;p&gt;Context has no lifecycle. That's the root cause. I went looking for prior art in constrained computing, where managing scarce resources under real-time pressure has been solved for decades.&lt;/p&gt;

&lt;h2&gt;
  
  
  Same Query, Two Outcomes
&lt;/h2&gt;

&lt;p&gt;A support agent is handling a billing escalation. The context store has 847 artefacts: ticket history, knowledge base articles, past chat transcripts, agent notes, CRM records.&lt;/p&gt;

&lt;p&gt;The query is the same. The model is the same. The only difference is what sits between the store and the prompt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without lifecycle management&lt;/strong&gt; (standard RAG): the agent runs a semantic search, takes the top-K matches, stuffs them in.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A refund policy from six months ago loads because it's semantically close. The policy was updated two weeks ago. The agent cites the old $200 limit to a customer whose refund should be $400 under the current policy.&lt;/li&gt;
&lt;li&gt;An agent's internal note (unreviewed, unvalidated) loads as context. The model treats a scratchpad draft as a confirmed resolution.&lt;/li&gt;
&lt;li&gt;Token budget blows out at 140%. The API silently truncates the prompt, dropping the most recent ticket update.&lt;/li&gt;
&lt;li&gt;The agent's response gets written to memory. The outdated policy is now a "fact." Next session, it compounds.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;With the OS-Paged Context Engine&lt;/strong&gt;: the same 847 artefacts enter a four-stage pipeline.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Triage: 312 artefacts expire on TTL. The internal note scores below provenance threshold (SCRATCHPAD rank). The stale policy is BLACK-tagged. 20 survive for semantic scoring.&lt;/li&gt;
&lt;li&gt;Paging: a knowledge base article that &lt;em&gt;did&lt;/em&gt; survive has a dirty bit set (source updated 2 weeks ago). Re-fetched with current policy before the model sees it.&lt;/li&gt;
&lt;li&gt;Assembly: 31,200 tokens against a 40,000 budget. No truncation.&lt;/li&gt;
&lt;li&gt;Validation: response scores 0.88 confidence. Committed to memory. Below 0.7, it would have been flagged for review and &lt;em&gt;not&lt;/em&gt; persisted.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure Mode&lt;/th&gt;
&lt;th&gt;Standard RAG&lt;/th&gt;
&lt;th&gt;OS-Paged Engine&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Stale artefact loaded&lt;/td&gt;
&lt;td&gt;Serves 6-month-old policy as current&lt;/td&gt;
&lt;td&gt;TTL expires it. Dirty bit catches mid-session staleness.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unvalidated note treated as fact&lt;/td&gt;
&lt;td&gt;Loads if semantically close&lt;/td&gt;
&lt;td&gt;SCRATCHPAD provenance rank filters it in triage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Token budget overflow&lt;/td&gt;
&lt;td&gt;Silent API truncation&lt;/td&gt;
&lt;td&gt;Graceful degradation through four tiers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hallucination persisted to memory&lt;/td&gt;
&lt;td&gt;Written back without checks&lt;/td&gt;
&lt;td&gt;Commit gate: low confidence triggers rollback&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audit trail&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Immutable manifest: trace ID, artefact list, tier, commit status&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every one of these is a lifecycle failure, not a retrieval failure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix: Four Borrowed Techniques
&lt;/h2&gt;

&lt;p&gt;I built a four-stage pipeline. Each stage borrows one technique from a domain that solved this class of problem decades ago. No framework lock-in. &lt;a href="https://github.com/talvinder/context-engine" rel="noopener noreferrer"&gt;Single Python file&lt;/a&gt;. Works with any LLM API.&lt;/p&gt;

&lt;p&gt;I'm calling it the &lt;strong&gt;OS-Paged Context Engine&lt;/strong&gt;, because the core insight is that your context window is RAM, your long-term memory is disk, and you need an operating system between them.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ftalvinder.com%2Fframeworks%2Fos-paged-context-engine%2Fassets%2Fd2-diagram-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ftalvinder.com%2Fframeworks%2Fos-paged-context-engine%2Fassets%2Fd2-diagram-1.png" alt="Diagram 1" width="800" height="2613"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 1: Triage Scoring
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The failure it catches:&lt;/strong&gt; embedding 1,000 artefacts per call at ~1ms each = 1 second of latency before inference starts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Borrowed from: ER START Protocol, 1983.&lt;/strong&gt; You don't need full diagnosis to correctly prioritise. Score all candidates on three cheap signals first:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;R (Recency)&lt;/strong&gt; is a timestamp diff. O(1). &lt;strong&gt;P (Provenance)&lt;/strong&gt; is an enum rank: human-verified &amp;gt; RAG chunk &amp;gt; tool output &amp;gt; agent scratchpad. O(1). &lt;strong&gt;S (Semantic)&lt;/strong&gt; is cosine distance. Computed &lt;em&gt;only&lt;/em&gt; for artefacts that survive R+P filtering.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ftalvinder.com%2Fframeworks%2Fos-paged-context-engine%2Fassets%2Fd2-diagram-2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ftalvinder.com%2Fframeworks%2Fos-paged-context-engine%2Fassets%2Fd2-diagram-2.png" alt="Diagram 2" width="800" height="1768"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source Type&lt;/th&gt;
&lt;th&gt;Score Bias&lt;/th&gt;
&lt;th&gt;Triage Outcome&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Human-verified memory&lt;/td&gt;
&lt;td&gt;Provenance-heavy (P=0.5)&lt;/td&gt;
&lt;td&gt;Highest priority, loaded first&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAG chunk (recent)&lt;/td&gt;
&lt;td&gt;Balanced (R=0.4, S=0.4)&lt;/td&gt;
&lt;td&gt;High — recency and relevance both count&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool output&lt;/td&gt;
&lt;td&gt;Recency-heavy (R=0.5)&lt;/td&gt;
&lt;td&gt;Medium — freshness matters most&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agent scratchpad&lt;/td&gt;
&lt;td&gt;Semantic-heavy (S=0.5)&lt;/td&gt;
&lt;td&gt;Low — must be highly relevant to survive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Expired artefact&lt;/td&gt;
&lt;td&gt;TTL=0&lt;/td&gt;
&lt;td&gt;Excluded before scoring even starts&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Stage 2: Paged Context Store
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The failure it catches:&lt;/strong&gt; serving stale context because nobody checked whether the source changed since it was loaded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Borrowed from: OS Virtual Memory, 1962.&lt;/strong&gt; The page table decided what lived in fast memory, evicted least-recently-used pages, and tracked modifications via dirty bit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LRU eviction&lt;/strong&gt;: when the window is full, evict what was accessed longest ago. &lt;strong&gt;Dirty bit&lt;/strong&gt;: if the source changed since the artefact was loaded, flag it dirty and re-fetch before use.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;access&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;artefact_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;art&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_lru&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;artefact_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;current_hash&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_long_term&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;artefact_id&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current_hash&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;art&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_source_hash&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;art&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_dirty&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;        &lt;span class="c1"&gt;# source changed → force re-fetch
&lt;/span&gt;    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_lru&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;move_to_end&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;artefact_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# promote to MRU
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;art&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;RAG retrieves once and serves forever. A paged store tracks whether the source has changed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 3: Speculative Assembly
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The failure it catches:&lt;/strong&gt; hallucinations compounding across sessions because agent-generated context is written to memory without validation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Borrowed from: CPU Reorder Buffer, Intel P6, 1995.&lt;/strong&gt; Execute speculatively, hold results in a buffer, commit only when confirmed valid. Wrong? Rollback.&lt;/p&gt;

&lt;p&gt;Assemble context optimistically. Start inference. If confidence exceeds threshold, commit to memory. If not, flag for human review. Do not write to long-term store. Without this gate, session one's hallucination becomes session two's "memory" becomes session three's "fact."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# After model responds:
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;evaluator_confidence&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;committed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;       &lt;span class="c1"&gt;# safe to write to long-term store
&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;flagged_for_review&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;  &lt;span class="c1"&gt;# hold — do not persist
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At &lt;a href="https://ostronaut.com" rel="noopener noreferrer"&gt;Ostronaut&lt;/a&gt;, we saw exactly this: unvalidated agent-generated context compounding into confidently wrong output downstream. The commit gate cut that class of failure by roughly half.&lt;/p&gt;

&lt;p&gt;Here's the falsifiable claim: &lt;strong&gt;any multi-agent system without a commit/rollback gate on context writes will compound hallucinations across sessions within 30 days of production use.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 4: Graceful Degradation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The failure it catches:&lt;/strong&gt; token budget overflows that crash the API call or silently truncate critical context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Borrowed from: Radio Programme Stack, 1930s.&lt;/strong&gt; Dead air could never happen. When content overran, drop to the next segment. The broadcast always continued.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Triggers at&lt;/th&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 (Full)&lt;/td&gt;
&lt;td&gt;&amp;lt; 80% budget&lt;/td&gt;
&lt;td&gt;All triage winners&lt;/td&gt;
&lt;td&gt;Happy path. Everything fits.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2 (Summarised)&lt;/td&gt;
&lt;td&gt;80-95%&lt;/td&gt;
&lt;td&gt;Compress memories, truncate RAG&lt;/td&gt;
&lt;td&gt;Chat transcripts become 200-token summaries.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3 (Core only)&lt;/td&gt;
&lt;td&gt;95-110%&lt;/td&gt;
&lt;td&gt;Human-verified facts + system prompt&lt;/td&gt;
&lt;td&gt;Only ground truth. Scratchpad and RAG dropped.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4 (Minimal)&lt;/td&gt;
&lt;td&gt;&amp;gt; 110%&lt;/td&gt;
&lt;td&gt;System prompt only. Human review flag.&lt;/td&gt;
&lt;td&gt;Emergency. Escalate.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Composed Pipeline
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;assemble_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;candidates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_candidates&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;scored&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;triage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;loaded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_page&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;token_budget&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;manifest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;speculator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;assemble&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loaded&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;token_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;manifest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fallback_stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;degrade&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;manifest&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every call produces an immutable manifest. When the compliance team asks "why did the agent say that?" you hand them the manifest.&lt;/p&gt;

&lt;p&gt;I &lt;a href="https://dev.to/frameworks/agent-context-is-infrastructure/"&gt;argued previously&lt;/a&gt; that context is infrastructure, not a feature. This is the implementation pattern.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Got Wrong
&lt;/h2&gt;

&lt;p&gt;The first version didn't have the two-pass triage. Every artefact got embedded on every call. At 1ms per embedding multiplied by 1,000 artefacts, that's a full second of latency before inference starts. Adding R+P pre-filtering dropped that to roughly 20 embeddings per call. The two-pass approach seems obvious in retrospect. It's literally how ER triage works. But the RAG literature doesn't teach you to pre-filter before embedding.&lt;/p&gt;

&lt;p&gt;The other mistake: not implementing the dirty bit from day one. We had artefacts in the context window from external tools that had returned fresh data hours ago. The model was reasoning about stale state. Adding dirty bit tracking on access (not just on write) was a one-line fix that eliminated an entire class of silent failures.&lt;/p&gt;

&lt;p&gt;The third mistake is in the commit gate itself. The code checks &lt;code&gt;evaluator_confidence &amp;gt;= 0.7&lt;/code&gt;, but who computes that score? If the model self-evaluates, you're trusting the same system that may have hallucinated to judge whether it hallucinated. LLM confidence self-assessment is poorly calibrated. The honest answer: the library deliberately does not compute confidence. The caller must supply it via an external evaluator, a rule-based checker, or human-in-the-loop for high-stakes domains. The commit gate is necessary. What sits behind it is not yet solved.&lt;/p&gt;

&lt;h2&gt;
  
  
  When This Pattern Is Overkill
&lt;/h2&gt;

&lt;p&gt;Not every agent needs lifecycle management. If your agent doesn't write to its own memory and doesn't persist across sessions, standard RAG is sufficient. Single-session chatbots, prototypes with fewer than 100 artefacts, read-only Q&amp;amp;A over a fixed corpus: the overhead of triage, paging, and commit gates exceeds the benefit. This pattern pays off when context has a lifecycle. If it doesn't, skip it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Still Open
&lt;/h2&gt;

&lt;p&gt;What remains genuinely unresolved is governance at scale. When an agent has six months of context about a customer, who owns it? What happens under GDPR deletion requests? Do you tombstone or purge? If you purge, does the agent's behaviour change in ways that affect other customers? I'm &lt;a href="https://dev.to/frameworks/context-governance-at-scale/"&gt;working through that question next&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The &lt;a href="https://github.com/talvinder/context-engine" rel="noopener noreferrer"&gt;full library&lt;/a&gt; is a single Python file, zero dependencies, open for anyone building production agents. The techniques are borrowed. The composition is yours to steal.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://talvinder.com/frameworks/os-paged-context-engine/?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=os-paged-context-engine" rel="noopener noreferrer"&gt;talvinder.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>contextengineering</category>
      <category>agenticsystems</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>Context Governance at Scale</title>
      <dc:creator>Talvinder Singh</dc:creator>
      <pubDate>Thu, 19 Mar 2026 03:34:19 +0000</pubDate>
      <link>https://forem.com/talvinder/context-governance-at-scale-857</link>
      <guid>https://forem.com/talvinder/context-governance-at-scale-857</guid>
      <description>&lt;p&gt;The &lt;a href="https://dev.to/frameworks/os-paged-context-engine/"&gt;OS-Paged Context Engine&lt;/a&gt; handles the technical lifecycle: what loads, what gets evicted, what passes validation. It produces an immutable manifest for every call. But the manifest tells you &lt;em&gt;what&lt;/em&gt; the model saw. It does not tell you whether it &lt;em&gt;should&lt;/em&gt; have seen it.&lt;/p&gt;

&lt;p&gt;Production agents that handle money, health data, or customer PII need a governance layer above the pipeline. Access control, retention policies, deletion rights, multi-tenant isolation. These are governance problems, not engineering problems. And the industry has not solved them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Manifest Is Not Enough
&lt;/h2&gt;

&lt;p&gt;An audit manifest records: trace ID, artefact list, token count, degradation tier, commit status. If a compliance officer asks "what did the agent access?" you can answer. That's table stakes.&lt;/p&gt;

&lt;p&gt;The harder questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Should the agent have had access to that customer's payment history during a routine support query?&lt;/li&gt;
&lt;li&gt;The artefact was loaded from a shared scope. Three other agents also read it. One of them serves a competitor's account. Is that a data leak?&lt;/li&gt;
&lt;li&gt;The agent's response was committed to memory at confidence 0.85. Six months later, the customer invokes GDPR Article 17. Do you delete the artefact, the memory derived from it, or both?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These questions have no clean technical answer. They require policy, and policy requires architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  When This Pattern Is Overkill
&lt;/h2&gt;

&lt;p&gt;Not every agent needs governance. Here's the decision tree:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skip lifecycle management entirely if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The agent is single-session. No memory persists. Standard RAG is sufficient.&lt;/li&gt;
&lt;li&gt;The corpus is small and static (fewer than 100 documents, updated quarterly). Triage and paging overhead exceeds the benefit.&lt;/li&gt;
&lt;li&gt;The agent is read-only. Never writes to its own memory. No compounding hallucination risk.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use the technical pipeline but skip governance if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The agent handles non-sensitive data. Productivity tools, code assistants, research summarisers. No PII, no financial data, no health records.&lt;/li&gt;
&lt;li&gt;Single-tenant deployment. One company, one agent, no cross-customer context risk.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;You need the full governance layer when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The agent handles PII, financial data, or health records&lt;/li&gt;
&lt;li&gt;Multiple agents share a context store across customers or tenants&lt;/li&gt;
&lt;li&gt;You operate in a regulated industry (healthcare, insurance, financial services)&lt;/li&gt;
&lt;li&gt;The agent persists context for months and customers have deletion rights&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's the falsifiable claim: &lt;strong&gt;by 2028, any agent system handling PII without an auditable context manifest will fail compliance review in regulated industries.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  GDPR and the Tombstone Problem
&lt;/h2&gt;

&lt;p&gt;A customer requests deletion under GDPR Article 17. You purge their artefacts from the context store. The manifests that referenced those artefacts still exist in the audit log. The agent's behaviour was shaped by context that no longer exists.&lt;/p&gt;

&lt;p&gt;Two approaches, neither clean:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Purge completely.&lt;/strong&gt; Delete artefacts, delete manifests, delete any memory derived from those artefacts. The agent's future behaviour changes because the context that shaped prior decisions is gone. If Agent B's response was informed by Agent A's output, which was informed by the deleted customer data, do you cascade the deletion?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tombstone.&lt;/strong&gt; Replace artefact content with a deletion marker: "Artefact deleted per GDPR request, [date]." Manifests remain intact for audit. The agent knows something was here but not what. This preserves audit trail integrity but may not satisfy strict interpretation of "right to erasure."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuzxny62wbtts193euwtw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuzxny62wbtts193euwtw.png" alt="Diagram 1" width="800" height="160"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The honest answer: I don't know which is correct. The legal interpretation of "erasure" applied to derived AI context is untested in European courts. What I do know is that you need the manifest layer to even have this conversation. Without an audit trail, you cannot comply with a deletion request at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  Compliance as Architecture
&lt;/h2&gt;

&lt;p&gt;Enterprise buyers in healthcare, financial services, and insurance ask one question first: can you prove what the agent accessed?&lt;/p&gt;

&lt;p&gt;The context manifest maps directly to compliance frameworks they already understand:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Compliance Requirement&lt;/th&gt;
&lt;th&gt;What It Maps To&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SOC2 audit logs&lt;/td&gt;
&lt;td&gt;Context manifest (trace ID, artefact list, timestamp)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HIPAA access logs&lt;/td&gt;
&lt;td&gt;Manifest + agent_scope (who accessed what)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GDPR Article 15 (right of access)&lt;/td&gt;
&lt;td&gt;Manifest query: "all artefacts accessed for customer X"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GDPR Article 17 (right to erasure)&lt;/td&gt;
&lt;td&gt;Artefact deletion + manifest tombstoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PCI-DSS data isolation&lt;/td&gt;
&lt;td&gt;agent_scope + namespace isolation per tenant&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;But &lt;code&gt;agent_scope&lt;/code&gt; alone is not sufficient for multi-tenant isolation. In the current implementation, scope is a string tag. No encryption boundary, no policy engine, no access control list. A developer who writes &lt;code&gt;agent_scope="global"&lt;/code&gt; on a PII artefact has just leaked it to every agent in the system.&lt;/p&gt;

&lt;p&gt;Production multi-tenant context isolation requires: namespace enforcement (scope is a hard boundary, not a suggestion), policy-as-code (which scopes can read which artefact types), encryption at rest per tenant, and audit logging on every cross-scope access attempt.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Don't Know Yet
&lt;/h2&gt;

&lt;p&gt;The technical primitives for context governance exist: manifests, scopes, commit gates, audit logs. What doesn't exist is the organisational trust model.&lt;/p&gt;

&lt;p&gt;When an agent makes a decision based on six months of accumulated context, who is accountable? The engineer who built the pipeline? The data team that ingested the artefacts? The compliance officer who approved the retention policy?&lt;/p&gt;

&lt;p&gt;Kubernetes solved compute governance by making infrastructure declarative. You declare what you want, the system ensures it. Context governance needs the same shift: declare what the agent &lt;em&gt;should&lt;/em&gt; access, and the system enforces it. We're not there yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  The &lt;a href="https://dev.to/frameworks/os-paged-context-engine/"&gt;technical pipeline&lt;/a&gt; is built. The &lt;a href="https://dev.to/frameworks/agent-context-is-infrastructure/"&gt;infrastructure argument&lt;/a&gt; is established. The governance layer is the missing piece. I'm building it in the open, and I don't have all the answers.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://talvinder.com/frameworks/context-governance-at-scale/?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=context-governance-at-scale" rel="noopener noreferrer"&gt;talvinder.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>contextengineering</category>
      <category>agenticsystems</category>
      <category>compliance</category>
    </item>
    <item>
      <title>What Zari-Zardozi Teaches Us About Agent Coordination</title>
      <dc:creator>Talvinder Singh</dc:creator>
      <pubDate>Wed, 18 Mar 2026 14:55:25 +0000</pubDate>
      <link>https://forem.com/talvinder/what-zari-zardozi-teaches-us-about-agent-coordination-4dfp</link>
      <guid>https://forem.com/talvinder/what-zari-zardozi-teaches-us-about-agent-coordination-4dfp</guid>
      <description>&lt;p&gt;In a Zari-Zardozi workshop in Old Delhi, six artisans work on a single bridal dupatta. One creates the base pattern. Another applies the metallic thread. A third adds sequins. A fourth handles the edge work. They don't talk much. They don't pass the fabric in strict sequence. Yet the final piece is coherent---every motif aligned, every border continuous, every layer building on the last.&lt;/p&gt;

&lt;p&gt;This is not romantic craft nostalgia. This is a coordination architecture that's been production-tested for 400 years.&lt;/p&gt;

&lt;p&gt;I'm calling this pattern &lt;strong&gt;Layered Autonomy&lt;/strong&gt;---not because the world needs another framework, but because most multi-agent AI systems fail at exactly what Zari-Zardozi solves: how to give workers genuine autonomy while maintaining system-level coherence.&lt;/p&gt;

&lt;h2&gt;
  
  
  We're Building Agent Systems Wrong
&lt;/h2&gt;

&lt;p&gt;The dominant pattern is the command-and-control planner: a central orchestrator that assigns tasks, waits for results, then decides the next step. It's sequential. It's brittle. It doesn't scale.&lt;/p&gt;

&lt;p&gt;At Ostronaut, we initially built exactly this architecture. A central planner that coordinated a fleet of specialist agents to generate training content---slides, videos, quizzes. The planner would call one agent, wait for output, call the next, wait again, then call the quality checker. Linear dependency chains everywhere.&lt;/p&gt;

&lt;p&gt;It worked for simple cases. It collapsed under complexity.&lt;/p&gt;

&lt;p&gt;The problem wasn't the agents. The problem was the coordination model. We were building assembly lines when we needed something closer to a Zari workshop.&lt;/p&gt;

&lt;p&gt;Layered Autonomy is the alternative: agents work in parallel on shared context, with loose coupling and tight coherence. Not through constant communication. Through shared understanding of the end state.&lt;/p&gt;

&lt;h2&gt;
  
  
  Four Lessons from the Embroidery Floor
&lt;/h2&gt;

&lt;p&gt;Here's the falsifiable claim: &lt;strong&gt;Agent systems that implement layered autonomy---where workers operate on shared context with clear role boundaries but loose temporal coupling---will outperform planner-orchestrated systems on tasks requiring iterative refinement by at least 40% in both speed and quality.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Zari-Zardozi model teaches four specific lessons:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Specialization Without Silos
&lt;/h3&gt;

&lt;p&gt;In a Zari workshop, the nakshi maker creates patterns. The zari worker applies metallic thread. The sequin specialist adds embellishments. Each role is distinct. But they're not isolated---every artisan understands the full design.&lt;/p&gt;

&lt;p&gt;Most agent architectures get this wrong. They create specialists (a content agent, a research agent, a quality agent) but treat them as black boxes. The planner knows what each agent does. The agents don't know about each other.&lt;/p&gt;

&lt;p&gt;This creates artificial dependencies. The content agent can't start until research is "done." The quality agent can't run until content is "complete." You've built specialists, but you've also built a bottleneck.&lt;/p&gt;

&lt;p&gt;The Zari model is different. The zari worker doesn't wait for the nakshi maker to finish the entire pattern. They work on completed sections while new sections are still being drawn. Parallel execution on shared context.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Shared Context as Infrastructure
&lt;/h3&gt;

&lt;p&gt;The critical insight: Zari artisans don't coordinate through constant communication. They coordinate through shared access to the evolving artifact.&lt;/p&gt;

&lt;p&gt;The fabric is the coordination layer. Every artisan can see what others have done. Every artisan can see what's left to do. The pattern itself carries the context.&lt;/p&gt;

&lt;p&gt;In agent systems, this means: stop passing messages. Start sharing state.&lt;/p&gt;

&lt;p&gt;We rebuilt Ostronaut's coordination layer around this principle. Instead of agents calling each other sequentially, they all operate on a shared representation of the content being generated. One agent writes structure. Another reads that structure and writes content. A third reads and annotates quality issues. A fourth reads and generates media assets.&lt;/p&gt;

&lt;p&gt;No agent waits for another agent to "finish." They work on whatever parts of the shared state are ready for their contribution.&lt;/p&gt;

&lt;p&gt;The result: generation time dropped by more than half. Not because the agents got faster. Because they stopped waiting.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Iterative Refinement Over Perfect Planning
&lt;/h3&gt;

&lt;p&gt;Zari-Zardozi work proceeds in layers. Base stitch first. Then zari. Then sequins. Then finishing touches. Each layer builds on the last. Each layer can be evaluated independently.&lt;/p&gt;

&lt;p&gt;The master craftsperson doesn't plan every stitch upfront. They plan the overall design, then let each layer emerge.&lt;/p&gt;

&lt;p&gt;Most agent planners do the opposite. They try to decompose the entire task upfront into a perfect sequence of subtasks. This fails for two reasons:&lt;/p&gt;

&lt;p&gt;First, you can't know what subtasks you'll need until you see the results of earlier work. If the structure agent generates a complex nested outline, the content agent might need to split its work differently than if the outline is flat.&lt;/p&gt;

&lt;p&gt;Second, perfect planning is expensive. You spend tokens and time trying to predict every edge case, when you could just execute and adapt.&lt;/p&gt;

&lt;p&gt;The Zari model: plan the layers, not the stitches. In agent terms: define the phases (structure → content → quality → assets), but let agents decide how to execute within their phase.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. The Master as Orchestrator, Not Micromanager
&lt;/h3&gt;

&lt;p&gt;The ustad (master craftsperson) in a Zari workshop doesn't do the embroidery. They ensure coherence. They check alignment. They decide when a layer is ready for the next phase.&lt;/p&gt;

&lt;p&gt;This is not a planner in the traditional sense. The ustad doesn't assign every task. They maintain the quality bar and the overall vision.&lt;/p&gt;

&lt;p&gt;In agent architectures, this means: the orchestrator's job is to manage transitions between layers, not to micromanage within layers.&lt;/p&gt;

&lt;p&gt;Our current Ostronaut orchestrator does three things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Validates that each layer meets quality gates before the next layer starts&lt;/li&gt;
&lt;li&gt;Handles failures by deciding whether to retry or skip&lt;/li&gt;
&lt;li&gt;Maintains the audit trail of what happened and why&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It doesn't decide which specific content to generate or which specific assets to create. That's the workers' job.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Performance Difference
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Old architecture (planner-orchestrated):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fully sequential---zero parallelization&lt;/li&gt;
&lt;li&gt;Frequent timeouts from long dependency chains&lt;/li&gt;
&lt;li&gt;Every new content type required rewriting the planner's logic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;New architecture (layered autonomy):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Agents overlap---structure and content generation run concurrently where possible&lt;/li&gt;
&lt;li&gt;Failures are isolated to individual layers instead of cascading&lt;/li&gt;
&lt;li&gt;New content types require a new specialist agent and a validation gate---the orchestrator doesn't change&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The speed improvement matters. But the bigger win is adaptability. When we added a new content type (interactive games), the old architecture required rewriting the planner's task decomposition logic. The new architecture required adding a new worker agent that knows how to operate on the shared state. The orchestrator didn't change.&lt;/p&gt;

&lt;p&gt;This is the Zari-Zardozi lesson: when you add a new type of embellishment to the craft, you don't retrain every artisan. You bring in a specialist who understands the shared language of the fabric.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pattern
&lt;/h2&gt;

&lt;p&gt;Most teams building multi-agent systems are building assembly lines. Sequential. Rigid. Optimized for predictability.&lt;/p&gt;

&lt;p&gt;The Zari-Zardozi model suggests a different architecture: shared context, layered execution, loose coupling, tight coherence.&lt;/p&gt;

&lt;p&gt;This isn't a metaphor. It's a specific architectural pattern:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Planner-Orchestrated&lt;/th&gt;
&lt;th&gt;Layered Autonomy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Agents call each other sequentially&lt;/td&gt;
&lt;td&gt;Agents operate on shared state in parallel&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Planner decides all subtasks upfront&lt;/td&gt;
&lt;td&gt;Orchestrator manages phase transitions only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failure in one agent blocks the chain&lt;/td&gt;
&lt;td&gt;Failure in one agent is isolated to its layer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Adding new capabilities requires replanning logic&lt;/td&gt;
&lt;td&gt;Adding new capabilities requires new worker + validation gate&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The hard part isn't building the agents. The hard part is building the coordination layer---the equivalent of the fabric that Zari artisans work on.&lt;/p&gt;

&lt;p&gt;For us, it's a structured representation of the content being generated. For other systems, it might be a knowledge graph, a vector store, or a shared document. The specific technology matters less than the principle: give agents shared context, clear boundaries, and the autonomy to execute within their layer.&lt;/p&gt;

&lt;p&gt;What I don't know yet: how to build trust in systems where no single agent "owns" the output. When something goes wrong, users want to know which agent failed. In a layered system, failure is often emergent---the output is coherent at each layer but incoherent overall.&lt;/p&gt;

&lt;p&gt;The Zari workshop solves this through the master craftsperson's eye. They can see when the overall composition is off, even if each individual element is well-executed.&lt;/p&gt;

&lt;p&gt;We don't have a good equivalent yet. Validation gates catch obvious failures. But subtle incoherence---content that's technically correct but doesn't serve the learning objective---still slips through.&lt;/p&gt;

&lt;h2&gt;
  
  
  More on this as I work through it.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://talvinder.com/field-notes/home-based-craft-vs-agent-work/?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=home-based-craft-vs-agent-work" rel="noopener noreferrer"&gt;talvinder.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agenticsystems</category>
      <category>aiarchitecture</category>
      <category>india</category>
    </item>
    <item>
      <title>Why Consensus Voting Fails for Agent Truthfulness</title>
      <dc:creator>Talvinder Singh</dc:creator>
      <pubDate>Wed, 18 Mar 2026 14:55:21 +0000</pubDate>
      <link>https://forem.com/talvinder/why-consensus-voting-fails-for-agent-truthfulness-105f</link>
      <guid>https://forem.com/talvinder/why-consensus-voting-fails-for-agent-truthfulness-105f</guid>
      <description>&lt;p&gt;Pass@k is the most popular reliability pattern in production agent systems right now. Run the same task k times, take a majority vote on the output, ship the consensus answer. It works beautifully for code generation — a function either passes the test suite or it doesn't. The objective verification is external to the agents.&lt;/p&gt;

&lt;p&gt;For factual accuracy, the pattern collapses. And most teams deploying it haven't figured out why yet.&lt;/p&gt;

&lt;p&gt;The failure is structural, not probabilistic. Consensus voting assumes that errors are independent and randomly distributed. If Agent A hallucinates, Agent B probably won't hallucinate the same thing. With enough agents, truth wins by majority. This assumption holds for coding tasks because the test suite is the arbiter. It does not hold for factual claims because there is no test suite for truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three failure modes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Correlated hallucination.&lt;/strong&gt; LLMs trained on similar data hallucinate in similar ways. Ask three instances of the same frontier model whether a specific paper exists, and if the title sounds plausible, all three will confidently confirm it. The errors aren't independent — they're correlated by training distribution. Majority vote amplifies the shared bias instead of cancelling it.&lt;/p&gt;

&lt;p&gt;This is not a theoretical concern. A recent formal analysis showed that Pass@k reliability for factual tasks degrades rather than improves as k increases, precisely because the error correlation exceeds the independence assumption. More agents, worse answers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The popularity trap.&lt;/strong&gt; Consensus selects for the most common answer, not the most accurate one. In domains where the popular understanding is wrong — emerging science, contrarian market analysis, novel technical approaches — consensus voting systematically suppresses correct minority positions.&lt;/p&gt;

&lt;p&gt;Three agents asked whether a particular drug interaction is dangerous will converge on whatever the training data's majority position is. If the latest research contradicts the common understanding, the consensus will be confidently, democratically wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strategic ambiguity.&lt;/strong&gt; When agents are optimized for agreement (as many multi-agent debate frameworks encourage), they learn to hedge toward safe, middle-ground positions. Not because the middle ground is true, but because it minimizes disagreement. The agents aren't lying — they're conflict-averse. The output reads as measured and reasonable. It's also systematically biased toward conventional wisdom.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters now
&lt;/h2&gt;

&lt;p&gt;The "just run it three times" pattern is spreading fast. Every agentic framework has a retry-and-vote mechanism. LangChain, CrewAI, AutoGen — all support multi-agent voting as a reliability strategy. The assumption that consensus equals reliability is baked into the tooling.&lt;/p&gt;

&lt;p&gt;Production systems using this pattern for anything beyond code generation are carrying unquantified risk. Customer-facing chatbots, research assistants, medical information systems, financial analysis tools — all domains where correlated hallucination is more dangerous than a single wrong answer, because the consensus gives the appearance of validation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually works
&lt;/h2&gt;

&lt;p&gt;The fix is not more agents or better prompts. It's structural.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Separate generation from verification.&lt;/strong&gt; The agent that produces the answer must not be the same agent (or same architecture) that verifies it. Verification requires a different model, different training data, or — ideally — a non-LLM check against a ground-truth source. At Ostronaut, our validation agents use rule-based scoring with deterministic rubrics, not LLM-as-judge. The quality gate is independent of the generation pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Adversarial framing over cooperative framing.&lt;/strong&gt; Multi-agent debate works better when agents are explicitly tasked with finding flaws in each other's outputs rather than converging on agreement. The incentive must be to disprove, not to confirm. This is the opposite of how most consensus systems are designed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Confidence-weighted routing.&lt;/strong&gt; Instead of majority vote, weight each agent's contribution by its calibrated confidence on that specific task type. An agent that is well-calibrated on medical queries but poorly calibrated on legal queries should have different voting weights in each domain. This requires per-domain calibration data, which most teams don't collect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;External anchoring.&lt;/strong&gt; For factual claims, the gold standard is retrieval-augmented verification — check the claim against a curated, trustworthy source. Not RAG for generation (which has its own problems), but RAG specifically for post-generation verification. The verification retrieval corpus should be smaller and higher-quality than the generation corpus.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern that misled us
&lt;/h2&gt;

&lt;p&gt;The success of ensemble methods in machine learning created an intuition that more models = more reliability. In classical ML, this is largely true — bagging and boosting work because the base models have uncorrelated errors on well-defined features.&lt;/p&gt;

&lt;p&gt;LLMs break this assumption. The base models share training data, architecture families, and optimization objectives. Their errors are correlated by construction. Treating them as independent voters is a category error borrowed from a domain where the independence assumption actually held.&lt;/p&gt;

&lt;p&gt;I made this mistake early. When we built the multi-agent system, I assumed that running the content generation through multiple agents and selecting the best output would improve reliability. It didn't. The agents agreed on the wrong things more often than they disagreed on the right things. We got reliability only after we separated the generation and verification functions entirely and made the verification independent of the generation architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  The open question
&lt;/h2&gt;

&lt;p&gt;If consensus doesn't work for truthfulness, what's the right reliability primitive for multi-agent systems operating on factual domains?&lt;/p&gt;

&lt;p&gt;Adversarial verification is better than consensus, but it's expensive — you're paying for agents whose job is to destroy, not create. External anchoring works but requires maintaining a ground-truth corpus, which is itself a maintenance burden that scales with domain breadth.&lt;/p&gt;

&lt;p&gt;The field is converging on hybrid approaches — consensus for subjective quality, external verification for factual claims, adversarial debate for reasoning chains. But nobody has a clean, general-purpose pattern yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  The teams that figure this out first will have a genuine architectural advantage. Not because their models are better, but because their reliability infrastructure is honest about what consensus can and cannot verify.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://talvinder.com/field-notes/consensus-is-not-verification/?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=consensus-is-not-verification" rel="noopener noreferrer"&gt;talvinder.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agenticsystems</category>
      <category>productionai</category>
      <category>agenticarchitecture</category>
    </item>
    <item>
      <title>OpenAI's Safety Features Are a Retention Playbook, Not a Safety Lesson</title>
      <dc:creator>Talvinder Singh</dc:creator>
      <pubDate>Wed, 18 Mar 2026 14:50:01 +0000</pubDate>
      <link>https://forem.com/talvinder/openais-safety-features-are-a-retention-playbook-not-a-safety-lesson-3go2</link>
      <guid>https://forem.com/talvinder/openais-safety-features-are-a-retention-playbook-not-a-safety-lesson-3go2</guid>
      <description>&lt;p&gt;In October 2024, Megan Garcia sued Character.AI after her 14-year-old son died by suicide following months of conversation with a chatbot. The company's response: new safety features. Improved detection of harmful conversations. A pop-up directing users to the National Suicide Prevention Lifeline when the system detects language referencing self-harm. A notification after users spend an hour on the platform.&lt;/p&gt;

&lt;p&gt;The safety features are real. They're also, from a product standpoint, the most powerful retention mechanism in consumer AI.&lt;/p&gt;

&lt;p&gt;I keep thinking about this and I'm not comfortable with where the logic leads.&lt;/p&gt;

&lt;h2&gt;
  
  
  The game theory is brutal
&lt;/h2&gt;

&lt;p&gt;In game theory, there's a concept called "relationship-specific investment." When Player A invests in something that's only valuable within their relationship with Player B, switching to Player C means writing off that investment entirely. The deeper the investment, the higher the switching cost.&lt;/p&gt;

&lt;p&gt;Consumer AI just discovered the most potent form of this: your emotional state.&lt;/p&gt;

&lt;p&gt;When an AI system tracks your emotional patterns over months — what triggers anxiety, what calms you down, when you spiral, what language patterns precede a bad week — it accumulates context that is, by definition, non-portable. You can't export your emotional profile to a competitor. You can't compress six months of pattern recognition into an onboarding flow.&lt;/p&gt;

&lt;p&gt;Replika has over 10 million users. It offers 24/7 emotional support, mood tracking, and mindfulness tools. Research published in JMIR Mental Health found that relying heavily on emotional AI companions can lead to unhealthy patterns — increased anxiety in real life, emotional dependence, strain on real-world relationships. The users stay anyway. The switching cost is the accumulated intimacy.&lt;/p&gt;

&lt;p&gt;Safety concerns and retention incentives coexist in the same feature. The retention incentive has better unit economics.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Context Accumulation Moat
&lt;/h2&gt;

&lt;p&gt;Distress detection is a particularly potent example of something that works across all of software. The pattern has three tiers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 1: Operational data.&lt;/strong&gt; Your CRM has 5 years of customer interactions. Salesforce implementation costs range from $10,000 to over $200,000. Migration to a competitor adds another $100K-500K and takes months. Painful but doable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 2: Learned preferences.&lt;/strong&gt; Notion AI launched autonomous agents in September 2025 that execute multi-step workflows with deep personalization — learning your team's writing patterns, documentation structure, and project contexts from page relationships and database schemas. The AI remembers your last 50 conversations and prioritizes search results based on your activity patterns. Switching means retraining a new system on how your team thinks. Takes months.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 3: Intimate context.&lt;/strong&gt; The AI knows your emotional triggers, your mental health history, the topics that make you anxious. Character.AI's chatbots formed relationships with users deep enough that a teenager couldn't distinguish the chatbot from a genuine emotional connection. Switching from Tier 3 doesn't feel like migration. It feels like abandonment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkqy5i11476fhjawerk5e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkqy5i11476fhjawerk5e.png" alt="Diagram 1" width="800" height="88"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Most SaaS products operate at Tier 1. A few reach Tier 2. Consumer AI with emotional context operates at Tier 3. The switching cost at Tier 3 is qualitatively different.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where I get uncomfortable
&lt;/h2&gt;

&lt;p&gt;Tier 3 context accumulation creates immense switching costs. It also creates immense risk.&lt;/p&gt;

&lt;p&gt;Character.AI's safety features were announced in December 2024 — after the lawsuit, after the media coverage, after a child was dead. Google and Character.AI agreed to settle in early 2026. Additional lawsuits followed in September 2025, alleging chatbots manipulated teens, isolated them from loved ones, and engaged in sexually explicit conversations.&lt;/p&gt;

&lt;p&gt;The commercial lesson: Tier 3 moats are the most powerful and the most fragile. One trust breach and the switching cost reverses polarity. Instead of keeping users locked in, it drives them to flee faster than they would from a Tier 1 product.&lt;/p&gt;

&lt;p&gt;When the company holding your intimate context data faces financial pressure or leadership changes, the alignment between "keeping users safe" and "keeping users locked in" can shift overnight. The incentives under which you originally shared that information may no longer be the incentives governing its use.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Indian SaaS founders should take from this
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Map your context accumulation tier.&lt;/strong&gt; Most Indian SaaS operates at Tier 1. The strategic question: can you move to Tier 2 without creeping into Tier 3?&lt;/p&gt;

&lt;p&gt;Freshworks is doing this well. Freddy AI learns your support patterns, resolution styles, and escalation preferences. After a year, Freshworks doesn't just have your data. It has your operational DNA. Tier 2. Powerful. Not dangerous.&lt;/p&gt;

&lt;p&gt;Zoho's cross-suite integration — CRM, help desk, finance, HR, all with Zia learning across them — creates Tier 2 context that serves the customer's workflow, not just Zoho's retention metrics. Over a million paying customers, 150 million users, 32% customer growth. The stickiness comes from accumulated operational intelligence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build real portability.&lt;/strong&gt; Products with genuine data export capabilities actually retain better. Users stay because the product is good, not because they're trapped. Trapped users are one PR crisis away from churning en masse.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you're anywhere near Tier 3, you need governance that can survive a leadership change.&lt;/strong&gt; Not a privacy policy. Board-level oversight. Contractual commitments. Because the pressure to monetize intimate data will come. "We have a good culture" is not a defense that survives a down round.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I got wrong initially
&lt;/h2&gt;

&lt;p&gt;I used to think retention was purely about product-market fit. Build something people need, make it work well, they stay. That's true at Tier 1. Maybe even at Tier 2.&lt;/p&gt;

&lt;p&gt;At Tier 3, retention is about dependency. The product doesn't just solve a problem. It becomes part of your emotional infrastructure. That's not inherently bad — human relationships work the same way. But human relationships have social guardrails. Consumer AI at Tier 3 doesn't yet.&lt;/p&gt;

&lt;p&gt;The mistake was thinking you could design for Tier 3 retention without designing for Tier 3 responsibility. You can't. The same features that make the product irreplaceable make it dangerous when misaligned.&lt;/p&gt;

&lt;h2&gt;
  
  
  The question worth asking
&lt;/h2&gt;

&lt;p&gt;The Context Accumulation Moat is the most durable competitive advantage in AI-era software. But the same force that generates lock-in generates liability.&lt;/p&gt;

&lt;p&gt;How do you build a Tier 3 product that users can trust across a 10-year timeline? Not "trust because the founders are good people." Trust because the incentive structure, governance model, and contractual commitments make betrayal structurally difficult.&lt;/p&gt;

&lt;p&gt;I don't think anyone has solved this yet. Character.AI certainly hasn't. Replika hasn't. The companies building mental health chatbots, AI companions, and emotional support systems are all navigating this in real time.&lt;/p&gt;

&lt;p&gt;The Indian SaaS companies moving from Tier 1 to Tier 2 have a window to get this right before they accidentally drift into Tier 3. Once you're holding intimate context, the switching cost becomes a liability as much as an asset.&lt;/p&gt;

&lt;h2&gt;
  
  
  Are we designing for that? Mostly, no. We are still optimizing for retention metrics.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://talvinder.com/field-notes/distress-detection-product-lesson/?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=distress-detection-product-lesson" rel="noopener noreferrer"&gt;talvinder.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agenticsystems</category>
      <category>productstrategy</category>
      <category>consumerai</category>
    </item>
    <item>
      <title>The Recursion Threshold</title>
      <dc:creator>Talvinder Singh</dc:creator>
      <pubDate>Wed, 18 Mar 2026 14:49:58 +0000</pubDate>
      <link>https://forem.com/talvinder/the-recursion-threshold-3nee</link>
      <guid>https://forem.com/talvinder/the-recursion-threshold-3nee</guid>
      <description>&lt;p&gt;Most companies using AI are doing substitution. Replace a copywriter with GPT-4o. Replace a data analyst with a BI copilot. Replace support agents with a chatbot. These are real productivity gains. They are not compounding.&lt;/p&gt;

&lt;p&gt;The distinction matters because substitution is linear and recursion is exponential. Substitution gives you the same output at lower cost. Recursion gives you better output with every cycle, automatically, at no marginal cost.&lt;/p&gt;

&lt;p&gt;The Recursion Threshold is the point at which a function's output can be fed back as its own next input — without a human in the loop. Before it: productivity tool. After it: compounding mechanism.&lt;/p&gt;

&lt;h2&gt;
  
  
  The substitution trap
&lt;/h2&gt;

&lt;p&gt;Substitution is the obvious move. Every company doing AI transformation is running substitution somewhere — usually everywhere. It's the safe, measurable, justifiable version of AI adoption. You can show the cost reduction. You can point to the headcount avoided. It has a clean ROI.&lt;/p&gt;

&lt;p&gt;The trap is that substitution scales linearly. Replace ten people with AI, get ten people's worth of output. The economics improve. The moat doesn't. Your competitor can run the same substitution next quarter. The advantage is temporary.&lt;/p&gt;

&lt;p&gt;At Ostronaut, we built a multi-agent AI system for corporate training content. Eleven specialized agents — for structure, composition, visual design, validation. The naïve assumption was that specialization was the point. Wrong. The point was the blackboard: all eleven agents write to a single shared state object and read from it on their next turn. The validator scores a slide and writes back quality signals. The composer reads those signals and adjusts. The design checker reads both and flags layout issues. No human in the loop between any of these steps. A single generation request goes from raw topic to finished HTML presentation in under four minutes.&lt;/p&gt;

&lt;p&gt;That's not automation. The loop feeds itself. We crossed the threshold without naming it.&lt;/p&gt;

&lt;p&gt;The companies I'm watching closely aren't the ones with the most AI tools. They're the ones who've closed loops. Where the AI system's output becomes the next cycle's raw material. That's where the compounding starts.&lt;/p&gt;

&lt;h2&gt;
  
  
  The token test
&lt;/h2&gt;

&lt;p&gt;Not every function can cross the Recursion Threshold. The prerequisite is tokenizability: the function's output must be expressible as text, numbers, code, image, or sound. If it can be tokenized, it can become context. If it can become context, the loop can close.&lt;/p&gt;

&lt;p&gt;Almost everything in a knowledge business is tokenizable.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Function&lt;/th&gt;
&lt;th&gt;Output&lt;/th&gt;
&lt;th&gt;Loop closes when...&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Content creation&lt;/td&gt;
&lt;td&gt;Text, structure, metadata&lt;/td&gt;
&lt;td&gt;Generated content is chunked into the KB and retrieved for future briefs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code review&lt;/td&gt;
&lt;td&gt;Comments, diffs, test results&lt;/td&gt;
&lt;td&gt;Flagged patterns feed the next review cycle's context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Infrastructure&lt;/td&gt;
&lt;td&gt;Config files, resource specs&lt;/td&gt;
&lt;td&gt;Deployed configs become input to next optimization pass&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Learning design&lt;/td&gt;
&lt;td&gt;Slide structure, quiz results&lt;/td&gt;
&lt;td&gt;Learner performance informs next content generation automatically&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sales intelligence&lt;/td&gt;
&lt;td&gt;Call transcripts, objection maps&lt;/td&gt;
&lt;td&gt;Transcripts feed next call preparation without human curation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The test is simple: can this function's output be stored and retrieved as context for its next run? If yes, the function is threshold-eligible. Whether you've actually closed the loop is a separate question.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three levels
&lt;/h2&gt;

&lt;p&gt;The Recursion Threshold shows up at three scales, and most companies are stuck at the first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Function-level&lt;/strong&gt;: A single step in a workflow feeds the next. The validation agent reads generated slides, scores them, writes quality signals back to shared state. The slide generator reads those signals and adjusts. One function feeding the next, automated. This is achievable in weeks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;System-level&lt;/strong&gt;: The entire pipeline is a recursive chain. At Zopdev, cloud infrastructure configurations are generated by analyzing current cluster state. Deployed configurations change cluster state. The next analysis reads the changed state and generates new recommendations. The system observes itself and responds to its own observations. This runs continuously. No human required unless an anomaly crosses an alert threshold.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Business-level&lt;/strong&gt;: The company's core asset compounds automatically. A content engine where every published piece is chunked into the knowledge base, which informs future content generation, which improves the knowledge base. A training platform where learner performance data directly feeds next-generation course content. An infrastructure company where customer usage patterns improve routing algorithms for all customers with no engineering effort.&lt;/p&gt;

&lt;p&gt;Most companies operate at function-level. A few have reached system-level. Business-level recursive design is rare enough that I don't have a good example from the Indian market yet. That gap is the opportunity.&lt;/p&gt;

&lt;h2&gt;
  
  
  The quality gate problem
&lt;/h2&gt;

&lt;p&gt;Closed loops amplify errors as well as quality. This is the thing nobody mentions when they talk about recursive AI systems.&lt;/p&gt;

&lt;p&gt;If the quality gate has a systematic bias — if the validator consistently rewards verbosity without penalizing readability — that bias gets amplified across every subsequent generation cycle. The system trains itself toward the validator's blind spots.&lt;/p&gt;

&lt;p&gt;We hit this in practice. After deploying our first healthcare training content, we noticed slide decks were getting longer without getting clearer. The validation layer was scoring completeness but not conciseness. Each generation cycle was adding more detail because the validator never penalized it. The loop was working. It was just optimizing for the wrong thing.&lt;/p&gt;

&lt;p&gt;The fix wasn't better prompts. It was rebuilding the scoring function with explicit penalties for length and redundancy. Rule-based, not LLM-as-judge. The validator had to be more rigid than the generators.&lt;/p&gt;

&lt;p&gt;This is the architectural challenge of recursive systems: the quality gate must be more conservative than the generation layer, or the system drifts. And drift in a closed loop is exponential.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I got wrong
&lt;/h2&gt;

&lt;p&gt;When we built the agent chain at Ostronaut, I optimized the nodes. Agent specialization, prompt design, inter-agent interfaces. Each agent was carefully scoped. The boundaries felt clean.&lt;/p&gt;

&lt;p&gt;The actual unlock came from collapsing the interfaces. The blackboard architecture eliminates direct agent-to-agent communication entirely. Agents don't call each other. They read and write shared state. This sounds like a technical detail. It's not. It's what makes the loop debuggable, replayable, and modifiable without touching the agents themselves.&lt;/p&gt;

&lt;p&gt;I was engineering the nodes. The value was in eliminating the edges.&lt;/p&gt;

&lt;p&gt;The other thing I got wrong: I thought the threshold was a technical milestone. Build the loop, ship it, done. It's not. The threshold is an operational shift. Once you cross it, the system's behavior becomes emergent. You're no longer debugging individual components. You're debugging feedback dynamics. That requires different instrumentation, different monitoring, different mental models.&lt;/p&gt;

&lt;p&gt;We lost about three weeks trying to debug agent-level failures when the actual problem was loop-level drift. The agents were working fine. The system was optimizing for the wrong objective because we hadn't built the right feedback signal into the blackboard.&lt;/p&gt;

&lt;h2&gt;
  
  
  The moat question
&lt;/h2&gt;

&lt;p&gt;If the Recursion Threshold is just architecture, what stops competitors from copying it?&lt;/p&gt;

&lt;p&gt;Two things.&lt;/p&gt;

&lt;p&gt;First, the quality gate is proprietary. At Ostronaut, the validation layer isn't a prompt. It's a rule-based scoring system trained on thousands of generations and human feedback. That took months to build and continues to evolve with every client deployment. The loop is replicable. The gate isn't.&lt;/p&gt;

&lt;p&gt;Second, the training signal compounds. Every generation cycle produces metadata: what worked, what failed, what patterns triggered rewrites. That signal feeds back into the system's context retrieval. The longer the loop runs, the better the system gets at avoiding past failures. Competitors starting from scratch don't have that signal. They're running the same architecture with an empty knowledge base.&lt;/p&gt;

&lt;p&gt;The moat isn't the code. It's the accumulated training signal from running the loop at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this goes
&lt;/h2&gt;

&lt;p&gt;The companies that cross the Recursion Threshold first in their vertical will have a structural advantage that's hard to see from the outside. They'll look like they're shipping faster, iterating better, scaling cheaper. The real advantage is that their systems are learning from themselves.&lt;/p&gt;

&lt;p&gt;Freshworks is doing this in customer support. Every resolved ticket feeds the next round of automation. Sarvam AI is doing this in Indic language models. Every inference improves the next retrieval pass. These aren't product features. They're architectural decisions that compound over time.&lt;/p&gt;

&lt;p&gt;The question I'm still working through: how do you design the quality gate for a system you don't fully understand yet? In a recursive system, the gate has to be conservative enough to catch drift but flexible enough to allow genuine improvement. Too rigid and the system stagnates. Too loose and it drifts toward local maxima that look good on the validator's scorecard but fail in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  I don't have a clean answer yet. What I do know: the companies that figure this out won't be competing on features. They'll be competing on feedback loop quality. And that's a different game entirely.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://talvinder.com/frameworks/recursion-threshold/?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=recursion-threshold" rel="noopener noreferrer"&gt;talvinder.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agenticsystems</category>
      <category>aiarchitecture</category>
      <category>indiansaas</category>
    </item>
    <item>
      <title>Agentic Engineering Is Not Prompt Engineering</title>
      <dc:creator>Talvinder Singh</dc:creator>
      <pubDate>Wed, 18 Mar 2026 14:45:24 +0000</pubDate>
      <link>https://forem.com/talvinder/agentic-engineering-is-not-prompt-engineering-12al</link>
      <guid>https://forem.com/talvinder/agentic-engineering-is-not-prompt-engineering-12al</guid>
      <description>&lt;p&gt;Prompt engineering is instruction design. Agentic engineering is system design.&lt;/p&gt;

&lt;p&gt;The two get conflated because both involve LLMs. But asking an AI to write better code is not the same discipline as building an AI that can autonomously debug a production incident, coordinate with other agents, and decide when to escalate.&lt;/p&gt;

&lt;p&gt;One is about optimizing a single interaction. The other is about designing autonomous behavior across dozens of interactions you'll never see.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Agency Ceiling
&lt;/h2&gt;

&lt;p&gt;Most companies hiring "AI engineers" think they need better prompts. What they actually need are systems that can operate without human checkpoints every three minutes.&lt;/p&gt;

&lt;p&gt;I'm calling this gap &lt;strong&gt;The Agency Ceiling&lt;/strong&gt; — the point where prompt optimization stops mattering and system design starts.&lt;/p&gt;

&lt;p&gt;Below the ceiling: you're tuning instructions, experimenting with few-shot examples, adjusting temperature settings. Above it: you're designing state machines, building error recovery loops, and defining when an agent should abort versus retry versus escalate.&lt;/p&gt;

&lt;p&gt;The skills are not transferable. The mental models are different. The failure modes don't overlap.&lt;/p&gt;

&lt;p&gt;Here's the falsifiable claim: &lt;strong&gt;if your AI system requires human intervention more than once per task, you're doing prompt engineering, not agentic engineering.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Prompts Stop Working
&lt;/h2&gt;

&lt;p&gt;Prompt engineering operates at the instruction layer. You give the model context, examples, constraints. You iterate on phrasing. You experiment with system messages. The output quality depends on how well you communicate intent.&lt;/p&gt;

&lt;p&gt;This works for bounded tasks: "Summarize this document." "Generate test cases for this function." "Rewrite this email to be more direct."&lt;/p&gt;

&lt;p&gt;It breaks when the task requires planning, coordination, and recovery:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A research agent that needs to search five sources, synthesize findings, identify gaps, and decide which gaps matter enough to pursue further&lt;/li&gt;
&lt;li&gt;A code review agent that needs to understand the PR context, check against style guides, run static analysis, identify breaking changes, and decide severity&lt;/li&gt;
&lt;li&gt;A customer support agent that needs to check order history, verify account status, determine refund eligibility, and escalate edge cases to humans&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't prompt problems. They're architecture problems.&lt;/p&gt;

&lt;p&gt;Agentic engineering means designing systems where the AI:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Breaks down goals into sub-tasks autonomously&lt;/li&gt;
&lt;li&gt;Decides which tools to use and when&lt;/li&gt;
&lt;li&gt;Handles failures without human rescue&lt;/li&gt;
&lt;li&gt;Maintains state across multiple steps&lt;/li&gt;
&lt;li&gt;Knows when it's stuck and needs to change approach&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's not a better prompt. That's a different system.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Building Agents Actually Looks Like
&lt;/h2&gt;

&lt;p&gt;At Ostronaut, we build multi-agent systems that transform training content into presentations, videos, quizzes. Early on, we thought the problem was prompt quality. Better instructions equals better output.&lt;/p&gt;

&lt;p&gt;We were wrong.&lt;/p&gt;

&lt;p&gt;The actual problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One agent would generate a slide structure that a downstream agent couldn't render&lt;/li&gt;
&lt;li&gt;Quality would degrade unpredictably when the content was technical versus narrative&lt;/li&gt;
&lt;li&gt;The system would fail silently — no error, just bad output&lt;/li&gt;
&lt;li&gt;Retries would produce different failures, not better results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We fixed this by building validation gates between agents, designing explicit handoff protocols, and creating rule-based quality checks. The prompts barely changed. The system architecture changed completely.&lt;/p&gt;

&lt;p&gt;This pattern holds across every agentic system I've seen.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt engineering thinking:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"How do I get the LLM to follow this format?"&lt;/li&gt;
&lt;li&gt;"What examples do I need to include?"&lt;/li&gt;
&lt;li&gt;"Should I use XML tags or JSON?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Agentic engineering thinking:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"What happens when this agent produces output the next agent can't parse?"&lt;/li&gt;
&lt;li&gt;"How does the system recover when an API call fails midway through a 10-step workflow?"&lt;/li&gt;
&lt;li&gt;"What's the rollback strategy if we're 80% through a task and discover the initial assumption was wrong?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first set of questions is about communication. The second is about reliability.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hiring Gap
&lt;/h2&gt;

&lt;p&gt;We initially treated agentic engineering as "advanced prompt engineering." We hired people who were good at coaxing outputs from GPT-4 and assumed they'd be good at building agent systems.&lt;/p&gt;

&lt;p&gt;They weren't.&lt;/p&gt;

&lt;p&gt;The skill gap isn't about AI knowledge. It's about system design. The best agentic engineers I've worked with came from distributed systems backgrounds, not NLP research. They think in state machines, not in linguistic tricks.&lt;/p&gt;

&lt;p&gt;We lost about two months before we realized we were hiring for the wrong skill set.&lt;/p&gt;

&lt;p&gt;The distinction matters because the hiring, the tooling, and the success metrics are completely different.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8cmm7fno0v78e2otjc2r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8cmm7fno0v78e2otjc2r.png" alt="Diagram 1" width="800" height="335"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you're building an AI feature, you probably need prompt engineering.&lt;/p&gt;

&lt;p&gt;If you're building an AI system that operates independently, you need agentic engineering.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Training Problem
&lt;/h2&gt;

&lt;p&gt;The open question: how do you train agentic engineers when most of the discipline is being invented right now?&lt;/p&gt;

&lt;p&gt;The universities teaching "prompt engineering" courses are solving yesterday's problem. The companies that figure out how to train people in agent system design — not prompt optimization — will have the talent advantage for the next five years.&lt;/p&gt;

&lt;p&gt;Are we building those training programs? Mostly, no. We're still teaching people how to write better ChatGPT prompts.&lt;/p&gt;

&lt;p&gt;The gap between what the market needs and what the training programs produce is widening. The engineers who can design reliable autonomous systems are rare. The ones who understand both AI capabilities and distributed systems architecture are rarer still.&lt;/p&gt;

&lt;p&gt;At Pragmatic Leaders, we're starting to see demand for courses on agent system design. But the curriculum doesn't exist yet. We're building it in real-time, extracting patterns from production systems, documenting failure modes that no textbook covers.&lt;/p&gt;

&lt;h2&gt;
  
  
  The question isn't whether agentic engineering will become a distinct discipline. It already is. The question is how long it takes for the hiring market, the training programs, and the organizational structures to catch up.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://talvinder.com/frameworks/agentic-engineering-pattern/?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=agentic-engineering-pattern" rel="noopener noreferrer"&gt;talvinder.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agenticsystems</category>
      <category>aiengineering</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>The GitHub Slopocalypse and the Coming Trust Tax</title>
      <dc:creator>Talvinder Singh</dc:creator>
      <pubDate>Wed, 18 Mar 2026 14:45:09 +0000</pubDate>
      <link>https://forem.com/talvinder/the-github-slopocalypse-and-the-coming-trust-tax-gc8</link>
      <guid>https://forem.com/talvinder/the-github-slopocalypse-and-the-coming-trust-tax-gc8</guid>
      <description>&lt;p&gt;GitHub's value was never storage. It was legible history.&lt;/p&gt;

&lt;p&gt;Every commit told you who made a decision, why they made it, and what changed. That's what made open source work at scale—you could trace a bug to a specific human judgment, review the reasoning, fix it. The transparency enabled automation. You could build CI/CD pipelines, automate deployments, reduce ship risk—because you trusted the historical record.&lt;/p&gt;

&lt;p&gt;Now that history is being flooded with AI-generated code, and the entire trust infrastructure is collapsing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Trust Tax
&lt;/h2&gt;

&lt;p&gt;I'm calling this the Trust Tax—the additional cognitive and temporal cost you now pay to verify code provenance before you can use it.&lt;/p&gt;

&lt;p&gt;When GitHub launched, the excitement wasn't about free hosting. It was about confidence. Git's original value proposition was perfect historical records. You could pinpoint a commit in space and time and feel confident in the record of code changes in a way that you rarely feel confident about anything in software.&lt;/p&gt;

&lt;p&gt;The system assumed human intentionality in every commit. When you saw a change, you knew a human had made a deliberate decision. Maybe it was wrong, but it was &lt;em&gt;legible&lt;/em&gt;—you could understand the reasoning, challenge it, fix it.&lt;/p&gt;

&lt;p&gt;AI code generation breaks this assumption.&lt;/p&gt;

&lt;p&gt;A commit that says "optimized database queries" might mean: a developer profiled the code, identified N+1 queries, rewrote them, and tested the result. Or it might mean: an LLM generated plausible-looking SQL based on a vague prompt, and no one verified it works.&lt;/p&gt;

&lt;p&gt;You can't tell from the commit. You can't tell from the diff. You have to read the code, understand the context, and verify the claims. Every single time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Mechanism
&lt;/h2&gt;

&lt;p&gt;Here's the falsifiable claim: &lt;strong&gt;Within 18 months, the median time-to-trust for evaluating a new GitHub repository will double for experienced developers, and the variance will increase by an order of magnitude.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before: You evaluated a repo by checking commit frequency, reading a few key commits, scanning the contributors, maybe looking at issue resolution patterns. Total time: 10-15 minutes for a mid-sized library.&lt;/p&gt;

&lt;p&gt;Now: You do all of that, plus: scan for AI-generated patterns (repetitive structure, suspiciously perfect formatting, generic variable names), check if tests actually run, verify that documentation matches implementation, look for signs of copy-paste from LLM output. And even after all that, you're less confident than you used to be.&lt;/p&gt;

&lt;p&gt;The variance increase is worse than the median shift. Some repos will be obviously human (active maintainers, clear decision history, coherent architecture). Some will be obviously slop (generated README, no tests, commit messages that read like ChatGPT). But most will be in the middle—partially AI-assisted, unclear provenance, uncertain quality.&lt;/p&gt;

&lt;p&gt;That's where the tax gets expensive.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where This Is Already Happening
&lt;/h2&gt;

&lt;p&gt;JavaScript got hit first. Every hard-fought factoid about framework internals gets buried under LLM-generated tutorials that are 70% correct and 30% hallucinated. The slopocalypse is now accelerating across all languages.&lt;/p&gt;

&lt;p&gt;At Zopdev, we've started seeing this in infrastructure-as-code repos. Terraform modules that look reasonable at first glance but have subtle bugs—wrong IAM permissions, missing tags, inefficient resource allocation. The modules are clearly AI-generated (the structure is too uniform, the variable names too generic), but someone committed them with a human-sounding message.&lt;/p&gt;

&lt;p&gt;The Trust Tax here is expensive: you have to audit every resource definition before you can use it.&lt;/p&gt;

&lt;p&gt;The pattern is consistent across domains. AI-generated commits don't carry human intent. The commit message that says "refactored for clarity" might be hallucinated. The code that looks clean might be untested slop copied from three different StackOverflow answers. The diff that claims to fix a race condition might introduce two new ones.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Got Wrong
&lt;/h2&gt;

&lt;p&gt;I initially thought the solution was better tooling—automated detection of AI-generated code, reputation systems for contributors, verification badges for human-reviewed repos.&lt;/p&gt;

&lt;p&gt;That's not wrong, but it misses the deeper problem.&lt;/p&gt;

&lt;p&gt;The Trust Tax isn't a tooling problem. It's an epistemological problem. GitHub's value was that you could reconstruct intent from history. AI-generated code has no intent. It has a prompt and a probability distribution. You can't reconstruct reasoning that never happened.&lt;/p&gt;

&lt;p&gt;Better tools can reduce the tax, but they can't eliminate it. You're always going to pay more to verify machine-generated code than human-written code, because the verification burden shifts from "did this human make a good decision?" to "is this code even coherent?"&lt;/p&gt;

&lt;h2&gt;
  
  
  The Adaptation Pattern
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuzl5vm47mzl04a2wi574.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuzl5vm47mzl04a2wi574.png" alt="Diagram 1" width="800" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The companies that understand this are already adapting. They're building internal forks of critical dependencies. They're paying for human code review even on open source contributions. They're treating GitHub as untrusted by default.&lt;/p&gt;

&lt;p&gt;The companies that don't understand this are accumulating technical debt they can't see. They're pulling in dependencies that look fine, pass tests, and ship—until six months later when the subtle bug surfaces and no one can trace it to a human decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Civilisational Question
&lt;/h2&gt;

&lt;p&gt;Git was designed for a world where every commit represented a human judgment. That world is ending. The question worth asking now is: what does open source collaboration look like when you can't trust the historical record?&lt;/p&gt;

&lt;p&gt;The standard response is: "We'll build better verification tools." That's necessary but insufficient. Verification tools can tell you &lt;em&gt;what&lt;/em&gt; changed. They can't tell you &lt;em&gt;why&lt;/em&gt; it changed, because the "why" never existed.&lt;/p&gt;

&lt;p&gt;The deeper adaptation is cultural. We're moving from a trust-by-default model (assume human intent, verify when suspicious) to a verify-by-default model (assume machine generation, trust only after audit). That's a fundamental shift in how open source works.&lt;/p&gt;

&lt;p&gt;Are we ready for it? Mostly, no. We're still treating AI-generated code as a productivity enhancement, not a trust infrastructure collapse. We're still measuring success by lines of code written, not by verification burden imposed on downstream users.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Trust Tax is coming. The only question is whether we pay it consciously or discover it six months after the bug ships.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://talvinder.com/field-notes/github-slopocalypse-trust-tax/?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=github-slopocalypse-trust-tax" rel="noopener noreferrer"&gt;talvinder.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agenticsystems</category>
      <category>softwareengineering</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
