Forem: Sriramprabhu Rajendran

Why Story Points Don’t Work in the AI Era, And What Should Take Their Place Instead.

Sriramprabhu Rajendran — Sun, 24 May 2026 18:34:17 +0000

Imagine this scenario at your next sprint review meeting: You're looking good on your velocity graph. But half your team is struggling in their own little hell. Estimations have devolved into Russian roulette. You get a "2-point" done in 45 minutes. You have another "2-point" that takes 4 days because AI-written code introduced a nasty bug in staging that was missed.

This isn't an effort issue; it’s a problem of predictability. Story points, meant for predicting the performance of stable humans, don’t even know what to do with this situation.

I believe it’s time to ditch story points. Let me explain how.

Why I Think the Old System Is Defective

"Story points are a commitment to the idea 'that task is about twice as hard as this task.'" These assumptions include the following:

The developer has approximately the same capacity
The level of complexity correlates well to duration

AI breaks both assumptions.

Throughput is no longer fixed. The same developer, same work, same day—make one tweak to the model and next thing you know they’re getting things done in 40% less time. Or something goes wrong with the model and it takes them two days to fix an hallucinated abstraction.

Complexity not linked to task duration. A very difficult task could become very easy if the AI does it right. But a very simple task could become very difficult if the AI gets it wrong. The variance is much larger than the mean—and story points measure only the mean.

Here’s what I believe a normal team would be seeing after a couple of sprints using AI:

Task Type         | Pre-AI Avg  | Post-AI Range
------------------|-------------|----------------
API endpoint      | 2 days      | 3 hours – 2.5 days
DB migration      | 3 days      | 1 day – 4 days  
UI component      | 1.5 days    | 30 min – 3 days
Legacy refactor   | 5 days      | 2 days – 8 days

If the range exceeds the estimation, then the estimation is noise.

The Hidden Cost That No One Is Estimating

This is my hypothesis on what most teams are failing to estimate: verifying the AI output is the highest-cost activity for almost any task.

A real-time estimate of costs involved would be something like this:

20% of the time: figuring out what to build
15% of the time: the AI building it
65% of the time: going through it, looking for problems

And that third one—the curation tax—falls into an unseen category for everyone. They're estimating the time spent on construction and not on curation. That would be like planning a home renovation based only on how fast someone bangs a hammer.

If teams started taking review-and-validation seriously as part of their estimate process, I'm convinced their accuracy will improve drastically.

What I Suggest to Replace Story Points

1. Confidence-Tagged Estimates

What I suggest: Every ticket must be provided with two things—a time estimate and a confidence tag.

ticket: INDE-002 — Migrate auth service to new SDK
estimate: 1.5 days
confidence: low
reason: "New SDK, AI hasn't seen our auth patterns before"
action: spike first (0.5 day), then re-estimate

The confidence tag is the important part. It tells the PM whether to trust the number or treat it as a hypothesis.

Three bands:

Tag	What it means	How to plan
`high`	Done this before with AI, know the variance	Plan on the estimate
`medium`	Familiar territory, some unknowns	Buffer 2x
`low`	Novel task or AI-unfriendly domain	Spike first, don't commit

My recommendation for the rule: always spike a ticket marked low before adding it to a sprint. My guess is that just one rule can prevent almost all sprint meltdowns in AI-driven teams.

2. "Free Second Time" Paradigm

It’s interesting how I notice a certain pattern: when doing any task for the first time with the help of artificial intelligence, it’s very costly. But the second time the same is done, the cost is reduced by 60%. And by the third time, it becomes pretty much negligible.

How can this happen? Well, the first attempt makes one develop a specific workflow – an optimal prompt structuring, a proper context window configuration, and everything that might go wrong during execution.

That being said, I believe the cost estimation should be done in a different way:

First instance of task type:  estimate × 2 (you're building the workflow)
Second instance:              estimate × 0.8 (refining the workflow)
Third+ instance:              estimate × 0.4 (executing the workflow)

Example: consider migration of 12 microservices using a new observability SDK.

Service 1: 6 hours (thinking about how to do it, writing prompts)
Service 2: 2.5 hours (fine-tuning, dealing with edge cases)
Services 3-12: ~45 minutes each (batches in bulk using known procedure)

Old estimate: 12 x 4 hours = 48 hours. This approach: ~16 hours. But only if you put in the effort on service 1 rather than rushing through.

3. Review-Weighted Sizing

I don’t believe that one should size by "how difficult will it be to develop." Instead, one must size by "how difficult will it be to verify?"

The easiest pieces to create are often very difficult to review (large refactors, verbose migrations), while difficult pieces to generate are simple to review (small algorithmic fixes with explicit test cases).

This sizing rubric must be inverted:
| Old thinking | New thinking |
|-------------|-------------|
| "Lots of code = big ticket" | "Lots of code to review = big ticket" |
| "Complex logic = big ticket" | "Ambiguous correctness criteria = big ticket" |
| "New framework = big ticket" | "AI-unfamiliar patterns = big ticket" |

500-lines of boilerplate migration needs to be large not because it’s difficult to generate, which an AI can do within minutes, but because checking for nuanced differences in 500 lines of code is truly costly.

How This Changes The PM Conversation

The hardest part of any estimation paradigm shift isn’t technical. It’s explaining the change.

Old conversation:

"This epic is 34 story points. At our velocity of 21/sprint, it’ll take ~1.6 sprints."

Where I think this discussion needs to go:

"This epic has 8 tickets. 5 of which are high-confidence tickets (we’re going to meet the estimates here). 2 are medium confidence (double our estimates). 1 is low confidence (we need a spike day to be sure about that). Optimistic estimate: 1 sprint. Pessimistic: 1.5 sprints. What if the low-confidence ticket is a problem? 2.5 sprints."

Bigger sentences? Yes. More useful? Absolutely. PMs have a choice to make now: "Pull out the low-confidence ticket and ship everything else on time" is now a discussion that you can have.

Metrics Worth Tracking Instead

Velocity as a metric should be scrapped in favor of more useful measures, which I would propose to track include the following:

Curating rate – proportion of review time vs creation time. Goal: below 3:1.
Confidence success rate – proportion of 'high' tickets that make into the estimate.
Process reuse rate – frequency of reusing a process for second similar task vs creating anew.
Spike conversion rate – after spike how often 'low' ticket turns into 'medium' or 'high'.

These measures will inform about the progress of the team in collaboration with the AI, as opposed to going 'fast'.

TL;DR – The Replacement Kit

If you are still Fibonacci-estimating stories in 2026 and asking why sprints are akin to playing Russian roulette, here’s my suggestion:

Confidence tagging for estimates — confidence level will matter more than estimate itself
Curation effort estimation vs Construction effort estimation — curation will be the hard work
Novelty tracking for each task — new tasks are 2-3 times costlier than recurring tasks
Task size based on difficulty of reviewing and not generation — reverse the complexity paradigm
Spike before undertaking high uncertainty tasks — one simple rule to massively reduce blowups

In my mind, however, the fundamental change that needs to happen is as follows: before, estimation was focused on how much time it takes to build something. Today, it’s time to focus on how much time it takes to validate it. Change the paradigm, and suddenly, sprint planning starts reflecting reality.

Obviously just one way of seeing things—there are many brilliant minds out there who have figured out how to make story points work using AI-driven adjustments. Where do you stand: adding to the old system or building a new one from scratch?

Mutation Testing: The Missing Safety Net for AI-Generated Code

Sriramprabhu Rajendran — Tue, 31 Mar 2026 01:28:59 +0000

92% code coverage. No SonarQube criticals. All green. And an AI-generated deduplication bug made it to production because not a single test had challenged the logic.

Code coverage tells you what ran. Mutation testing tells you what your tests would actually catch if the code were wrong. And in the AI world, that's the only thing that matters.

Let us check an analogy here > Walking through a building, coverage means we visited all rooms. Mutation testing means we would notice if there were a missing wall. One measures presence, the other measures resistance.

The Bug That Coverage Could Not See

I've seen this occur in the wild. An AI agent produced the service layer for a critical reconciliation workflow. 140 unit tests. 92% line coverage. It looked good on the PR.

But two days after deployment, the reconciliation started silently duplicating line items. The AI had used reference equality on objects, not business key equality. For 98%, it was functionally the same. For the 2% it reconstructed from the database query, it was catastrophically wrong.

All the tests ensured that the deduplication happened, not how:

assertEquals(3, result.size()); // passes with either implementation
assertTrue(result.containsAll(expected)); // passes — same objects in test setup

Change .equals() to ==, and all tests pass. This is exactly what mutation testing is designed to fix.

From a observability point of view, every one of these surviving mutants is a "silent failure" just waiting to happen – a problem your logging and monitoring won't detect until a downstream reconciliation report blows up 48 hours later. Mutation testing can actually reduce your Mean Time to Detect by catching these problems before they ever hit production.

What Mutation Testing Actually Does

The idea is deceptively simple. Take your code, introduce small deliberate breaks, and see if your tests notice.

Original:    if (a.getBusinessKey().equals(b.getBusinessKey()))
Mutant 1:    if (a.getBusinessKey().equals(b.toString()))
Mutant 2:    if (a == b)
Mutant 3:    if (true)
Mutant 4:    if (!a.getBusinessKey().equals(b.getBusinessKey()))

But if tests pass anyway on such a mutant, then that mutant survived, which means we have found a blind spot in your tests. That blind spot is already known to exist, so this is not really a new problem. However, it is a problem for your tests, not for the code itself. So, in this case, we can stop here. If you want to proceed, then:

Mutation score = killed mutants / total mutants. If your score is 60%, then 40% of your behavioral paths are not tested, regardless of your line coverage.

Why AI Makes This Worse

Our tests have been improving over the years to cover the kinds of errors that humans tend to make: typos, off
structurally correct but semantically drifted.

The LLM has no concept of your domain. It has no idea that dedup in this system means business key equality, not object identity. It has no idea that null in this system means "skip," not "default." The code compiles, tests pass, and logic is subtly incorrect.

Mutation tests detect this because they're based on mechanisms, not intent. They don't care how you write your code. All they care is: "If this particular piece of logic were wrong, would any tests fail?"

In my experience, and from what early adopter teams have told me, survival rates are 15-25% higher on AI-generated code at equivalent coverage levels. Same coverage number, weaker tests.

Setting Up PIT for Java

The go-to tool for Java mutation testing is PIT (pitest.org). Here is a minimal configuration for Maven:

<plugin>
    <groupId>org.pitest</groupId>
    <artifactId>pitest-maven</artifactId>
    <version>1.15.3</version>
    <configuration>
        <targetClasses>
            <param>com.sri.recon.*</param>
        </targetClasses>
        <targetTests>
            <param>com.sri.recon.*Test</param>
        </targetTests>
        <mutators>
            <mutator>DEFAULTS</mutator>
            <mutator>REMOVE_CONDITIONALS</mutator>
            <mutator>RETURN_VALS</mutator>
        </mutators>
         <timestampedReports>false</timestampedReports>
        <outputFormats>
            <param>HTML</param>
             <param>XML</param>
        </outputFormats>
    </configuration>
  </plugin>

mvn org.pitest:pitest-maven:mutationCoverage

The HTML report displays all mutants, whether they killed or survived, and which statement they targeted. The surviving mutants are your action items.

The Continuous Integration Pipeline

The workflow is simple. The AI produces or changes code, a developer looks at the pull request, and then pipeline runs the usual unit tests. If those pass, then pipeline runs the mutation tests on the changed files. If the mutants survive, they are marked on the pull request for the developer to write tests to kill them. The threshold for the mutation score determines whether the pull request merges.

This is the workflow:

One thing worth calling out: avoid running mutations on the entire code base. PIT has support for SCM based Git integration to allow you to target only the lines of code that were changed in a PR. This is known as differential mutation testing, and this is what makes mutation testing feasible because the time to run is reduced to minutes, not hours, and you're targeting exactly what the AI just created. This is done via the scmMutationCoverage goal:

mvn org.pitest:pitest-maven:scmMutationCoverage -Dpit.target.tests=com.sri.*Test

As far as mutation thresholds, be reasonable. I'd recommend a mutation score of at least 80% on newly created AI code, and I'd also recommend that the mutation score not decrease when the AI modifies existing code. For critical domains, authentication, and data integrity, I'd recommend a mutation score of 90%. Don't aim for 100% because you'll never get there, and you'll also encounter diminishing returns because of equivalent mutants.

Example: Catching What Coverage Missed

Here is a concrete one. Say an AI generates this discount calculation:

public BigDecimal applyDiscount(BigDecimal amount, DiscountType type) {
    if (type == DiscountType.PERCENTAGE) {
        return amount.multiply(BigDecimal.ONE.subtract(type.getValue()));
    } else if (type == DiscountType.FLAT) {
        return amount.subtract(type.getValue()); }
    return amount;
}

Existing tests (100% line coverage):

@Test void percentage_discount() {
    assertEquals(new BigDecimal("90.00"),
        service.applyDiscount(new BigDecimal("100.00"), DiscountType.PERCENTAGE_10));
}

@Test void flat_discount() {
    assertEquals(new BigDecimal("90.00"),
        service.applyDiscount(new BigDecimal("100.00"), DiscountType.FLAT_10));
}

PIT report — two survivors:

>> Line 4: removed conditional (else-if always executes) → SURVIVED
>> Line 6: replaced return amount with return null  → SURVIVED

The survivor for Line 4 is a bit cunning. The test cases for Line 4 happen to have the same numerical answer (100 - 10 = 90 and 100 * 0.9 = 90), so the two discount methods are indistinguishable by these test cases. The survivor for Line 6 is a bit more obvious. The default return statement is not actually executed, so a new unhandled DiscountType will return the original amount without any test case noticing.

Tests that kill these mutants:

@Test void percentage_discount_differs_from_flat() {
    BigDecimal amount = new BigDecimal("200.00");
    BigDecimal result = service.applyDiscount(amount, DiscountType.PERCENTAGE_10);
            // 200 * 0.9 = 180, NOT 200 - 10 = 190
    assertEquals(new BigDecimal("180.00"), result);
}

@Test void unknown_discount_type_returns_original() {
    BigDecimal amount = new BigDecimal("100.00");
    BigDecimal result = service.applyDiscount(amount, DiscountType.NONE);
       assertEquals(amount, result);
}

Both mutants are killed. The tests are now verifying intent rather than execution.

Non Java

Python has mutmut and cosmic-ray
JavaScript and TypeScript developers should use Stryker (stryker.mutator.io)
For the Go language, go-mutesting is available. Of these tools, seems PIT and Stryker are the most mature to be leveraged. However, the basic principle is the same for all languages.

When Mutation Testing Is Overkill?

Not every situation is a good fit for mutation testing. When working on small scripts or prototyping, the overhead is not justified for throwaway code. When working on stable legacy code bases that do not have a high change frequency, mutation testing is mostly a source of noise. When your team does not have a good unit test foundation yet, focus on writing those first. Mutation testing is a measure of test strength. What is the point of measuring if there are no tests? When working on a project in a phase of rapid experimentation where interfaces are changing daily, wait until the design stabilizes.

The Pushback?

"Yes, It is slow."
Scoped to PR-changed files: 2-5 minutes. Cheaper than a production bug your tests were too shallow to detect.

"We already have high coverage."
Coverage is how many tests ran. Mutation score is how many tests detected.

"Some mutants maynot be meaningful."
This is true. There are equivalent mutants. Most are handled by PIT. Ignore the rest and move on.

Where This Is Headed?

In the world of AI, passing tests is no longer enough. The real question is: Would my tests fail if the code was wrong?

This is where mutation testing comes in, and more and more, it might be the only thing preventing "all green" and silent failure.

Looking forward, I see this working naturally within an agentic model. A living mutant will spawn a secondary "Test Generator" agent to create a test case to kill it, before a PR is even reviewed by a human. The mutation testing loop will be fully autonomous: AI generates code, mutation testing identifies areas to be filled, another AI agent fixes them. The human reviewer will only be concerned with intent, not coverage.

Have you tried running mutation testing on the code generated by AI or agentic coding tools? Please comment below about the survival rates of your projects or code.

Why Your Next Enterprise Chatbot Should Write Its Own GraphQL Queries (Safely)

Sriramprabhu Rajendran — Mon, 30 Mar 2026 02:31:23 +0000

Your chatbot needs to query live business data. Here is why GraphQL maybe preferable or safer, more controllable interface / tools for LLM generated questionnaire.

The Real Question: How Should Your Agent Talk to Your Data?

If you are creating an agentic chatbot, that will need access to tools (APIs/query SOR, and perform other tasks that require coordination), then you have already answered the hard conceptual part. Your agent thinks, selects tools, and assembles responses.

However, there is another issue that is not receiving as much consideration as it should: how should your agent interact with your business data?

An image that can be drawn is that the agent will need to query your sales schema, retrieve customer churn information, and cross-check support tickets. This is three or four calls to tools that will create live queries against your backend. These queries need to be safe, typed, auditable, and constrained, as no human will review them prior to execution.

Most teams will default to using REST endpoints. Some will even consider using agents that write SQL against transactional databases, which I would strongly advise against. I think there is a much better way.

The Architecture Pattern

Here is the architecture I continue to come back to. While the interesting piece is not the orchestration piece itself, it is the selection of GraphQL as the data interface between the agent’s tools and the backend.

The data path is the GraphQL tool server, and that is the main way in which the agent is accessing business data. The other tools are just implementing specific operations and do not all share the same backend.

With regards to the orchestration piece, it is simple in the sense that the agent selects the tools, runs them, and continues running them in a loop until it has enough information. However, the interesting part is inside those tool runs, and that is where the agent is constructing GraphQL queries against your backend. This is where the interface selection makes or breaks the system.

Why GraphQL Over SQL or REST?

This is my strongest conviction: for enterprise AAL use cases, GraphQL is a safer default interface than REST for LLM-based queries. SQL is appropriate for analytics, but it should not be used as an interface between an agent and your transactional data.

Introspection: Realistically, you expose a curated subset of the schema, or pre-fetch the minified SDL, rather than allowing the agent to freely introspect in production. This ensures the schema remains small enough to be included in the prompt without consuming your context window, while still allowing the agent to discover the data available.
Type safety as a security boundary: There is no DROP TABLE in GraphQL. The schema is a whitelist, and bad queries will not reach your data.
Reduced hallucination surface: While GraphQL removes an entire category of hallucinations (invalid joins, non-existent tables), there can still be queries for non-existent or improperly used relationships, which is why you'll still want to use validation layers.
Guardrails baked in: Complexity analysis, depth limiting, and field-level authorization are all first-class citizens in the GraphQL toolchain.

This is what an agent-generated query looks like in the real world:

query {
  sales(week: "2026-W1") {
    region
    revenue}
  churnedCustomers(week: "2026-W2") {
    count
    reason  }
}

The agent only requests what it needs. Fewer tokens in the response, less load on your backend.

Let’s look at the alternative of allowing an agent to query your data in SQL against your relational data store. What if the agent’s WHERE clause is not quite right and returns the wrong data? What if the agent forgot to include the LIMIT clause and now your entire table is being scanned? What if the agent’s JOIN is not quite right and locks up your data or slows down every other user of your system? GraphQL is the reverse of this problem. The model only sees what you’ve made available and nothing more.

However, to be fair: GraphQL is not without its own set of problems. N+1 query problems will result if resolvers are not implemented properly. Also, with GraphQL, we are moving the complexity to resolver performance and cost management, especially in the case of queries coming from autonomous agents. For offline analytics queries that involve complex aggregation, SQL against a read-only data warehouse is indeed the correct approach. However, that is a fundamentally different scenario from an agent querying your live application data in real time. At the application level, which is where most enterprise chatbots live, GraphQL is indeed a more controllable and auditable interface. That is a trade-off that is worth making for most of the use cases that I see in the wild.

The Code: Two Key Components

Here are the two components that make this pattern work. Everything else is standard boilerplate, which we’re sure you already have in place. Spring Boot is an excellent choice here. Its type-safe support for GraphQL, its maturity, and its support for Spring AI make it an excellent choice for building agent-facing APIs.

1. MCP Tool Server with Guardrails (Java)

The MCP tool server is essentially a safety wrapper for your GraphQL API. The agent sends in its query, which is then checked by the MCP tool before it is run.

@Service
public class GraphQLQueryTool implements McpTool {

    private final GraphQLClient graphQLClient;
    private final SchemaValidator schemaValidator;
    private final QueryComplexityAnalyzer complexityAnalyzer;
    private static final int MAX_QUERY_DEPTH = 4;
    private static final int MAX_QUERY_COMPLEXITY = 100;
    @Override
    public String name() { return "query_business_data"; }
    @Override
    public ToolResult execute(Map<String, Object> parameters) {
        String query = (String) parameters.get("query");
        // Safety Layer 1: Schema validation
        ValidationResult validation = schemaValidator.validate(query);
        if (!validation.isValid()) {
            return ToolResult.error("Invalid query: " + validation.errors());        }

           // Safety Layer 2: Complexity analysis
        int complexity = complexityAnalyzer.calculate(query);
        if (complexity > MAX_QUERY_COMPLEXITY) {
            return ToolResult.error("Complexity " + complexity
                + " exceeds limit of " + MAX_QUERY_COMPLEXITY);}

                   // Safety Layer 3: Depth limiting
        int depth = complexityAnalyzer.calculateDepth(query);
        if (depth > MAX_QUERY_DEPTH) {
            return ToolResult.error("Depth " + depth
                + " exceeds limit of " + MAX_QUERY_DEPTH);
        }
        GraphQLResponse response = graphQLClient.execute(query);
        return ToolResult.success(response.toJson());
    }
}

Three levels of validation before anything touches your data. This is defense in depth. This is important when the model is actually making decisions and sending queries on its own.

2. Agentic Orchestration with LangGraph (Python)

LangGraph controls the reasoning loop. The model suggests what tools to invoke, the orchestration layer controls and corrects the loop until it has enough information.

from langgraph.graph import StateGraph, END
from langchain_anthropic import ChatAnthropic
def create_agent():
    model = ChatAnthropic(
        model="claude-opus-x",
        max_tokens=4096, temperature=0
    )
    model_with_tools = model.bind_tools([
        graphql_query_tool, customer_data_tool,
        analytics_tool, notification_tool
    ])

    def reasoning_node(state):
        response = model_with_tools.invoke(state["messages"])
        return {"messages": state["messages"] + [response]}

    def tool_execution_node(state):
        last_msg = state["messages"][-1]
        results = [
            ToolMessage(
                content=tool_registry[tc["name"]].invoke(tc["args"]),
                tool_call_id=tc["id"]
            ) for tc in last_msg.tool_calls
        ]
        return {"messages": state["messages"] + results}

    def should_continue(state):
        last_msg = state["messages"][-1]
        if hasattr(last_msg, "tool_calls") and last_msg.tool_calls:
            return "execute_tools"
        return END

    graph = StateGraph(AgentState)
    graph.add_node("reason", reasoning_node)
    graph.add_node("execute_tools", tool_execution_node)
    graph.set_entry_point("reason")
    graph.add_conditional_edges("reason", should_continue)
    graph.add_edge("execute_tools", "reason")
    return graph.compile()

The model offers a plan; it is up to the orchestration layer to constrain, execute, and correct it. There is no need to think through all possible questions or create intricate routing logic.

The Hard-Won Opinions

Guardrails are the product. The hardest part is not making agentic chatbots work; it is making them safe. My personal stack includes schema validation, complexity limits, depth limits, tool call budgets (max 8 per turn), query deduplication to prevent loops, hard timeouts (10s/tool, 60s/turn), read-only by default, and field-level authorization for PCI or sensitive data. One other thing that many teams overlook is that every query executed is executed within the authorization context of the requesting user. The agent should not have greater access rights than it is acting on behalf of. While this is a lot, it is also important to remember that an agent with unfettered rights to your business systems is not something you want.

Transparency is the key to building trust. Users need to see the logic, the generated queries, and the raw data. The answer is black box if it simply says "The answer is..." rather than "I queried the sales data for weeks 12 and 13. I saw that there was a 12% drop in the Northeast. I cross-referenced that with 47 lost enterprise accounts and also looked at the support tickets that came in with billing complaints." Transparency is what will get the adoption. Without it, the project will fail.

Tool calls can get out of control. Agents will get caught in an infinite loop calling the same tool over and over with slightly different parameters. I've seen it happen in one of my prototypes. The agent made 30 nearly identical tool calls in under 15 seconds before timing out. The combination of the budget and the deduplication and the timeout is the bare minimum.

When GraphQL Is Not the Right Fit? THoughts?

The use of GraphQL as the query language against the agent data is not necessarily the correct answer. SQL against the read-only warehouse is the correct answer for the offline analytics query workload with complex aggregations. This is analytics, however, and not the agents querying the live data. When the high-risk write operations are critical in that the outcome is catastrophic if the query is incorrect, approval is the answer regardless of the query language. And in some cases, your backend is simply a good set of REST endpoints with well-defined contracts. In those cases, the cost of switching is likely not worth the benefit. This is particularly true in cases where agents must query many types of entities with differing field sets required. This is true in all of the enterprise cases that I have been involved with.

Where This Is Heading

The tooling is already mature: the current-gen models like Claude Sonnet and GPT-4o have decent native tool usage capabilities, MCP is becoming the de facto standard for tool integration, and GraphQL has been around for nearly a decade now. The orchestration frameworks are in place. What’s lacking is the organizational willingness to put in place a typed and guarded query interface between the agent and the data, rather than just making raw REST calls and crossing our fingers.

The advice I’d give is: **One read-only tool against a non-sensitive data set. Get the reasoning loop right. Make sure stakeholders can see the agent’s output. Then iterate.

Once you have those tools, the real decision is how those tools interact with your data. REST works, and GraphQL offers typed schema definition, introspection, and query constraints. These are much more important when your caller is a model instead of a human.

Is GraphQL the right abstraction layer for LLM-generated queries, or is there a better approach? Drop your thoughts in the comments below.

Beyond the Single Prompt: Orchestrating Parallel Context Isolation (PCI) with Claude Code

Sriramprabhu Rajendran — Sun, 15 Mar 2026 20:24:22 +0000

Executive Summary

As of March 2026, the bottleneck in AI-assisted development is not how intelligent a model is. It is Context Rot. This article introduces Parallel Context Isolation (PCI), a distributed systems approach to running multiple instances of Claude simultaneously to execute complex, production-grade refactors without hallucinations.

The 2026 Reality: From Chatting to Orchestrating

The day of chatting with AI to get a few snippets is over. As we face more complex system refactors, we have crossed the Complexity Threshold. When you provide a single AI instance with 50+ files to refactor, its context window suffers, and it begins to "hallucinate" API signatures or missing edge cases.

The answer is to move to Parallel Context Isolation (PCI). Instead of a single "God Agent" that attempts to keep all of your architecture in its context, you treat your entire system as a distributed system and each of your agents as separate processes.

🏗️ The Pattern: Parallel Context Isolation (PCI)

Parallel Context Isolation is the pattern of launching several independent Claude Code agents, each working concurrently on the same codebase but within a separate context silo.

The Scenario: Decoupling a Payment Module

Suppose we're tasked with modernizing a legacy "Order Reconciliation" module into a new, asynchronous service-based architecture. In a PCI-based workflow, we create a new terminal, spawn a new project, and create three separate Claude Code agents, each with a specific responsibility:

Role	Responsibility	Scope
Business Logic Specialist	Domain Logic & Service layer	`src/main/core/`
Schema Modeller	SQL, DTOs, Migrations	`src/main/resources/db/`
Quality Shadow	Proactive test generation	`src/test/`

🛠️ The Blueprint: Coordination via CLAUDE.md

In order to manage multiple agents, multiple merges, and the inevitable chaos of merge conflicts, we need a governance structure. As of 2026, the standard solution is a special "CLAUDE.md" file at the root of your project, serving as your "Shared Memory."

** CLAUDE.md Template for Reference here: **

# Project Rules: Parallel/Multiple Agent Coordination

## 🤖 Multi-Agent Protocol
- **Isolation:** Ensure that agents only write to their designated directory scopes.
- **Locking:** Always check `.claude/tasks/` for a `.lock` file prior to writing a file.
- **State Sync:** If an API contract is updated, please update @ARCHITECTURE.md immediately.

## 📝 Coding Standards
- **Records:** DTOs should always utilize Records to guarantee immutability during agent handoffs.
- **OpenTelemetry:** All new endpoints should include OpenTelemetry tracing.
- **Project Loom:** Consider utilizing Project Loom virtual threads for I/O-bound operations.

🛰️ Synchronization: The "Mailbox" Pattern

Since the agents operate independently, we need a mechanism to enable handover. Instead of copy-paste operations, we'll employ an Agent to Agent log.
A2A_MESSAGES.log

[2026-03-15 14:02] FROM: Schema-Agent | TO: Logic-Agent
ACTION: Updated 'payment_records' table schema. 
CHANGE: 'amount' field is now BigDecimal (18,2). 
IMPACT: Update 'PaymentDTO.java' to avoid precision loss.

[2026-03-15 14:10] FROM: Logic-Agent | TO: QA-Agent
ACTION: Logic refactor complete in 'PaymentService.java'.
REQUEST: Execute 'PaymentRegTest.java' regression suite.

📈 The Professional Take: Why This Works

The key advantage of using PCI if you are working with systems that have high production constraints is:

Context Hygiene: By keeping the focus of the agent's attention narrow (e.g., only the DB access layer), you eliminate noise in the prompt, resulting in 40% fewer hallucinations within complex enterprise repos.
Concurrency = Velocity bump: You are no longer just coding at a higher velocity. You are no longer just writing code. You are now writing code concurrently. A 3-day sequence of a refactor is now a 4-hour orchestration.
Curation Over Construction: You are no longer the "writer." You are now the Lead Architect. You can now write the plan, direct the concurrent execution, and then perform the integration review.

Final Thoughts

We are shifting from a world where we manage code to a world where we manage context. Parallel Context Isolation is the bridge between "vibe coding" and professional-grade software engineering.

Are you still using a single chat window, or have you moved to a concurrent squad? Let's discuss your multi-agent setups in the comments.

What's your experience with running multiple AI agents? Share your workflows below! 👇