Forem: Elena Romanova

AI-Assisted Development: When orchestration starts collapsing under its own weight

Elena Romanova — Tue, 14 Apr 2026 16:23:01 +0000

I thought adding more control would make my AI-assisted coding workflow more reliable. Instead, it made the whole system heavier, slower, and more fragile. This article is about that failure — and about the redesign that finally made the workflow stable enough to keep using.

This is the second article in the series, to get a bit more context about the experiment — please check AI-Assisted Development: Part I

After my first attempts to build an agentic workflow that could be used not only for nice demos, but for something closer to real enterprise development, it became clear that moving forward “as is” made no sense.

The first working version showed that orchestrating roles really can improve the result compared to one big generation. The code became more structured, intermediate artifacts started to appear, handoffs (passing results and context from one agent to another) became part of the flow, and there was at least some process discipline. But the limits also became visible: requirements were getting lost on the way, the review gate was still too soft, and confidence in the result was often higher than the result itself deserved.

That is why the next iteration was not just “one more version” for me. It was an attempt to make the workflow more controllable. At this stage, I was not trying to improve code generation itself as much as I was trying to strengthen control around it: preserve requirements, add more independent validation, and make the process itself easier to understand from one version to another.

What I was trying to fix after the first version

Drifting requirements

One of the most visible problems in the first version was that some requirements simply disappeared on the way. The original prompt, the user story, and the final implementation no longer fully matched each other. The system could generate fairly clean code, but that still did not mean it preserved the actual meaning of the task all the way through.

To reduce this drift, I added a requirement lock — a separate layer of explicitly fixed constraints that the Team Lead had to pass further down the chain. The idea was simple: models should not “fill in the blanks” for things that look unimportant to them, and they should not silently narrow the task into a shape that is easier for them to implement.

## Requirement Lock
Before delegating, extract and pass a requirement lock that captures:
- Source-of-truth inputs and configuration dependencies
- Required request/response contract constraints
- Default behaviors and mandatory validations
- Explicit exclusions and non-goals
- Unresolved questions that must not be guessed away
Read `documentation/project-overview.md` and scan `documentation/constitution.md` principle titles for constraint relevance. Ask clarification only when docs do not resolve ambiguity.

Validation was still too soft

In the previous stage, responsibility for judging the quality of the result was still mostly falling on the Team Lead, and that was not enough.

So I added a separate Reviewer agent whose job was to validate the implementation plan written by the Architect against the requirements and project documents, and then review the final implementation itself: check whether it followed the plan, whether it respected the general project rules, and whether the code actually worked instead of just looking believable.

At the same time, I decided to strengthen the final quality gate:

add Sonar as a required part of final validation
introduce mandatory smoke tests (a quick check that the application starts and the main scenario actually works) before passing the implementation further down the chain

No transparency between versions

Once the workflow configuration started changing quickly, I myself started losing track of what was actually helping and what was not. So I added run logs for each iteration — a way to record not only what changed in the setup, but also what really happened during each run.

Final iteration code you can check on GitHub

The workflow looked stricter — but became heavier

If you only looked at the final code, the next iteration really was a step forward.

The generated implementation matched the technical task much better: the task itself was quite simple — an API that returns a list of GitLab issues. This time the result was noticeably closer to the original intent. The code was more structured, covered with tests, the application started successfully, and the endpoint really returned a valid result.

Compared to the previous version, this was already a meaningful improvement.

But the main failure of this iteration was not in the code itself. It was in the process.

What went wrong

the full cycle took more than two hours
there were four return loops between Team Lead and Coder
the final review artifact ended up corrupted
some of the control mechanisms added more friction than actual stability

And at that point something became much more obvious: good, or at least acceptable, code does not automatically mean that the orchestration system itself is reliable.

Context rotting became the first inevitable cost of complex orchestration

The main problem turned out to be context rotting — the degradation of quality caused by overloaded context, too many artifacts, and compaction inside a long session.

As the workflow became more complex, the number of artifacts that agents had to read, pass to each other, and generate again also grew. On top of that there was chat history. By the time I reached the second call to the Coder, I could already see signs of compaction, and it was clearly starting to affect agent performance.

Formally, there were now more control points. But that did not make the system more stable — it made it heavier.

Why formal control did not create stability

The problem was not only the size of the context.

Yes, Sonar turned out to be too expensive and too noisy as a quality gate for this kind of process. Yes, the wrong model was again picked up for the Architect role. Yes, the implementation plan was still not specific enough and did not help the Coder as much as it should have. Yes, the Coder remained the weakest part of the system: it returned invalid results, skipped the required structured handoff (a structured batch summary with changed files, checks, and status), got lost in terminal commands, and in general consumed a disproportionate amount of time compared to the real value of the feature.

But these were no longer separate small failures. They were symptoms of a much bigger problem: I was trying to strengthen the workflow by adding more roles, more artifacts, and more review stages. As a result, the system became not more reliable, but heavier.

By the end of that run, one thing was clear to me: reasonably valid code alone does not mean the orchestration is successful. Generation time is a resource just like tokens are. If a simple feature requires more than two hours, several repeat loops, and still does not produce a stable final review, then the problem is no longer in one specific agent. It is in the process design itself.

The next step was not “more control”, but a lighter system

The next iteration was no longer just a series of small fixes. It was an attempt to redesign the workflow so that it would consume less context, create fewer chances for drift, and rely on more deterministic validation.

1. Lighter handoff artifacts

The main goal was to reduce the risk of context rotting.

To do that, I:

replaced heavy markdown handoff artifacts (artifacts passed between agents) with JSON wherever possible
kept markdown only where the artifact really had to be readable as a document by a human — for example, user story and implementation plan
shortened and simplified the agent prompts
cleaned up part of the context documentation

2. Circuit breaker and red card logic

Another important change was a circuit breaker — a mechanism that prevents the same fix loop from running forever.

Now the Team Lead had to re-check the Coder’s results, and the system got a red card mechanism: if an agent returned a false-positive or incomplete result several times in a row, the task would not just go through the same loop again — it would go back to the Architect for plan revision.

This was a simple change, but it was the first time the workflow started behaving more like a real engineering escalation process instead of an endless retry loop.

3. Local tooling instead of heavy external gates

Tooling changed a lot too. I moved away from a heavy external quality gate and switched to a local toolset:

Checkstyle
PMD/CPD
SpotBugs
JaCoCo check

I also created separate test instructions with clearer test types:

unit
component
integration

And to reduce terminal chaos, I added local scripts:

[verify-quick.sh](https://github.com/for-alisia/delivery-flow/blob/feature-1-v2.1.0/scripts/verify-quick.sh)
[quality-check.sh](https://github.com/for-alisia/delivery-flow/blob/feature-1-v2.1.0/scripts/quality-check.sh)

Their job was very practical: replace a stream of random terminal commands with more predictable and repeatable validation steps.

4. A more useful plan for the Coder

Finally, I strengthened the requirements for the Architect.

Now I expected not just “a plan in general terms”, but something actually useful for the Coder:

task split into slices
required payload examples
expected class structure
logging expectations
test coverage expectations

The goal was to reduce guesswork before coding even started.

What actually stabilized the process

This iteration did not make the workflow perfect. But it became the first one where the main problem from the previous run — overloaded and degrading context — stopped being the biggest risk.

In the previous run, the system looked stricter on paper: more artifacts, more steps, more checks. But in practice it produced the opposite effect. The context kept growing, handoff artifacts became too heavy, Reviewer Phase 2 broke, and the Coder spent too much time in loops and unstable validation.

After simplifying handoffs, reducing agent instructions, and moving toward more local and deterministic checks, the workflow finally stopped collapsing halfway through the run. Reviewer artifacts were no longer damaged. JSON handoffs really did stabilize state transfer. Context usage improved noticeably. In this run, the Coder stopped giving false-positive reports. And the total run time dropped by roughly half.

For a quick comparison, it looked like this:

The main conclusion here was simple: the problem was not only the quality of specific models, but how much operational noise the workflow itself was creating. Once the handoff became lighter and the checks became closer, simpler, and cheaper, orchestration stopped degrading so early.

But process stability still did not mean system maturity

At the same time, this new iteration did not solve everything.

What was still a problem

The plan was still too long.

Even a good plan should not blow up downstream context. If the implementation plan itself becomes a heavy artifact, it starts hurting the very stability it was supposed to support. A harder limit was needed here — no more than 200 lines.

There was still no mandatory early smoke check for the API.

The Coder still tended to treat the work as finished after tests. But green tests are not the same as a working API. Happy path and error path should be checked before Reviewer Phase 2, not later.

Handoff discipline was still not fully there.

Each batch still needed a proper structured handoff, and agents still should not be allowed to create side files outside the scope of the plan and checkpoint.

Red card logic existed, but was not fully wired into execution.

The counters and checkpoint state were already there, but the mechanism still had to become automatic and formal instead of situational.

Tooling still missed some code-style issues.

For example:

repeated string literals
one consistent constructor style
standardized Mockito matcher practices

So the process became more stable, but code-shape quality still needed separate attention.

Why I decided to keep this code

This was the point where something important changed for me.

This iteration was not perfect. But it was the first version that I decided not just to document and move on from, but to keep as the basis for future runs. Not as a final result, but as the first truly usable seed version for further generation.

And maybe the main lesson was not even about agents themselves. It was about the engineering practice around them.

Local tools — the ones we almost started forgetting in the era of heavy IDE mega-platforms and all-in-one systems — suddenly felt useful again. They were simple, local, understandable, and cheap to run. And that was exactly what gave the workflow the stability it had been missing.

After that, the focus shifted naturally.

Once the main context problem was significantly reduced, the next question became different: not will the orchestration collapse halfway through, but what kind of code does it produce now — and how much does that code look like something I would actually want to maintain further.

From AI Hype to Controlled Enterprise AI-Assisted Development

Elena Romanova — Wed, 08 Apr 2026 13:58:54 +0000

Most AI coding demos answer the easiest question: can a model generate something that runs?

This experiment asks a harder one: can a multi-agent workflow produce code that stays aligned with requirements, respects project constraints, and remains reviewable enough for a long-lived enterprise codebase?

This article covers the first iteration of that experiment: a four-role orchestration setup, the artifacts around it, and the lessons that came from requirement drift, weak review gates, and over-optimistic test confidence.

The AI wave of the last few years did not pass my company by.

Presentations, workshops, ambitious plans to improve productivity by double-digit percentages before the end of the year through AI adoption — all of that quickly became part of everyday engineering life. We recently got access to GitHub Copilot, and for me that was the final signal: there is no going back. These tools are here, and I need to adapt to this new reality seriously.

I work in a large financial enterprise — thousands of engineers across the globe, strict regulatory and security constraints, projects that live for decades, and release processes that can take a week to prepare and involve a pile of tickets, approvals, and governance steps. In that world, Java 8 is still a perfectly normal tool, and getting a project on Java 17 already feels like a gift. I am a full-stack developer (Java + React) and a Delivery Lead in a 13-person team, so for me it matters a lot that our processes stay effective and our releases remain predictable.

When I started digging deeper into AI-assisted development, one problem became obvious very quickly: almost every tutorial and demo looks the same. Usually it is a React TODO app, a portfolio page, or some neat little proof of concept that a model or a swarm of agents generates from scratch. Then the video author checks whether it runs, and that is treated as success.

But almost nobody looks deeper:

how maintainable the code is
whether it actually fits the architecture
what the test coverage looks like
how it behaves in edge cases
whether it can be evolved further without a full rewrite

The “if it doesn’t work, delete it and generate again” approach works perfectly well for throwaway code: tiny demos, pet projects, one-page apps, or presentation-ready POCs. It is a good starting point if you want to get familiar with AI code generation. But it has very little to do with real enterprise development.

And this is where things got interesting: finding something genuinely useful for the day-to-day life of a working developer turned out to be much harder than I expected. The experiments my team started trying in practice were often chaotic and, quite often, brought more new problems than real value.

What problems we ran into

Unpredictable generation

Very often, debugging and fixing generated code — or simply figuring out why it did not work or behaved differently than expected — took more time than implementing the same functionality manually.

That is the moment when AI stops being an accelerator and starts becoming an expensive detour. Instead of saving time, the team gets a new layer of noise that still has to be untangled by hand.

Constant violations of project rules and engineering principles

We were slowly turning into code police: adding more and more restrictions to instructions, refining wording, documenting anti-patterns — and the model still kept either ignoring the rules or finding a new creative way to break them.

General phrases like “write production-ready code” or “follow clean architecture” sound nice, but in practice they guarantee almost nothing. Without concrete boundaries and checks, they remain far too vague.

No structured project context

Developers inside a team know a lot of things that are never explicitly written down. For example, a specific Elasticsearch query used by Flowable may not exist in the codebase as a normal .ftl file at all, but instead live inside a .zip archive in a specific directory or be accessible only through a designer UI. To a human, that is normal working knowledge. To the model, it is a blank spot.

So the model starts hallucinating, and the team compensates by piling on more context files, more instructions, more notes, and more workarounds.

The problem got worse because there was no unified team-level approach. Everyone kept their own local context files, prompt snippets, and notes based on what they personally thought was important. As a result, outputs became unpredictable and inconsistent from one developer to another.

Uncontrolled merge request style

The code started looking so unlike its supposed author that it was obvious it had been generated almost end-to-end. Which meant the extra burden immediately shifted to the reviewer.

The reviewer no longer just had to assess logic and task fit. First, they had to convince themselves that the code was even safe to merge into the project, that it did not hide surprises, and that it would not introduce additional technical debt.

Very quickly it became clear that developers get especially annoyed reviewing AI-generated code when the author has not cleaned it up, adapted it, and made it feel like part of the existing codebase before opening the MR. As a result, tension inside the team grew, and trust in each other’s code dropped noticeably.

Unrealistic expectations of AI as a “magic box”

Some people — especially in management, though not only there — started treating AI as a kind of magic box: put a good prompt in, and a great result will automatically come out.

That mindset says a lot not only about inflated expectations, but also about how poorly many people still understand how models actually work, where their limits are, and why good results do not appear on their own without context, verification, and engineering control.

Sometimes this became almost absurd. At one point I was asked to take a 35-page SAD (Software Architecture Document) template in Word, run it through Opus, and generate documentation for a two-year-old project with thousands of classes.

Requests like that make it painfully obvious how little many people still understand about the real cost of context, verification, and engineering accountability.

No clear model for a large real-world project

And this was probably the biggest problem of all. Even in engineering environments, there often was no clear understanding of how to adapt all this AI enthusiasm to a large real project — one with architecture, standards, code reviews, security constraints, support burden, and a long history of decisions.

We had all seen success stories about React TODO apps. We had seen much fewer examples of how AI-assisted development could actually work in a serious team environment.

The experiment: trying to build an enterprise-friendly approach

I do not have a silver bullet.

I am not going to pretend I found some universal solution that removes all of these problems in a few hours. Quite the opposite: my goal is to show the path honestly — the hypotheses I am testing, the decisions I am making, and the mistakes I am making along the way — so others can reuse what works and avoid at least some of what does not.

At its core, I want to test whether it is possible to build an AI-assisted workflow that is reliable enough for a serious project.

What “good” looks like for me

Code that is as indistinguishable as possible from something written by me or a teammate
Reasonably stable output from one iteration to the next
Strong test coverage
Code that actually works, not just compiles
Null-safe implementations that handle edge cases properly
No need for complicated prompt engineering before every generation
Developer control preserved throughout the process: final responsibility for quality stays with a human, and there is always a human in the loop

This part matters a lot: I am not trying to build a magical system that “writes everything by itself.” I am trying to build a controlled process where AI helps accelerate development, while the developer remains the one who sets boundaries, validates the outcome, and owns quality.

Why I chose a pet project

I did not want to run these experiments directly on a work project.

First, the cost of mistakes there is too high. Second, I was interested in a harder challenge: starting from a blank slate.

In my experience, models struggle the most when they have to produce good code with no real project context at all. When there are no established conventions, no patterns to anchor to, and no domain-specific structure yet, the output becomes especially unpredictable. Quite often, it is the kind of code you want to throw away immediately, without even trying to refactor it.

So I started building my own project — Delivery Orchestrator, a Java application with an MCP server acting as its client layer.

This is not a toy application and not yet another TODO list. I deliberately chose a project that is expected to become complex, scalable, and architecture-sensitive over time. It will grow step by step — story by story, generation by generation. Readability, refactorability, package structure, class boundaries, and architecture matter from the very beginning.

But at the first stage, it serves another purpose too: it is my playground for answering one bigger question — whether AI-assisted development can be turned from a pretty demo into a controlled, reviewable, and engineering-sound process.

Where I started: the initial orchestration model

At that point, I already knew that a great prompt plus a shared instruction file would not be enough. Yes, that improves the output noticeably, and I still think it should be the first step in any project. But for a large backend codebase, it is nowhere near sufficient.

That is why I became interested in the idea of agent orchestration, which I first picked up from Burke Holland. What I liked was the core idea itself: instead of trying to squeeze everything out of one “smart” request, split the work into several roles with clear responsibility and explicit handoffs between them.

The initial idea

The first working model was fairly simple:

Team Lead — takes in the task, sets the boundaries, and assembles the final result
Product Manager — formulates the user story
Java Architect — proposes the implementation plan
Java Coder — writes the code, tests, and verification evidence

The point was not to “give agents freedom,” but the opposite — to narrow their area of responsibility. Each step had to produce a clear artifact that could be reviewed with human eyes, instead of just passing along vague context.

What I added right away

At this stage, I already had the basic ingredients that, to me, are required before any serious work even starts.

First, there was a shared layer of project instructions — copilot-instructions.md - defining the baseline architectural and process context.

Second, there was a dedicated code-guidance.md file with rules for code style, structure, and quality.

But the most important layer from the start was constitution.md - a set of technology-agnostic rules that should not be violated regardless of which part of the system an agent is working on. In practice, this was a layer of invariants: not “nice-to-have recommendations,” but constraints that should survive changes in models, prompts, and even the workflow itself. And interestingly, this layer has remained one of the most resilient so far: when these rules are explicit and live in the repository, agents tend to check them first.

On top of that, I already had:

intermediate artifacts between stages
expectations not only around code, but also tests, documentation, and signs of verification
dedicated folders for implementation reports, agent setup in .github/agents

So even at an early stage, the goal was not “generate an endpoint,” but create a minimally disciplined delivery flow.

The first meaningful run: what worked

One of the early setups already produced something useful.

What came out was not random code, but a fairly clean, more enterprise-style skeleton:

a layered Spring structure
basic validation
error mapping
API documentation
unit tests and @WebMvcTest
overall, a much more disciplined result than one-shot generation

That was an important moment: it became clear that orchestration really can improve the quality of the first output compared to the classic “here is a prompt, generate the whole thing” approach.

If you want to see that exact slice of the project, it is available in the feature-1-v1 branch. It already contains the early repository structure, flow-orchestrator, mcp-server skeleton, project artifacts, and the full layer of constraints this run was built on.

What this run exposed immediately

At the same time, this run also showed the limitations of the approach.

Requirements started shrinking along the way

This was probably the most important observation.

Even though the original idea and the intermediate artifacts were broader, the final implementation ended up narrower and simpler than expected. Some requirements simply disappeared during the handoffs.

So the system could already produce neat artifacts and decent code — but it still could not reliably preserve the meaning of the task all the way to the end.

There was no independent technical review

At that stage, there was no dedicated reviewer or audit layer between the plan and final acceptance. As a result, all drift detection and mismatch checking effectively fell back onto the Team Lead.

That exposed a weak spot very quickly: splitting work across roles does not automatically create quality control.

Quality gates were too soft

The tests were already there, but they created a stronger sense of reliability than actual reliability.

Yes, the code was checked better than in a one-shot setup. But for integration-heavy logic, it was still not enough:

there were not enough realistic integration tests
there was no full end-to-end check of the whole chain
there was no proper contract verification for external calls

The instructions existed, but they were not “locked in”

This became another key lesson that shaped the next setup.

Even if the rules already live in the repository — in copilot-instructions.md, code-guidance.md, or constitution.md - that does not mean they are truly enforced. Until a rule becomes part of an acceptance or review gate, it is still closer to a wish than to a constraint.

And this was exactly the point where I realized I needed to log not only the result, but the run itself: what the setup was, which instructions were used, where drift started, which rules actually worked, and which ones only existed on paper.

Here is what I would now consider mandatory even for an early agentic flow:

Requirement lock — a clear record of what must not be lost along the way
Intermediate artifacts between stages, not just final code
An audit step before final acceptance
Quality gates that validate more than just compilation
Explicit project context stored in the repository
An invariant layer, like constitution.md, that is not tied to a specific model or stack and defines non-negotiable rules
Logs for every meaningful run, so setups can actually be compared against each other

The main takeaway from this stage

My first useful conclusion was very simple:

Good orchestration is not about having many roles and pretty artifacts. Good orchestration is about a system that does not lose requirements on the way and can hold quality on its own, instead of pushing all control back onto the human.

That was the moment I started treating each new setup as a separate experiment: tracking the rules, the result, the points of degradation, and looking not only at the code, but at how exactly the system got there.

What I decided to change in the second approach

After this setup, it became clear that the next step was not “add more instructions,” but strengthen the control loop itself.

So in the second approach, I moved toward:

tighter requirement preservation between stages
a dedicated review / audit step
more explicit quality gates
better observability of the run itself, so I could understand not only what came out, but also where exactly the system started losing quality

That became the next important turning point — no longer just role orchestration, but an attempt to make the workflow itself reviewable and reproducible.

The experiment is open in the delivery-flow repository. If you want to dig deeper, you will find not only code there, but also the orchestration layer itself: instructions, artifacts, logs, and the evolution of the workflow.

In the next article, I will show what I changed in the second setup — and what those changes actually improved in practice.