Forem: Esteban S. Abait

The Four Modalities for Coding with Agents

Esteban S. Abait — Wed, 28 Jan 2026 13:21:51 +0000

It is 2026, and many software engineers around the world are realizing that coding agents are capable of generating high-quality outputs. Yet adopting these tools involves trade-offs. Teams vary in how much effort they invest up front in specifying design/requirements versus afterwards in reviewing and testing the AI’s output.

Developing functional software with agents involves a three-stage process:

Upfront specification. Developers initiate the task by defining the task’s goal, outlining a plan, and furnishing essential context. This might include providing agents with specific rules (e.g., via AGENT.MD, memory banks, and other markdown files that can fit the agent’s context) and detailed instructions, which the agents will use as the foundation for implementation.
Code generation. Agents utilize the provided context, rules, knowledge-banks, and plan to automatically generate the necessary code that fulfills the ultimate objective set by the developer.
Output revision. In the final stage, the developer is responsible for reviewing the generated code, testing the software to ensure it runs correctly, and verifying that it functions as intended.

We can classify four distinct “modalities” of using coding agents, based on whether Upfront Specification Effort is low or high, and whether Output Revision Effort is low or high.

		Output Revision Effort
		Low	High
Upfront Specification Effort	Low	Vibe coding. Loosely specified input prompt. The output is only functionally validated.	Guided Prototyping. Loosely specified input prompt. The output is functionally validated and also its implementation details.
	High	Autopilot Coding. Uses coding agents to produce product increments with a high level of trust in the Agent outputs. The input is well specified in terms of functionality and technical design, but the output is loosely reviewed as it is considered of good quality.	Agentic Engineering. Using Agents to generate code following a software engineering process. Each line or the majority of lines of generated code is reviewed. There is a testing strategy in place that ensures software fulfills its intended use.

Below, I describe each modality, its pros and cons, and recommendations for when to use it.

Vibe Coding

Vibe coding (term introduced by Andrej Karpathy) refers to using an AI coding assistant with minimal upfront specification and minimal code review. You “follow your vibes” by providing a loosely specified natural-language prompt for the feature or program you want, letting the AI generate the code, and then running it to see if it works. Crucially, you do not meticulously review the code; you validate it only by testing its functionality (does the app run and do what you asked?).

In Karpathy’s words, vibe coding means “fully give in to the vibes, embrace exponentials, and forget that the code even exists”.

The human acts more as a product manager or tester, focusing on describing goals and trying the software, rather than reading or structuring the code.

Pros

Enables non-programmers to create software. Even people with little coding experience can produce working applications by describing what they want in plain English. For example, an accountant or designer could use tools like Replit’s Ghostwriter or Cursor’s natural-language interface to build simple apps, whereas before they might be limited to Excel macros. In short, vibe coding democratizes software creation by making “the hottest new programming language English”.
Encourages rapid experimentation. Because you’re not spending time on detailed specs or boilerplate coding, you can quickly try out new feature ideas or product prototypes. Product managers and UX designers can spin up proofs-of-concept with AI to demonstrate an idea in code rather than writing a PRD or drawing a Figma mockup (see here). This fast, “just try it” approach aligns with the idea of optionality, which involves exploring multiple solutions in parallel since the cost to attempt each is low.
Lowest upfront cost and development time. For a hobby project, hackathon, or pre-seed startup, vibe coding can deliver a minimal viable product extremely quickly and cheaply. Entire weekend projects or MVPs can be built in days or hours rather than weeks. This acceleration has already been observed in practice (e.g., Y Combinator’s CEO noted that in their Winter 2025 batch, 25% of startups had 95% of code generated by AI, heralding that “the age of vibe coding is here”).

Cons

No control over code quality or architecture. Since you rely on the AI’s outputs without deep inspection, the codebase can be inconsistent or poorly structured, accumulating significant technical debt. One developer’s 27-day AI-coding experiment found that as the project grew, similar functions ended up implemented differently across the codebase, and components lacked awareness of each other. This kind of hidden complexity becomes costly later.
Higher risk of bugs and security issues. Lack of code review means vulnerabilities can slip through. There have been incidents of AI-generated code doing dangerous things unexpectedly. For example, an AI coding assistant for one user deleted an entire database despite explicit instructions not to. Other vibe coder founders have been victims of their own unsecured creations, leading to “maxed out usage on api keys, people bypassing the subscription, creating random shit on db”.
“Works now, breaks later” maintenance challenges. Because the human doesn’t fully understand the generated code, future modifications or debugging become daunting. As one engineer quipped, many vibe-coded apps hit a “maintenance wall” – everything looks great in a demo, but when a bug arises or a new feature is needed, the AI’s fixes often introduce new problems, since the AI has no memory of why it made certain architectural decisions. Without an engineer who groks the code, each fix can become a frustrating game of whack-a-mole.

Recommendation

Vibe coding can be recommended for non-developers or novice coders who want to build small, non-critical applications. It allows people with domain knowledge (but not coding skills) to automate tasks or create tools they otherwise couldn’t. For instance, a finance analyst “vibe-coding” a custom report generator.

In industry, some product and design teams use this modality to prototype ideas and create working demos to hand off to engineering, instead of just static specs. It’s also useful for early-stage startups or solo developers trying to validate a product idea quickly on a shoestring budget.

In these scenarios, the lack of code quality is acceptable because if the idea proves valuable, the team can rewrite or heavily refactor the code later (and doing so is now cheaper with AI assistance anyway). However, teams should avoid shipping vibe-coded prototypes to production in any long-lived or critical system. As one GitLab principal engineer put it, “No vibe coding while I’m on call!”.

Use vibe coding to explore and validate ideas, but plan to invest in a proper engineering pass if the project needs to be maintained.

Guided Prototyping

Developers or even non-technical people run a PoC to see how the AI would implement a feature. The output is reviewed to see if the implementation makes sense and is suitable to be used. In their book on Vibe Coding, Gene Kim and Steve Yegge call this modality “tracer bullet testing”. Tracer bullet testing involves “... creating a thin, working slice of a system that touches all critical components—UI, backend, database, and APIs—to prove the architecture works, rather than just prototyping isolated parts”.

In Guided Prototyping, the developer still provides a relatively loose natural-language prompt or goal to the coding agent (similar to vibe coding), but with an important twist: the developer thoroughly reviews the implementation details of the AI’s output and verifies that the approach makes sense.

Essentially, the AI quickly produces a proof-of-concept implementation, and a human then inspects that code (and likely runs it) to evaluate its correctness, quality, and suitability. This modality is akin to what Gene Kim and Steve Yegge (in their Vibe Coding book) refer to as “tracer bullet testing.” In software, a tracer bullet is a thin, working end-to-end slice of the system that touches all critical components (UI, backend, database, APIs) but only implements the bare basics. It’s a skeletal but functional version of the feature that shows the architecture will work.

Guided prototyping uses AI to fire off one or more of these tracer bullets: the AI writes a minimal implementation across the stack, and the developer examines this tracer code to see if it’s on the right track.

By reviewing and possibly refining the AI’s code, the engineer ensures the prototype’s implementation (not just functionality) meets their expectations. This might involve checking that coding best practices were followed, that the approach aligns with the intended architecture, or that the code is extensible. Essentially, guided prototyping uses the AI as a rapid coder to generate prototypes, but keeps a human in the loop for technical validation (as opposed to vibe coding’s pure “just run it” approach).

Pros

Fast feedback on architecture & feasibility. Because tracer-bullet prototypes are working end-to-end, stakeholders get immediate visibility into a feature’s implementation. You can validate early whether the chosen tech stack, APIs, and integrations will actually support the requirements. This reduces risk by exposing architectural or interoperability problems earlier rather than after weeks of development.
Mitigates risk through early testing. Instead of building parts in isolation, the thin vertical slice will reveal if any critical component (database, external API, etc.) is going to be a blocker. This “system smoke test” catches high-risk issues when they’re easier (and cheaper) to fix. In other words, you prove the path works before investing heavily.
Team alignment and technical calibration. Having a working end-to-end prototype ensures that frontend and backend developers, QA, etc., share a common understanding of how the pieces connect. It’s a concrete reference implementation to discuss. Everyone sees the same thin slice, which promotes synchronization on design decisions early (avoiding big surprises later).
Rapid course-correction and learning. If the tracer implementation is off-target, you find out quickly and can adjust the design or requirements with minimal wasted effort. The team can iterate on the prototype or try an alternate approach. In fact, with AI generation being so fast, you could have multiple prototype implementations of a complex feature created in parallel, exploring different approaches. This is an extension of the “optionality” benefit of AI – you might let the agent build two or three variant solutions and then pick the best one after review.
Tracer code often becomes the foundation. Unlike throwaway prototypes, tracer bullet code is meant to evolve into the final product. So the effort isn’t wasted; the initial AI-guided slice provides a base to gradually flesh out with full functionality, with confidence that the foundations are solid.

Cons

Can create a false sense of completeness: Since it touches all components, the "thin slice" might be mistaken for a production-ready feature.
Requires high alignment on architecture first: If the underlying architecture is flawed, the tracer bullet will expose it, but potentially after significant effort has been invested.
Overhead for non-trivial systems: Setting up a fully functional end-to-end slice, even a thin one, can be time-consuming for very complex systems with many dependencies.
Focus on breadth over depth: The implementation might lack the robustness, error handling, and performance tuning required for production code.

Recommendation

I would recommend using Guided Prototyping when the team faces uncertainty in the implementation and you want to de-risk those unknowns early.

It’s essentially the AI-era equivalent of a technical spike or proof-of-concept, but more formalized and kept runnable. Teams can leverage coding agents to spin up multiple alternative tracer bullet implementations in parallel, then compare which approach is best.

This modality is valuable for projects where the architecture is not proven, or there’s debate about the best design: you can test the waters with minimal cost. Afterward, the knowledge gained should inform the real implementation.

In practice, organizations that already embrace Agile “spikes” or proofs-of-concept will find guided prototyping with AI a natural fit. It’s a way to harness AI’s speed to get concrete answers early in the development cycle, improving decision-making and reducing costly late-stage changes.

Autopilot Coding

Autopilot Coding is a modality where the team (or often a solo developer) provides a well-specified prompt and context to the coding agent. This means the desired functionality and even high-level technical design are clearly described and then let the agent generate a substantial chunk of the codebase with minimal human intervention or review.

In other words, the developers put the coding agent on “autopilot,” trusting it to produce production-quality code from a solid spec, and they do only a cursory review (if any) of the output. This approach has emerged in some cutting-edge small teams and individual projects, and it’s being pushed to extremes in research experiments.

In practice, autopilot coding might look like “outsourcing the development” to the agent by providing a detailed functional spec (and perhaps a software design document) and asking it to implement the whole thing (or a large portion of it) while you monitor progress occasionally.

It assumes high confidence in the AI’s capabilities and output quality. In fact, the team behind Cursor (an AI-augmented IDE) recently did an experiment very much like this: they ran hundreds of AI agents for a week building a browser from scratch, which resulted in over 1 million lines of code across 1,000 files. Simon Willison has also written about the Cursor experiment.

I believe this approach has a future as long as the software to be built is very well specified and can be verified in some way by the agent. If you want to build a web browser, your agents can implement multiple specifications for HTML, CSS, and JavaScript.

This modality is unorthodox, but it’s attractive to some because it promises unprecedented development speed. If one agent can write code 10× faster than a human, what about 10 agents working in parallel? Autopilot coding is essentially the “move fast” approach: spec well, then let the AI rip, and only later worry about fixes.

Pros

Potential order-of-magnitude productivity gains. Advocates claim that with the right prompts and context, AI agents can generate features or even entire products extremely quickly. There are anecdotal reports of individual developers achieving “100×” or even “1000x” engineers using AI coding assistants end-to-end.
Scales beyond a single agent (or human). Autopilot coding can involve multiple AI agents working in parallel, further boosting throughput. Tools like Conductor might facilitate this task. Also, Linear started to offer a new modality to run agents as independent coworkers.
Extreme velocity for well-defined problems. If the problem is very well-specified (e.g., implementing a known standard or algorithm), coding agents can churn out a correct solution quickly. For self-contained, verifiable (there is an existing test suite that can validate behavior), specification-driven projects, autopilot coding essentially lets you “skip ahead” in time.

Cons

Uncovered bugs and security issues.
Potential tech debt accumulation
Requires a thoughtful upfront specification for long-horizon agent work.
For some companies, the code is their IP (Intellectual Property). Not understanding your own IP will affect the business.
Unproven and high-risk. This methodology is very new and largely experimental. We have a few real-world success stories of an AI-developed codebase at scale that didn’t require major human fixes. As the 2025 DORA research noted, AI tends to amplify whatever setup you have. It can boost high-performing teams but also magnify dysfunctions in struggling teams.
Uncertainty, lack of control, and security issues. When you trust the agent output without thorough review, you inherently accept a lot of uncertainty. The code may run initially, but hidden bugs or suboptimal choices can lurk beneath the surface. Security vulnerabilities, inefficient algorithms, or simply incorrect edge-case handling might only surface later (possibly in production). Indeed, data from early adopters shows that teams using a lot of AI without adapting their process have seen bug rates increase (by ~9%) and longer review times, with no overall improvement in delivery speed. These teams were able to speed up coding, but just created a bottleneck elsewhere or quality issues that neutralized the gains.
Accumulated technical and comprehension debt due to “black box” code. A codebase produced largely by AI, without human oversight, can become difficult for developers to understand or maintain. This ossification is a new kind of tech debt: the AI’s design might be suboptimal, but changing it later is extremely costly when no one has the mental model of the code. This new type of effect has been called by some practitioners “comprehension debt.
Loss of competitiveness. Moreover, if an organization’s competitive advantage is its software, treating the code as a mysterious artifact created by AI (rather than something the team deeply understands) is dangerous. The company’s intellectual property isn’t just the final code output, but the knowledge of why it’s built that way. By letting AI write everything, you risk turning your own codebase into a foreign legacy system from day one
Bugs and incidents can be harder to resolve. In autopilot mode, you might encounter the nightmare scenario described earlier in vibe coding, but at a larger scale: something breaks in production and now no engineer is intimately familiar with that part of the system. Debugging is much harder when you have to ask an AI to explain code it wrote (and that AI might not even have the full context anymore).
Requires exceptionally clear specifications. To even attempt autopilot coding, you must provide very detailed, precise requirements and technical guidelines to the agent(s). Some software components we use every day are heavily documented, and they must also comply with strict specifications. In those cases, this approach might work. For the rest of the software landscape, building such detailed upfront specifications has been proven problematic (remember the problems associated with waterfall processes?).

Recommendation

I think this modality is best suited for developers embarking on greenfield projects, where they are building something entirely new and unconstrained by legacy code. It should also be viable for developers tackling small-scale brownfield projects that they are intimately familiar with, where the existing codebase is manageable and well-understood.

However, a critical caveat must be applied: this approach is not recommended for mission-critical projects. Relying heavily on agents in such high-stakes environments significantly increases the risk of unforeseen bugs, complex technical debt, and system failures, the kind that might trigger your pager at the most inconvenient hour, such as 3 AM. The cost of a failure in a mission-critical system far outweighs the productivity gains.

For solo developers and small teams, this agent-centric coding methodology can deliver good initial outcomes and accelerate the early development phase of a project. The rapid generation of functional code provides a strong starting momentum. Nevertheless, this advantage is often counterbalanced by substantial risks as the project inevitably grows in size and complexity. The primary danger lies in the unreviewed technical debt and structural inconsistencies that agents can inadvertently introduce. This "unreviewed" code starts to creep over the codebase, acting like a slow poison.

This process significantly increases the code entropy—the measure of disorder and degradation in a codebase. High code entropy has a dual negative effect:

Human Comprehension and Maintainability: It drastically affects a human developer's ability to understand, navigate, and therefore maintain the code. Complex, poorly structured, or inconsistent code slows down debugging and feature development.
Agent Performance: Ironically, the agents themselves begin to suffer. As the codebase becomes more chaotic and less coherent, the coding agent's own ability to accurately understand the existing context and generate correct, high-quality new code diminishes. The agent is effectively generating code based on a deteriorating foundation.

Furthermore, if you are engaged in building something novel and innovative, the resulting code constitutes your Intellectual Property (IP) and is a core business asset.

In this scenario, you must maintain a deep, granular understanding of how the code works, including all the architectural decisions and underlying trade-offs. This deep knowledge is essential for strategic reasons: it allows you to respond quickly and effectively to market feedback, competitive pressures, and unexpected technical challenges.

By outsourcing this foundational knowledge and control to coding agents, the business inherently places itself at significant risk. The loss of direct, human-expert IP ownership can compromise the ability to innovate and adapt, fundamentally putting the future of the product and the business itself in jeopardy.

Agentic Engineering

In this modality, teams have adopted not only coding agents but also AI throughout the whole SDLC. They invest in context-engineering so when asking the coding agent to create a plan, the agent has access to several internal documents and specifications that can guide the generated implementation into the organization's best practices. They have coding rules, documented architecture, documented design patterns, and best practices available for Agents to be used. When the developer generates a plan, it is usually very detailed in terms of functionality and non-functional requirements. The chunk of work specified in the plan must be kept as small as possible so the size of the generated implementation is also acceptable for a human reviewer (as the 2025 DORA report shows). Teams also leverage the coding agent to review the code or implement specific reviewer agents. Once the PR is created, other human reviewers will finally approve the code. For this approach to work and be production-ready, teams implement several guardrails: rigorous automation testing, high testing coverage, TDD, CI/CD, linting rules, among other practices. Another recommendation provided by the DORA report is Value Stream Management (VSM). VSM is described as the practice of visualizing, analyzing, and improving the flow of work from idea to customer. The use of VSM should help organizations to track how AI affects lead time, rework, and deployment frequency (more here).

Essentially, it’s traditional software engineering supercharged with AI, rather than letting AI run wild.

Key characteristics of agentic engineering include:

Extensive context and prompt engineering. Teams invest heavily in providing the AI with the right context: up-to-date internal documentation, architectural guidelines, coding standards, and even organization-specific knowledge bases. The agent is not coding in a vacuum; it’s informed by the company’s best practices and the project’s design specs.
Small, incremental tasks. Rather than asking the agent to build an entire feature in one go, work is broken into small chunks (perhaps a few lines to a few dozen lines of code) that fit within the AI’s attention span and can be reviewed easily. The 2025 DORA report found that working in small batches is still crucial even with AI. Teams that maintained incremental change discipline reaped more benefits, whereas AI tended to increase PR sizes by 154% when unmanaged.
AI-assisted code review and testing. In agentic engineering, AI doesn’t just write code but also helps review code. For instance, after an AI generates code, a separate AI agent (or the same agent with a different prompt) might statically analyze that code, suggest improvements, or point out potential bugs. The human developer then reviews both the code and the AI’s review comments. Lately, there has been a lot of innovation in this space.
Human oversight and final approval. Unlike autopilot coding, here every line of code is ultimately reviewed by a human (or at least the vast majority of lines, with trivial changes possibly an exception). This ensures that knowledge of the code is internalized by the team and nothing unintelligible slips in. As Kim and Yegge’s book emphasizes, “delegation of implementation doesn’t mean delegation of responsibility”.
Comprehensive guardrails in the SDLC. Agentic engineering often requires that the organization have strong engineering practices already in place: continuous integration/deployment, high test coverage (often >90%), linting and static analysis, security scans, etc. These guardrails catch mistakes, whether made by humans or AI. In fact, DORA’s 2025 research found that AI acts as an amplifier where good practices yield even better results with AI, and poor practices just get amplified into bigger problems.
Value Stream Management (VSM) and metrics. Because of the propensity of AI to shift bottlenecks, top teams use VSM to track the flow from idea to production. The DORA 2025 report specifically highlights VSM as critical to turn individual AI productivity into organizational performance.

Pros

Significant productivity boost without sacrificing quality. Organizations practicing this modality report substantial improvements in throughput while maintaining or even improving code quality. For instance, Booking.com found that after introducing specialized AI coding agents (within a robust dev process), they achieved a ~30% productivity gain, lighter code reviews, and faster deliveries. Unlike vibe coding, where speed comes at the cost of quality, agentic engineering strives for both speed and quality by catching issues early and often.
Human oversight ensures maintainability and shared knowledge. By having developers review AI contributions, the team remains in control of the architecture and understands the codebase. This avoids the “black box code” problem, where knowledge is not lost to the AI.
Faster iteration from plan to working code. When AI can handle the boilerplate and rote coding tasks, developers spend more time on higher-level design and polishing.
Focus on higher-value work for humans. Since the AI is churning out the basic code, human developers can focus on what humans do best: making judgment calls on architecture, tackling particularly tricky algorithmic or edge-case problems, and handling creative design tasks.
Robust, high-quality codebase. With strong guardrails (CI, tests, linting) and careful review in place, the final code that reaches production can actually be more robust than before, because the team can afford to enforce stricter quality standards.

Cons

Requires process maturity and upfront investment. Not every organization is ready to implement this modality. If your CI/CD is flaky, tests are lacking, or your documentation is poor, trying to add AI into the mix can backfire. DORA’s research indicates that only the most mature teams currently see strong benefits from AI, whereas many teams see little to no improvement because their bottlenecks simply move elsewhere.
Slower than pure AI in the short run. By insisting on human review and small batches, you naturally throttle the AI’s raw speed. For startups in very early stages, this overhead might feel stifling.
Needs high-quality prompts and context engineering. The effectiveness of the coding agents depends heavily on the quality of the input they’re given. A poorly guided AI can waste time, requiring multiple redos, negating the efficiency gains.
Maintaining developer skills and engagement. If the AI is handling a lot of the routine coding, developers might lose practice in those areas. Some have expressed concern that over-reliance on AI for low-level coding could, over time, erode engineers’ ability to code without AI or to dive deep into debugging complex issues. Additionally, there’s a cultural shift: developers have to embrace a more meta-role (like a conductor) which not everyone may enjoy or excel at. There can be initial resistance (“Am I just here to babysit the AI?”). Engineering leaders need to manage this change and ensure team members still feel ownership and pride in the work.
New failure modes and complexity. Introducing AI into the workflow creates new ways things can go wrong, like introducing a security hole that all reviewers miss because it looks fine, or the AI’s suggested refactor might break something non-obvious. Debugging an issue in code that was partially AI-generated might be tricky if the code is written in an unfamiliar style. Moreover, coordinating multiple AI agents (for coding, reviewing, etc.) is itself a complexity; it’s like adding new team members who work blazingly fast but need training. The processes to manage AI output (like deciding when to trust it vs. override it) are still evolving.

Recommendation

Agentic engineering is emerging as a suitable modality for professional software teams that need to balance velocity with reliability. This is especially true for enterprise environments, organizations building long-term proprietary systems, and any product where understanding and maintaining the code is critical.

If your software is core IP for your business (your competitive advantage), you can’t afford to treat it as throwaway; you need to deeply understand it, which means humans in the loop.

This approach is well-suited for teams in regulated industries or mission-critical domains (finance, healthcare, etc.) where there’s zero tolerance for unchecked code. It’s also appropriate for large codebases and legacy systems, such places where uncontrolled AI codegen could wreak havoc, but guided AI use could significantly improve developer productivity (for example, using an agent to safely refactor a legacy module under close tests and review).

To implement this modality, organizations should lay the groundwork: invest in automated testing (high coverage), continuous delivery, and developer tooling. They should also train the team in how to work with AI (the “head chef mindset” of orchestrating AI assistants). The 2025 DORA report suggests adopting “clear AI policies” and providing training and playbooks for AI usage. Everyone on the team should know how and when to use the coding agent, what the review protocols are, and what the acceptance criteria for AI-generated code are. Essentially, treat the AI as an ultra-fast junior developer who needs mentorship and strict code review.

Conclusion

Coding agents are a reality in 2026, and teams are widely adopting them. This article defines four distinct modalities for adopting these agents, each with its own pros, cons, and recommended use case.

It's important to note that a single team can adopt multiple modalities across different phases of their Software Development Life Cycle (SDLC). For instance, teams using Agentic Engineering might also employ:

Vibe Code for developing small utilities or dashboards.
Guided Prototypes to evaluate the trade-offs of a feature before starting analysis.
Autopilot Coding for running controlled migrations in specific cases.

The greatest value from this technology will be realized by teams that effectively integrate it to solve business problems, all while carefully managing the long-term benefits and potential challenges.

Beyond Code Generation: LLMs for Code Understanding

Esteban S. Abait — Fri, 02 Jan 2026 12:40:06 +0000

TL;DR
Engineers spend more time understanding code than writing it.

LLM-based tools help, but in different ways.

This article compares modern AI tools for code understanding and explains when to use which one, based on cognitive bottlenecks rather than features.

For most software engineers, the dominant cost of development is not writing new code but understanding existing systems: navigating large codebases, reconstructing intent, tracing behavior across layers, and assessing the impact of change.

Empirical studies consistently show that developers spend a substantial portion of their time on comprehension, information seeking, and coordination rather than coding itself, based on large-scale field studies of professional developers’ daily activities and measured comprehension tasks (here, and here). Program-comprehension research further demonstrates that understanding code imposes significant cognitive load and is influenced by factors such as vocabulary, structure, and “naturalness”.

That is why having new tools to understand large codebases is so important, especially at big enterprises and organizations with lots of apps and services. Modern enterprise environments present unique challenges. They operate with multi-decade legacy systems, polyglot architectures, security and compliance constraints, and large-scale team structures that demand tools with much more rigor than consumer-grade AI assistants.

In 2025, we have seen a rise in the popularity of tools using LLMs to explain how an application's source code works or to auto-document it. Unlike traditional static analysis, which relies on explicit rules, models, and predefined abstractions, LLM-based analysis leverages learned representations to interpret code semantics and intent. This allows context-aware code analysis (e.g., detecting insecure patterns or summarizing complex logic) with a flexibility beyond hard-coded rules.

This article explores why LLMs and agent-based tools are a natural fit for program comprehension—and where they still fall short. Based on my own research and hands-on experimentation, I compare a selected set of commercial and open-source tools using a qualitative lens. Rather than reviewing outputs line by line, the goal is to clarify each tool’s strengths and trade-offs and offer practical guidance on when to use which tool, depending on the stage of understanding and the developer’s workflow.

Advantages of using LLMs and agents for code understanding

LLMs are particularly effective at reducing the friction of initial comprehension: summarizing functions and modules, explaining unfamiliar APIs or idioms, and translating low-level implementation details into higher-level intent.

When combined with agents, their impact extends beyond isolated snippets to repository-scale understanding. Agentic workflows characterized by iterative planning, tool use, retrieval, and validation enable incremental, multi-step exploration of codebases, rather than single-pass analysis.

Research and early industrial practice show that structure-aware context (e.g., symbols, call graphs, dependencies, and history) significantly improves the relevance and usefulness of explanations compared to flat context windows, as demonstrated in repository-level and IDE-integrated code understanding studies. These capabilities are especially valuable for onboarding, legacy modernization, bug localization, and code migrations, where the primary challenge is knowing where to look rather than what to type.

Disadvantages and limitations

Despite their fluency, LLMs do not reliably demonstrate deep semantic understanding of code. Several empirical studies show that models often rely on surface-level lexical or syntactic cues and can fail under small, semantics-preserving transformations, a limitation demonstrated in semantic fault localization and robustness evaluations of code-focused models. This creates a risk of false confidence: explanations may sound convincing while being subtly incorrect or incomplete.

Controlled experiments with experienced professional developers have shown that AI-assisted workflows can increase task completion time due to verification overhead and context mismatch, even when participants report higher perceived productivity.

As a result, comprehension gains are highly sensitive to retrieval quality, grounding mechanisms, and task context.

LLM-based Tools for Code Understanding

My exploration of LLM-based tools for code understanding started with DeepWiki, which I have been using since its early release. As my interest shifted toward analyzing private repositories and experimenting more deeply with the underlying mechanics, I began looking for open-source alternatives. This led me to deepwiki-rs and later OpenDeepWiki. After starring OpenDeepWiki on GitHub, one of the authors of Davia reached out, which introduced me to a different, more collaborative approach to AI-assisted documentation. I later encountered PocketFlow Tutorial Codebase Knowledge through a technical report, and finally Google Code Wiki when it was publicly announced, which I followed closely given its enterprise positioning.

Tool	What it produces	How it’s used	Notes
DeepWiki	Wiki-style pages, diagrams, chat	Read-only exploration of a repository snapshot	Fast orientation; requires re-indexing to stay fresh
Google Code Wiki	Continuously regenerated wiki, diagrams, chat	Living documentation synchronized with code	Strong grounding and freshness; enterprise-oriented
Davia	Editable docs, visual boards, agent outputs	Interactive, human-in-the-loop workspace	Grounding depends on agent integration
deepwiki-rs	Structured docs, C4 diagrams	Architecture analysis and reasoning	Batch-generated; favors correctness over speed
OpenDeepWiki	Structured docs, knowledge graphs	Queryable knowledge layer for humans & agents	Can act as infrastructure (MCP server)
PocketFlow Tutorial Codebase Knowledge	Guided tutorials, explanations	Learning and onboarding	Optimizes clarity over completeness

Although all of these tools aim to reduce the cost of understanding large codebases, they approach the problem from different angles.

DeepWiki and Google Code Wiki focus on automatically generating structured, navigable wikis from repositories, optimizing for rapid orientation and high-level understanding.

deepwiki-rs emphasizes architecture-first documentation, producing explicit C4 models and structural views that support reasoning about system boundaries and change impact.

OpenDeepWiki takes a more infrastructure-oriented approach, positioning itself as a structured code knowledge base that can be queried by both humans and agents and integrated into broader tooling ecosystems.

In contrast, Davia acts as an interactive, human-in-the-loop workspace, where AI agents help generate and evolve documentation collaboratively rather than producing a static artifact.

Finally, PocketFlow Tutorial Codebase Knowledge reframes repositories as pedagogical artifacts, prioritizing approachability and onboarding through tutorial-style explanations.

Together, these tools form a representative cross-section of current approaches to AI-assisted code comprehension, making them well-suited for a qualitative comparison across dimensions such as mental model formation, grounding and trust, freshness over time, and workflow fit.

Qualitative dimensions comparison

When comparing AI tools for code understanding, it helps to step back and ask a simple question:

What part of the thinking process does this tool actually make easier?

Reading and understanding code is not a single activity but a sequence of cognitive steps. From getting oriented to building confidence, to keeping that understanding up to date as the system evolves.

The qualitative dimensions below reflect those realities and explain why different tools shine in different situations.

Mental model formation is about how quickly a tool helps you answer the big-picture questions: What is this system? How is it structured? What are the main responsibilities and flows? Tools that excel here reduce the initial cognitive load by externalizing architecture and intent, allowing engineers to move from confusion to clarity without reading every file. This is especially valuable when joining a new project or revisiting a codebase after time away.

Grounding and trust address a different concern: Can I rely on what this tool is telling me? Clear explanations are useful, but they only become actionable when they are tied back to concrete code: files, symbols, and implementation details that can be inspected and verified. Tools with strong grounding make it easy to validate claims, while weaker grounding forces engineers to double-check everything manually, reducing trust and limiting real productivity gains.

Freshness over time reflects the reality that code changes constantly. Even the best explanation loses value if it no longer matches the current state of the system. Some tools provide powerful snapshots of understanding, while others focus on keeping documentation and explanations synchronized with ongoing code changes. This dimension matters most in fast-moving teams, where stale understanding can be more dangerous than no documentation at all.

Workflow fit recognizes that developers ask different questions at different moments. Early on, they want orientation; later, they want precision; sometimes they want learning, other times impact analysis or review support. Tools differ not in overall quality, but in which stage of understanding they optimize. A good fit aligns the tool with the user’s context: new contributor, experienced engineer, architect, or platform team rather than assuming one-size-fits-all comprehension.

Taken together, I hope these dimensions help explain why no single AI tool “wins” across all scenarios. Each makes deliberate trade-offs to reduce a specific kind of cognitive friction, and understanding those trade-offs is key to choosing and using the right tool effectively.

Mental model formation

Mental model formation is about how quickly and accurately a tool helps a developer answer the fundamental question: “What is this system, and how does it fit together?” Different tools approach this problem in distinct ways.

DeepWiki excels at orientation speed: it synthesizes structure, responsibilities, and flow into wiki-style narratives and diagrams with almost zero setup. Ideal for “what is this repo?” moments.
Google Code Wiki goes further by maintaining architectural summaries that stay synchronized with code changes, reducing documentation drift.
deepwiki-rs is strongest when architecture matters more than narrative: C4 models and explicit component relationships help senior engineers reason about system boundaries and change impact.
Davia and OpenDeepWiki emphasize semantic structure (entities, relations, graphs) over prose, which supports deeper, iterative understanding rather than instant summaries.
PocketFlow deliberately simplifies architecture into tutorials, trading completeness for approachability.

In a nutshell: DeepWiki and Google Code Wiki optimize time-to-orientation, deepwiki-rs and OpenDeepWiki emphasize structural correctness, and PocketFlow prioritizes learnability.

Grounding and trust

Grounding and trust determine whether an engineer can act on what an AI tool says. Is the output yielded by the tool instantly actionable and linked to specific source code files and line numbers? Can a modernization architect trust the architectural diagrams generated by the tool?

Google Code Wiki places a strong emphasis on grounding by design: its chat answers and wiki pages are explicitly linked to current repository artifacts—files, symbols, and definitions—and are regenerated as the code evolves. This tight coupling between explanations and source code reinforces trust and helps reduce hallucination risk, particularly in fast-moving codebases where stale documentation is a common failure mode.
OpenDeepWiki also scores highly on grounding, primarily through its use of structured representations such as knowledge graphs and its ability to act as an MCP (Model Context Protocol) server. Rather than presenting explanations in isolation, it is designed to expose explicit relationships between code elements, making it well-suited as a grounded context provider for downstream agents and tools.
DeepWiki provides stronger grounding than a purely narrative system: its generated pages explicitly reference relevant source files and often include line-level citations, enabling engineers to verify architectural claims against the actual implementation. However, because DeepWiki represents a snapshot of the repository at indexing time, its output is best treated as a grounded but temporal hypothesis—accurate and traceable, yet requiring awareness of potential drift as the codebase changes.
deepwiki-rs approaches grounding through explicit, architecture-first artifacts rather than conversational explanations. Its outputs, such as C4 diagrams, component boundaries, and cross-references, are derived directly from static analysis of the source code, which makes their grounding relatively strong and inspectable. This tool implements a 4-step pipeline to generate documentation that includes C4 models of the codebase.
Davia exhibits variable grounding characteristics that depend largely on the underlying AI agent and integration context (e.g., Copilot, Cursor). When paired with agents that perform structured retrieval and symbol-level navigation, Davia can support strong traceability; when used with weaker or less-contextual agents, grounding quality correspondingly degrades.
PocketFlow is intentionally weaker on grounding by design. Its primary goal is pedagogical clarity and onboarding, favoring simplified explanations and conceptual walkthroughs over exhaustive traceability to every file or symbol, which makes it effective for learning but less suitable for verification-heavy engineering tasks.

Freshness & evolution

Freshness and evolution capture how well a tool preserves understanding as a codebase changes over time. For enterprises, this is a critical factor where yesterday’s explanation can quickly become misleading.

Google Code Wiki is explicitly designed to regenerate content continuously as code changes, which is its defining advantage.
deepwiki-rs and OpenDeepWiki can be re-run to refresh docs, but this is typically batch-driven.
DeepWiki reflects repo state at analysis time; freshness depends on re-indexing.
Davia shines in interactive evolution: docs can be edited, refined, and co-created alongside agents.
PocketFlow outputs static tutorials unless the pipeline is rerun.

If “docs rot” is your core pain point, Google Code Wiki is uniquely positioned. Documentation becomes outdated when three things drift apart: 1) code changes frequently (PRs, refactors, dependency updates), 2) docs are regenerated manually or periodically, 3) no automatic coupling exists between code deltas and documentation updates.

Even AI-generated docs rot if they are: snapshot-based, re-run manually, and detached from the CI/repo lifecycle. DeepWiki, Davia, deepwiki-rs, and OpenDeepWiki operate as snapshots, even if they are very good snapshots.

On the other hand, Google’s Code Wiki is designed to continuously update a structured wiki for code bases where each wiki section and chat answer is hyperlinked to code files, classes, and functions (Code Wiki, 2025).

Workflow fit

Workflow fit describes how well a tool aligns with the moment an engineer is in and the type of question they are trying to answer, whether they are onboarding, validating changes, reviewing code, or planning modernization.

Persona	Best Fit	Why
New codebase contributor	DeepWiki, PocketFlow	New contributors need fast orientation, not exhaustive correctness. Their primary questions are what this repo is, how it is structured, and where to start. DeepWiki accelerates this by generating an immediate, structured mental model with minimal setup, while PocketFlow goes further by turning the codebase into guided, beginner-friendly tutorials. Both reduce the initial cognitive barrier to entry.
Enterprise engineer in fast-moving repo	Google Code Wiki	In enterprise environments, the dominant risk is documentation drift. Engineers often understand the domain but struggle to keep up with what has changed and whether existing documentation is still accurate. Google Code Wiki addresses this directly by treating documentation as a continuously regenerated artifact, keeping summaries, diagrams, and explanations synchronized with the codebase.
AI-native team using agents daily	Davia, OpenDeepWiki	AI-native teams want tools that integrate into agent workflows rather than static documentation. Davia provides an interactive, human-in-the-loop workspace where documentation evolves alongside agent reasoning. OpenDeepWiki complements this with structured, machine-readable knowledge that agents can query, reuse, and extend across tasks.
Architect/modernization team	deepwiki-rs	Architects prioritize system boundaries, dependencies, and change impact over narrative explanations. deepwiki-rs supports this by producing architecture-first outputs such as C4 diagrams and explicit component relationships, enabling more reliable reasoning about refactoring, modernization, and system decomposition.
Platform/tooling team	OpenDeepWiki (as infra)	Platform teams focus on enabling others at scale. OpenDeepWiki can act as a centralized code knowledge layer, ingesting repositories and exposing structured context to IDEs, agents, and internal tools. Its extensibility makes it suitable as shared infrastructure rather than a single-user productivity tool.

Taken together, these dimensions and personas show that adopting LLM-based tools for code understanding is less about choosing the ‘best’ tool and more about choosing the right one for a given moment.

Key takeaways

LLMs and agent-based tools are best understood as cognitive amplifiers for code comprehension, not as replacements for human judgment or engineering expertise. Across tools like DeepWiki, Google Code Wiki, Davia, and OpenDeepWiki, their strongest and most defensible value is not in producing “answers,” but in compressing the early phases of understanding: helping engineers orient themselves, explore structure, and form testable hypotheses about how a system works.

In practice, these tools help engineers move faster from ‘I don’t know this system’ to ‘I know where to look and what to verify. They do this by externalizing structure and intent: surfacing architectures, highlighting key files and relationships, and guiding engineers toward “where to look next.” This aligns with broader DevOps and software delivery research showing that practices which improve team flow and feedback loops (such as shorter lead times, faster deployment frequency, and effective collaboration) correlate strongly with organizational performance and developer productivity beyond raw coding speed (as documented in the yearly DORA reports).

However, sustainable impact depends on how these tools are integrated, not on model capability alone. The most effective setups ground explanations in concrete code artifacts (files, symbols, line ranges), leverage structure-aware context (architecture, dependencies, knowledge graphs), and explicitly frame AI output as a starting point for validation rather than an authoritative source. Snapshot-based wikis, continuously regenerated documentation, and agent-driven knowledge layers each solve different parts of the problem and must be chosen deliberately based on workflow and organizational needs.

Finally, teams that succeed with LLMs for code understanding are those that measure the right outcomes. Metrics such as time-to-locate relevant code, time-to-explain a subsystem, onboarding speed, and review latency better reflect real comprehension gains than lines of code generated or tasks automated. When adoption is guided by these understanding-centric outcomes, LLMs and agents can deliver durable, compounding benefits rather than short-lived productivity illusions.

Sources note

This article draws on peer-reviewed program comprehension research, recent empirical studies on AI-assisted development, and primary documentation from the tools discussed. All claims are supported by publicly available sources linked inline.

How AI Coding Agents Are Reshaping Developer Workflows

Esteban S. Abait — Wed, 05 Nov 2025 13:47:38 +0000

By Esteban S. Abait — November 2025

The following article is what I would call the summarization of my own research. This 2025, I have been deep into understanding how AI Agents are changing the Software Development Life Cycle (SDLC) and how they are transforming the workflows developers use to deliver software. The article also mixes some of my opinions based on my own side project experiences (like this one).

There is a common agreement in the industry that new best practices for coding and software engineering are emerging as a result of adopting AI to accelerate the SDLC.

Ever since I read the seminal book Accelerate, I have closely followed the DORA (DevOps Research and Assessment) reports to identify which practices actually work for software organizations. The DORA reports are based on rigorous statistical analysis of large-scale surveys of thousands of technology professionals worldwide. Their mission is to identify practices that predict software delivery performance and other organizational outcomes.

The goal of this article is to review a few real-world documented experiences from developers experimenting with AI coding agents and to contrast them with the latest DORA 2025 State of AI-Assisted Software Development Report.

The selected articles reveal how developers are changing their day-to-day workflows through agentic, parallel, and vibe-coding loops. Comparing the findings of these articles against DORA’s insights sheds light on what is truly working and what new ideas are emerging from the field that can be leveraged in enterprise settings.

1. Developers are building agentic loops, not just prompts

In his post Designing Agentic Loops, Simon Willison argues that the future of AI-assisted development lies in goal-driven loops, not one-off prompts:

Analyze → Plan → Implement → Test → Review → Iterate

Rather than treating AI as autocomplete, developers now design closed feedback loops where agents can reason, test, and refine outputs.

DORA 2025 backs this: throughput rises when feedback loops exist — but stability drops if they’re missing or unmanaged.

The Kim & Yegge essay The Vibe Coding Loop extends this idea: vibe-coding reframes development as continuous conversation and orchestration.

The developer’s role becomes less about typing code and more about directing agents through intent, goals, and feedback.

It’s still engineering — but at a higher altitude.

2. Quality internal platforms are the new accelerators

The DORA report emphasizes that Quality Internal Platforms (QIP) amplify AI’s benefits. So, what is a QIP? A QIP is defined as a set of shared systems, services, and code artifacts that standardize and abstract best practices, enabling developers to build and deliver reliable, secure applications quickly and independently. Typically, this platform consists of shared CI/CD, standard pipelines, self-service tools, guardrails, observability tools, and access to development environments, among others.

In The New Calculus of AI-based Coding, Joe Magerramov explains how his team achieves ~80% AI-generated commits through a strictly controlled environment with “steering rules,” review gates, and fast pipelines.

Freedom without structure leads to chaos; structure with autonomy enables safe speed.

Magerramove experience matches what DORA considers a QIP and draws a similar conclusion where the developer platform boosts AI’s positive impact and creates psychological safety for teams to experiment responsibly.

3. Context is the new fuel

Every modern reflection on agentic coding echoes the same point: context beats clever prompting.

In Just Talk To It, Peter Steinberger notes that agents should “read the code, docs, and tests” instead of being micromanaged through verbose prompts.

Magerramov adds that internal test harnesses, mocks, and dependency fakes provide the “scaffolding” that lets agents iterate confidently.

The vibe-coding framework described in The Key Vibe Coding Practices extends this: developers maintain a shared context across agents (what Kim & Yegge call the “vibe space”) so that multiple agents operate within the same state of understanding.

That shared context is exactly what DORA categorizes as AI-Accessible Internal Data. As stated in the DORA report: Connect your AI tools to your internal systems to move beyond generic assistance and unlock boosts in individual effectiveness and code quality. This means going beyond simply procuring licenses and investing the engineering effort to give your AI tools secure access to internal documentation, codebases, and other data sources. This provides the company-specific context necessary for the tools to be maximally effective.

4. Clarity builds trust (and reduces friction)

Willison advocates sandboxing and “blast radius” controls; Magerramov requires every AI commit to be reviewed by a human.

This is not bureaucracy — it’s how teams build trust.

DORA calls this a Clear + Communicated AI Stance: A

clear and communicated AI stance" refers to the comprehensibility and awareness of an organization's official position regarding how its developers are expected and permitted to use AI-assisted development tools.

In a nutshell, organizations must publish policies for tools, scope, and security, and officially foster their adoption. When developers know the rules, they experiment more boldly.

5. Parallel agents and small batches: the new flow unit

The next frontier in agentic practice is parallelism.

In Embracing the Parallel Coding Agent Lifestyle, Willison describes running multiple agents side-by-side, each attempting a feature or fix, then comparing and merging results.

The Pragmatic Engineer newsletter calls this “programming by kicking off parallel agents” where one agent explores approach A, another tests approach B, and the developer acts as reviewer/orchestrator.

This fits beautifully with DORA’s findings: Working in Small Batches and Strong Version Control Practices improve delivery speed and stability.

Parallel agents could potentially scale the small-batch principle horizontally multiple independent changes, fast feedback, safe integration.

However, more research will be needed to ensure parallel agents provide a real gain in productivity.

6. UX and developer experience define adoption

Kim & Yegge’s Vibe Coding Loop positions the developer as a conductor of agents, using language, emotion, and intent to shape outcomes.

Steinberger’s observations about cognitive load reinforce this: for creative tasks, AI should respond to intent, not flood you with completions.

Magerramov adds mode-switching with chat for ideation, completions for boilerplate, and agentic commits for execution. This pattern is what DORA labels as User-Centric Focus.

Together, vibe-coding and parallel-agents illustrate a new UX layer: AI is not a single assistant but an ensemble you conduct.

7. Value appears when the loop connects to the business

At the enterprise level, DORA stresses Value Stream Management (VSM). VSM is described as the practice of visualizing, analyzing, and improving the flow of work from idea to customer. This practice involves charting the entire software delivery lifecycle, which covers: product discovery, design, development, testing, deployment, and operations.

The most significant finding regarding VSM in the context of the State of AI-assisted Software Development (2025) is that it acts as the force multiplier that ensures AI investment delivers a competitive advantage. The use of VSM should help organizations to track how AI affects lead time, rework, and deployment frequency. In essence, VSM provides the necessary clarity and system view required to unlock new technologies like AI, preventing them from creating "disconnected local optimizations" and instead ensuring they translate into significant organizational advantage.

This finding is also supported by other experiences. Magerramov calls this the “cost-benefit rebalance”: as agentic loops increase velocity, you must upgrade testing and infrastructure to maintain trust.

Vibe-coding adds a human dimension here: teams should measure not just velocity but quality of flow — how collaboration feels and how developers experience creative momentum.

8. The Unified Playbook

The following is a table that summarizes the findings of all the surveyed sources for successful AI Agents adoption.

Focus Area	Practical Action	Expected Effect
Policy & Trust	Publish clear AI usage policy; require human review for agent commits.	Builds confidence and reduces risk.
Context Grounding	Connect AI tools to internal repos, tests, and “vibe spaces.”	Improves accuracy and code quality.
Platform Engineering	Invest in strong internal platforms and fast feedback loops.	Amplifies AI impact, reduces friction.
Flow Efficiency	Adopt agentic loops and parallel agents; prefer small batches.	Improves throughput and stability.
Human Oversight	Keep reviewers in the loop; instrument metrics of trust.	Controls instability and builds learning.
Developer UX	Embrace vibe-coding principles — intent-driven orchestration, mode switching.	Reduces cognitive load, enhances flow.
Value Measurement	Use VSM metrics and developer experience surveys to measure impact.	Converts speed into sustainable value.

Final Takeaway

AI coding agents aren’t replacing developers, they’re reshaping how developers work.

The modern workflow is a blend of agentic loops that are best leveraged by internal quality platforms, grounded in the seven DORA capabilities that make AI effective in real teams.

The teams winning with AI aren’t chasing full autonomy; they’re mastering value streams, working in small chunks, creating feedback loops, and using mature internal development platforms with clear policies and agents grounded in a shared internal context.

References

Acknowledgment: This article was researched and drafted in collaboration with ChatGPT (GPT-5), used as a co-writer and technical synthesis partner.

Designing Agentic Workflows: Lessons from Orchestration, Context, and UX

Esteban S. Abait — Wed, 22 Oct 2025 01:14:41 +0000

Many challenges in AI products stem less from choosing frameworks and more from how user experience (UX) and architecture shape each other.

I first noticed this while using ChatGPT to draft and maintain product requirement documents (PRDs) — reusing prompt variants, manually curating context, and constantly tweaking outputs to stay aligned. The workflow technically worked, but it felt brittle and overly manual.

That experience raised a question: What might it take for an agentic workflow — a coordinated system of specialized LLM sub-agents orchestrated by code rather than a single prompt — to produce and maintain a complex artifact like a PRD without so much manual prompting, context oversight, and guesswork? More broadly, how could changes in architecture and UX design improve usability, predictability, and trust in such systems?

This write-up shares early technical and UX explorations behind building an agentic workflow for structured artifacts, using PRDs as the initial testbed. The goal wasn’t to ship a product but to experiment — testing hypotheses about orchestration, context design, and agent-centric UX. Although exploratory, the lessons may apply to multi-step, artifact-producing AI workflows in general.

Terminology and scope: This is an agentic workflow—LLMs and tools orchestrated through predefined code paths—rather than an autonomous agent that directs its own tools. The hypotheses here are exploratory and intended to inform agents and AI‑powered products more broadly; this project is a working lab, not a shipped product.

Audience and Stack
Iteration 0 – Initial Architecture
Iteration 1 – Context Awareness
Iteration 2 – UX-Driven Agentic Workflow
- Key UX Principles and Screens
- Architectural Foundations Behind the UX
- Design Principles for Agentic Workflows
**Creation vs. Editing
Conclusion
Working Hypotheses
References

Audience and Stack

Audience: AI engineers, architects, product designers, and UX practitioners working on multi-step or long-running agentic workflows and AI-powered products.
Stack: TypeScript monorepo, Vercel AI SDK, Zod schemas, Next.js frontend, OpenRouter integration.
Repository: github.com/eabait/product-agents
Influences: Anthropic on orchestration and context engineering,^{1, 2} Breunig on context failures and curation,^{3, 4} Luke Wroblewski on UX patterns in AI,^{5, 6} Jakob Nielsen on wait-time transparency and “Slow AI.”⁷

Iteration 0 – Initial Architecture

The first version established the core shape of the workflow, introducing three key roles within the agentic system:

Analyzer - a subagent that extracts or classifies structured information.
Writer - a subagent responsible for generating a specific section of the PRD.
Orchestrator - the controller that coordinates analyzers and writers across the workflow. This is implemented directly in the code without using an LLM (as opposed to the definition given by Anthropic^1).

Together, these components formed the foundation of the agentic workflow:

Clarification subagent acts as a gate before analysis — Ensures there’s enough grounding to write anything; the system asks 0–3 targeted questions, then proceeds or returns with gaps. Example: if “target users” is missing, it prompts for personas and primary jobs-to-be-done before any writer runs.
Centralized analyzers: Context, Risks, Summary — Consolidates one-pass extraction into a structured bundle reused by all writers, avoiding repeated reasoning and drift. Example: the risk list produced once is consumed by multiple sections needing constraints/assumptions.
Multiple section writers (e.g., context, problem, assumptions, metrics) — Decouples generation so sections can evolve independently and merge deterministically. Example: in a PRD or similar structured artifact, only the Metrics writer reruns when you request “tighten success metrics.”
Dual paths: full PRD generation and targeted edits — Selects creation vs. edition based on inputs and presence of an existing PRD to improve efficiency and stability. Example: if a prior PRD is supplied and the request says “update constraints,” only that slice is scheduled.
Shared-analysis caching to avoid redundant analyzer runs — Keys analyzer outputs by inputs so subsequent edits reuse results without recomputing. Example: copy tweaks reuse the same context summary instead of re-extracting from scratch.

Early issues still surfaced:

Analyzer overhead and coupling increased latency
Early UI offered limited visibility into the workflow’s steps
Edition path existed but lacked confidence/telemetry to guide edits

These friction points informed the next iteration, which focused on improving context handling and reducing latency.

Iteration 1 – Context Awareness

Applying ideas from Anthropic and Breunig, the workflow evolved toward planned cognition and curated context:

Five-section PRD redesign (Target Users, Solution, Features, Metrics, Constraints) — Aligns the artifact with audience-facing sections and reduces ambiguity about ownership. Example: “add OKRs” maps cleanly to the Metrics section.
Parallel section writers atop a shared ContextAnalyzer result — Lowers latency and coupling by letting independent writers run concurrently on the same structured inputs. Example: Solution and Features complete in parallel.
SectionDetectionAnalyzer subagent for edit routing and autodetecting existing PRDs — Interprets requests and selects affected sections, and if a PRD is present, defaults to edition. Example: “tighten constraints about latency” routes only to Constraints.
Confidence metadata per section to aid UX transparency — Each output carries a confidence hint so the UI can flag fragile changes. Example: low confidence on Personas nudges the user to review.
Modularized pipeline helpers (clarification check, shared analysis, parallel sections, assembly) — Improves maintainability and testability; responsibilities are isolated so new writers or analyzers slot in without side effects.

Once the workflow became context-aware, the next challenge was making its cognition visible to users — bringing UX principles directly into orchestration.

Iteration 2 – UX-Driven Agentic Workflow

With context reliability established, the project entered a new phase: aligning architectural choices with the UX principles that would make those inner workings transparent.

At this point, the work shifted toward system legibility. Wroblewski highlights recurring UX gaps in AI around context awareness, capability awareness, and readability,⁵ and Nielsen emphasizes transparency around wait time for “Slow AI.”⁷

These insights suggested that UX requirements should shape orchestration decisions, not just react to them.

What changed and why:

Streaming progress events — Reduces “Slow AI” ambiguity by emitting real-time updates (see Principle #1: Visible Capabilities + Streaming).
Configurable workers (per-writer runtime settings) — Allows specialization (e.g., different models/temperatures) while enforcing streaming capability for observability. Example: a concise, extractive analyzer vs. a creative section writer.
Usage and cost accounting — Surfaces telemetry and burn-rate transparency (see Principle #6: Cost Visibility).
Edition parity and fallbacks — Heuristics prevent silent no-op edits, and top-level fields stay in sync with partial updates to avoid stale PRDs. Example: editing Constraints also updates the summary if needed.

Key UX Principles and Screens

1. Visible Capabilities + Streaming (addressing “Slow AI”)
Starter affordances clarify what the agentic workflow can do. Streaming exposes long-running steps to reduce ambiguity.

2. Context Awareness and Control
Users can inspect, pin, or exclude context items before generation.

3. Structured Output Instead of “Walls of Text”
Structured components allow partial edits and reduce cognitive load.

4. Inspectability and Control (Configuration Drawer)
Exposes subagents toggles—temperature, model choice, context filters—without forcing a detour into config files

5. Localized Updates (Section-level Editing)
When someone says, “Change user personas to be from LATAM,” the system routes only the Personas writer, preserving other sections.

6. Cost Visibility
Surfaces estimated token usage and dollar cost per run. Engineers care about the burn.

Each UI principle emerges directly from the system’s underlying architecture — together, they form a feedback loop between technical design and user experience.

Architectural Foundations Behind the UX

The UI work only “clicked” once the agent runtime supported it. Every visible affordance required a corresponding architectural move. Implementation details are available in the open-source repository.

UX Principle	Architecture Foundation	How It Works
Localized edits	Section-level writers	Enables partial regeneration of sections (demonstrated in Principle #5: Localized Updates).
Explainability	Orchestrator hooks for intermediate artifacts	The orchestrator emits progress events and returns analyzer payloads before final assembly. These feed the status stream that the UI renders as visible steps, making the system’s reasoning legible.
Streaming transparency	Event-based updates	Progress callbacks stream over Server-Sent Events, letting the interface update the timeline and status indicators as each subagent completes — no more opaque spinners while the model “thinks.”
Inspectable context	Shared analysis bundle + context registry	Powers the context inspector UI (see Principle #2: Context Awareness and Control).
Repeatability	Audit logs and metadata	Each run captures usage metrics, cost estimates, and section-level metadata. The audit trail can be replayed so users can trace what changed, which model handled it, and how many tokens it consumed.
Configurable workers	Per-worker runtime settings	Each analyzer and writer can run with its own model configuration, temperature, or parameters, as long as it supports streaming for progress visibility.
Edition parity & fallbacks	Heuristic coverage + field sync	Heuristics prevent silent no-op edits; top-level PRD fields stay consistent with edited sections, ensuring partial updates don’t produce stale data.

Together, these shifts align the system’s internals with what the UI promises — when the interface says “only this section will change” or “here’s the context you’re about to send,” the architecture makes that statement true.

Design Principles for Agentic Workflows

These exploratory principles emerged while iterating on an agentic workflow — intended to be useful for agents and AI-powered products more broadly:

Expose System Cognition — When the system runs/thinks, show its phases (streaming, intermediate artifacts).

Let Users Curate Context — Treat context as a user-visible surface.

Structure the Artifact — Use sections and diffs, not monolithic text.

Localize Change — Architect so edits update only what changed.

Make Capabilities Legible — Provide affordances and visible configuration.

Reduce Waiting Ambiguity — If the system must be slow, it should not be silent.

Creation vs. Editing

There’s no toggle between “create” and “edit” in the UI. Instead, the orchestrator inspects the request—and the presence (or absence) of an existing PRD—to decide whether it should synthesize an entire document or focus on specific sections. That inference is handled by the same subagents we’ve already seen: the clarification analyzer checks if the agent has enough information to write anything, and the section-detection analyzer decides which slices of the artifact need attention.

Confidence signals from section detection are surfaced to help users decide when to intervene.

Detected Workflow	System Behavior	UX Goal	Typical UX Affordances
Full PRD generation	Multi-step synthesis across every section	Transparency	Clarification loop (up to three passes), context preview, streaming timeline, cost meter
Targeted update	Regenerate only the sections flagged by the analyzer	Precision	Section highlights, diff view, rollback controls, warnings when edits ripple into adjacent sections

How the Orchestrator Makes the Call

Clarification acts as a gatekeeper: When no prior PRD exists, the orchestrator will loop with the clarifier (up to three times) to gather personas, goals, and constraints before any section writers run. If the user supplies an existing PRD, the clarifier usually stands down because the grounding context is already available.
Section detection scopes the work: The section-detection-analyzer infers intent (“update the LATAM personas”) and hands the orchestrator a targeted section list. Only those section writers get invoked unless the analyzer indicates the request touches multiple areas.
Shared analysis keeps context in sync: Both scenarios reuse cached analyzer outputs whenever possible. A targeted update will draw from the existing analysis bundle and current PRD text instead of regenerating everything from scratch.
Audit logs reflect the path taken: When the orchestrator opts for full generation, the audit trail captures every section output and the clarifier’s reasoning. For targeted updates it records before/after diffs, confidence scores, and the sections that actually changed—mirroring what the UI presents.
Edition parity and fallbacks: Heuristics prevent silent no-op edits and keep top-level PRD fields consistent during partial updates.

So while users don’t flip between modes, the system has a working theory about which workflow they expect. Making that inference explicit—and surfacing it through the UX affordances—has reduced surprises when moving between drafting and maintenance tasks.

Working Hypotheses

Context is a user-facing product surface. Expose it.
Streaming is not cosmetic. It is trust-preserving UX for “thinking systems.”
Agent-driven interactive UI and structured outputs outperform walls of text.
Creation and editing require different mental models.
UX and agent orchestration must co-evolve. One cannot be downstream of the other.

Conclusion

This exploration began with a practical frustration described in the introduction: using general-purpose agents like ChatGPT to create and maintain complex documents (like PRDs) required repeating prompts, managing context by hand, and working through long, opaque generation cycles. The core friction wasn’t just in the model, but in the UX around the workflow — hidden state, unclear progress, and outputs that were difficult to iterate on.

Building a domain-specific PRD agent became a way to investigate whether orchestration patterns, context design, and UX choices could reduce that friction. The current version now includes structured outputs, context controls, streaming transparency, and targeted editing — enough functionality that, for this specific use case, it feels like a more effective alternative to a general-purpose chat interface.

The project is still evolving, but the journey so far suggests that UX and architecture designed together—from the first iteration—can meaningfully improve how people collaborate with AI on complex, evolving artifacts.

The next steps will focus on validating these ideas with real users, refining orchestration stability, and exploring new mechanisms for consistency and context evolution. While this implementation centers on PRDs, the underlying principles—legibility, localized change, and user-visible cognition—apply broadly to agentic systems and AI-powered products that coordinate multi-step work.

If you find this useful or want to explore the code, star or contribute to the open-source project onhttps://github.com/eabait/product-agents.

References

Anthropic, “Building Effective Agents.”
Anthropic, “Effective Context Engineering for AI Agents.”
Dan Breunig, “How Contexts Fail (and How to Fix Them).”
Dan Breunig, “How to Fix Your Context.”
Luke Wroblewski, “Common AI Product Issues.”
Luke Wroblewski, “Context Management UI in AI Products.”
Jakob Nielsen, “Slow AI.”

Forem: Esteban S. Abait

The Four Modalities for Coding with Agents

Vibe Coding

Pros

Cons

Recommendation

Guided Prototyping

Pros

Cons

Recommendation

Autopilot Coding

Pros

Cons

Recommendation

Agentic Engineering

Pros

Cons

Recommendation

Conclusion

Beyond Code Generation: LLMs for Code Understanding

Advantages of using LLMs and agents for code understanding

Disadvantages and limitations

LLM-based Tools for Code Understanding

Qualitative dimensions comparison

Mental model formation

Grounding and trust

Freshness & evolution

Workflow fit

Key takeaways

Sources note

How AI Coding Agents Are Reshaping Developer Workflows

1. Developers are building agentic loops, not just prompts

2. Quality internal platforms are the new accelerators

3. Context is the new fuel

4. Clarity builds trust (and reduces friction)

5. Parallel agents and small batches: the new flow unit

6. UX and developer experience define adoption

7. Value appears when the loop connects to the business

8. The Unified Playbook

Final Takeaway

References

Designing Agentic Workflows: Lessons from Orchestration, Context, and UX

Table of Contents

Audience and Stack

Iteration 0 – Initial Architecture

Iteration 1 – Context Awareness

Iteration 2 – UX-Driven Agentic Workflow

Key UX Principles and Screens

Architectural Foundations Behind the UX

Design Principles for Agentic Workflows

Creation vs. Editing

How the Orchestrator Makes the Call

Working Hypotheses

Conclusion

References