Forem: Alex Metelli

From AI Demo to Production: How to Ship Quality Agentic Applications

Alex Metelli — Sat, 02 May 2026 09:35:57 +0000

AI developers are getting very good at building demos.

A prompt, a model call, maybe a tool call or two, and suddenly you have something that looks impressive in a workshop, a hackathon, or an internal prototype.

But production is where the illusion breaks.

The hard part is no longer proving that an LLM can answer a question. The hard part is knowing whether your AI system behaves correctly across messy real-world inputs, edge cases, model changes, tool failures, latency constraints, cost pressure, and business-specific policies.

That was the core theme of a recent Braintrust and Trainline workshop in London: shipping AI applications requires operational rigor, not just better prompts.

The prototype trap

A lot of AI applications start the same way:

const response = await openai.chat.completions.create({
  model: "gpt-5-mini",
  messages: [
    { role: "system", content: "You are a helpful support agent." },
    { role: "user", content: ticketText },
  ],
});

For a proof of concept, this is fine.

You pass in a ticket, get a structured response, and the output looks plausible. Maybe it categorizes the issue, assigns a severity, and drafts a customer reply.

The problem is that plausibility is not correctness.

A single prompt can work beautifully on three demo cases and fail badly once exposed to production inputs. This is especially true in business workflows where the model needs to understand implicit priority, policies, customer tiers, refunds, SLAs, billing impact, or escalation rules.

Traditional software has deterministic failure modes. AI systems do not.

In normal software, 1 + 1 = 2.

In LLM systems, the equivalent is closer to: “usually 2, unless the context, prompt, model, tool result, temperature, or previous step nudges it somewhere else.”

That does not mean AI systems are unusable. It means they need a different quality model.

Production AI needs both software engineering and ML discipline

Trainline described this well.

They operate both traditional ML systems and agentic AI systems at large scale. For example, they use machine learning to predict train disruptions, but they also run a travel assistant that can help users with refunds, alternative trains, and support workflows.

Those two worlds have different quality practices.

Classic software engineering gives you:

type checks
unit tests
integration tests
CI/CD
structured logging
service observability
release discipline

Machine learning gives you:

datasets
offline evaluation
online evaluation
model comparison
data quality checks
drift monitoring

Agentic AI sits in the middle.

Part of the system is deterministic: API calls, database lookups, validation, routing, tool execution.

Part of the system is nondeterministic: reasoning, classification, language generation, judgment, summarization.

So the correct quality model is not “just write tests” and not “just vibe-check the outputs.”

It is both.

Break the monolithic prompt into stages

One of the most useful engineering patterns from the workshop was taking a single prompt-based support agent and breaking it into a staged workflow.

Instead of one large model call that does everything, the system was split into clearer responsibilities:

Context collection
Gather deterministic context: account information, previous tickets, relevant help articles, customer tier, billing state, etc.
Triage
Classify the issue, infer severity, identify the affected domain, and decide whether more information is needed.
Policy review
Check whether the proposed action follows company policy, SLAs, refund rules, escalation rules, or compliance constraints.
Reply writing
Draft a customer-facing response in the correct tone.
Final packaging
Emit structured output for downstream systems, including escalation flags, internal notes, and customer reply.

This is not just cleaner architecture. It improves debuggability.

When a single prompt fails, you often do not know why. Was the categorization wrong? Did it miss account context? Did it misunderstand policy? Did the final reply hallucinate?

When the workflow is staged, each part can be traced, evaluated, and improved independently.

This is the same instinct software engineers already use when decomposing a monolith. The AI version is: do not put all reasoning, policy, tool usage, and writing into one giant prompt unless the problem is genuinely trivial.

Tool calls improve capability, but increase failure surface

Adding tools makes an AI application more useful.

A support agent can call tools to:

retrieve help articles
inspect account metadata
check previous incidents
look up billing state
fetch policy rules
create an escalation
draft or update a ticket

But every tool call adds another possible failure mode.

The tool may return stale data. The model may choose the wrong tool. The tool result may be incomplete. The model may ignore the result. The tool may succeed but the final answer may still misinterpret it.

This is why AI observability becomes mandatory.

If you cannot see which tools were called, with what inputs, what outputs they returned, and how the model used those outputs, you are effectively debugging production behavior blind.

Logs are not enough.

Logs tell you what happened at a shallow level. Tracing tells you how the system behaved internally.

Trace the full execution path

For production AI systems, tracing should capture the entire workflow:

parent request
child spans for each stage
model inputs
model outputs
tool calls
tool results
token usage
latency
cost
metadata
final structured output
scores or evaluation results

The important detail is nesting.

A lot of teams instrument only the top-level model call. That is not enough for agentic systems.

You want a trace that looks more like this:

support-ticket-run
  ├── collect-context
  │   ├── fetch-account
  │   ├── search-help-articles
  │   └── fetch-ticket-history
  ├── triage-specialist
  │   └── llm-call
  ├── policy-reviewer
  │   └── llm-call
  ├── reply-writer
  │   └── llm-call
  └── finalize-result
      └── maybe-escalate

This lets you answer the questions that matter:

Which step failed?
Did the model receive the right context?
Did the tool return the expected data?
Did latency come from retrieval, reasoning, or final generation?
Did the model output violate policy?
Did a cheaper model behave differently?
Did a prompt change improve one case but regress another?

Without tracing, you are guessing.

And guessing is not an engineering process.

Build a golden dataset before you trust the system

A production AI application needs evaluation data.

At the beginning, this can be small. It does not need to be perfect. But it needs to exist.

For the workshop support agent, the golden dataset contained representative support tickets with expected properties, such as:

correct category
expected severity
whether escalation is required
whether the output schema is valid
whether the reply follows policy
whether the customer-facing response is appropriate

This dataset becomes the baseline for safe iteration.

Every time you change the prompt, model, tool behavior, or workflow, you can rerun the evaluation and ask:

Did this change improve the system, or did it just look better on one example?

That distinction matters.

Prompt editing without evaluations is just production gambling.

Use deterministic scores where possible

Not every evaluation needs an LLM judge.

Some checks should be deterministic:

function hasValidSchema(output: unknown): boolean {
  return SupportTicketOutputSchema.safeParse(output).success;
}

Useful deterministic checks include:

schema validity
required fields present
escalation reason exists when escalation is required
severity is within allowed enum values
category matches expected label
reply is not empty
internal-only data is not present in customer reply

These checks are cheap, fast, stable, and should run often.

They are the AI equivalent of unit tests and type checks.

Use LLM-as-judge for subjective quality

Some things cannot be reliably checked with simple assertions.

For example:

Is this reply helpful?
Is the tone appropriate?
Does it follow the refund policy?
Does it avoid overpromising?
Does it correctly reason about the customer’s situation?
Is the escalation decision justified?

For those, LLM-as-judge can be useful.

The key is to use it deliberately. Write clear rubrics. Score specific dimensions. Avoid vague prompts like “is this good?”

A better judge prompt asks for concrete criteria:

Evaluate whether the support response:
1. Correctly identifies the user’s issue.
2. Does not promise actions outside company policy.
3. Gives a clear next step.
4. Uses an appropriate customer-facing tone.
5. Escalates when business impact is high.

Return a score from 0 to 1 and a short rationale.

LLM judges are not magic. They are another probabilistic component. But when paired with deterministic checks and real production traces, they give you a scalable way to evaluate nuance.

Offline evaluation is not enough

A golden dataset gives you confidence before deployment.

But production data is where the real failures appear.

Users will phrase things oddly. They will omit important details. They will create conflicting signals. They will say something is “not urgent” while describing a CFO blocked before a board meeting.

That example came up in the workshop:

"This isn't urgent, but our CFO can't export the invoices before the board meeting."

A weak triage agent may classify this as low severity because the user said “not urgent.”

A better system understands the business context:

CFO
invoices
board meeting
likely time-sensitive
high business impact

This is exactly the kind of failure that does not always appear in initial test data.

So production traces should feed back into the evaluation loop.

When you find a failure:

Capture the trace.
Add the case to the dataset.
Write or update the scoring rule.
Fix the prompt, tool, policy, or workflow.
Rerun evaluations.
Compare against previous runs.
Deploy only if the fix does not introduce regressions.

That loop is the real production AI workflow.

Model switching needs evaluation discipline

Trainline also described a very practical problem: model cost.

At scale, LLM bills can become painful fast. Teams naturally want to switch models, use cheaper models, reduce token usage, or route simpler requests to smaller models.

But model switching without evaluation is dangerous.

A cheaper model may perform similarly on simple tickets and fail on high-impact edge cases. A newer model may improve reasoning but change tone. A faster model may reduce latency but increase escalation mistakes.

The right question is not:

Can we switch to a cheaper model?

The right question is:

Can we switch to a cheaper model without degrading the scores that matter?

That requires offline evaluation, online monitoring, and trace-level comparison.

Managed prompts help cross-functional teams

Another practical point: prompts often become collaboration bottlenecks.

Engineers own the codebase, but product managers, support leads, legal reviewers, and domain experts often understand the desired behavior better than the engineering team.

If every prompt change requires a code change, review, deploy, and engineering handoff, iteration slows down.

Managed prompts and parameters solve part of this.

They allow teams to:

version prompts
track who changed what
compare prompt versions
update model parameters
collaborate with non-engineers
test changes before rollout
keep production behavior reproducible

This does not mean abandoning Git or software discipline.

The better pattern is to keep prompts, tools, and configuration synchronized with code where needed, while still giving teams a managed operational layer for experimentation and review.

For regulated industries, this matters even more. You need to know what changed, when it changed, who changed it, and what effect it had.

The production AI flywheel

The workshop’s core operating model can be summarized as a flywheel:

Build → Trace → Evaluate → Find failures → Remediate → Deploy → Monitor → Repeat

More concretely:

Start with a working agent
Even if it is simple.
Break it into explicit stages
Separate context gathering, reasoning, policy checks, reply generation, and final output.
Add tracing
Capture model calls, tool calls, latency, token usage, cost, inputs, outputs, and metadata.
Create a golden dataset
Start with representative examples and known edge cases.
Add deterministic scores
Validate structure, required fields, categories, escalation rules, and other objective behavior.
Add LLM judges where needed
Evaluate tone, helpfulness, policy compliance, and reasoning quality.
Run offline evaluations
Before shipping prompt, model, or workflow changes.
Score production traces
Use online evaluation and sampling to detect real-world failures.
Turn failures into tests
Every production failure should improve your dataset.
Compare experiments over time
Do not trust a change unless you can see its effect.

The main lesson for AI developers

The future of AI engineering is not just better models.

It is better systems around models.

A production-grade AI application needs the same seriousness we already expect from software systems: observability, testing, versioning, review, deployment discipline, and feedback loops.

But it also needs ML-style evaluation: datasets, scoring, model comparison, judge rubrics, and continuous monitoring.

The teams that win will not be the ones with the fanciest demo.

They will be the ones that can answer, with evidence:

What changed?
Did quality improve?
Did cost increase?
Did latency regress?
Which failure modes remain?
Which users are affected?
Can we reproduce the issue?
Can we safely ship the fix?

That is the difference between an AI prototype and an AI product.

And for agentic systems, that difference is everything.

Building Better Software with AI Agents: Why Fundamentals Still Matter

Alex Metelli — Mon, 27 Apr 2026 20:39:12 +0000

AI coding tools are changing how software gets built, but they do not remove the need for software engineering discipline. In practice, they make fundamentals more important.

This post is a condensed write-up of a workshop by Matt Pocock on building better software with AI agents. The original workshop is available here: https://youtu.be/-QFHIoCo-Ko?si=9qyQKxnid9sE_ehc

The core mistake many developers make is treating AI as a “spec-to-code compiler”: write a vague requirement, hand it to an agent, and expect production-ready software to appear. That works for demos. It breaks down in real codebases.

A better model is this:

Use AI agents to accelerate implementation, but use software engineering fundamentals to control alignment, architecture, feedback loops, and quality.

This post distills a developer workflow from a workshop transcript on AI-assisted coding, agentic planning, PRDs, vertical slicing, TDD, and codebase design.

The Two Constraints of LLM Coding Agents

Before designing an AI-assisted workflow, you need to understand two constraints.

1. Agents have a “smart zone” and a “dumb zone”

An LLM performs best when the context is clean, focused, and not overloaded.

As the conversation grows, the agent has to reason across more tokens, more decisions, more previous mistakes, and more irrelevant detail. Eventually, it starts making worse decisions.

This is why giant context windows are not a free lunch. They are useful for retrieval, but not always for coding. A 1M-token window does not mean the agent stays sharp for 1M tokens.

For coding, the practical strategy is:

Keep tasks small.
Keep context clean.
Avoid long, drifting conversations.
Prefer fresh sessions for focused work.
Do not let the agent accumulate too much conversational sediment.

2. Agents forget unless you externalize state

LLMs are stateless between sessions unless you give them state explicitly.

That means important decisions should not live only in the chat history. You need artifacts:

Product requirements documents.
Local issue files.
GitHub issues.
Architecture notes.
Test cases.
Commit history.
Review summaries.

The trick is not to preserve everything. It is to preserve the useful shape of the work.

Step 1: Start with Alignment, Not Implementation

When a vague feature request arrives, the wrong move is to immediately ask the agent to code.

Example request:

“Retention is bad. Students sign up, do a few lessons, then drop off. Let’s add gamification.”

That sounds simple. It is not.

Before coding, you need to clarify:

What actions earn points?
Are points retroactive?
Do streaks earn points?
Where does the UI live?
What counts as lesson completion?
What is the progression curve?
What is out of scope?
What data model supports this?
How will this be tested?

The useful pattern here is a grilling session.

Instead of saying:

“Create a plan.”

Ask the agent to interrogate the requirement:

“Interview me relentlessly about every aspect of this feature until we reach a shared understanding. Ask one question at a time. For each question, give your recommended answer.”

This changes the interaction.

The goal is not to produce a plan immediately. The goal is to reach a shared design concept between the human and the agent.

That matters because most AI coding failures are not syntax failures. They are alignment failures.

Step 2: Turn the Conversation into a Destination Document

Once the agent and human have converged on the feature, convert the alignment conversation into a PRD.

The PRD should not be a bloated corporate artifact. It should capture the destination.

A useful PRD contains:

# Feature: Gamification System

## Problem

Students start courses but do not consistently return or complete lessons.

## Solution

Add a lightweight gamification system with points, levels, and streaks to increase visible progress and motivation.

## User Stories

- As a student, I can earn points when I complete lessons.
- As a student, I can see my current points on the dashboard.
- As a student, I can see my level progression.
- As an instructor/admin, I can trust that points are derived from real completion events.

## Implementation Decisions

- Points are awarded for lesson completion.
- Video watch events are excluded because they are noisy and gameable.
- Existing completion records may be backfilled.
- Streaks are tracked separately from points.

## Out of Scope

- Leaderboards.
- Social sharing.
- Complex achievements.
- Manual admin point editing.

## Testing Decisions

- Core point logic is tested in a dedicated gamification service.
- Integration tests cover lesson completion triggering point awards.
- UI smoke tests verify dashboard visibility.

The PRD is not the implementation. It is the destination.

The point is to move from “vague intent” to “clear target.”

Step 3: Do Not Use Linear Phase Plans by Default

A common AI workflow is:

Phase 1: Database schema
Phase 2: Backend services
Phase 3: API routes
Phase 4: Frontend UI
Phase 5: Tests

This looks organized, but it has a major flaw: it is horizontal.

The agent builds layer by layer, but you do not get useful feedback until late in the process. The database may be done, the backend may be done, and the UI may be partially done before you discover that the full flow does not actually work.

That is bad engineering.

A better approach is to break work into vertical slices.

Step 4: Prefer Vertical Slices / Tracer Bullets

A vertical slice crosses the full stack and produces something testable.

Bad first task:

Create the gamification database schema and service.

Better first task:

Award points when a student completes a lesson and show the points on the dashboard.

That first slice may include:

A minimal database change.
A gamification service.
A lesson completion hook.
A dashboard display.
A test proving points are awarded.

This is more valuable because the system becomes testable immediately.

The agent gets feedback earlier. The human gets something visible earlier. The architecture gets pressure-tested earlier.

This is the same idea as tracer bullets from The Pragmatic Programmer: build a thin, end-to-end path through the system so you can see where you are aiming.

Step 5: Convert the PRD into a Kanban Board, Not a Sequential Script

Instead of one long plan, convert the PRD into independently grabbable issues.

Example:

Issue 1: Award lesson completion points and display them on dashboard
Blocks: none
Type: AFK

Issue 2: Track student streaks
Blocks: Issue 1
Type: AFK

Issue 3: Add level progression based on accumulated points
Blocks: Issue 1
Type: AFK

Issue 4: Backfill points for existing lesson completions
Blocks: Issue 1
Type: AFK

Issue 5: Add dashboard polish and empty states
Blocks: Issues 1, 2, 3
Type: Human review

This gives you a directed acyclic graph of work.

That matters because agents can work in parallel only when dependencies are clear.

A linear plan can usually be executed by one agent. A Kanban-style graph can be executed by multiple agents safely.

Step 6: Separate Human-in-the-Loop Work from AFK Work

Not all tasks should be delegated equally.

Some work needs humans:

Product alignment.
Domain decisions.
Architecture boundaries.
UX judgment.
QA.
Final code review.
Tradeoff decisions.

Some work can be AFK:

Implementing a well-scoped issue.
Adding tests.
Running type checks.
Fixing straightforward failures.
Refactoring within a clear boundary.
Generating boilerplate.
Applying known patterns.

The practical split is:

Human-in-the-loop:
Idea → Grilling → PRD → Issue breakdown → QA → Review

AFK:
Issue implementation → Tests → Type checks → Automated review → Commit

This is the “day shift / night shift” model.

Humans prepare the backlog and define quality. Agents execute scoped tasks.

Step 7: Use TDD as an Agent Control Mechanism

TDD is not just a human discipline. It is especially useful for AI agents.

The pattern is:

1. Write a failing test.
2. Confirm it fails for the right reason.
3. Implement the smallest change.
4. Run the test.
5. Refactor.
6. Run full feedback loops.

Why this works well with agents:

It prevents the agent from coding blind.
It gives the agent immediate feedback.
It makes cheating harder.
It forces the agent to encode expected behavior before implementation.
It leaves the codebase better tested after each task.

Without tests, agents tend to hallucinate correctness. With tests, they have a feedback loop.

Bad codebases produce bad agents partly because they lack feedback loops.

Step 8: Improve the Codebase for Agents by Deepening Modules

A codebase made of many tiny, shallow modules is hard for both humans and agents to reason about.

Shallow modules often look like this:

function A depends on helper B
helper B depends on utility C
utility C depends on config D
service E calls A, B, and C directly
tests mock half the graph

This creates problems:

The dependency graph is hard to understand.
Test boundaries are unclear.
Agents modify the wrong layer.
Small changes cause unexpected breakage.
The agent has to inspect too many files to understand one behavior.

A better structure uses deep modules.

A deep module has:

A small public interface.
Significant internal functionality.
Clear ownership of behavior.
A natural test boundary.

Example:

type AwardLessonCompletionPointsInput = {
  userId: string
  lessonId: string
  completedAt: Date
}

type GamificationService = {
  awardLessonCompletionPoints(input: AwardLessonCompletionPointsInput): Promise<void>
  getStudentProgress(userId: string): Promise<StudentGamificationProgress>
}

Internally, the service may do many things:

Check whether points were already awarded.
Insert a point event.
Update streaks.
Recalculate level.
Return dashboard data.

But callers do not need to know that.

This is good for humans and good for agents.

The human owns the interface. The agent can implement the internals.

That is the right abstraction boundary.

Step 9: Use Push vs Pull Context Deliberately

Do not dump every rule into every prompt.

There are two ways to provide context to an agent.

Push context

You always include it.

Examples:

Follow these coding standards.
Use strict TypeScript.
Do not introduce new dependencies.
Run tests before committing.

Push context is useful for reviewers and critical constraints.

Pull context

You make information available, and the agent retrieves it when needed.

Examples:

/skills/react-patterns.md
/skills/database-migrations.md
/skills/testing-guidelines.md
/architecture/gamification.md

Pull context is useful for implementation guidance that is not always needed.

A good rule:

Push constraints to reviewers.
Let implementers pull guidance when needed.

The reviewer should be stricter than the implementer.

Step 10: Always Review in a Fresh Context

If the same agent implements and reviews in one long session, the review often happens in the “dumb zone.”

Better:

Session 1:
Implement issue.

Clear context.

Session 2:
Review the diff against the issue, coding standards, and architecture rules.

This keeps the reviewer sharper.

It also reduces self-justification. Agents are less likely to catch their own mistakes when they are still carrying the implementation history.

Step 11: QA Is Where Taste Re-enters the System

Automated tests are necessary, but they are not enough.

Human QA is where you impose taste.

This is especially true for:

Frontend behavior.
UX quality.
Product feel.
Naming.
Edge cases.
“Does this actually solve the problem?”
“Would I be happy merging this?”

If you automate everything from idea to QA, you often get software that technically exists but lacks judgment.

That is how teams produce AI slop.

The human role is not disappearing. It is moving upward:

Less typing.
More shaping.
More reviewing.
More boundary-setting.
More taste enforcement.

A Practical Workflow You Can Steal

Here is the full loop.

1. Start with a vague idea or client brief.

2. Run a grilling session.
   Goal: reach shared understanding.

3. Convert the conversation into a PRD.
   Goal: define the destination.

4. Convert the PRD into vertical-slice issues.
   Goal: create independently grabbable tasks.

5. Mark each issue:
   - Human-in-the-loop
   - AFK
   - Blocked by X
   - Blocks Y

6. Run one agent per available AFK issue.
   Goal: scoped implementation.

7. Require TDD and feedback loops.
   Goal: prevent blind coding.

8. Run automated review in a fresh context.
   Goal: catch obvious problems.

9. Human QA and code review.
   Goal: enforce correctness and taste.

10. Add new issues from QA findings.
    Goal: keep the Kanban board alive.

11. Merge only when the slice is coherent.

The Bigger Lesson

AI coding is not replacing software engineering fundamentals.

It is punishing teams that ignored them.

If your codebase has:

Poor tests.
Shallow modules.
Unclear boundaries.
Weak architecture.
Vague requirements.
No review discipline.
No product taste.

Then agents will amplify the mess.

If your codebase has:

Clear modules.
Strong feedback loops.
Small vertical slices.
Explicit requirements.
Testable behavior.
Good review practices.

Then agents can move extremely fast.

The future of software development is not “write specs and ignore code.”

It is closer to this:

Developers design the system, define the boundaries, create the feedback loops, and delegate scoped implementation to agents.

That is a much stronger model than vibe coding.

And it is much closer to real engineering.