<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Alex Metelli</title>
    <description>The latest articles on Forem by Alex Metelli (@alex_metelli_f22d28dae8de).</description>
    <link>https://forem.com/alex_metelli_f22d28dae8de</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3862991%2F3d35f220-4ca0-4ace-a4ec-d55b93230843.png</url>
      <title>Forem: Alex Metelli</title>
      <link>https://forem.com/alex_metelli_f22d28dae8de</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/alex_metelli_f22d28dae8de"/>
    <language>en</language>
    <item>
      <title>From AI Demo to Production: How to Ship Quality Agentic Applications</title>
      <dc:creator>Alex Metelli</dc:creator>
      <pubDate>Sat, 02 May 2026 09:35:57 +0000</pubDate>
      <link>https://forem.com/alex_metelli_f22d28dae8de/from-ai-demo-to-production-how-to-ship-quality-agentic-applications-403f</link>
      <guid>https://forem.com/alex_metelli_f22d28dae8de/from-ai-demo-to-production-how-to-ship-quality-agentic-applications-403f</guid>
      <description>&lt;p&gt;AI developers are getting very good at building demos.&lt;/p&gt;

&lt;p&gt;A prompt, a model call, maybe a tool call or two, and suddenly you have something that looks impressive in a workshop, a hackathon, or an internal prototype.&lt;/p&gt;

&lt;p&gt;But production is where the illusion breaks.&lt;/p&gt;

&lt;p&gt;The hard part is no longer proving that an LLM can answer a question. The hard part is knowing whether your AI system behaves correctly across messy real-world inputs, edge cases, model changes, tool failures, latency constraints, cost pressure, and business-specific policies.&lt;/p&gt;

&lt;p&gt;That was the core theme of a recent Braintrust and Trainline workshop in London: &lt;strong&gt;shipping AI applications requires operational rigor, not just better prompts.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The prototype trap
&lt;/h2&gt;

&lt;p&gt;A lot of AI applications start the same way:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gpt-5-mini&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;system&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;You are a helpful support agent.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ticketText&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a proof of concept, this is fine.&lt;/p&gt;

&lt;p&gt;You pass in a ticket, get a structured response, and the output looks plausible. Maybe it categorizes the issue, assigns a severity, and drafts a customer reply.&lt;/p&gt;

&lt;p&gt;The problem is that plausibility is not correctness.&lt;/p&gt;

&lt;p&gt;A single prompt can work beautifully on three demo cases and fail badly once exposed to production inputs. This is especially true in business workflows where the model needs to understand implicit priority, policies, customer tiers, refunds, SLAs, billing impact, or escalation rules.&lt;/p&gt;

&lt;p&gt;Traditional software has deterministic failure modes. AI systems do not.&lt;/p&gt;

&lt;p&gt;In normal software, &lt;code&gt;1 + 1 = 2&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In LLM systems, the equivalent is closer to: “usually 2, unless the context, prompt, model, tool result, temperature, or previous step nudges it somewhere else.”&lt;/p&gt;

&lt;p&gt;That does not mean AI systems are unusable. It means they need a different quality model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production AI needs both software engineering and ML discipline
&lt;/h2&gt;

&lt;p&gt;Trainline described this well.&lt;/p&gt;

&lt;p&gt;They operate both traditional ML systems and agentic AI systems at large scale. For example, they use machine learning to predict train disruptions, but they also run a travel assistant that can help users with refunds, alternative trains, and support workflows.&lt;/p&gt;

&lt;p&gt;Those two worlds have different quality practices.&lt;/p&gt;

&lt;p&gt;Classic software engineering gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;type checks&lt;/li&gt;
&lt;li&gt;unit tests&lt;/li&gt;
&lt;li&gt;integration tests&lt;/li&gt;
&lt;li&gt;CI/CD&lt;/li&gt;
&lt;li&gt;structured logging&lt;/li&gt;
&lt;li&gt;service observability&lt;/li&gt;
&lt;li&gt;release discipline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Machine learning gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;datasets&lt;/li&gt;
&lt;li&gt;offline evaluation&lt;/li&gt;
&lt;li&gt;online evaluation&lt;/li&gt;
&lt;li&gt;model comparison&lt;/li&gt;
&lt;li&gt;data quality checks&lt;/li&gt;
&lt;li&gt;drift monitoring&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Agentic AI sits in the middle.&lt;/p&gt;

&lt;p&gt;Part of the system is deterministic: API calls, database lookups, validation, routing, tool execution.&lt;/p&gt;

&lt;p&gt;Part of the system is nondeterministic: reasoning, classification, language generation, judgment, summarization.&lt;/p&gt;

&lt;p&gt;So the correct quality model is not “just write tests” and not “just vibe-check the outputs.”&lt;/p&gt;

&lt;p&gt;It is both.&lt;/p&gt;

&lt;h2&gt;
  
  
  Break the monolithic prompt into stages
&lt;/h2&gt;

&lt;p&gt;One of the most useful engineering patterns from the workshop was taking a single prompt-based support agent and breaking it into a staged workflow.&lt;/p&gt;

&lt;p&gt;Instead of one large model call that does everything, the system was split into clearer responsibilities:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Context collection&lt;/strong&gt;&lt;br&gt;
Gather deterministic context: account information, previous tickets, relevant help articles, customer tier, billing state, etc.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Triage&lt;/strong&gt;&lt;br&gt;
Classify the issue, infer severity, identify the affected domain, and decide whether more information is needed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Policy review&lt;/strong&gt;&lt;br&gt;
Check whether the proposed action follows company policy, SLAs, refund rules, escalation rules, or compliance constraints.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Reply writing&lt;/strong&gt;&lt;br&gt;
Draft a customer-facing response in the correct tone.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Final packaging&lt;/strong&gt;&lt;br&gt;
Emit structured output for downstream systems, including escalation flags, internal notes, and customer reply.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is not just cleaner architecture. It improves debuggability.&lt;/p&gt;

&lt;p&gt;When a single prompt fails, you often do not know why. Was the categorization wrong? Did it miss account context? Did it misunderstand policy? Did the final reply hallucinate?&lt;/p&gt;

&lt;p&gt;When the workflow is staged, each part can be traced, evaluated, and improved independently.&lt;/p&gt;

&lt;p&gt;This is the same instinct software engineers already use when decomposing a monolith. The AI version is: do not put all reasoning, policy, tool usage, and writing into one giant prompt unless the problem is genuinely trivial.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tool calls improve capability, but increase failure surface
&lt;/h2&gt;

&lt;p&gt;Adding tools makes an AI application more useful.&lt;/p&gt;

&lt;p&gt;A support agent can call tools to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;retrieve help articles&lt;/li&gt;
&lt;li&gt;inspect account metadata&lt;/li&gt;
&lt;li&gt;check previous incidents&lt;/li&gt;
&lt;li&gt;look up billing state&lt;/li&gt;
&lt;li&gt;fetch policy rules&lt;/li&gt;
&lt;li&gt;create an escalation&lt;/li&gt;
&lt;li&gt;draft or update a ticket&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But every tool call adds another possible failure mode.&lt;/p&gt;

&lt;p&gt;The tool may return stale data. The model may choose the wrong tool. The tool result may be incomplete. The model may ignore the result. The tool may succeed but the final answer may still misinterpret it.&lt;/p&gt;

&lt;p&gt;This is why AI observability becomes mandatory.&lt;/p&gt;

&lt;p&gt;If you cannot see which tools were called, with what inputs, what outputs they returned, and how the model used those outputs, you are effectively debugging production behavior blind.&lt;/p&gt;

&lt;p&gt;Logs are not enough.&lt;/p&gt;

&lt;p&gt;Logs tell you what happened at a shallow level. Tracing tells you how the system behaved internally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trace the full execution path
&lt;/h2&gt;

&lt;p&gt;For production AI systems, tracing should capture the entire workflow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;parent request&lt;/li&gt;
&lt;li&gt;child spans for each stage&lt;/li&gt;
&lt;li&gt;model inputs&lt;/li&gt;
&lt;li&gt;model outputs&lt;/li&gt;
&lt;li&gt;tool calls&lt;/li&gt;
&lt;li&gt;tool results&lt;/li&gt;
&lt;li&gt;token usage&lt;/li&gt;
&lt;li&gt;latency&lt;/li&gt;
&lt;li&gt;cost&lt;/li&gt;
&lt;li&gt;metadata&lt;/li&gt;
&lt;li&gt;final structured output&lt;/li&gt;
&lt;li&gt;scores or evaluation results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The important detail is nesting.&lt;/p&gt;

&lt;p&gt;A lot of teams instrument only the top-level model call. That is not enough for agentic systems.&lt;/p&gt;

&lt;p&gt;You want a trace that looks more like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;support-ticket-run
  ├── collect-context
  │   ├── fetch-account
  │   ├── search-help-articles
  │   └── fetch-ticket-history
  ├── triage-specialist
  │   └── llm-call
  ├── policy-reviewer
  │   └── llm-call
  ├── reply-writer
  │   └── llm-call
  └── finalize-result
      └── maybe-escalate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This lets you answer the questions that matter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which step failed?&lt;/li&gt;
&lt;li&gt;Did the model receive the right context?&lt;/li&gt;
&lt;li&gt;Did the tool return the expected data?&lt;/li&gt;
&lt;li&gt;Did latency come from retrieval, reasoning, or final generation?&lt;/li&gt;
&lt;li&gt;Did the model output violate policy?&lt;/li&gt;
&lt;li&gt;Did a cheaper model behave differently?&lt;/li&gt;
&lt;li&gt;Did a prompt change improve one case but regress another?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without tracing, you are guessing.&lt;/p&gt;

&lt;p&gt;And guessing is not an engineering process.&lt;/p&gt;

&lt;h2&gt;
  
  
  Build a golden dataset before you trust the system
&lt;/h2&gt;

&lt;p&gt;A production AI application needs evaluation data.&lt;/p&gt;

&lt;p&gt;At the beginning, this can be small. It does not need to be perfect. But it needs to exist.&lt;/p&gt;

&lt;p&gt;For the workshop support agent, the golden dataset contained representative support tickets with expected properties, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;correct category&lt;/li&gt;
&lt;li&gt;expected severity&lt;/li&gt;
&lt;li&gt;whether escalation is required&lt;/li&gt;
&lt;li&gt;whether the output schema is valid&lt;/li&gt;
&lt;li&gt;whether the reply follows policy&lt;/li&gt;
&lt;li&gt;whether the customer-facing response is appropriate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This dataset becomes the baseline for safe iteration.&lt;/p&gt;

&lt;p&gt;Every time you change the prompt, model, tool behavior, or workflow, you can rerun the evaluation and ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Did this change improve the system, or did it just look better on one example?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That distinction matters.&lt;/p&gt;

&lt;p&gt;Prompt editing without evaluations is just production gambling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use deterministic scores where possible
&lt;/h2&gt;

&lt;p&gt;Not every evaluation needs an LLM judge.&lt;/p&gt;

&lt;p&gt;Some checks should be deterministic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;hasValidSchema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;unknown&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;SupportTicketOutputSchema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;safeParse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;success&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Useful deterministic checks include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;schema validity&lt;/li&gt;
&lt;li&gt;required fields present&lt;/li&gt;
&lt;li&gt;escalation reason exists when escalation is required&lt;/li&gt;
&lt;li&gt;severity is within allowed enum values&lt;/li&gt;
&lt;li&gt;category matches expected label&lt;/li&gt;
&lt;li&gt;reply is not empty&lt;/li&gt;
&lt;li&gt;internal-only data is not present in customer reply&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These checks are cheap, fast, stable, and should run often.&lt;/p&gt;

&lt;p&gt;They are the AI equivalent of unit tests and type checks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use LLM-as-judge for subjective quality
&lt;/h2&gt;

&lt;p&gt;Some things cannot be reliably checked with simple assertions.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is this reply helpful?&lt;/li&gt;
&lt;li&gt;Is the tone appropriate?&lt;/li&gt;
&lt;li&gt;Does it follow the refund policy?&lt;/li&gt;
&lt;li&gt;Does it avoid overpromising?&lt;/li&gt;
&lt;li&gt;Does it correctly reason about the customer’s situation?&lt;/li&gt;
&lt;li&gt;Is the escalation decision justified?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For those, LLM-as-judge can be useful.&lt;/p&gt;

&lt;p&gt;The key is to use it deliberately. Write clear rubrics. Score specific dimensions. Avoid vague prompts like “is this good?”&lt;/p&gt;

&lt;p&gt;A better judge prompt asks for concrete criteria:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Evaluate whether the support response:
1. Correctly identifies the user’s issue.
2. Does not promise actions outside company policy.
3. Gives a clear next step.
4. Uses an appropriate customer-facing tone.
5. Escalates when business impact is high.

Return a score from 0 to 1 and a short rationale.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;LLM judges are not magic. They are another probabilistic component. But when paired with deterministic checks and real production traces, they give you a scalable way to evaluate nuance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Offline evaluation is not enough
&lt;/h2&gt;

&lt;p&gt;A golden dataset gives you confidence before deployment.&lt;/p&gt;

&lt;p&gt;But production data is where the real failures appear.&lt;/p&gt;

&lt;p&gt;Users will phrase things oddly. They will omit important details. They will create conflicting signals. They will say something is “not urgent” while describing a CFO blocked before a board meeting.&lt;/p&gt;

&lt;p&gt;That example came up in the workshop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"This isn't urgent, but our CFO can't export the invoices before the board meeting."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A weak triage agent may classify this as low severity because the user said “not urgent.”&lt;/p&gt;

&lt;p&gt;A better system understands the business context:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CFO&lt;/li&gt;
&lt;li&gt;invoices&lt;/li&gt;
&lt;li&gt;board meeting&lt;/li&gt;
&lt;li&gt;likely time-sensitive&lt;/li&gt;
&lt;li&gt;high business impact&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is exactly the kind of failure that does not always appear in initial test data.&lt;/p&gt;

&lt;p&gt;So production traces should feed back into the evaluation loop.&lt;/p&gt;

&lt;p&gt;When you find a failure:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Capture the trace.&lt;/li&gt;
&lt;li&gt;Add the case to the dataset.&lt;/li&gt;
&lt;li&gt;Write or update the scoring rule.&lt;/li&gt;
&lt;li&gt;Fix the prompt, tool, policy, or workflow.&lt;/li&gt;
&lt;li&gt;Rerun evaluations.&lt;/li&gt;
&lt;li&gt;Compare against previous runs.&lt;/li&gt;
&lt;li&gt;Deploy only if the fix does not introduce regressions.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That loop is the real production AI workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model switching needs evaluation discipline
&lt;/h2&gt;

&lt;p&gt;Trainline also described a very practical problem: model cost.&lt;/p&gt;

&lt;p&gt;At scale, LLM bills can become painful fast. Teams naturally want to switch models, use cheaper models, reduce token usage, or route simpler requests to smaller models.&lt;/p&gt;

&lt;p&gt;But model switching without evaluation is dangerous.&lt;/p&gt;

&lt;p&gt;A cheaper model may perform similarly on simple tickets and fail on high-impact edge cases. A newer model may improve reasoning but change tone. A faster model may reduce latency but increase escalation mistakes.&lt;/p&gt;

&lt;p&gt;The right question is not:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can we switch to a cheaper model?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The right question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can we switch to a cheaper model without degrading the scores that matter?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That requires offline evaluation, online monitoring, and trace-level comparison.&lt;/p&gt;

&lt;h2&gt;
  
  
  Managed prompts help cross-functional teams
&lt;/h2&gt;

&lt;p&gt;Another practical point: prompts often become collaboration bottlenecks.&lt;/p&gt;

&lt;p&gt;Engineers own the codebase, but product managers, support leads, legal reviewers, and domain experts often understand the desired behavior better than the engineering team.&lt;/p&gt;

&lt;p&gt;If every prompt change requires a code change, review, deploy, and engineering handoff, iteration slows down.&lt;/p&gt;

&lt;p&gt;Managed prompts and parameters solve part of this.&lt;/p&gt;

&lt;p&gt;They allow teams to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;version prompts&lt;/li&gt;
&lt;li&gt;track who changed what&lt;/li&gt;
&lt;li&gt;compare prompt versions&lt;/li&gt;
&lt;li&gt;update model parameters&lt;/li&gt;
&lt;li&gt;collaborate with non-engineers&lt;/li&gt;
&lt;li&gt;test changes before rollout&lt;/li&gt;
&lt;li&gt;keep production behavior reproducible&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This does not mean abandoning Git or software discipline.&lt;/p&gt;

&lt;p&gt;The better pattern is to keep prompts, tools, and configuration synchronized with code where needed, while still giving teams a managed operational layer for experimentation and review.&lt;/p&gt;

&lt;p&gt;For regulated industries, this matters even more. You need to know what changed, when it changed, who changed it, and what effect it had.&lt;/p&gt;

&lt;h2&gt;
  
  
  The production AI flywheel
&lt;/h2&gt;

&lt;p&gt;The workshop’s core operating model can be summarized as a flywheel:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Build → Trace → Evaluate → Find failures → Remediate → Deploy → Monitor → Repeat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;More concretely:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Start with a working agent&lt;/strong&gt;&lt;br&gt;
Even if it is simple.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Break it into explicit stages&lt;/strong&gt;&lt;br&gt;
Separate context gathering, reasoning, policy checks, reply generation, and final output.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Add tracing&lt;/strong&gt;&lt;br&gt;
Capture model calls, tool calls, latency, token usage, cost, inputs, outputs, and metadata.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Create a golden dataset&lt;/strong&gt;&lt;br&gt;
Start with representative examples and known edge cases.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Add deterministic scores&lt;/strong&gt;&lt;br&gt;
Validate structure, required fields, categories, escalation rules, and other objective behavior.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Add LLM judges where needed&lt;/strong&gt;&lt;br&gt;
Evaluate tone, helpfulness, policy compliance, and reasoning quality.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Run offline evaluations&lt;/strong&gt;&lt;br&gt;
Before shipping prompt, model, or workflow changes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Score production traces&lt;/strong&gt;&lt;br&gt;
Use online evaluation and sampling to detect real-world failures.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Turn failures into tests&lt;/strong&gt;&lt;br&gt;
Every production failure should improve your dataset.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Compare experiments over time&lt;/strong&gt;&lt;br&gt;
Do not trust a change unless you can see its effect.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The main lesson for AI developers
&lt;/h2&gt;

&lt;p&gt;The future of AI engineering is not just better models.&lt;/p&gt;

&lt;p&gt;It is better systems around models.&lt;/p&gt;

&lt;p&gt;A production-grade AI application needs the same seriousness we already expect from software systems: observability, testing, versioning, review, deployment discipline, and feedback loops.&lt;/p&gt;

&lt;p&gt;But it also needs ML-style evaluation: datasets, scoring, model comparison, judge rubrics, and continuous monitoring.&lt;/p&gt;

&lt;p&gt;The teams that win will not be the ones with the fanciest demo.&lt;/p&gt;

&lt;p&gt;They will be the ones that can answer, with evidence:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What changed?&lt;/li&gt;
&lt;li&gt;Did quality improve?&lt;/li&gt;
&lt;li&gt;Did cost increase?&lt;/li&gt;
&lt;li&gt;Did latency regress?&lt;/li&gt;
&lt;li&gt;Which failure modes remain?&lt;/li&gt;
&lt;li&gt;Which users are affected?&lt;/li&gt;
&lt;li&gt;Can we reproduce the issue?&lt;/li&gt;
&lt;li&gt;Can we safely ship the fix?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the difference between an AI prototype and an AI product.&lt;/p&gt;

&lt;p&gt;And for agentic systems, that difference is everything.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>testing</category>
    </item>
    <item>
      <title>Building Better Software with AI Agents: Why Fundamentals Still Matter</title>
      <dc:creator>Alex Metelli</dc:creator>
      <pubDate>Mon, 27 Apr 2026 20:39:12 +0000</pubDate>
      <link>https://forem.com/alex_metelli_f22d28dae8de/building-better-software-with-ai-agents-why-fundamentals-still-matter-22fd</link>
      <guid>https://forem.com/alex_metelli_f22d28dae8de/building-better-software-with-ai-agents-why-fundamentals-still-matter-22fd</guid>
      <description>&lt;p&gt;AI coding tools are changing how software gets built, but they do not remove the need for software engineering discipline. In practice, they make fundamentals more important.&lt;/p&gt;

&lt;p&gt;This post is a condensed write-up of a workshop by Matt Pocock on building better software with AI agents. The original workshop is available here: &lt;a href="https://youtu.be/-QFHIoCo-Ko?si=9qyQKxnid9sE_ehc" rel="noopener noreferrer"&gt;https://youtu.be/-QFHIoCo-Ko?si=9qyQKxnid9sE_ehc&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The core mistake many developers make is treating AI as a “spec-to-code compiler”: write a vague requirement, hand it to an agent, and expect production-ready software to appear. That works for demos. It breaks down in real codebases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A better model is this&lt;/strong&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Use AI agents to accelerate implementation, but use software engineering fundamentals to control alignment, architecture, feedback loops, and quality.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This post distills a developer workflow from a workshop transcript on AI-assisted coding, agentic planning, PRDs, vertical slicing, TDD, and codebase design. &lt;/p&gt;




&lt;h2&gt;
  
  
  The Two Constraints of LLM Coding Agents
&lt;/h2&gt;

&lt;p&gt;Before designing an AI-assisted workflow, you need to understand two constraints.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Agents have a “smart zone” and a “dumb zone”
&lt;/h3&gt;

&lt;p&gt;An LLM performs best when the context is clean, focused, and not overloaded.&lt;/p&gt;

&lt;p&gt;As the conversation grows, the agent has to reason across more tokens, more decisions, more previous mistakes, and more irrelevant detail. Eventually, it starts making worse decisions.&lt;/p&gt;

&lt;p&gt;This is why giant context windows are not a free lunch. They are useful for retrieval, but not always for coding. A 1M-token window does not mean the agent stays sharp for 1M tokens.&lt;/p&gt;

&lt;p&gt;For coding, the &lt;strong&gt;practical strategy&lt;/strong&gt; is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keep tasks small.&lt;/li&gt;
&lt;li&gt;Keep context clean.&lt;/li&gt;
&lt;li&gt;Avoid long, drifting conversations.&lt;/li&gt;
&lt;li&gt;Prefer fresh sessions for focused work.&lt;/li&gt;
&lt;li&gt;Do not let the agent accumulate too much conversational sediment.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Agents forget unless you externalize state
&lt;/h3&gt;

&lt;p&gt;LLMs are stateless between sessions unless you give them state explicitly.&lt;/p&gt;

&lt;p&gt;That means important decisions should not live only in the chat history. You need artifacts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Product requirements documents.&lt;/li&gt;
&lt;li&gt;Local issue files.&lt;/li&gt;
&lt;li&gt;GitHub issues.&lt;/li&gt;
&lt;li&gt;Architecture notes.&lt;/li&gt;
&lt;li&gt;Test cases.&lt;/li&gt;
&lt;li&gt;Commit history.&lt;/li&gt;
&lt;li&gt;Review summaries.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trick is not to preserve everything. It is to preserve the useful shape of the work.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Start with Alignment, Not Implementation
&lt;/h2&gt;

&lt;p&gt;When a vague feature request arrives, the wrong move is to immediately ask the agent to code.&lt;/p&gt;

&lt;p&gt;Example request:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Retention is bad. Students sign up, do a few lessons, then drop off. Let’s add gamification.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That sounds simple. It is not.&lt;/p&gt;

&lt;p&gt;Before coding, you need to clarify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What actions earn points?&lt;/li&gt;
&lt;li&gt;Are points retroactive?&lt;/li&gt;
&lt;li&gt;Do streaks earn points?&lt;/li&gt;
&lt;li&gt;Where does the UI live?&lt;/li&gt;
&lt;li&gt;What counts as lesson completion?&lt;/li&gt;
&lt;li&gt;What is the progression curve?&lt;/li&gt;
&lt;li&gt;What is out of scope?&lt;/li&gt;
&lt;li&gt;What data model supports this?&lt;/li&gt;
&lt;li&gt;How will this be tested?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The useful pattern here is a &lt;strong&gt;grilling session&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of saying:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Create a plan.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Ask the agent to interrogate the requirement:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Interview me relentlessly about every aspect of this feature until we reach a shared understanding. Ask one question at a time. For each question, give your recommended answer.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This changes the interaction.&lt;/p&gt;

&lt;p&gt;The goal is not to produce a plan immediately. The goal is to reach a shared design concept between the human and the agent.&lt;/p&gt;

&lt;p&gt;That matters because most AI coding failures are not syntax failures. They are alignment failures.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Turn the Conversation into a Destination Document
&lt;/h2&gt;

&lt;p&gt;Once the agent and human have converged on the feature, convert the alignment conversation into a PRD.&lt;/p&gt;

&lt;p&gt;The PRD should not be a bloated corporate artifact. It should capture the destination.&lt;/p&gt;

&lt;p&gt;A useful PRD contains:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Feature: Gamification System&lt;/span&gt;

&lt;span class="gu"&gt;## Problem&lt;/span&gt;

Students start courses but do not consistently return or complete lessons.

&lt;span class="gu"&gt;## Solution&lt;/span&gt;

Add a lightweight gamification system with points, levels, and streaks to increase visible progress and motivation.

&lt;span class="gu"&gt;## User Stories&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; As a student, I can earn points when I complete lessons.
&lt;span class="p"&gt;-&lt;/span&gt; As a student, I can see my current points on the dashboard.
&lt;span class="p"&gt;-&lt;/span&gt; As a student, I can see my level progression.
&lt;span class="p"&gt;-&lt;/span&gt; As an instructor/admin, I can trust that points are derived from real completion events.

&lt;span class="gu"&gt;## Implementation Decisions&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; Points are awarded for lesson completion.
&lt;span class="p"&gt;-&lt;/span&gt; Video watch events are excluded because they are noisy and gameable.
&lt;span class="p"&gt;-&lt;/span&gt; Existing completion records may be backfilled.
&lt;span class="p"&gt;-&lt;/span&gt; Streaks are tracked separately from points.

&lt;span class="gu"&gt;## Out of Scope&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; Leaderboards.
&lt;span class="p"&gt;-&lt;/span&gt; Social sharing.
&lt;span class="p"&gt;-&lt;/span&gt; Complex achievements.
&lt;span class="p"&gt;-&lt;/span&gt; Manual admin point editing.

&lt;span class="gu"&gt;## Testing Decisions&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; Core point logic is tested in a dedicated gamification service.
&lt;span class="p"&gt;-&lt;/span&gt; Integration tests cover lesson completion triggering point awards.
&lt;span class="p"&gt;-&lt;/span&gt; UI smoke tests verify dashboard visibility.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The PRD is not the implementation. It is the destination.&lt;/p&gt;

&lt;p&gt;The point is to move from “vague intent” to “clear target.”&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Do Not Use Linear Phase Plans by Default
&lt;/h2&gt;

&lt;p&gt;A common AI workflow is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;Phase 1: Database schema
Phase 2: Backend services
Phase 3: API routes
Phase 4: Frontend UI
Phase 5: Tests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This looks organized, but it has a major flaw: it is horizontal.&lt;/p&gt;

&lt;p&gt;The agent builds layer by layer, but you do not get useful feedback until late in the process. The database may be done, the backend may be done, and the UI may be partially done before you discover that the full flow does not actually work.&lt;/p&gt;

&lt;p&gt;That is bad engineering.&lt;/p&gt;

&lt;p&gt;A better approach is to break work into &lt;strong&gt;vertical slices&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Prefer Vertical Slices / Tracer Bullets
&lt;/h2&gt;

&lt;p&gt;A vertical slice crosses the full stack and produces something testable.&lt;/p&gt;

&lt;p&gt;Bad first task:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;Create the gamification database schema and service.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Better first task:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;Award points when a student completes a lesson and show the points on the dashboard.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That first slice may include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A minimal database change.&lt;/li&gt;
&lt;li&gt;A gamification service.&lt;/li&gt;
&lt;li&gt;A lesson completion hook.&lt;/li&gt;
&lt;li&gt;A dashboard display.&lt;/li&gt;
&lt;li&gt;A test proving points are awarded.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is more valuable because the system becomes testable immediately.&lt;/p&gt;

&lt;p&gt;The agent gets feedback earlier. The human gets something visible earlier. The architecture gets pressure-tested earlier.&lt;/p&gt;

&lt;p&gt;This is the same idea as tracer bullets from &lt;em&gt;The Pragmatic Programmer&lt;/em&gt;: build a thin, end-to-end path through the system so you can see where you are aiming.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: Convert the PRD into a Kanban Board, Not a Sequential Script
&lt;/h2&gt;

&lt;p&gt;Instead of one long plan, convert the PRD into independently grabbable issues.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;Issue 1: Award lesson completion points and display them on dashboard
Blocks: none
Type: AFK

Issue 2: Track student streaks
Blocks: Issue 1
Type: AFK

Issue 3: Add level progression based on accumulated points
Blocks: Issue 1
Type: AFK

Issue 4: Backfill points for existing lesson completions
Blocks: Issue 1
Type: AFK

Issue 5: Add dashboard polish and empty states
Blocks: Issues 1, 2, 3
Type: Human review
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives you a directed acyclic graph of work.&lt;/p&gt;

&lt;p&gt;That matters because agents can work in parallel only when dependencies are clear.&lt;/p&gt;

&lt;p&gt;A linear plan can usually be executed by one agent. A Kanban-style graph can be executed by multiple agents safely.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 6: Separate Human-in-the-Loop Work from AFK Work
&lt;/h2&gt;

&lt;p&gt;Not all tasks should be delegated equally.&lt;/p&gt;

&lt;p&gt;Some work needs humans:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Product alignment.&lt;/li&gt;
&lt;li&gt;Domain decisions.&lt;/li&gt;
&lt;li&gt;Architecture boundaries.&lt;/li&gt;
&lt;li&gt;UX judgment.&lt;/li&gt;
&lt;li&gt;QA.&lt;/li&gt;
&lt;li&gt;Final code review.&lt;/li&gt;
&lt;li&gt;Tradeoff decisions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some work can be AFK:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Implementing a well-scoped issue.&lt;/li&gt;
&lt;li&gt;Adding tests.&lt;/li&gt;
&lt;li&gt;Running type checks.&lt;/li&gt;
&lt;li&gt;Fixing straightforward failures.&lt;/li&gt;
&lt;li&gt;Refactoring within a clear boundary.&lt;/li&gt;
&lt;li&gt;Generating boilerplate.&lt;/li&gt;
&lt;li&gt;Applying known patterns.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The practical split is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;Human-in-the-loop:
Idea → Grilling → PRD → Issue breakdown → QA → Review

AFK:
Issue implementation → Tests → Type checks → Automated review → Commit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the “day shift / night shift” model.&lt;/p&gt;

&lt;p&gt;Humans prepare the backlog and define quality. Agents execute scoped tasks.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 7: Use TDD as an Agent Control Mechanism
&lt;/h2&gt;

&lt;p&gt;TDD is not just a human discipline. It is especially useful for AI agents.&lt;/p&gt;

&lt;p&gt;The pattern is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="p"&gt;1.&lt;/span&gt; Write a failing test.
&lt;span class="p"&gt;2.&lt;/span&gt; Confirm it fails for the right reason.
&lt;span class="p"&gt;3.&lt;/span&gt; Implement the smallest change.
&lt;span class="p"&gt;4.&lt;/span&gt; Run the test.
&lt;span class="p"&gt;5.&lt;/span&gt; Refactor.
&lt;span class="p"&gt;6.&lt;/span&gt; Run full feedback loops.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why this works well with agents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It prevents the agent from coding blind.&lt;/li&gt;
&lt;li&gt;It gives the agent immediate feedback.&lt;/li&gt;
&lt;li&gt;It makes cheating harder.&lt;/li&gt;
&lt;li&gt;It forces the agent to encode expected behavior before implementation.&lt;/li&gt;
&lt;li&gt;It leaves the codebase better tested after each task.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without tests, agents tend to hallucinate correctness. With tests, they have a feedback loop.&lt;/p&gt;

&lt;p&gt;Bad codebases produce bad agents partly because they lack feedback loops.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 8: Improve the Codebase for Agents by Deepening Modules
&lt;/h2&gt;

&lt;p&gt;A codebase made of many tiny, shallow modules is hard for both humans and agents to reason about.&lt;/p&gt;

&lt;p&gt;Shallow modules often look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;function A depends on helper B
helper B depends on utility C
utility C depends on config D
service E calls A, B, and C directly
tests mock half the graph
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The dependency graph is hard to understand.&lt;/li&gt;
&lt;li&gt;Test boundaries are unclear.&lt;/li&gt;
&lt;li&gt;Agents modify the wrong layer.&lt;/li&gt;
&lt;li&gt;Small changes cause unexpected breakage.&lt;/li&gt;
&lt;li&gt;The agent has to inspect too many files to understand one behavior.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A better structure uses &lt;strong&gt;deep modules&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A deep module has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A small public interface.&lt;/li&gt;
&lt;li&gt;Significant internal functionality.&lt;/li&gt;
&lt;li&gt;Clear ownership of behavior.&lt;/li&gt;
&lt;li&gt;A natural test boundary.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;AwardLessonCompletionPointsInput&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;
  &lt;span class="na"&gt;lessonId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;
  &lt;span class="na"&gt;completedAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;GamificationService&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;awardLessonCompletionPoints&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;AwardLessonCompletionPointsInput&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nf"&gt;getStudentProgress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="na"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;StudentGamificationProgress&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Internally, the service may do many things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Check whether points were already awarded.&lt;/li&gt;
&lt;li&gt;Insert a point event.&lt;/li&gt;
&lt;li&gt;Update streaks.&lt;/li&gt;
&lt;li&gt;Recalculate level.&lt;/li&gt;
&lt;li&gt;Return dashboard data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But callers do not need to know that.&lt;/p&gt;

&lt;p&gt;This is good for humans and good for agents.&lt;/p&gt;

&lt;p&gt;The human owns the interface. The agent can implement the internals.&lt;/p&gt;

&lt;p&gt;That is the right abstraction boundary.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 9: Use Push vs Pull Context Deliberately
&lt;/h2&gt;

&lt;p&gt;Do not dump every rule into every prompt.&lt;/p&gt;

&lt;p&gt;There are two ways to provide context to an agent.&lt;/p&gt;

&lt;h3&gt;
  
  
  Push context
&lt;/h3&gt;

&lt;p&gt;You always include it.&lt;/p&gt;

&lt;p&gt;Examples:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;Follow these coding standards.
Use strict TypeScript.
Do not introduce new dependencies.
Run tests before committing.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Push context is useful for reviewers and critical constraints.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pull context
&lt;/h3&gt;

&lt;p&gt;You make information available, and the agent retrieves it when needed.&lt;/p&gt;

&lt;p&gt;Examples:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;/skills/react-patterns.md
/skills/database-migrations.md
/skills/testing-guidelines.md
/architecture/gamification.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pull context is useful for implementation guidance that is not always needed.&lt;/p&gt;

&lt;p&gt;A good rule:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Push constraints to reviewers.&lt;/li&gt;
&lt;li&gt;Let implementers pull guidance when needed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The reviewer should be stricter than the implementer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 10: Always Review in a Fresh Context
&lt;/h2&gt;

&lt;p&gt;If the same agent implements and reviews in one long session, the review often happens in the “dumb zone.”&lt;/p&gt;

&lt;p&gt;Better:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;Session 1:
Implement issue.

Clear context.

Session 2:
Review the diff against the issue, coding standards, and architecture rules.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This keeps the reviewer sharper.&lt;/p&gt;

&lt;p&gt;It also reduces self-justification. Agents are less likely to catch their own mistakes when they are still carrying the implementation history.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 11: QA Is Where Taste Re-enters the System
&lt;/h2&gt;

&lt;p&gt;Automated tests are necessary, but they are not enough.&lt;/p&gt;

&lt;p&gt;Human QA is where you impose taste.&lt;/p&gt;

&lt;p&gt;This is especially true for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Frontend behavior.&lt;/li&gt;
&lt;li&gt;UX quality.&lt;/li&gt;
&lt;li&gt;Product feel.&lt;/li&gt;
&lt;li&gt;Naming.&lt;/li&gt;
&lt;li&gt;Edge cases.&lt;/li&gt;
&lt;li&gt;“Does this actually solve the problem?”&lt;/li&gt;
&lt;li&gt;“Would I be happy merging this?”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you automate everything from idea to QA, you often get software that technically exists but lacks judgment.&lt;/p&gt;

&lt;p&gt;That is how teams produce AI slop.&lt;/p&gt;

&lt;p&gt;The human role is not disappearing. It is moving upward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Less typing.&lt;/li&gt;
&lt;li&gt;More shaping.&lt;/li&gt;
&lt;li&gt;More reviewing.&lt;/li&gt;
&lt;li&gt;More boundary-setting.&lt;/li&gt;
&lt;li&gt;More taste enforcement.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  A Practical Workflow You Can Steal
&lt;/h2&gt;

&lt;p&gt;Here is the full loop.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="p"&gt;1.&lt;/span&gt; Start with a vague idea or client brief.
&lt;span class="p"&gt;
2.&lt;/span&gt; Run a grilling session.
   Goal: reach shared understanding.
&lt;span class="p"&gt;
3.&lt;/span&gt; Convert the conversation into a PRD.
   Goal: define the destination.
&lt;span class="p"&gt;
4.&lt;/span&gt; Convert the PRD into vertical-slice issues.
   Goal: create independently grabbable tasks.
&lt;span class="p"&gt;
5.&lt;/span&gt; Mark each issue:
&lt;span class="p"&gt;   -&lt;/span&gt; Human-in-the-loop
&lt;span class="p"&gt;   -&lt;/span&gt; AFK
&lt;span class="p"&gt;   -&lt;/span&gt; Blocked by X
&lt;span class="p"&gt;   -&lt;/span&gt; Blocks Y
&lt;span class="p"&gt;
6.&lt;/span&gt; Run one agent per available AFK issue.
   Goal: scoped implementation.
&lt;span class="p"&gt;
7.&lt;/span&gt; Require TDD and feedback loops.
   Goal: prevent blind coding.
&lt;span class="p"&gt;
8.&lt;/span&gt; Run automated review in a fresh context.
   Goal: catch obvious problems.
&lt;span class="p"&gt;
9.&lt;/span&gt; Human QA and code review.
   Goal: enforce correctness and taste.
&lt;span class="p"&gt;
10.&lt;/span&gt; Add new issues from QA findings.
    Goal: keep the Kanban board alive.
&lt;span class="p"&gt;
11.&lt;/span&gt; Merge only when the slice is coherent.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Bigger Lesson
&lt;/h2&gt;

&lt;p&gt;AI coding is not replacing software engineering fundamentals.&lt;/p&gt;

&lt;p&gt;It is punishing teams that ignored them.&lt;/p&gt;

&lt;p&gt;If your codebase has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Poor tests.&lt;/li&gt;
&lt;li&gt;Shallow modules.&lt;/li&gt;
&lt;li&gt;Unclear boundaries.&lt;/li&gt;
&lt;li&gt;Weak architecture.&lt;/li&gt;
&lt;li&gt;Vague requirements.&lt;/li&gt;
&lt;li&gt;No review discipline.&lt;/li&gt;
&lt;li&gt;No product taste.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then agents will amplify the mess.&lt;/p&gt;

&lt;p&gt;If your codebase has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clear modules.&lt;/li&gt;
&lt;li&gt;Strong feedback loops.&lt;/li&gt;
&lt;li&gt;Small vertical slices.&lt;/li&gt;
&lt;li&gt;Explicit requirements.&lt;/li&gt;
&lt;li&gt;Testable behavior.&lt;/li&gt;
&lt;li&gt;Good review practices.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then agents can move extremely fast.&lt;/p&gt;

&lt;p&gt;The future of software development is not “write specs and ignore code.”&lt;/p&gt;

&lt;p&gt;It is closer to this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Developers design the system, define the boundaries, create the feedback loops, and delegate scoped implementation to agents.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is a much stronger model than vibe coding.&lt;/p&gt;

&lt;p&gt;And it is much closer to real engineering.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
