<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Raphael Porto</title>
    <description>The latest articles on Forem by Raphael Porto (@raphaelpor).</description>
    <link>https://forem.com/raphaelpor</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F210664%2F4c461eaa-94d1-427a-8296-e9ac448e403b.jpg</url>
      <title>Forem: Raphael Porto</title>
      <link>https://forem.com/raphaelpor</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/raphaelpor"/>
    <language>en</language>
    <item>
      <title>Self-improving Coding Agents</title>
      <dc:creator>Raphael Porto</dc:creator>
      <pubDate>Fri, 27 Mar 2026 23:51:31 +0000</pubDate>
      <link>https://forem.com/raphaelpor/self-improving-coding-agents-435l</link>
      <guid>https://forem.com/raphaelpor/self-improving-coding-agents-435l</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;In this article, I demonstrate how a more capable coding agent (Codex/GPT-5.4) successfully refined the prompt for a less capable agent (GitHub Copilot CLI/GPT-5-mini) using a custom evaluation tool called &lt;a href="https://github.com/raphaelpor/katt" rel="noopener noreferrer"&gt;Katt&lt;/a&gt;. This "Test-Driven Agentic Workflow" (TDAW) increased passing evals from 0/3 to 3/3, but also significantly increased runtime and token usage, demonstrating that an agent can improve a prompt's effectiveness by clarifying instructions and fixing the evaluation process itself.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Why this article now?
&lt;/h2&gt;

&lt;p&gt;I realized I haven't been very verbose about my AI research in relation to coding agents and the ability to extract accurate outputs, even with low-capability models. I've spent the last 3 years experimenting with ML and AI, and 1.5 years with coding agents, while using my engineering skills to create different harness strategies on top of them. With this more recent experiment, I felt like it was time to be more outspoken and share a bit of my learnings and journey with AI.&lt;/p&gt;

&lt;p&gt;So here comes my first article in many years. Let's start by talking about &lt;a href="https://github.com/raphaelpor/katt" rel="noopener noreferrer"&gt;Katt&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://github.com/raphaelpor/katt" rel="noopener noreferrer"&gt;Katt&lt;/a&gt;: A benchmark/evals tool for testing coding agents
&lt;/h2&gt;

&lt;p&gt;Recently, I decided to create a simple CLI tool inspired by unit testing libraries like Jest and Vitest to run on top of coding models like Claude Code, Codex, and GitHub Copilot. It aims for a more deterministic way of working with automation/autonomous coding workflows while keeping a syntax that is very similar to the unit testing tools we are familiar with.&lt;/p&gt;

&lt;p&gt;It is able to do a lot of things in terms of benchmarking and evaluation harness efficiency. But my favorite value it creates is the ability to give this tool to an agent with the focus of improving other coding agents for a quick loop that aims to generate the most efficient and accurate generation.&lt;/p&gt;

&lt;p&gt;So let's break down this experiment.&lt;/p&gt;

&lt;h2&gt;
  
  
  The experiment
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;The main idea was simple: Can a coding agent improve the effectiveness of a prompt running on a less capable model?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For this to happen, Codex got access to a pre-existing test that used &lt;code&gt;gpt-5-mini&lt;/code&gt; and a simple prompt. This test was expecting to return a valid output that was verified with a snapshot. A very similar method to the known &lt;code&gt;toMatchSnapshot()&lt;/code&gt; from Jest.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tooling used
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Codex (&lt;code&gt;gpt-5.4&lt;/code&gt;, reasoning &lt;code&gt;xhigh&lt;/code&gt;) is the improver.&lt;/li&gt;
&lt;li&gt;GitHub Copilot CLI (&lt;code&gt;gpt-5-mini&lt;/code&gt;) is the agent under test.&lt;/li&gt;
&lt;li&gt;GitHub MCP tools provide live repository issues.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/raphaelpor/katt" rel="noopener noreferrer"&gt;Katt&lt;/a&gt; runs the eval three times and records pass/fail, time, and token usage.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The initial prompt for &lt;code&gt;GPT-5-mini&lt;/code&gt;
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;Main goal: Using the Copilot MCP tools, list the 5 most recent issues in this repo.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The loop
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Codex follow the instructions of self-improving.&lt;/li&gt;
&lt;li&gt;Runs the katt eval.&lt;/li&gt;
&lt;li&gt;Reads the failure.&lt;/li&gt;
&lt;li&gt;Make one focused change.&lt;/li&gt;
&lt;li&gt;Commit it.&lt;/li&gt;
&lt;li&gt;Run the eval again.&lt;/li&gt;
&lt;li&gt;Keep what helps. Fix what does not.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What GPT-5.4 changed along the way
&lt;/h2&gt;

&lt;p&gt;The interesting part is that Codex did not just keep piling on more prompt text. It discovered that there were two different problems:&lt;br&gt;
The tested agent needed clearer instructions.&lt;br&gt;
The evaluator itself was brittle because it compared live GitHub data against stale snapshots.&lt;/p&gt;

&lt;p&gt;Here is the breakdown of changes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Commit&lt;/th&gt;
&lt;th&gt;What changed&lt;/th&gt;
&lt;th&gt;Why it mattered&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bc096d4&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Baseline experiment setup&lt;/td&gt;
&lt;td&gt;A minimal prompt created a clean starting point.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;df9761c&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Forced exact JSON output, exact fields, exact count, and no extra text&lt;/td&gt;
&lt;td&gt;This attacked the first failure mode: Copilot answered conversationally instead of matching the expected schema.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;5ca4ce4&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Added explicit fetch depth, sorting, and tie-breaking rules&lt;/td&gt;
&lt;td&gt;This reduced ranking mistakes when issues had very similar timestamps.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;422c325&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Switched the eval from stale snapshots to live GitHub issue validation and pinned the repo source&lt;/td&gt;
&lt;td&gt;This was the big insight: sometimes the test is wrong, not just the agent.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;93df064&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Forced live MCP issue data, blocked local-file shortcuts, and required unlabeled issues to be included&lt;/td&gt;
&lt;td&gt;This stopped the model from falling back to stale examples inside the repo.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fca11b8&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Saved the before/after runs and Codex reasoning log&lt;/td&gt;
&lt;td&gt;This preserved the experiment as evidence instead of just a final state.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The final prompt
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;Main goal: Using the Copilot MCP tools, list the 5 most recent issues in this repo.

Instructions:
&lt;span class="p"&gt;-&lt;/span&gt; Use the Copilot MCP tools to read issues from the GitHub repository declared by this project: &lt;span class="sb"&gt;`nentgroup/self-improving-agentic-workflow`&lt;/span&gt;.
&lt;span class="p"&gt;-&lt;/span&gt; Use the live Copilot MCP issue data as the only source of truth for the issue list.
&lt;span class="p"&gt;-&lt;/span&gt; Do not use local files in this repository, snapshots, previous results, or examples in this prompt as the source of issue data.
&lt;span class="p"&gt;-&lt;/span&gt; List issues only. Exclude pull requests.
&lt;span class="p"&gt;-&lt;/span&gt; Do not filter by labels. Include unlabeled issues if they are among the 5 most recent.
&lt;span class="p"&gt;-&lt;/span&gt; Fetch enough issues to determine the top 5 reliably. Do not rely on the tool's default ordering alone.
&lt;span class="p"&gt;-&lt;/span&gt; Sort the fetched issues yourself using this exact rule:
&lt;span class="p"&gt;  1.&lt;/span&gt; Newer &lt;span class="sb"&gt;`created_at`&lt;/span&gt; first.
&lt;span class="p"&gt;  2.&lt;/span&gt; If &lt;span class="sb"&gt;`created_at`&lt;/span&gt; is identical, higher issue number first.
&lt;span class="p"&gt;-&lt;/span&gt; Sort by &lt;span class="sb"&gt;`created_at`&lt;/span&gt; descending. If two issues have the same &lt;span class="sb"&gt;`created_at`&lt;/span&gt;, put the higher issue number first.
&lt;span class="p"&gt;-&lt;/span&gt; Return exactly 5 issues.
&lt;span class="p"&gt;-&lt;/span&gt; Return exactly one valid JSON object and nothing else. Do not use Markdown. Do not use code fences. Do not add explanations.
&lt;span class="p"&gt;-&lt;/span&gt; The JSON must match this shape exactly:
{
  "issues": [
    {
      "id": 123,
      "title": "Issue title",
      "labels": ["label-1", "label-2"],
      "created_at": "2026-03-25T21:51:52Z"
    }
  ]
}
&lt;span class="p"&gt;-&lt;/span&gt; Use the GitHub issue number as &lt;span class="sb"&gt;`id`&lt;/span&gt;.
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`labels`&lt;/span&gt; must be an array of label names only.
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`created_at`&lt;/span&gt; must be the ISO 8601 creation timestamp for the issue.
&lt;span class="p"&gt;-&lt;/span&gt; If the MCP tool returns no issues or an obviously incomplete result, retry before answering. Do not return an empty &lt;span class="sb"&gt;`issues`&lt;/span&gt; array unless the repository truly has no issues.
&lt;span class="p"&gt;-&lt;/span&gt; Before answering, verify the response is valid JSON, contains exactly 5 issue objects, includes only the keys &lt;span class="sb"&gt;`id`&lt;/span&gt;, &lt;span class="sb"&gt;`title`&lt;/span&gt;, &lt;span class="sb"&gt;`labels`&lt;/span&gt;, and &lt;span class="sb"&gt;`created_at`&lt;/span&gt;, and that no omitted live issue should come before the fifth issue under the sorting rule above.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Before vs After
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;th&gt;Change&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Passing evals&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0 / 3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;3 / 3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;+3&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total runtime&lt;/td&gt;
&lt;td&gt;&lt;code&gt;286,552 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;364,460 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;+27%&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Average runtime per eval&lt;/td&gt;
&lt;td&gt;&lt;code&gt;95.5 s&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;121.5 s&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;+26.0 s&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total tokens used&lt;/td&gt;
&lt;td&gt;&lt;code&gt;329,625&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;893,151&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;+171%&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Average tokens per eval&lt;/td&gt;
&lt;td&gt;&lt;code&gt;109,875&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;297,717&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;+171%&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Bonus stat: Codex spent about &lt;code&gt;58k&lt;/code&gt; tokens and &lt;code&gt;18m 48s&lt;/code&gt; to reach the final passing setup.&lt;/p&gt;

&lt;h2&gt;
  
  
  The result
&lt;/h2&gt;

&lt;p&gt;Codex was able to generate a prompt that made GPT-5-mini use an MCP tool to retrieve information in an expected output, while also making it accurate and avoiding hallucinations along the way. One more interesting approach was making the agent try again in case of a problem with the connection.&lt;/p&gt;

&lt;p&gt;It shows that it is possible to create a harness around agents that is very similar to TDD, but in this case, &lt;strong&gt;TDAW (Test-Driven Agentic Workflows)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Check out the details and the repo used in this project:&lt;br&gt;
&lt;a href="https://github.com/raphaelpor/self-improving-agentic-workflow" rel="noopener noreferrer"&gt;https://github.com/raphaelpor/self-improving-agentic-workflow&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next?
&lt;/h2&gt;

&lt;p&gt;I will continue to evaluate self-improving agents. As with every experiment I do, I don’t know if they will really bring value to the real world. But my job is to scale them just a bit in every step and evaluate the current limits.&lt;/p&gt;

&lt;p&gt;See you in the next one.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>harness</category>
      <category>ai</category>
      <category>evals</category>
    </item>
    <item>
      <title>Minhas humildes dicas para entrevistas em processos seletivos para o exterior</title>
      <dc:creator>Raphael Porto</dc:creator>
      <pubDate>Mon, 03 Feb 2020 08:00:57 +0000</pubDate>
      <link>https://forem.com/raphaelpor/minhas-humildes-dicas-para-entrevistas-em-processos-seletivos-para-o-exterior-4ljl</link>
      <guid>https://forem.com/raphaelpor/minhas-humildes-dicas-para-entrevistas-em-processos-seletivos-para-o-exterior-4ljl</guid>
      <description>&lt;p&gt;Olá! Faz tempo que não posto algo novo. Mas como um presente de ano novo, gostaria de compartilhar algumas coisas que aprendi em processos seletivos.&lt;/p&gt;

&lt;p&gt;Eu fiz algumas entrevistas ao longo da minha vida. Uma coisa que aprendi nisso tudo, é que você pode fazer uma entrevista por inúmeros motivos, além do óbvio, que é conseguir um emprego. Seja para praticar o inglês, conhecer outras empresas e como elas trabalham, entender seu valor no mercado… No fim, seja qual for o motivo, entrevistas para o exterior te fazem crescer como pessoa e como profissional.&lt;/p&gt;

&lt;p&gt;Portanto, essas são minha humildes dicas para caso você queira fazer um processo para o exterior.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tente marcar entrevistas pela manhã.
&lt;/h2&gt;

&lt;p&gt;Entrevistas pela manhã vão lhe ajudar a organizar seu dia melhor.&lt;/p&gt;

&lt;h2&gt;
  
  
  Esteja disponível no computador onde fará a call, 10 minutos antes do combinado.
&lt;/h2&gt;

&lt;p&gt;Essa dica complementa a anterior. Acorde pelo menos uma hora antes, tome um banho e um café da manhã leve. Use roupas confortáveis como uma camiseta por exemplo e verifique sempre se o lugar onde você fará a entrevista está fresco para que você não comece a transpirar durante a entrevista (sim, já aconteceu comigo hahaha).&lt;/p&gt;

&lt;h2&gt;
  
  
  Tenha sempre uma aba do Google Tradutor aberta para ajudar em palavras pontuais.
&lt;/h2&gt;

&lt;p&gt;Sempre bate aquele branco, não é mesmo?&lt;/p&gt;

&lt;h2&gt;
  
  
  Esteja sempre atento ao fuso horário.
&lt;/h2&gt;

&lt;p&gt;Fusos diferentes e horários de verão sempre são uma pegadinha. Estar atento a isso ao marcar a reunião é muito importante para não haver imprevisto.&lt;/p&gt;

&lt;h2&gt;
  
  
  Escreva e treine antes.
&lt;/h2&gt;

&lt;p&gt;Caso você esteja um pouco inseguro, escreva possíveis perguntas e o que você responderia na língua que será a entrevista.&lt;/p&gt;

&lt;p&gt;Escreva perguntas comuns como: “conte-me um pouco sobre sua experiência” ou “por que você gostaria de se mudar para o país X?”.&lt;/p&gt;

&lt;h2&gt;
  
  
  Seja você mesmo.
&lt;/h2&gt;

&lt;p&gt;Não tente atuar ou ser outra pessoa durante a entrevista. Recrutadores entrevistam pessoas constantemente e podem conseguir detectar que você não está sendo verdadeiro.&lt;/p&gt;

&lt;h2&gt;
  
  
  Procure estar aberto a dicas e feedbacks.
&lt;/h2&gt;

&lt;p&gt;Mesmo que você acabe não passando na entrevista, peça dicas e/ou feedbacks ao final da entrevista. Por exemplo, se você estava inseguro com seu nível de inglês, pergunte se estava bom para o entendimento do recrutador.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sempre responda os e-mails o quanto antes, mesmo que seja uma resposta negativa.
&lt;/h2&gt;

&lt;p&gt;Manter um canal aberto com um recrutador, é muito importante. Nunca se sabe o que pode acontecer, não é mesmo?&lt;/p&gt;

&lt;p&gt;Por enquanto é isso… Se vocês gostarem, posso escrever sobre as dicas em desafios técnicos também.&lt;br&gt;
Grande abraço e feliz 2020! ❤&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
