<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Andrew Cappelli</title>
    <description>The latest articles on Forem by Andrew Cappelli (@andrewmostlikely).</description>
    <link>https://forem.com/andrewmostlikely</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3858315%2F59167117-756c-40fe-8a35-9203e65b1566.jpeg</url>
      <title>Forem: Andrew Cappelli</title>
      <link>https://forem.com/andrewmostlikely</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/andrewmostlikely"/>
    <language>en</language>
    <item>
      <title>I kept breaking my AI features by accident so I built unit tests for prompts</title>
      <dc:creator>Andrew Cappelli</dc:creator>
      <pubDate>Thu, 02 Apr 2026 20:38:09 +0000</pubDate>
      <link>https://forem.com/andrewmostlikely/i-kept-breaking-my-ai-features-by-accident-so-i-built-unit-tests-for-prompts-2la6</link>
      <guid>https://forem.com/andrewmostlikely/i-kept-breaking-my-ai-features-by-accident-so-i-built-unit-tests-for-prompts-2la6</guid>
      <description>&lt;h2&gt;
  
  
  We test our code. Why don’t we test our AI?
&lt;/h2&gt;

&lt;p&gt;When I started shipping LLM-powered features, I ran into a problem nobody warned me about: &lt;strong&gt;prompt drift&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I'd make a small tweak to a prompt, or swap to a newer model, and the outputs would silently change.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Wrong format
&lt;/li&gt;
&lt;li&gt;Different tone
&lt;/li&gt;
&lt;li&gt;Missing information
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And I’d only find out a week later from a user report.&lt;/p&gt;

&lt;p&gt;The fix in traditional software is obvious: &lt;strong&gt;unit tests&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;But nobody had done the simple version for prompts.&lt;/p&gt;

&lt;p&gt;So I built it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Introducing &lt;code&gt;prompt-ci&lt;/code&gt;
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;prompt-ci
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It works in three commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;prompt-ci init      &lt;span class="c"&gt;# create a config file&lt;/span&gt;
prompt-ci record    &lt;span class="c"&gt;# run your prompts, save outputs as golden files&lt;/span&gt;
prompt-ci check     &lt;span class="c"&gt;# compare current outputs to golden, fail if they drift&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What it actually does
&lt;/h2&gt;

&lt;p&gt;You define your prompts and test inputs in a YAML config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropic&lt;/span&gt;
&lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude-haiku-4-5-20251001&lt;/span&gt;
&lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.80&lt;/span&gt;

&lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;summarize_bullets&lt;/span&gt;
    &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;exactly&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;3&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;bullet&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;points:&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{input}}"&lt;/span&gt;
    &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Your&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;article&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;here..."&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sentiment_check&lt;/span&gt;
    &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reply&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;one&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;word&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;positive,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;negative,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;or&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;neutral:"&lt;/span&gt;
    &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;love&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;this&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;product!"&lt;/span&gt;
    &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.95&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;prompt-ci record
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This saves outputs to &lt;code&gt;.golden/&lt;/code&gt; as JSON.&lt;/p&gt;

&lt;p&gt;Commit that directory -&amp;gt; it becomes your &lt;strong&gt;locked expected behavior&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Catch regressions in CI
&lt;/h2&gt;

&lt;p&gt;On every PR:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;prompt-ci check
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This re-runs the prompts and scores how similar the new outputs are to the golden files.&lt;/p&gt;

&lt;p&gt;If the score drops below your threshold:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;X Exit code 1
&lt;/li&gt;
&lt;li&gt;X CI fails
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Semantic similarity, not string matching
&lt;/h2&gt;

&lt;p&gt;This is the important part.&lt;/p&gt;

&lt;p&gt;Exact string matching would be useless -&amp;gt; LLM outputs vary naturally.&lt;/p&gt;

&lt;p&gt;But pure string diff misses the point entirely.&lt;/p&gt;

&lt;p&gt;Instead, &lt;code&gt;prompt-ci&lt;/code&gt; uses &lt;strong&gt;LLM-as-a-judge&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;It sends both outputs to your model and asks it to rate &lt;strong&gt;semantic equivalence on a 0.0–1.0 scale&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This catches real regressions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FAIL summarize_bullets  score=0.61  threshold=0.80

Expected:
- Revenue grew 23% YoY
- Margins expanded to 18%
- Guidance raised for Q4

Actual:
Revenue increased significantly year over year,
with notable margin improvements...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same facts, completely different format.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That’s a regression.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  GitHub Actions in one step
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Prompt regression tests&lt;/span&gt;
  &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.ANTHROPIC_API_KEY }}&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prompt-ci check&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Try it without an API key
&lt;/h2&gt;

&lt;p&gt;Use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mock&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This enables a local dry run -&amp;gt; no API key needed.&lt;/p&gt;

&lt;p&gt;It uses &lt;strong&gt;token overlap scoring&lt;/strong&gt; instead.&lt;/p&gt;




&lt;h2&gt;
  
  
  Get it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;prompt-ci
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GitHub:&lt;br&gt;&lt;br&gt;
&lt;a href="https://github.com/Andrew-most-likely/prompt-ci" rel="noopener noreferrer"&gt;https://github.com/Andrew-most-likely/prompt-ci&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;If you're shipping anything with LLMs, you need this.&lt;/p&gt;

&lt;p&gt;Curious what prompt testing workflows others are using -&amp;gt; drop them in the comments &lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
