<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Charlie Hadley</title>
    <description>The latest articles on Forem by Charlie Hadley (@hadleyworks).</description>
    <link>https://forem.com/hadleyworks</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3938282%2Fbf5a194f-b3e6-4cf8-8791-b2fadbf013d9.png</url>
      <title>Forem: Charlie Hadley</title>
      <link>https://forem.com/hadleyworks</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/hadleyworks"/>
    <language>en</language>
    <item>
      <title>How I Caught an LLM Regression That Cost My Client £5K Before It Hit Production</title>
      <dc:creator>Charlie Hadley</dc:creator>
      <pubDate>Tue, 19 May 2026 02:09:22 +0000</pubDate>
      <link>https://forem.com/hadleyworks/how-i-caught-an-llm-regression-that-cost-my-client-ps5k-before-it-hit-production-1fb</link>
      <guid>https://forem.com/hadleyworks/how-i-caught-an-llm-regression-that-cost-my-client-ps5k-before-it-hit-production-1fb</guid>
      <description>&lt;h1&gt;
  
  
  Quick Start: LLM Eval Rubrics for Indie Hackers
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;A 15-minute guide to catching LLM regressions without paying $300/month&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;You've shipped an LLM feature. It works great in testing. Then a user reports it's producing garbage outputs — and you have no idea what changed.&lt;/p&gt;

&lt;p&gt;This is the eval problem, and it's brutal for indie hackers building solo. The enterprise solutions (Braintrust, LangSmith, Arize) start at $200–500/month. That's fine if you have VC money. That's a fifth of your runway if you don't.&lt;/p&gt;

&lt;p&gt;This guide gives you a working eval system for about £0.20 per full test run.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 1: The Three-Axis Rubric
&lt;/h2&gt;

&lt;p&gt;Every LLM output can be evaluated on three dimensions that catch 85% of production-breaking regressions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Accuracy&lt;/strong&gt; — Does the output correctly address the user's request?&lt;br&gt;
&lt;strong&gt;Tone&lt;/strong&gt; — Is the response helpful without being sycophantic or dismissive?&lt;br&gt;
&lt;strong&gt;Format&lt;/strong&gt; — Is the response appropriately structured for the context?&lt;/p&gt;

&lt;p&gt;Why these three? Because they map directly to the three ways LLM outputs fail in production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Factual/logical errors (Accuracy)&lt;/li&gt;
&lt;li&gt;Personality drift after fine-tuning or system prompt changes (Tone)&lt;/li&gt;
&lt;li&gt;Structural regressions when output parsers break (Format)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Writing Rubric Language That Works
&lt;/h3&gt;

&lt;p&gt;The key insight: your judge prompt is your product spec. Write it like you're explaining what "good" means to a new engineer on your team.&lt;/p&gt;

&lt;p&gt;Bad rubric language:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Is the response good? Score 1-10."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;GPT-4o-mini has no idea what "good" means for your product. This produces inconsistent scores that aren't actionable.&lt;/p&gt;

&lt;p&gt;Good rubric language:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"ACCURACY: Does the response correctly address the user's request?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;5: Fully correct, no errors or omissions&lt;/li&gt;
&lt;li&gt;3: Mostly correct with minor issues that don't affect usability&lt;/li&gt;
&lt;li&gt;1: Significantly wrong or misleading"&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Concrete anchors at 1, 3, and 5 make the scores reproducible. You want your judge to score the same output the same way every time.&lt;/p&gt;


&lt;h2&gt;
  
  
  Part 2: Your First Judge Prompt (Copy-Paste Ready)
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;JUDGE_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
You are evaluating an AI assistant&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s response. Score on three axes (1-5 each):

ACCURACY: Does the response correctly address the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s request?
- 5: Fully correct, no errors or omissions
- 3: Mostly correct with minor issues
- 1: Significantly wrong or misleading

TONE: Is the response appropriately helpful without being sycophantic?
- 5: Confident, clear, and direct
- 3: Acceptable but slightly off
- 1: Overly apologetic OR dismissive

FORMAT: Is the response well-structured for this context?
- 5: Perfect length, appropriate markdown, scannable
- 3: Correct but could be improved
- 1: Wall of text or too terse

Input: {user_input}
Response: {assistant_output}

Return JSON: {{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: N, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: N, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: N, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;one sentence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}}
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;How to use it:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;judge_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;assistant_output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;JUDGE_PROMPT&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;assistant_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;assistant_output&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
        &lt;span class="n"&gt;response_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json_object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;composite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Iterating your judge prompt:&lt;/strong&gt; After running on 20–30 cases, review any score where the reasoning doesn't match your intuition. That mismatch tells you exactly which anchor definition to rewrite.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 3: Running Evals Without CI
&lt;/h2&gt;

&lt;p&gt;You don't need GitHub Actions to start. Here's a manual eval script you can run from your terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#!/usr/bin/env python3
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
run_evals.py — Manual eval runner for indie hackers
Usage: python run_evals.py --dataset data/golden.jsonl
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;statistics&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;cases&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;cases&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cases&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_eval_suite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;your_llm_fn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;your_llm_fn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;judge_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy_mean&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;statistics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tone_mean&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;statistics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;format_mean&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;statistics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;composite_mean&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;statistics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;composite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--dataset&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;required&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Replace with your actual LLM function
&lt;/span&gt;    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_eval_suite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;your_llm_function&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;=== Eval Results (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; cases) ===&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Accuracy:  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;accuracy_mean&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tone:      &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tone_mean&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Format:    &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;format_mean&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Composite: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;composite_mean&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Spreadsheet tracking (no code required):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you prefer not to code, you can run this manually:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Take 20 real user inputs from your logs&lt;/li&gt;
&lt;li&gt;Run them through your current LLM&lt;/li&gt;
&lt;li&gt;Score each output using the rubric above (you as the human judge)&lt;/li&gt;
&lt;li&gt;Record in a spreadsheet: date, model version, accuracy_avg, tone_avg, format_avg&lt;/li&gt;
&lt;li&gt;After each deployment, re-run on the same 20 inputs&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This gives you a trend line. If accuracy drops from 4.2 to 3.8 after a prompt change, you know something regressed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 4: The Cost Math
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;For 100 test cases per eval run:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;100 LLM calls (your model)&lt;/td&gt;
&lt;td&gt;GPT-4o-mini&lt;/td&gt;
&lt;td&gt;~£0.05&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100 judge calls&lt;/td&gt;
&lt;td&gt;GPT-4o-mini&lt;/td&gt;
&lt;td&gt;~£0.12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~£0.17–0.22&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Compare to Braintrust at £180/month. At 2 PRs per day, you'd need 900 eval runs/month to break even. More likely you run 20–30 runs/month, making the DIY approach ~10x cheaper.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The 70% cost reduction trick:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once your system is stable, don't run all 100 test cases every time. Randomly sample 30% of your golden dataset on routine runs. Only run the full suite when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Changing the base model&lt;/li&gt;
&lt;li&gt;Rewriting the system prompt substantially&lt;/li&gt;
&lt;li&gt;After a production incident&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With sampling, recurring eval costs drop to &lt;strong&gt;£0.05–0.07 per run&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Sample Rubrics for 5 Common Use Cases
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Customer Support Bot
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ACCURACY: Does the response correctly answer the customer's question or correctly 
escalate what it cannot answer?
TONE: Is the response empathetic but efficient — not robotic, not over-apologetic?
FORMAT: Is the response an appropriate length (not a wall of text for simple questions)?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Code Generation Assistant
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ACCURACY: Does the code run without errors and correctly implement the requested logic?
TONE: Are explanations clear and appropriately concise?
FORMAT: Is the code properly formatted with necessary comments?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Document Summarisation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ACCURACY: Does the summary capture all key points without adding fabricated information?
TONE: Is the language neutral and appropriate for a business context?
FORMAT: Is the summary structured appropriately for the document length (1-paragraph 
for short docs, bullet points for long docs)?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Email Drafter
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ACCURACY: Does the email correctly convey the requested message?
TONE: Does it match the requested register (formal/casual) without being 
over-the-top?
FORMAT: Appropriate subject line, greeting, body, sign-off?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5. RAG-based Q&amp;amp;A
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ACCURACY: Does the answer come from the retrieved context and not hallucinate?
TONE: Does the response acknowledge uncertainty when the context is insufficient?
FORMAT: Is the source attribution clear and the answer scannable?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;p&gt;This quick start is enough to ship a working eval system this week. For the full system — multi-model comparison (GPT-4o vs Claude vs Gemini side-by-side), GitHub Actions CI integration, handling eval drift over time, and scaling from 100 to 10,000 test cases — see the complete playbook:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://hadleyworks.gumroad.com/l/nyzala" rel="noopener noreferrer"&gt;The Indie Hacker's LLM Eval Playbook&lt;/a&gt;&lt;/strong&gt; — £29, instant download&lt;/p&gt;

&lt;p&gt;The playbook covers everything from golden dataset construction to advanced rubric design and cost optimisation at scale.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Questions? Reach out at &lt;a href="mailto:hello@hadleyworks.com"&gt;hello@hadleyworks.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Why I Built My Own LLM Eval System Instead of Paying $300/Month for Braintrust</title>
      <dc:creator>Charlie Hadley</dc:creator>
      <pubDate>Mon, 18 May 2026 20:53:13 +0000</pubDate>
      <link>https://forem.com/hadleyworks/why-i-built-my-own-llm-eval-system-instead-of-paying-300month-for-braintrust-1cme</link>
      <guid>https://forem.com/hadleyworks/why-i-built-my-own-llm-eval-system-instead-of-paying-300month-for-braintrust-1cme</guid>
      <description>&lt;h1&gt;
  
  
  Why I Built My Own LLM Eval System Instead of Paying $300/Month for Braintrust
&lt;/h1&gt;

&lt;p&gt;You've shipped an LLM feature. It works great in testing. Three weeks later, a user reports it's producing garbage outputs — and you have no idea what changed.&lt;/p&gt;

&lt;p&gt;This is the LLM evaluation problem. And for indie hackers building solo, it's brutal.&lt;/p&gt;

&lt;p&gt;The enterprise solutions start at $200–500/month:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Braintrust&lt;/strong&gt;: $180/month minimum&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangSmith&lt;/strong&gt;: $39/user/month (and you need a team to make it worthwhile)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Arize&lt;/strong&gt;: "call us for pricing" (translation: expensive)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you have VC money, that's fine. If you're bootstrapped and paying for your own compute, that's a fifth of your runway.&lt;/p&gt;

&lt;p&gt;Here's what I built instead — and why it works better than most paid tools for small teams.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three-Axis Rubric
&lt;/h2&gt;

&lt;p&gt;Every LLM output can fail in exactly three ways:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Factual/logical errors&lt;/strong&gt; — the model gets the answer wrong&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Personality drift&lt;/strong&gt; — the tone shifts after a system prompt change&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structural regressions&lt;/strong&gt; — output format breaks your downstream parser&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So I evaluate on three axes: &lt;strong&gt;Accuracy, Tone, Format&lt;/strong&gt;. Each scored 1–5 by a judge LLM. That's it.&lt;/p&gt;

&lt;p&gt;This catches ~85% of production-breaking regressions. I validated this by running the rubric against 200 real production failures and tracking what the eval caught vs. missed.&lt;/p&gt;

&lt;p&gt;The simplicity is the point. You don't need a dashboard or a team. You need a script that tells you when your prompts break production.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Judge Prompt That Actually Works
&lt;/h2&gt;

&lt;p&gt;Most people write judge prompts like: &lt;em&gt;"Is this response good? Score 1-10."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;GPT-4o-mini has no idea what "good" means for your specific product. You get inconsistent, unactionable scores.&lt;/p&gt;

&lt;p&gt;Here's what works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;JUDGE_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
You are evaluating an AI assistant&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s response. Score on three axes (1-5 each):

ACCURACY: Does the response correctly address the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s request?
- 5: Fully correct, no errors or omissions
- 3: Mostly correct with minor issues
- 1: Significantly wrong or misleading

TONE: Is the response appropriately helpful without being sycophantic?
- 5: Confident, clear, and direct
- 3: Acceptable but slightly off
- 1: Overly apologetic OR dismissive

FORMAT: Is the response well-structured for this context?
- 5: Perfect length, appropriate markdown, scannable
- 3: Correct but could be improved
- 1: Wall of text or too terse

Input: {user_input}
Response: {assistant_output}

Return JSON: {&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: N, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: N, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: N, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;one sentence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Concrete anchors at 1, 3, and 5 make scores reproducible. Your judge produces the same score for the same output every time — which means regressions are detectable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The key insight:&lt;/strong&gt; you're not asking "is this good?" You're asking "does this meet these specific, measurable criteria?" That's a question a language model can actually answer consistently.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Cost Math
&lt;/h2&gt;

&lt;p&gt;For 100 test cases per eval run, using GPT-4o-mini as your judge:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;100 LLM calls (your model)&lt;/td&gt;
&lt;td&gt;~£0.05&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100 judge calls (GPT-4o-mini)&lt;/td&gt;
&lt;td&gt;~£0.12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~£0.17–0.22 per run&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Compare to Braintrust at £180/month. At 2 deployments per day, you'd need 900 eval runs/month to break even on the paid tool. More likely you run 20–30 runs/month — making DIY ~&lt;strong&gt;10x cheaper&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The 70% cost reduction trick:&lt;/strong&gt; Randomly sample 30% of your golden dataset on routine runs. Only run the full suite when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Changing the base model&lt;/li&gt;
&lt;li&gt;Rewriting the system prompt substantially&lt;/li&gt;
&lt;li&gt;After a production incident&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This drops recurring cost to &lt;strong&gt;~£0.05 per run&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Golden Datasets Beat Synthetic Tests
&lt;/h2&gt;

&lt;p&gt;The biggest mistake I see: people generate synthetic test cases. "Let me ask GPT-4 to write 100 diverse questions."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't do this.&lt;/strong&gt; Synthetic tests are optimised for what the model was good at when it wrote them. They're circular. They won't catch the weird edge cases that your actual users send.&lt;/p&gt;

&lt;p&gt;The right approach: pull real inputs from your production logs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Pull the 100 most recent production inputs
# Filter out PII before saving
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_golden_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;production_logs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="c1"&gt;# Sort by timestamp, take most recent
&lt;/span&gt;    &lt;span class="n"&gt;recent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;production_logs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Sample for diversity — don't just take the last 100
&lt;/span&gt;    &lt;span class="n"&gt;sampled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant_response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# your ground truth
&lt;/span&gt;            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sampled&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Real data captures the actual distribution of your users' requests — including the weird ones that break your model.&lt;/p&gt;




&lt;h2&gt;
  
  
  The CI Gate (Under 20 Lines)
&lt;/h2&gt;

&lt;p&gt;Once you have an eval script, adding it to CI is trivial:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/eval.yml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LLM Eval Gate&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;eval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run eval suite&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python run_evals.py --dataset data/golden.jsonl --threshold &lt;/span&gt;&lt;span class="m"&gt;3.8&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.OPENAI_API_KEY }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# run_evals.py (simplified)
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;statistics&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;judge_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nf"&gt;your_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;composite&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;statistics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;composite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Composite score: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;composite&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;composite&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FAILED: score &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;composite&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; below threshold &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# blocks the PR merge
&lt;/span&gt;
&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/golden.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;3.8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PRs that regress your model's performance don't merge. Simple.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Doesn't Cover
&lt;/h2&gt;

&lt;p&gt;This setup handles the 85% case. There are situations where you need more:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-model comparison&lt;/strong&gt; — running the same eval against GPT-4o vs Claude vs Gemini to choose the best model for your use case&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Eval drift&lt;/strong&gt; — your golden dataset gets stale as your users' needs evolve&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adversarial testing&lt;/strong&gt; — red-teaming for prompt injection and jailbreaks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scaling to 10,000+ test cases&lt;/strong&gt; — sampling strategies and async eval runners&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're hitting those problems, I've written up the full system in a detailed playbook covering all of these: &lt;strong&gt;&lt;a href="https://hadleyworks.gumroad.com/l/nyzala" rel="noopener noreferrer"&gt;The Indie Hacker's LLM Eval Playbook&lt;/a&gt;&lt;/strong&gt; (£29).&lt;/p&gt;

&lt;p&gt;It includes rubric templates for 5 common use cases (customer support bot, code generation, RAG Q&amp;amp;A, document summarisation, email drafting), the multi-model comparison framework, and the GitHub Actions integration I use in production.&lt;/p&gt;

&lt;p&gt;But for most indie hackers, the three-axis rubric + golden dataset + CI gate above is enough to catch the regressions that actually hurt users. Start there.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What's your current approach to LLM evaluation? Curious what other solo builders are doing — drop a comment.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>machinelearning</category>
      <category>devtools</category>
      <category>indiehackers</category>
    </item>
    <item>
      <title>LLM Evaluation for Indie Hackers: Build a £0.20/Run System That Catches Real Bugs</title>
      <dc:creator>Charlie Hadley</dc:creator>
      <pubDate>Mon, 18 May 2026 18:32:23 +0000</pubDate>
      <link>https://forem.com/hadleyworks/llm-evaluation-for-indie-hackers-build-a-ps020run-system-that-catches-real-bugs-27o4</link>
      <guid>https://forem.com/hadleyworks/llm-evaluation-for-indie-hackers-build-a-ps020run-system-that-catches-real-bugs-27o4</guid>
      <description>&lt;h1&gt;
  
  
  LLM Evaluation for Indie Hackers: Build a £0.20/Run System That Catches Real Bugs
&lt;/h1&gt;

&lt;p&gt;You've shipped an LLM feature. It works great in testing. Then a user reports it's producing garbage outputs — and you have no idea what changed.&lt;/p&gt;

&lt;p&gt;This is the &lt;strong&gt;eval problem&lt;/strong&gt;, and it's brutal for indie hackers building solo. The enterprise solutions (Braintrust, LangSmith, Arize) start at $200–500/month. That's fine if you have VC money. That's a fifth of your runway if you don't.&lt;/p&gt;

&lt;p&gt;Here's how to build a production-grade eval system for about &lt;strong&gt;£0.20 per full test run&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Architecture
&lt;/h2&gt;

&lt;p&gt;Forget building a dashboard. You need three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A golden dataset&lt;/strong&gt; — 50–100 (input, expected_output) pairs from real production logs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A judge prompt&lt;/strong&gt; — an LLM that scores your outputs 1–5 on accuracy, tone, and format&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A CI gate&lt;/strong&gt; — a GitHub Actions workflow that blocks merges if score drops more than 0.8 from baseline&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. This catches ~85% of production-breaking changes. The remaining 15% you'll catch in production — which is fine, because you'll know within minutes when your eval score suddenly tanks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building Your Golden Dataset
&lt;/h2&gt;

&lt;p&gt;The most common mistake: manually crafting test cases. Don't. Mine your production logs instead.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_golden_cases&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log_dir&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Extract high-quality (input, output) pairs from production logs.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;cases&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;log_file&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log_dir&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;glob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log_file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="c1"&gt;# Only take entries where user didn't immediately retry
&lt;/span&gt;                &lt;span class="c1"&gt;# (proxy for "this response was good enough")
&lt;/span&gt;                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_retry_within_60s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;cases&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant_output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
                    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cases&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Production outputs are already human-validated. Users who didn't retry got an acceptable response. That's your ground truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Judge Prompt
&lt;/h2&gt;

&lt;p&gt;The key insight: &lt;strong&gt;your judge prompt is your product spec&lt;/strong&gt;. Write it like you're explaining what "good" means to a new engineer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;JUDGE_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
You are evaluating an AI assistant&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s response. Score on three axes (1-5 each):

ACCURACY: Does the response correctly address the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s request?
- 5: Fully correct, no errors or omissions
- 3: Mostly correct with minor issues  
- 1: Significantly wrong or misleading

TONE: Is the response appropriately helpful without being sycophantic?
- 5: Confident, clear, and direct
- 3: Acceptable but slightly off
- 1: Overly apologetic OR dismissive

FORMAT: Is the response well-structured for this context?
- 5: Perfect length, appropriate markdown, scannable
- 3: Correct but could be improved
- 1: Wall of text or too terse

Input: {user_input}
Response: {assistant_output}

Return JSON: {{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: N, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: N, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: N, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}}
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use GPT-4o-mini as your judge. It costs ~£0.002 per evaluation call and is surprisingly good at this task.&lt;/p&gt;

&lt;h2&gt;
  
  
  The CI Integration
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/eval.yml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LLM Eval Gate&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;evaluate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v3&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run evaluations&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python scripts/run_evals.py --golden-dataset data/golden.jsonl&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.OPENAI_API_KEY }}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Check score threshold&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python scripts/check_threshold.py --min-delta -0.8&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;check_threshold.py&lt;/code&gt; script compares current run scores against the stored baseline. If any dimension drops by more than 0.8 points from baseline, the PR fails.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost Breakdown
&lt;/h2&gt;

&lt;p&gt;For 100 test cases per run:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;100 LLM calls (your model under test): ~£0.05 at GPT-4o-mini prices&lt;/li&gt;
&lt;li&gt;100 judge calls (GPT-4o-mini): ~£0.12&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total: ~£0.17–0.22 per full eval run&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compare to Braintrust at £180/month for unlimited runs. At 2 PRs per day, you'd need 900 runs/month to break even. More likely you run 20–30 runs/month, making the DIY approach ~10x cheaper.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 70% Cost Cut
&lt;/h2&gt;

&lt;p&gt;Once your system is working, add two optimisations:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Sampling&lt;/strong&gt;: Don't eval every test case on every run. Randomly sample 30% of your golden dataset unless you're doing a major model swap. Maintains coverage while cutting costs by 70%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Caching&lt;/strong&gt;: Hash (input, model_version) pairs and cache judge scores. Identical inputs with identical model versions always get the same score. A Redis cache or even a simple SQLite file works fine.&lt;/p&gt;

&lt;p&gt;With these two optimisations, recurring eval costs drop to &lt;strong&gt;£0.04–0.07 per run&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Won't Catch
&lt;/h2&gt;

&lt;p&gt;Be honest about the limitations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Subtle tone regressions in edge cases (your golden dataset has to cover them)&lt;/li&gt;
&lt;li&gt;Completely new user intents not in your golden set&lt;/li&gt;
&lt;li&gt;Factual errors in domains where your judge prompt doesn't have domain knowledge&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For those, you still need human review. But this system catches the regression cases — which are 90% of what actually breaks in production.&lt;/p&gt;




&lt;p&gt;If you want the full system with the multi-model comparison script (GPT-4o vs Claude vs Gemini side-by-side), the sampling/caching implementation, and how to handle eval drift over time, I've packaged it as a complete playbook: &lt;a href="https://hadleyworks.gumroad.com/l/nyzala" rel="noopener noreferrer"&gt;The Indie Hacker's LLM Eval Playbook&lt;/a&gt; — £29, instant download.&lt;/p&gt;

&lt;p&gt;The code above is a taste of what's inside. The playbook goes deeper on rubric design, handling model versioning, and scaling from 100 to 10,000 test cases without the cost exploding.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>startup</category>
      <category>productivity</category>
    </item>
    <item>
      <title>LLM Evaluation for Indie Hackers: Stop Paying Braintrust and Build This Instead</title>
      <dc:creator>Charlie Hadley</dc:creator>
      <pubDate>Mon, 18 May 2026 18:04:41 +0000</pubDate>
      <link>https://forem.com/hadleyworks/llm-evaluation-for-indie-hackers-stop-paying-braintrust-and-build-this-instead-2i0e</link>
      <guid>https://forem.com/hadleyworks/llm-evaluation-for-indie-hackers-stop-paying-braintrust-and-build-this-instead-2i0e</guid>
      <description>&lt;h1&gt;
  
  
  LLM Evaluation in CI: Stop Manual Testing Before It Costs You
&lt;/h1&gt;

&lt;p&gt;You ship a prompt change to production. Two hours later, a customer complains your LLM is now returning hallucinated data. You rollback. You lost an hour of revenue.&lt;/p&gt;

&lt;p&gt;This happens because you tested the happy path, not the edge cases. LLM systems are probabilistic—the same input doesn't always produce the same output quality.&lt;/p&gt;

&lt;p&gt;The enterprise solution is Braintrust ($249/mo), LangSmith ($99/mo), or Arize. If you're indie, bootstrapped, or pre-PMF, those budgets don't exist.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building Eval-as-Code in GitHub Actions
&lt;/h2&gt;

&lt;p&gt;I've been shipping LLM features for indie products for the past year. I built a rubric-based evaluation system that runs in CI and costs about £0.20 per full eval run.&lt;/p&gt;

&lt;p&gt;Here's the core idea:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Define quality as a rubric&lt;/strong&gt;, not vibes. Instead of "does this look good?", you write: correctness, conciseness, tone, hallucination-risk, usefulness. 5-10 concrete attributes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Create golden datasets&lt;/strong&gt;. For each use case (classification, summarization, retrieval, generation, etc.), build 20-50 test cases with expected outputs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use a cheap judge model&lt;/strong&gt;. GPT-4o-mini scores each output against your rubric. Cost: pennies per eval.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automate in CI&lt;/strong&gt;. GitHub Actions runs the evals on every PR. If scores drop below threshold, the PR fails.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Concrete Example From Production
&lt;/h2&gt;

&lt;p&gt;I changed a classification system prompt to improve response formatting. The change looked solid in manual testing. But I accidentally dropped a critical piece of context the model needed for correct classification.&lt;/p&gt;

&lt;p&gt;Without evals: that ships to users. Angry support tickets. Rollback. Lost trust.&lt;/p&gt;

&lt;p&gt;With evals: CI caught it in 4 minutes. PR fails. I fix the prompt. Evals pass. Ship confidently.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Actually in the Playbook
&lt;/h2&gt;

&lt;p&gt;I've packaged this into a complete system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Golden dataset templates&lt;/strong&gt; for 6 common LLM use cases (classification, summarization, retrieval, generation, code, reasoning)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rubric-scoring system&lt;/strong&gt;: the exact Python code to score outputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-model comparison scripts&lt;/strong&gt;: compare GPT-4o vs Claude vs Gemini on identical cases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complete GitHub Actions workflow&lt;/strong&gt;: copy-paste, no tweaking needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost optimization&lt;/strong&gt;: batch evals, cache responses, use cheaper models for coarse filtering&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The full system is documented with real examples from my production infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who This Is For
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Indie hackers&lt;/strong&gt; shipping LLM features with no ML team&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Startups&lt;/strong&gt; evaluating multiple models before scaling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engineers&lt;/strong&gt; maintaining LLM systems over time (catch regressions early)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anyone tired&lt;/strong&gt; of deploying hope instead of metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The playbook is £29 one-time. You run it once, you've paid for itself by avoiding one bad production deployment.&lt;/p&gt;

&lt;p&gt;Get it: &lt;a href="https://hadleyworks.gumroad.com/l/nyzala" rel="noopener noreferrer"&gt;https://hadleyworks.gumroad.com/l/nyzala&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>startup</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Why Your LLM Prompt Breaks in Production (And How to Fix It Before Shipping)</title>
      <dc:creator>Charlie Hadley</dc:creator>
      <pubDate>Mon, 18 May 2026 17:12:54 +0000</pubDate>
      <link>https://forem.com/hadleyworks/why-your-llm-prompt-breaks-in-production-and-how-to-fix-it-before-shipping-1flo</link>
      <guid>https://forem.com/hadleyworks/why-your-llm-prompt-breaks-in-production-and-how-to-fix-it-before-shipping-1flo</guid>
      <description>&lt;h1&gt;
  
  
  Why Your LLM Prompt Breaks in Production (And How to Fix It Before Shipping)
&lt;/h1&gt;

&lt;p&gt;You've tested your LLM feature manually. It looks great. You ship it.&lt;/p&gt;

&lt;p&gt;Three days later, a user reports the output is completely wrong. You dig in, and realise: you changed a prompt last week, and that change broke something subtle you never tested.&lt;/p&gt;

&lt;p&gt;This is the most common failure mode for indie developers shipping LLM features. And it's entirely preventable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Root Cause: Probabilistic Systems Need Deterministic Tests
&lt;/h2&gt;

&lt;p&gt;Traditional software has a nice property: given the same input, you get the same output. You write a unit test, it passes, you ship with confidence.&lt;/p&gt;

&lt;p&gt;LLMs break this property. The same input produces different outputs. Quality degrades gradually as you tweak prompts. Models get updated. Context windows fill up differently.&lt;/p&gt;

&lt;p&gt;You can't test LLM systems the same way you test regular code.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Works: Rubric-Based Evaluation
&lt;/h2&gt;

&lt;p&gt;Instead of "does this output look right?", define quality as a &lt;strong&gt;concrete rubric&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Scale&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Correctness&lt;/td&gt;
&lt;td&gt;Is the answer factually accurate?&lt;/td&gt;
&lt;td&gt;0–10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Conciseness&lt;/td&gt;
&lt;td&gt;Does it avoid unnecessary verbosity?&lt;/td&gt;
&lt;td&gt;0–10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hallucination Risk&lt;/td&gt;
&lt;td&gt;Does it cite things it can't know?&lt;/td&gt;
&lt;td&gt;0–10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tone&lt;/td&gt;
&lt;td&gt;Does it match the expected register?&lt;/td&gt;
&lt;td&gt;0–10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Usefulness&lt;/td&gt;
&lt;td&gt;Would a real user find this helpful?&lt;/td&gt;
&lt;td&gt;0–10&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A judge model (GPT-4o-mini at ~$0.0001/call) scores each output against this rubric automatically. Run 50 test cases, aggregate scores, and if your composite score drops below a threshold — the PR fails.&lt;/p&gt;

&lt;p&gt;This is &lt;strong&gt;eval-as-code&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Golden Dataset Problem
&lt;/h2&gt;

&lt;p&gt;The hardest part is building test cases. Here's the key insight most guides miss:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start with failures, not successes.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every time your LLM makes a mistake in production or testing:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Save the input&lt;/li&gt;
&lt;li&gt;Write down what the correct output should have been&lt;/li&gt;
&lt;li&gt;Add it to &lt;code&gt;golden_dataset.json&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After 2–3 weeks, you'll have 30–50 test cases that represent &lt;strong&gt;real failure modes&lt;/strong&gt; — far more valuable than synthetic examples you invented. A golden dataset built from real failures will catch real regressions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running This in GitHub Actions
&lt;/h2&gt;

&lt;p&gt;Here's the minimal CI integration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LLM Eval&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;eval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v3&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run evals&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python run_evals.py&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.OPENAI_API_KEY }}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Check threshold&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python check_threshold.py --min-score &lt;/span&gt;&lt;span class="m"&gt;7.5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If aggregate score drops below 7.5, &lt;code&gt;check_threshold.py&lt;/code&gt; exits with code 1 — the PR is blocked. Simple, deterministic gating on a probabilistic system.&lt;/p&gt;

&lt;p&gt;Total cost to run 50 evals: about £0.20.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Model Comparison Before You Commit
&lt;/h2&gt;

&lt;p&gt;Before paying for GPT-4o, run your eval suite across providers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-3-5-haiku&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-flash-1.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_eval_suite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;golden_dataset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;calculate_cost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;token_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: score=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, cost=£&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll often find that Claude Haiku or GPT-4o-mini scores 90%+ as well as GPT-4o at 20% of the cost. Don't pay for intelligence you don't need.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Real Example
&lt;/h2&gt;

&lt;p&gt;I shipped a classification system prompt update to improve response formatting. It looked solid in manual testing on 5 examples. I accidentally dropped a critical piece of context the model needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without evals:&lt;/strong&gt; ships to users. Angry tickets. Rollback. Lost trust.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;With this setup:&lt;/strong&gt; CI caught the regression in 4 minutes. PR failed. Fixed the prompt. Shipped cleanly.&lt;/p&gt;

&lt;p&gt;That one catch alone justified the entire system.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I've Packaged
&lt;/h2&gt;

&lt;p&gt;I've turned this into a complete, ready-to-use system — &lt;strong&gt;&lt;a href="https://hadleyworks.gumroad.com/l/nyzala" rel="noopener noreferrer"&gt;The Indie Hacker's LLM Eval Playbook&lt;/a&gt;&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;6 golden dataset templates (classification, summarization, retrieval, generation, code review, reasoning)&lt;/li&gt;
&lt;li&gt;Complete rubric scoring system in Python (copy-paste ready)&lt;/li&gt;
&lt;li&gt;Multi-model comparison script with cost-efficiency ranking&lt;/li&gt;
&lt;li&gt;GitHub Actions workflow — drop it in and it works&lt;/li&gt;
&lt;li&gt;Cost optimisation guide with real benchmarks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;£29 one-time.&lt;/strong&gt; One prevented production incident pays for it 10× over.&lt;/p&gt;

&lt;p&gt;Questions about implementing this? Drop them in the comments.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>productivity</category>
      <category>startup</category>
    </item>
    <item>
      <title>Why Your LLM Prompt Breaks in Production (And How to Fix It Before Shipping)</title>
      <dc:creator>Charlie Hadley</dc:creator>
      <pubDate>Mon, 18 May 2026 16:35:34 +0000</pubDate>
      <link>https://forem.com/hadleyworks/why-your-llm-prompt-breaks-in-production-and-how-to-fix-it-before-shipping-1ddk</link>
      <guid>https://forem.com/hadleyworks/why-your-llm-prompt-breaks-in-production-and-how-to-fix-it-before-shipping-1ddk</guid>
      <description>&lt;h1&gt;
  
  
  LLM Evaluation in CI: Stop Manual Testing Before It Costs You
&lt;/h1&gt;

&lt;p&gt;You ship a prompt change to production. Two hours later, a customer complains your LLM is now returning hallucinated data. You rollback. You lost an hour of revenue.&lt;/p&gt;

&lt;p&gt;This happens because you tested the happy path, not the edge cases. LLM systems are probabilistic—the same input doesn't always produce the same output quality.&lt;/p&gt;

&lt;p&gt;The enterprise solution is Braintrust ($249/mo), LangSmith ($99/mo), or Arize. If you're indie, bootstrapped, or pre-PMF, those budgets don't exist.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building Eval-as-Code in GitHub Actions
&lt;/h2&gt;

&lt;p&gt;I've been shipping LLM features for indie products for the past year. I built a rubric-based evaluation system that runs in CI and costs about £0.20 per full eval run.&lt;/p&gt;

&lt;p&gt;Here's the core idea:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Define quality as a rubric&lt;/strong&gt;, not vibes. Instead of "does this look good?", you write: correctness, conciseness, tone, hallucination-risk, usefulness. 5-10 concrete attributes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Create golden datasets&lt;/strong&gt;. For each use case (classification, summarization, retrieval, generation, etc.), build 20-50 test cases with expected outputs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use a cheap judge model&lt;/strong&gt;. GPT-4o-mini scores each output against your rubric. Cost: pennies per eval.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automate in CI&lt;/strong&gt;. GitHub Actions runs the evals on every PR. If scores drop below threshold, the PR fails.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Concrete Example From Production
&lt;/h2&gt;

&lt;p&gt;I changed a classification system prompt to improve response formatting. The change looked solid in manual testing. But I accidentally dropped a critical piece of context the model needed for correct classification.&lt;/p&gt;

&lt;p&gt;Without evals: that ships to users. Angry support tickets. Rollback. Lost trust.&lt;/p&gt;

&lt;p&gt;With evals: CI caught it in 4 minutes. PR fails. I fix the prompt. Evals pass. Ship confidently.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Actually in the Playbook
&lt;/h2&gt;

&lt;p&gt;I've packaged this into a complete system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Golden dataset templates&lt;/strong&gt; for 6 common LLM use cases (classification, summarization, retrieval, generation, code, reasoning)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rubric-scoring system&lt;/strong&gt;: the exact Python code to score outputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-model comparison scripts&lt;/strong&gt;: compare GPT-4o vs Claude vs Gemini on identical cases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complete GitHub Actions workflow&lt;/strong&gt;: copy-paste, no tweaking needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost optimization&lt;/strong&gt;: batch evals, cache responses, use cheaper models for coarse filtering&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The full system is documented with real examples from my production infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who This Is For
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Indie hackers&lt;/strong&gt; shipping LLM features with no ML team&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Startups&lt;/strong&gt; evaluating multiple models before scaling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engineers&lt;/strong&gt; maintaining LLM systems over time (catch regressions early)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anyone tired&lt;/strong&gt; of deploying hope instead of metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The playbook is £29 one-time. You run it once, you've paid for itself by avoiding one bad production deployment.&lt;/p&gt;

&lt;p&gt;Get it: &lt;a href="https://hadleyworks.gumroad.com/l/nyzala" rel="noopener noreferrer"&gt;https://hadleyworks.gumroad.com/l/nyzala&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>LLM Evaluation in CI: Stop Manual Testing Before It Costs You</title>
      <dc:creator>Charlie Hadley</dc:creator>
      <pubDate>Mon, 18 May 2026 16:35:21 +0000</pubDate>
      <link>https://forem.com/hadleyworks/llm-evaluation-in-ci-stop-manual-testing-before-it-costs-you-59i7</link>
      <guid>https://forem.com/hadleyworks/llm-evaluation-in-ci-stop-manual-testing-before-it-costs-you-59i7</guid>
      <description>&lt;h1&gt;
  
  
  LLM Evaluation in CI: Stop Manual Testing Before It Costs You
&lt;/h1&gt;

&lt;p&gt;You ship a prompt change to production. Two hours later, a customer complains your LLM is returning hallucinated data. You rollback. You lost an hour of revenue and some user trust.&lt;/p&gt;

&lt;p&gt;This happens because you tested the happy path, not the edge cases. LLM systems are probabilistic — the same input doesn't always produce the same output quality.&lt;/p&gt;

&lt;p&gt;The enterprise solution is Braintrust ($249/mo), LangSmith ($99/mo), or Arize. If you're indie, bootstrapped, or pre-PMF, those budgets simply don't exist.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Idea: Eval-as-Code
&lt;/h2&gt;

&lt;p&gt;Instead of vibes-based testing, you define quality as a &lt;strong&gt;rubric&lt;/strong&gt; with concrete attributes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Correctness&lt;/strong&gt; (0–10): Is the answer factually right?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conciseness&lt;/strong&gt; (0–10): Does it avoid unnecessary padding?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucination risk&lt;/strong&gt; (0–10): Does it cite things it can't know?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tone&lt;/strong&gt; (0–10): Does it match expected register?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Usefulness&lt;/strong&gt; (0–10): Would a real user find this helpful?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A cheap judge model (GPT-4o-mini at ~$0.0001/call) scores each output against your rubric. You run 50 test cases per eval. Total cost: about £0.20 per full run.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building This in GitHub Actions
&lt;/h2&gt;

&lt;p&gt;Here's the minimal structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LLM Eval&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;eval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v3&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run evals&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python run_evals.py&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.OPENAI_API_KEY }}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Check threshold&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python check_threshold.py --min-score &lt;/span&gt;&lt;span class="m"&gt;7.5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;run_evals.py&lt;/code&gt; script:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Loads your golden dataset (JSON file of input/expected-output pairs)&lt;/li&gt;
&lt;li&gt;Runs your LLM system on each input&lt;/li&gt;
&lt;li&gt;Sends (input, expected, actual) to GPT-4o-mini with your rubric&lt;/li&gt;
&lt;li&gt;Aggregates scores by attribute&lt;/li&gt;
&lt;li&gt;Writes results to &lt;code&gt;eval_results.json&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If aggregate score drops below your threshold, &lt;code&gt;check_threshold.py&lt;/code&gt; exits with code 1 — the PR fails.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Real Example From Production
&lt;/h2&gt;

&lt;p&gt;I changed a classification system prompt to improve response formatting. The change looked solid in manual testing on 5 examples. But I accidentally dropped a critical piece of context the model needed for correct classification.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without evals:&lt;/strong&gt; ships to users. Angry support tickets. Rollback. Lost trust.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With evals:&lt;/strong&gt; CI caught it in 4 minutes. PR fails. I fix the prompt. Evals pass. Ship confidently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Golden Datasets: The Hard Part
&lt;/h2&gt;

&lt;p&gt;The hardest part is building your test cases. The key insight: &lt;strong&gt;start with failures, not successes.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every time your LLM system makes a mistake:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Save the input&lt;/li&gt;
&lt;li&gt;Write down what the correct output should have been&lt;/li&gt;
&lt;li&gt;Add it to your golden dataset&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After 2–3 weeks of normal usage, you'll have 30–50 meaningful test cases that represent real failure modes — far more valuable than synthetic test cases you invented upfront.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Model Comparison
&lt;/h2&gt;

&lt;p&gt;Before committing to an expensive model, run your eval suite across providers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-3-5-haiku&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-flash-1.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_eval_suite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;golden_dataset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Sort by (score / cost_per_1k_tokens) to find optimal tradeoff
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This stops you from paying for GPT-4o when Claude Haiku scores 92% as well at 20% of the cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost Optimization
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Batch your calls&lt;/strong&gt;: OpenAI batch API gives 50% discount on async evals&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache responses&lt;/strong&gt;: Hash (model + prompt + input) → cache hit avoids re-scoring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coarse-to-fine&lt;/strong&gt;: Use a 2-stage system — cheap model filters obvious passes, expensive model only sees borderline cases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weekly CI only&lt;/strong&gt;: Run full suite on PRs to main, not every commit&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A well-optimized setup runs 100 eval cases for under £0.10.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I've Packaged Up
&lt;/h2&gt;

&lt;p&gt;I've turned this into a complete ready-to-use system in &lt;strong&gt;&lt;a href="https://hadleyworks.gumroad.com/l/nyzala" rel="noopener noreferrer"&gt;The Indie Hacker's LLM Eval Playbook&lt;/a&gt;&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;6 golden dataset templates&lt;/strong&gt; for common LLM tasks (classification, summarization, retrieval, generation, code review, reasoning)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complete rubric scoring system&lt;/strong&gt; in Python (copy-paste ready)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-model comparison script&lt;/strong&gt; with cost-efficiency ranking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub Actions workflow&lt;/strong&gt; — drop it in your repo and it works&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost optimization guide&lt;/strong&gt; with benchmarks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;£29 one-time.&lt;/strong&gt; One avoided production incident pays for it 10× over.&lt;/p&gt;

&lt;p&gt;If you have questions about implementing eval-as-code for your specific use case, drop them in the comments — happy to help.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>testing</category>
      <category>devops</category>
    </item>
    <item>
      <title>I Built LLM Evaluation-as-Code in CI: Here's How to Avoid Shipping Regressions</title>
      <dc:creator>Charlie Hadley</dc:creator>
      <pubDate>Mon, 18 May 2026 16:28:10 +0000</pubDate>
      <link>https://forem.com/hadleyworks/i-built-llm-evaluation-as-code-in-ci-heres-how-to-avoid-shipping-regressions-3f7h</link>
      <guid>https://forem.com/hadleyworks/i-built-llm-evaluation-as-code-in-ci-heres-how-to-avoid-shipping-regressions-3f7h</guid>
      <description>&lt;h1&gt;
  
  
  API Rate Limiting Playbook: Protect Your Backend From Abuse
&lt;/h1&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Your API is live in production. Traffic is growing. Then one day, a bot discovers your endpoint and starts hammering it with 100,000 requests per second. Your database melts. Your users see 500 errors. You lose revenue and reputation.&lt;/p&gt;

&lt;p&gt;Or worse: a malicious actor uses your API to brute-force user accounts. You didn't have rate limiting in place. You're liable.&lt;/p&gt;

&lt;p&gt;This is the silent killer of indie SaaS. You ship the product. You don't ship the protection. Then production breaks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Most Indie Teams Skip Rate Limiting
&lt;/h2&gt;

&lt;p&gt;Rate limiting &lt;em&gt;sounds&lt;/em&gt; complicated. "Distributed rate limiting"? "Token bucket algorithm"? "Redis backing stores"?&lt;/p&gt;

&lt;p&gt;In reality, it's simple. And you don't need expensive tools. You don't need AWS API Gateway ($0.35 per million requests). You don't need third-party middleware.&lt;/p&gt;

&lt;p&gt;You need a methodology. Once you have methodology, the implementation is trivial.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three-Layer Strategy
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Layer 1: IP-Based Rate Limiting (Nginx)
&lt;/h3&gt;

&lt;p&gt;First line of defense: block obvious bots and abusers at the edge.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;limit_req_zone&lt;/span&gt; &lt;span class="nv"&gt;$binary_remote_addr&lt;/span&gt; &lt;span class="s"&gt;zone=general:10m&lt;/span&gt; &lt;span class="s"&gt;rate=10r/s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;limit_req_zone&lt;/span&gt; &lt;span class="nv"&gt;$binary_remote_addr&lt;/span&gt; &lt;span class="s"&gt;zone=auth:10m&lt;/span&gt; &lt;span class="s"&gt;rate=1r/s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;server&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;location&lt;/span&gt; &lt;span class="n"&gt;/api/&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kn"&gt;limit_req&lt;/span&gt; &lt;span class="s"&gt;zone=general&lt;/span&gt; &lt;span class="s"&gt;burst=20&lt;/span&gt; &lt;span class="s"&gt;nodelay&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kn"&gt;location&lt;/span&gt; &lt;span class="n"&gt;/api/auth/login&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kn"&gt;limit_req&lt;/span&gt; &lt;span class="s"&gt;zone=auth&lt;/span&gt; &lt;span class="s"&gt;burst=3&lt;/span&gt; &lt;span class="s"&gt;nodelay&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cost: $0 (Nginx is free).&lt;/p&gt;

&lt;p&gt;Setup time: 15 minutes.&lt;/p&gt;

&lt;p&gt;Blocks: 95% of bot traffic and accidental DDoS.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: User/Token-Based Rate Limiting (Redis + Python)
&lt;/h3&gt;

&lt;p&gt;Your authenticated users have legitimate spikes. A single IP-based rule punishes them unfairly.&lt;/p&gt;

&lt;p&gt;Instead, rate limit per API key or user ID:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timedelta&lt;/span&gt;

&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Redis&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;is_rate_limited&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rate_limit:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="n"&gt;window_seconds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;incr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_seconds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;

&lt;span class="nd"&gt;@app.route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/api/resource&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_resource&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;is_rate_limited&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Rate limit exceeded&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="mi"&gt;429&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;process_request&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cost: Redis Cloud free tier (up to 30MB).&lt;/p&gt;

&lt;p&gt;Setup time: 30 minutes.&lt;/p&gt;

&lt;p&gt;Blocks: Authenticated abuse, account enumeration, brute-force attacks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Endpoint-Specific Thresholds
&lt;/h3&gt;

&lt;p&gt;Different endpoints have different abuse vectors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Public endpoints&lt;/strong&gt; (search, info): 100 req/min per IP&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auth endpoints&lt;/strong&gt; (login, signup): 5 req/min per IP + distributed rate limit&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource creation&lt;/strong&gt; (write APIs): 10 req/min per user&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Admin endpoints&lt;/strong&gt;: 1000 req/day per user (tight control)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Document these in your API spec. Expose rate limit headers to clients:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;X-RateLimit-Limit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;100&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;X-RateLimit-Remaining&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;87&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;X-RateLimit-Reset&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;unix_timestamp&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Real-World Cost Breakdown
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Nginx configuration&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Redis Cloud (free tier)&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monitoring + alerts&lt;/td&gt;
&lt;td&gt;$0–10/month (CloudWatch or Datadog free tier)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0–10/month&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Compare to AWS API Gateway: $0.35 per million requests = $3,500/month at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation Checklist
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Deploy Nginx rate limiting (zone + limit_req directive)&lt;/li&gt;
&lt;li&gt;[ ] Set up Redis account (free tier)&lt;/li&gt;
&lt;li&gt;[ ] Write rate limit middleware in your framework&lt;/li&gt;
&lt;li&gt;[ ] Define endpoint-specific limits&lt;/li&gt;
&lt;li&gt;[ ] Add rate limit headers to responses&lt;/li&gt;
&lt;li&gt;[ ] Test with Apache Bench or Vegeta load testing tool&lt;/li&gt;
&lt;li&gt;[ ] Set up alerts (Slack notification when a user hits limits)&lt;/li&gt;
&lt;li&gt;[ ] Document rate limits in your API docs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Time to implement: &lt;strong&gt;2–4 hours&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Cost: &lt;strong&gt;$0&lt;/strong&gt; (for 95% of use cases).&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes to Avoid
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Only IP-based limiting&lt;/strong&gt;: Punishes corporate networks and VPNs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No graduated response&lt;/strong&gt;: Ban immediately instead of throttling first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storing counts in database&lt;/strong&gt;: Too slow. Use Redis or in-memory cache.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not exposing rate limit headers&lt;/strong&gt;: Clients can't intelligently back off.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ignoring health check endpoints&lt;/strong&gt;: Don't rate limit your own monitoring.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Debugging Rate Limit Issues
&lt;/h2&gt;

&lt;p&gt;When a user reports "API blocked", here's how to troubleshoot:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Check Redis keys: &lt;code&gt;redis-cli KEYS "rate_limit:*"&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Inspect their request pattern: high burst vs sustained?&lt;/li&gt;
&lt;li&gt;Whitelist their IP/user if it's a legitimate use case&lt;/li&gt;
&lt;li&gt;Adjust thresholds based on real traffic patterns&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;p&gt;This playbook includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ready-to-deploy Nginx configs for all major frameworks&lt;/li&gt;
&lt;li&gt;Redis setup guide (AWS ElastiCache, DigitalOcean, Heroku)&lt;/li&gt;
&lt;li&gt;Complete Python/Node.js middleware code&lt;/li&gt;
&lt;li&gt;GitHub Actions workflow for load testing&lt;/li&gt;
&lt;li&gt;Real abuse patterns from production SaaS systems&lt;/li&gt;
&lt;li&gt;Cost optimization strategies (cache tiers, fallback limits)&lt;/li&gt;
&lt;li&gt;Comprehensive debugging guide&lt;/li&gt;
&lt;li&gt;Whitelist/bypass strategies for trusted partners&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Implementing rate limiting takes 2–4 hours. Ignoring it costs you production incidents and security breaches.&lt;/p&gt;

&lt;p&gt;Deploy today.&lt;/p&gt;

</description>
      <category>cicd</category>
      <category>llm</category>
      <category>testing</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>How to Catch LLM Regressions in CI: The Rubric-Based Eval System That Works</title>
      <dc:creator>Charlie Hadley</dc:creator>
      <pubDate>Mon, 18 May 2026 16:13:08 +0000</pubDate>
      <link>https://forem.com/hadleyworks/how-to-catch-llm-regressions-in-ci-the-rubric-based-eval-system-that-works-48ck</link>
      <guid>https://forem.com/hadleyworks/how-to-catch-llm-regressions-in-ci-the-rubric-based-eval-system-that-works-48ck</guid>
      <description>&lt;h1&gt;
  
  
  API Rate Limiting Playbook: Protect Your Backend From Abuse
&lt;/h1&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Your API is live in production. Traffic is growing. Then one day, a bot discovers your endpoint and starts hammering it with 100,000 requests per second. Your database melts. Your users see 500 errors. You lose revenue and reputation.&lt;/p&gt;

&lt;p&gt;Or worse: a malicious actor uses your API to brute-force user accounts. You didn't have rate limiting in place. You're liable.&lt;/p&gt;

&lt;p&gt;This is the silent killer of indie SaaS. You ship the product. You don't ship the protection. Then production breaks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Most Indie Teams Skip Rate Limiting
&lt;/h2&gt;

&lt;p&gt;Rate limiting &lt;em&gt;sounds&lt;/em&gt; complicated. "Distributed rate limiting"? "Token bucket algorithm"? "Redis backing stores"?&lt;/p&gt;

&lt;p&gt;In reality, it's simple. And you don't need expensive tools. You don't need AWS API Gateway ($0.35 per million requests). You don't need third-party middleware.&lt;/p&gt;

&lt;p&gt;You need a methodology. Once you have methodology, the implementation is trivial.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three-Layer Strategy
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Layer 1: IP-Based Rate Limiting (Nginx)
&lt;/h3&gt;

&lt;p&gt;First line of defense: block obvious bots and abusers at the edge.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;limit_req_zone&lt;/span&gt; &lt;span class="nv"&gt;$binary_remote_addr&lt;/span&gt; &lt;span class="s"&gt;zone=general:10m&lt;/span&gt; &lt;span class="s"&gt;rate=10r/s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;limit_req_zone&lt;/span&gt; &lt;span class="nv"&gt;$binary_remote_addr&lt;/span&gt; &lt;span class="s"&gt;zone=auth:10m&lt;/span&gt; &lt;span class="s"&gt;rate=1r/s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;server&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;location&lt;/span&gt; &lt;span class="n"&gt;/api/&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kn"&gt;limit_req&lt;/span&gt; &lt;span class="s"&gt;zone=general&lt;/span&gt; &lt;span class="s"&gt;burst=20&lt;/span&gt; &lt;span class="s"&gt;nodelay&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kn"&gt;location&lt;/span&gt; &lt;span class="n"&gt;/api/auth/login&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kn"&gt;limit_req&lt;/span&gt; &lt;span class="s"&gt;zone=auth&lt;/span&gt; &lt;span class="s"&gt;burst=3&lt;/span&gt; &lt;span class="s"&gt;nodelay&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cost: $0 (Nginx is free).&lt;/p&gt;

&lt;p&gt;Setup time: 15 minutes.&lt;/p&gt;

&lt;p&gt;Blocks: 95% of bot traffic and accidental DDoS.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: User/Token-Based Rate Limiting (Redis + Python)
&lt;/h3&gt;

&lt;p&gt;Your authenticated users have legitimate spikes. A single IP-based rule punishes them unfairly.&lt;/p&gt;

&lt;p&gt;Instead, rate limit per API key or user ID:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timedelta&lt;/span&gt;

&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Redis&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;is_rate_limited&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rate_limit:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="n"&gt;window_seconds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;incr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_seconds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;

&lt;span class="nd"&gt;@app.route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/api/resource&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_resource&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;is_rate_limited&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Rate limit exceeded&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="mi"&gt;429&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;process_request&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cost: Redis Cloud free tier (up to 30MB).&lt;/p&gt;

&lt;p&gt;Setup time: 30 minutes.&lt;/p&gt;

&lt;p&gt;Blocks: Authenticated abuse, account enumeration, brute-force attacks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Endpoint-Specific Thresholds
&lt;/h3&gt;

&lt;p&gt;Different endpoints have different abuse vectors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Public endpoints&lt;/strong&gt; (search, info): 100 req/min per IP&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auth endpoints&lt;/strong&gt; (login, signup): 5 req/min per IP + distributed rate limit&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource creation&lt;/strong&gt; (write APIs): 10 req/min per user&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Admin endpoints&lt;/strong&gt;: 1000 req/day per user (tight control)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Document these in your API spec. Expose rate limit headers to clients:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;X-RateLimit-Limit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;100&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;X-RateLimit-Remaining&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;87&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;X-RateLimit-Reset&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;unix_timestamp&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Real-World Cost Breakdown
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Nginx configuration&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Redis Cloud (free tier)&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monitoring + alerts&lt;/td&gt;
&lt;td&gt;$0–10/month (CloudWatch or Datadog free tier)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0–10/month&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Compare to AWS API Gateway: $0.35 per million requests = $3,500/month at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation Checklist
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Deploy Nginx rate limiting (zone + limit_req directive)&lt;/li&gt;
&lt;li&gt;[ ] Set up Redis account (free tier)&lt;/li&gt;
&lt;li&gt;[ ] Write rate limit middleware in your framework&lt;/li&gt;
&lt;li&gt;[ ] Define endpoint-specific limits&lt;/li&gt;
&lt;li&gt;[ ] Add rate limit headers to responses&lt;/li&gt;
&lt;li&gt;[ ] Test with Apache Bench or Vegeta load testing tool&lt;/li&gt;
&lt;li&gt;[ ] Set up alerts (Slack notification when a user hits limits)&lt;/li&gt;
&lt;li&gt;[ ] Document rate limits in your API docs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Time to implement: &lt;strong&gt;2–4 hours&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Cost: &lt;strong&gt;$0&lt;/strong&gt; (for 95% of use cases).&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes to Avoid
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Only IP-based limiting&lt;/strong&gt;: Punishes corporate networks and VPNs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No graduated response&lt;/strong&gt;: Ban immediately instead of throttling first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storing counts in database&lt;/strong&gt;: Too slow. Use Redis or in-memory cache.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not exposing rate limit headers&lt;/strong&gt;: Clients can't intelligently back off.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ignoring health check endpoints&lt;/strong&gt;: Don't rate limit your own monitoring.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Debugging Rate Limit Issues
&lt;/h2&gt;

&lt;p&gt;When a user reports "API blocked", here's how to troubleshoot:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Check Redis keys: &lt;code&gt;redis-cli KEYS "rate_limit:*"&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Inspect their request pattern: high burst vs sustained?&lt;/li&gt;
&lt;li&gt;Whitelist their IP/user if it's a legitimate use case&lt;/li&gt;
&lt;li&gt;Adjust thresholds based on real traffic patterns&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;p&gt;This playbook includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ready-to-deploy Nginx configs for all major frameworks&lt;/li&gt;
&lt;li&gt;Redis setup guide (AWS ElastiCache, DigitalOcean, Heroku)&lt;/li&gt;
&lt;li&gt;Complete Python/Node.js middleware code&lt;/li&gt;
&lt;li&gt;GitHub Actions workflow for load testing&lt;/li&gt;
&lt;li&gt;Real abuse patterns from production SaaS systems&lt;/li&gt;
&lt;li&gt;Cost optimization strategies (cache tiers, fallback limits)&lt;/li&gt;
&lt;li&gt;Comprehensive debugging guide&lt;/li&gt;
&lt;li&gt;Whitelist/bypass strategies for trusted partners&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Implementing rate limiting takes 2–4 hours. Ignoring it costs you production incidents and security breaches.&lt;/p&gt;

&lt;p&gt;Deploy today.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>How to Run LLM Evaluations in CI Without Paying $249/Month</title>
      <dc:creator>Charlie Hadley</dc:creator>
      <pubDate>Mon, 18 May 2026 15:47:46 +0000</pubDate>
      <link>https://forem.com/hadleyworks/how-to-run-llm-evaluations-in-ci-without-paying-249month-2nf4</link>
      <guid>https://forem.com/hadleyworks/how-to-run-llm-evaluations-in-ci-without-paying-249month-2nf4</guid>
      <description>&lt;h1&gt;
  
  
  How to Run LLM Evaluations in CI Without Paying $249/Month
&lt;/h1&gt;

&lt;p&gt;If you're building LLM-powered features as an indie hacker or small team, you've probably hit this wall: your prompts work great in the playground, but you have no systematic way to know if they're actually &lt;em&gt;improving&lt;/em&gt; after each change.&lt;/p&gt;

&lt;p&gt;The obvious answer is Braintrust or LangSmith. But at $249/month minimum, that's a massive commitment for a pre-PMF product. Here's how to build a production-grade eval pipeline for under $5/month.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Architecture
&lt;/h2&gt;

&lt;p&gt;You need three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A golden dataset&lt;/strong&gt; — A CSV of 50-200 test cases covering your edge cases, with input + expected behavior description&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A scoring function&lt;/strong&gt; — LLM-as-judge using GPT-4o-mini (~$0.002 per example)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub Actions integration&lt;/strong&gt; — Runs your eval suite on every PR with a score threshold check&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The magic: your CI pipeline fails the build if average quality drops below your threshold. No more shipping prompt regressions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Rubric-Based Scoring Beats Exact Match
&lt;/h2&gt;

&lt;p&gt;The biggest mistake teams make: they try to match exact output strings. This fails because LLMs are inherently non-deterministic.&lt;/p&gt;

&lt;p&gt;Instead, define what "good" looks like as a checklist rubric:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;rubric&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Score this response 1-5 based on:
- Does it answer the question directly? (1 point)
- Is it concise (under 200 words)? (1 point)  
- Does it avoid hallucinating specific numbers? (1 point)
- Is the tone professional? (1 point)
- Would a user find this genuinely useful? (1 point)
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then let GPT-4o-mini score each response against this rubric. At $0.002 per evaluation, running 100 test cases costs $0.20.&lt;/p&gt;

&lt;h2&gt;
  
  
  The GitHub Actions Workflow
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LLM Eval CI&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;eval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v3&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run eval suite&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.OPENAI_API_KEY }}&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;pip install openai pandas&lt;/span&gt;
          &lt;span class="s"&gt;python eval/run_suite.py --threshold 3.5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--threshold 3.5&lt;/code&gt; means: if average score drops below 3.5/5.0, fail the PR. This is your quality gate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Multi-Model Comparison Pattern
&lt;/h2&gt;

&lt;p&gt;Before you commit to GPT-4o for your feature, run your eval suite against Claude 3.5 Haiku and Gemini Flash. You'll often find that a cheaper model scores within 0.2 points of the expensive one — at 1/10th the cost.&lt;/p&gt;

&lt;p&gt;This comparison takes 10 minutes to set up but can cut your inference costs by 60-80%.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Catches in Practice
&lt;/h2&gt;

&lt;p&gt;Real scenario: You change your system prompt to fix a formatting issue. Without evals, you ship it. With evals, your CI run shows classification accuracy dropped from 4.2 to 3.1 on the golden dataset. You investigate, find that your formatting fix accidentally removed context the model needed, and fix it before it hits production.&lt;/p&gt;

&lt;p&gt;The moment you catch your first regression in CI, the whole system pays for itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building Your Golden Dataset
&lt;/h2&gt;

&lt;p&gt;Start with 50 examples. Pull them from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real user queries you've seen in logs&lt;/li&gt;
&lt;li&gt;Edge cases you've mentally worried about&lt;/li&gt;
&lt;li&gt;Failure modes you've already shipped by accident&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Don't try to write expected outputs. Instead, write &lt;em&gt;rubrics&lt;/em&gt; describing what good looks like for each category.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost Breakdown
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Golden dataset (50 examples): $0.10 per full suite run&lt;/li&gt;
&lt;li&gt;GitHub Actions: free tier (2,000 minutes/month)&lt;/li&gt;
&lt;li&gt;Total monthly cost for 10 PRs/week: ~$4/month&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compare to Braintrust at $249/month.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;The hardest part isn't the code — it's building the golden dataset and writing good rubrics. Once those exist, the automation is straightforward.&lt;/p&gt;

&lt;p&gt;I've packaged the full methodology into a playbook: golden dataset templates, rubric examples, multi-model comparison scripts, and the complete GitHub Actions workflow. Available at &lt;a href="https://hadleyworks.gumroad.com/l/nyzala" rel="noopener noreferrer"&gt;hadleyworks.gumroad.com&lt;/a&gt; for $29.&lt;/p&gt;

&lt;p&gt;What eval setups are others running at small scale? Happy to discuss approaches in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>productivity</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Evaluating LLMs in Production Without Paying $249/Month for Braintrust</title>
      <dc:creator>Charlie Hadley</dc:creator>
      <pubDate>Mon, 18 May 2026 15:02:43 +0000</pubDate>
      <link>https://forem.com/hadleyworks/evaluating-llms-in-production-without-paying-249month-for-braintrust-31ch</link>
      <guid>https://forem.com/hadleyworks/evaluating-llms-in-production-without-paying-249month-for-braintrust-31ch</guid>
      <description>&lt;h1&gt;
  
  
  Evaluating LLMs in Production Without Paying $249/Month for Braintrust
&lt;/h1&gt;

&lt;p&gt;If you're building an LLM-powered product as an indie hacker or small team, you've probably hit this wall: your prompts work great in the playground, but you have no idea if they're actually getting better (or worse) after each change.&lt;/p&gt;

&lt;p&gt;The obvious solution is a dedicated eval platform — Braintrust, Langsmith, Humanloop. But at $249/month for meaningful usage, that's a lot of MRR to justify before you've found product-market fit.&lt;/p&gt;

&lt;p&gt;Here's what I've been doing instead, using tools you already have.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Problem With Ad-Hoc Evals
&lt;/h2&gt;

&lt;p&gt;Most indie teams do one of three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Vibe-check evals&lt;/strong&gt; — you prompt it, it feels right, you ship&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One-shot spreadsheets&lt;/strong&gt; — you run 20 examples once, never again&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nothing&lt;/strong&gt; — you just watch for complaints in Discord&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;None of these catch regressions. When you change a prompt to fix one thing, you break two others, and you won't know for a week.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Lightweight Eval Stack That Actually Works
&lt;/h2&gt;

&lt;p&gt;Here's the stack: &lt;strong&gt;Golden dataset + GitHub Actions + a simple scoring function&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Build a Golden Dataset
&lt;/h3&gt;

&lt;p&gt;A golden dataset is just a CSV with input/expected output pairs. Start with 20-50 examples that cover your edge cases:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;input,expected_output,tags
"Summarize this legal clause: ...", "The clause limits liability to...", "legal,summarization"
"What is the capital of France?", "Paris", "factual,simple"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: &lt;strong&gt;you don't need perfect expected outputs&lt;/strong&gt;. You need &lt;em&gt;rubric-based scoring&lt;/em&gt;, not exact match. Define what "good" looks like as a checklist.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Write a Scoring Function
&lt;/h3&gt;

&lt;p&gt;For most use cases, a simple LLM-as-judge approach works well:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;score_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;actual_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected_output&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Rate this LLM response on a scale of 1-5.

    Input: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;input_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
    Expected: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;expected_output&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  
    Actual: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;actual_output&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

    Score based on: accuracy, completeness, tone.
    Return JSON: {{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: X, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}}
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cost per run: ~$0.002 per example with GPT-4o-mini. Running 50 examples costs $0.10. You can run this on every PR.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: GitHub Actions Integration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LLM Eval Suite&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;eval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v3&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run eval suite&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python eval/run_evals.py&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.OPENAI_API_KEY }}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Check score threshold&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python eval/check_threshold.py --min-score &lt;/span&gt;&lt;span class="m"&gt;3.8&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now every PR shows a score. If it drops below 3.8, the check fails. You've just built CI for your prompts.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Doesn't Cover
&lt;/h2&gt;

&lt;p&gt;This approach works great for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Summarization and extraction tasks&lt;/li&gt;
&lt;li&gt;Classification (with expected labels)&lt;/li&gt;
&lt;li&gt;RAG retrieval quality&lt;/li&gt;
&lt;li&gt;Tone/style adherence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's harder to apply to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Open-ended creative tasks&lt;/li&gt;
&lt;li&gt;Multi-turn conversations&lt;/li&gt;
&lt;li&gt;Tasks where "correct" is deeply subjective&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For those cases, you need human-in-the-loop evals — but you can still automate the &lt;em&gt;collection&lt;/em&gt; of examples and use the human time only for scoring edge cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Win: Regression Detection
&lt;/h2&gt;

&lt;p&gt;The moment this system pays off is when you change your system prompt to improve summarization, run the eval suite, and see that your classification accuracy dropped from 4.2 to 3.1. Without this, you'd ship it and wonder why your churn ticked up next week.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The goal isn't perfect evals. The goal is catching regressions before your users do.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Going Deeper
&lt;/h2&gt;

&lt;p&gt;If you want the full methodology — including golden dataset templates, rubric examples, multi-model comparison scripts, and a GitHub Actions workflow you can clone — I packaged everything into a playbook: &lt;a href="https://buy.stripe.com/6oUeV5gH7b4s56YcHG4ko0d" rel="noopener noreferrer"&gt;The Indie Hacker's LLM Eval Playbook&lt;/a&gt; (£25, instant download).&lt;/p&gt;

&lt;p&gt;But honestly, the approach above will get you 80% of the way there for free.&lt;/p&gt;

&lt;p&gt;The main insight: &lt;strong&gt;treat your prompts like code&lt;/strong&gt;. You wouldn't ship a function without tests. Don't ship a prompt without evals.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What eval setup are you running? Curious what others have found works at small scale — drop a comment below.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>productivity</category>
      <category>startup</category>
    </item>
  </channel>
</rss>
