<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: HybridTechie</title>
    <description>The latest articles on Forem by HybridTechie (@hybridtechie).</description>
    <link>https://forem.com/hybridtechie</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F172076%2Fb76184b9-9723-4f16-a022-099866cb35f7.png</url>
      <title>Forem: HybridTechie</title>
      <link>https://forem.com/hybridtechie</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/hybridtechie"/>
    <language>en</language>
    <item>
      <title>AI Regression Tests Written in Markdown, Not Code</title>
      <dc:creator>HybridTechie</dc:creator>
      <pubDate>Sun, 08 Mar 2026 14:30:23 +0000</pubDate>
      <link>https://forem.com/hybridtechie/ai-regression-tests-written-in-markdown-not-code-5b09</link>
      <guid>https://forem.com/hybridtechie/ai-regression-tests-written-in-markdown-not-code-5b09</guid>
      <description>&lt;p&gt;As AI agents write more of our production code, I started asking a simple question: who's testing the code the AI just wrote?&lt;/p&gt;

&lt;p&gt;Not unit tests. We still have those. Not Playwright E2E suites. Those too. I'm talking about a new layer that sits alongside everything else. One that the AI itself writes as it builds features.&lt;/p&gt;

&lt;p&gt;Here's what we built.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;AI agents already write E2E tests. Playwright, Cypress. The agent generates the code, it runs, it passes. Job done, right?&lt;/p&gt;

&lt;p&gt;Not quite. Those tests are deterministic. They assert on exact selectors, exact text, exact DOM structure. The agent that wrote &lt;code&gt;page.locator('#sidebar-nav &amp;gt; ul &amp;gt; li:nth-child(3)')&lt;/code&gt; has baked in an assumption about the HTML that will break the moment another agent (or a human) touches that component. The test didn't get worse. The UI moved on and the test couldn't keep up.&lt;/p&gt;

&lt;p&gt;This is the real problem: deterministic tests written by AI are still brittle tests. The AI just writes them faster. It doesn't make them less fragile.&lt;/p&gt;

&lt;p&gt;What we actually needed was a test that behaves the way a human tester does. Look at the screen, find the login button (wherever it is), click it, check what happens. Not "find element with ID &lt;code&gt;btn-submit&lt;/code&gt;" but "find the button that says Sign In."&lt;/p&gt;

&lt;h2&gt;
  
  
  The solution: markdown test files
&lt;/h2&gt;

&lt;p&gt;Each regression test is a markdown file. Plain English. Structured steps. No code.&lt;/p&gt;

&lt;p&gt;Here's a real test from our suite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Test 001: Login as SuperAdmin&lt;/span&gt;

| Field        | Value                    |
|--------------|--------------------------|
| &lt;span class="gs"&gt;**ID**&lt;/span&gt;       | AI-REG-001               |
| &lt;span class="gs"&gt;**Priority**&lt;/span&gt; | P0 (Critical)            |
| &lt;span class="gs"&gt;**Area**&lt;/span&gt;     | Authentication           |
| &lt;span class="gs"&gt;**Requires**&lt;/span&gt; | testData/LoginCreds.json |

&lt;span class="gu"&gt;## Steps&lt;/span&gt;

&lt;span class="gu"&gt;### Step 1: Navigate to the app&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; &lt;span class="gs"&gt;**Action**&lt;/span&gt;: Open browser and navigate to the app URL
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Expected**&lt;/span&gt;: Since the user is not authenticated, the app redirects to /login
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Verify**&lt;/span&gt;: URL ends with /login

&lt;span class="gu"&gt;### Step 2: Verify login page elements&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; &lt;span class="gs"&gt;**Action**&lt;/span&gt;: Take a snapshot of the login page
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Expected**&lt;/span&gt;: The login form is visible with an email input, password input, and Sign In button
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Verify**&lt;/span&gt;: All three elements are present

&lt;span class="gu"&gt;### Step 3: Enter SuperAdmin email&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; &lt;span class="gs"&gt;**Action**&lt;/span&gt;: Fill the email input with the email from test data
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Expected**&lt;/span&gt;: The email appears in the input field

&lt;span class="gu"&gt;### Step 4: Enter SuperAdmin password&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; &lt;span class="gs"&gt;**Action**&lt;/span&gt;: Fill the password input with the password from test data
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Expected**&lt;/span&gt;: The password field shows masked characters

&lt;span class="gu"&gt;### Step 5: Submit the login form&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; &lt;span class="gs"&gt;**Action**&lt;/span&gt;: Click the Sign In button
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Expected**&lt;/span&gt;: Page redirects to the dashboard
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Verify**&lt;/span&gt;: URL is now / (no longer /login)

&lt;span class="gu"&gt;### Step 6: Verify SuperAdmin nav items&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; &lt;span class="gs"&gt;**Action**&lt;/span&gt;: Inspect the sidebar navigation
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Expected**&lt;/span&gt;: SuperAdmin-only items are visible: Manage DB, People Management, AI Models, AI Logs, AI Dashboard
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Verify**&lt;/span&gt;: At least 3 of the 5 SuperAdmin menu items are present

&lt;span class="gu"&gt;### Step 7: Verify user role badge&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; &lt;span class="gs"&gt;**Action**&lt;/span&gt;: Click on the user avatar to open the dropdown
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Expected**&lt;/span&gt;: Dropdown shows full name, email, and role badge
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Verify**&lt;/span&gt;: Role badge text contains "SUPERADMIN"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No selectors. No XPath. No &lt;code&gt;page.locator('#email-input')&lt;/code&gt;. Just descriptions of what a human would do and see.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it runs: agent-browser
&lt;/h2&gt;

&lt;p&gt;The execution engine is &lt;a href="https://github.com/vercel-labs/agent-browser" rel="noopener noreferrer"&gt;agent-browser&lt;/a&gt;, an open-source CLI from Vercel built for AI agents to automate browsers.&lt;/p&gt;

&lt;p&gt;An AI agent reads the markdown file, then translates each step into agent-browser commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Step 1: Navigate&lt;/span&gt;
agent-browser open &lt;span class="s2"&gt;"https://myapp.azurestaticapps.net/"&lt;/span&gt;
agent-browser &lt;span class="nb"&gt;wait&lt;/span&gt; &lt;span class="nt"&gt;--load&lt;/span&gt; networkidle

&lt;span class="c"&gt;# Step 2: Discover what's on the page (accessibility snapshot)&lt;/span&gt;
agent-browser snapshot &lt;span class="nt"&gt;-i&lt;/span&gt;
&lt;span class="c"&gt;# Returns: @e1 heading "Sign In", @e2 textbox "Email", @e3 textbox "Password", @e4 button "Sign In"&lt;/span&gt;

&lt;span class="c"&gt;# Step 3: Fill email&lt;/span&gt;
agent-browser fill @e2 &lt;span class="s2"&gt;"admin@example.com"&lt;/span&gt;

&lt;span class="c"&gt;# Step 4: Fill password&lt;/span&gt;
agent-browser fill @e3 &lt;span class="s2"&gt;"secretpassword"&lt;/span&gt;

&lt;span class="c"&gt;# Step 5: Click sign in&lt;/span&gt;
agent-browser click @e4
agent-browser &lt;span class="nb"&gt;wait&lt;/span&gt; &lt;span class="nt"&gt;--load&lt;/span&gt; networkidle

&lt;span class="c"&gt;# Step 6: Check what's on the dashboard&lt;/span&gt;
agent-browser snapshot &lt;span class="nt"&gt;-i&lt;/span&gt;
&lt;span class="c"&gt;# Returns the full accessibility tree, agent checks for nav items&lt;/span&gt;

&lt;span class="c"&gt;# Step 7: Open user dropdown (Radix UI popover)&lt;/span&gt;
agent-browser click @e15
agent-browser snapshot &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="s2"&gt;"[data-radix-popper-content-wrapper]"&lt;/span&gt;
&lt;span class="c"&gt;# Agent checks for "SUPERADMIN" in the popover content&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important bit: &lt;code&gt;snapshot -i&lt;/code&gt; returns an accessibility tree with reference IDs (&lt;code&gt;@e1&lt;/code&gt;, &lt;code&gt;@e2&lt;/code&gt;). The agent finds elements by their accessible name and role, not by CSS selectors. If someone renames a CSS class or reorders the DOM, the test still passes. It's testing what the user sees, not how the HTML is structured.&lt;/p&gt;

&lt;h2&gt;
  
  
  Writing tests as features get built
&lt;/h2&gt;

&lt;p&gt;Here's where it gets interesting. We set up our AI coding workflow so that when the agent builds a new feature, it also writes a regression test for that feature. Same session. Same context.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;/create-ai-test&lt;/code&gt; command walks the agent through:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Read the React source for the UI component being tested&lt;/li&gt;
&lt;li&gt;Identify the user journey&lt;/li&gt;
&lt;li&gt;Write the structured markdown test&lt;/li&gt;
&lt;li&gt;Validate against our test principles&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So the agent doesn't just ship code. It writes the test coverage in the same session. Each feature comes with its own regression test, automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running tests in parallel
&lt;/h2&gt;

&lt;p&gt;Tests are independent by design, so we run them in parallel:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/run-ai-tests P0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This launches one AI agent per test, each with its own isolated browser session:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Test 001 runs in its own session&lt;/span&gt;
agent-browser &lt;span class="nt"&gt;--session&lt;/span&gt; test-001 open &lt;span class="s2"&gt;"https://..."&lt;/span&gt;
agent-browser &lt;span class="nt"&gt;--session&lt;/span&gt; test-001 snapshot &lt;span class="nt"&gt;-i&lt;/span&gt;

&lt;span class="c"&gt;# Test 002 runs simultaneously in a separate session&lt;/span&gt;
agent-browser &lt;span class="nt"&gt;--session&lt;/span&gt; test-002 open &lt;span class="s2"&gt;"https://..."&lt;/span&gt;
agent-browser &lt;span class="nt"&gt;--session&lt;/span&gt; test-002 snapshot &lt;span class="nt"&gt;-i&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results get recorded back into each test file and into a central test log:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Last Run&lt;/span&gt;

| Field          | Value              |
|----------------|--------------------|
| &lt;span class="gs"&gt;**Timestamp**&lt;/span&gt;  | 2026-03-08 23:35   |
| &lt;span class="gs"&gt;**Result**&lt;/span&gt;     | PASS               |
| &lt;span class="gs"&gt;**Steps**&lt;/span&gt;      | 8/8 passed         |

&lt;span class="gu"&gt;### Step Results&lt;/span&gt;

| Step | Description                | Result | Notes                                   |
|------|----------------------------|--------|-----------------------------------------|
| 1    | Navigate to the app        | PASS   | Redirected to /login as expected        |
| 2    | Verify login page elements | PASS   | Email, Password, Sign in button present |
| 3    | Enter SuperAdmin email     | PASS   | Filled via fill @e2                     |
| 5    | Submit the login form      | PASS   | Redirected to / after networkidle wait  |
| 7    | Verify user role badge     | PASS   | Radix popover shows "SUPERADMIN" badge  |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Where this sits in the test stack
&lt;/h2&gt;

&lt;p&gt;This doesn't replace anything. It's an additional layer.&lt;/p&gt;

&lt;p&gt;Unit tests still test logic in isolation. AI writes these too. They're deterministic, and that's fine because they're testing pure functions, not UI.&lt;/p&gt;

&lt;p&gt;E2E tests (Playwright/Cypress): AI writes these as well. They're precise, fast, and good at catching exact regressions. But they're tightly coupled to the DOM. Every refactor risks breaking them.&lt;/p&gt;

&lt;p&gt;AI regression tests (markdown): this new layer. Tests that behave like a human. Find the button by what it looks like, not what it's called in the code. They survive refactors and component library swaps because they test the experience, not the implementation.&lt;/p&gt;

&lt;p&gt;The question each layer answers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Unit tests&lt;/td&gt;
&lt;td&gt;Does the logic work?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;E2E tests&lt;/td&gt;
&lt;td&gt;Does the code work exactly as written?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI regression tests&lt;/td&gt;
&lt;td&gt;Does the app work the way a user expects?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The elephant in the room (what QA will correctly point out)
&lt;/h2&gt;

&lt;p&gt;If you've been in testing for a while, your alarm bells are probably ringing. Yes, there are pros and cons. Let's address the valid criticisms:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. "You haven't solved brittleness, you just reinvented Cucumber/BDD."&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Writing tests in plain English isn't new. But in old BDD workflows, developers still had to write the glue code to explicitly map those English steps to CSS selectors. Here, there is no glue code. The agent translates the English intent directly against the live accessibility tree at runtime.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. "These tests are going to be slow and expensive."&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Yes, they are. Playwright runs a 100-step test in 3 seconds. An AI interpreting the DOM and making decisions takes minutes and costs API tokens. That's why these &lt;em&gt;don't&lt;/em&gt; replace E2E suites. They don't run on every single PR commit. They run on nightly builds or release branches to validate the human-experience layer and during build time as feedback loop to the builder agents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. "Determinism is a feature! I want the test to fail if the DOM structure changes."&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Absolutely. If a developer accidentally strips out ARIA tags or blocks elements, your standard E2E suite will (and should) catch it instantly. But when the app is functionally healthy yet visually or structurally reorganised, AI tests survive the refactor where deterministic tests shatter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. "AI hallucinates. That means flaky tests."&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
LLMs aren't entirely deterministic. Sometimes they might struggle if multiple elements have identical accessible names. But here's the twist: if an AI utilising the accessibility tree gets confused by your UI, a human using an assistive device probably will too. It forces you to build genuinely accessible, semantic interfaces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. "Natural language is ambiguous. How do you know it's testing what you think it's testing?"&lt;/strong&gt;&lt;br&gt;
Fair point, and this is why the tests use a structured format (Action / Expected / Verify) rather than freeform prose. "Verify: URL ends with /login" is not ambiguous. "Verify: at least 3 of the 5 SuperAdmin menu items are present" is not ambiguous. The structure constrains the language enough that the agent knows exactly what to check. Freeform "test the login page" would be a problem. We don't do that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. "The real problem is bad test architecture, not bad tools."&lt;/strong&gt;&lt;br&gt;
Agreed. If your existing tests use copy-pasted CSS selectors from the browser inspector, AI won't fix that. We still use proper test IDs, roles, and aria attributes in our E2E suite. The markdown tests are a separate layer for a separate question: does the app behave the way a user expects after a refactor?&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters now
&lt;/h2&gt;

&lt;p&gt;AI agents are writing more production code every month. The more autonomous the coding becomes, the more you need quality loops that are written at build time, not bolted on after the fact. Tests that a PM can review without knowing JavaScript. Tests that check behavior, not implementation. Tests where the file itself documents what the feature should do.&lt;/p&gt;

&lt;p&gt;The agents that ship reliable software won't just be the ones that write good code. They'll be the ones that write their own quality checks as they go.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Install agent-browser: &lt;code&gt;brew install agent-browser&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Create a markdown test file describing a user journey in your app&lt;/li&gt;
&lt;li&gt;Have your AI coding agent execute it step by step&lt;/li&gt;
&lt;li&gt;Record the results back into the file&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Start with login. It's the simplest journey and immediately proves the concept.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm a developer who can't stop tinkering with how AI agents build software. Early adopter of GenAI tooling, currently obsessed with the feedback loops between AI-written code and AI-driven quality. I have recently started sharing about my thoughts about what I'm learning.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Fav Quote Today: The early adopters always look reckless. They also always have a head start.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>testing</category>
      <category>playwright</category>
      <category>regression</category>
    </item>
  </channel>
</rss>
