<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Daniel Castillo</title>
    <description>The latest articles on Forem by Daniel Castillo (@soydanicg).</description>
    <link>https://forem.com/soydanicg</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3812313%2F4f1fc6a4-a488-4b0e-b0a1-76663c455f1c.jpg</url>
      <title>Forem: Daniel Castillo</title>
      <link>https://forem.com/soydanicg</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/soydanicg"/>
    <language>en</language>
    <item>
      <title>I let Claude Code iterate on reCAPTCHA until it figured it out — here's what the skill looks like</title>
      <dc:creator>Daniel Castillo</dc:creator>
      <pubDate>Thu, 19 Mar 2026 00:25:29 +0000</pubDate>
      <link>https://forem.com/soydanicg/i-built-a-skill-that-solves-recaptcha-with-an-llm-heres-how-it-actually-works-3c25</link>
      <guid>https://forem.com/soydanicg/i-built-a-skill-that-solves-recaptcha-with-an-llm-heres-how-it-actually-works-3c25</guid>
      <description>&lt;h2&gt;
  
  
  First: what's a "skill"?
&lt;/h2&gt;

&lt;p&gt;In Claude Code, a &lt;strong&gt;skill&lt;/strong&gt; is a markdown file (&lt;code&gt;SKILL.md&lt;/code&gt;) with structured instructions and helper scripts. It's not a program — it's a playbook. When the agent encounters a situation the skill covers, it loads the instructions and follows them.&lt;/p&gt;

&lt;p&gt;So when I say "I built a skill that solves reCAPTCHA," what I really mean is: &lt;strong&gt;I set up a feedback loop where Claude Code tried to solve a CAPTCHA, failed, and I captured what it learned into a document it could reference next time.&lt;/strong&gt; The model does the heavy lifting. The skill just makes sure it doesn't repeat the same mistakes.&lt;/p&gt;

&lt;p&gt;With that framing — here's what happened.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it started
&lt;/h2&gt;

&lt;p&gt;I was testing &lt;a href="https://docs.anthropic.com/en/docs/claude-code" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt; with the &lt;a href="https://github.com/anthropics/anthropic-quickstarts/tree/main/mcp-devtools" rel="noopener noreferrer"&gt;Chrome DevTools MCP server&lt;/a&gt; for browser automation. A reCAPTCHA popped up mid-flow. I asked Claude to solve it.&lt;/p&gt;

&lt;p&gt;It sort of worked — but it was slow, unreliable, and frequently timed out. Instead of moving on, I kept iterating: let Claude try, watch it fail, figure out &lt;em&gt;why&lt;/em&gt; it failed, and update the skill with that knowledge.&lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/rsh8bfHssmA"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the naive approach fails
&lt;/h2&gt;

&lt;p&gt;The obvious way to automate a CAPTCHA with a browser agent is: take a snapshot of the accessibility tree → get the element UID → &lt;code&gt;click(uid)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Through iteration, three structural problems became clear:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. iframes kill the accessibility tree.&lt;/strong&gt; reCAPTCHA renders everything inside cross-origin iframes. Elements inside these iframes show up as "ignored" in the accessibility tree with no assignable UIDs. The standard &lt;code&gt;click(uid)&lt;/code&gt; approach simply can't see them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The timer is brutal.&lt;/strong&gt; reCAPTCHA gives you 2 minutes. Each tool call — screenshot, script evaluation, LLM analysis — takes 1-10 seconds. An unoptimized flow (e.g., taking individual screenshots of each of the 9 tiles) can exhaust the timer before you even click verify.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The tiles are too small.&lt;/strong&gt; At native size, tiles in a 3×3 grid are ~100px. At that resolution, the vision model confuses visually similar objects — buses vs. cars vs. motorcycles — often enough to fail and force a restart.&lt;/p&gt;

&lt;p&gt;None of this was obvious upfront. Each lesson came from watching the agent fail and asking "why?"&lt;/p&gt;

&lt;h2&gt;
  
  
  How the skill evolved
&lt;/h2&gt;

&lt;p&gt;The first version tried the accessibility tree approach and hit the iframe wall. The second switched to &lt;code&gt;evaluate_script&lt;/code&gt; for direct DOM access but kept timing out on 3×3 grids because it was screenshotting tiles one by one. The third introduced a zoom trick after I noticed the agent was failing because tiles were too small to classify reliably.&lt;/p&gt;

&lt;p&gt;Each time something failed, I'd update the &lt;code&gt;SKILL.md&lt;/code&gt; with what worked and what didn't. The skill is essentially a written-down memory of every mistake — so the agent doesn't have to rediscover them every time.&lt;/p&gt;

&lt;p&gt;In the video, &lt;code&gt;CLAUDE.md&lt;/code&gt; references the skill so it loads automatically, which makes it look like the agent just figures it out on the fly. But behind the scenes, there's a well-tested playbook that took many failed attempts to build.&lt;/p&gt;

&lt;h2&gt;
  
  
  The resulting flow: 5 rounds
&lt;/h2&gt;

&lt;p&gt;The flow the skill documents is optimized around one constraint: &lt;strong&gt;minimize total tool calls&lt;/strong&gt; to stay within the 2-minute timer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Round 1 — Click the checkbox
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;evaluate_script&lt;/code&gt; accesses &lt;code&gt;iframe[0].contentDocument&lt;/code&gt; directly and calls &lt;code&gt;.click()&lt;/code&gt; on &lt;code&gt;#recaptcha-anchor&lt;/code&gt;. This bypasses the accessibility tree entirely — it's the only reliable way to interact with cross-origin iframe content.&lt;/p&gt;

&lt;h3&gt;
  
  
  Round 2 — Detect the challenge
&lt;/h3&gt;

&lt;p&gt;Another &lt;code&gt;evaluate_script&lt;/code&gt;, this time on the challenge iframe (usually &lt;code&gt;iframe[2]&lt;/code&gt;), reads &lt;code&gt;.rc-imageselect-desc&lt;/code&gt; to get the challenge text ("Select all images with traffic lights") and counts &lt;code&gt;td[role="button"]&lt;/code&gt; to determine grid size: 9 tiles = 3×3, 16 tiles = 4×4. If it returns &lt;code&gt;state: 'loading'&lt;/code&gt;, it waits 1 second and retries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Round 3 — Analyze the images
&lt;/h3&gt;

&lt;p&gt;This is where the strategy diverges based on grid size:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For 3×3 grids:&lt;/strong&gt; Apply &lt;code&gt;iframe.style.transform = 'scale(2)'&lt;/code&gt; via &lt;code&gt;evaluate_script&lt;/code&gt;, doubling tile size from ~100px to ~200px. Take a single &lt;code&gt;fullPage=true&lt;/code&gt; screenshot. Launch one sub-agent that receives the image + challenge text and returns matching indices (e.g., &lt;code&gt;MATCHES: 3, 5, 8&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;Why not fetch individual tile images? The 9 tiles are actually a single CSS sprite with different &lt;code&gt;background-position&lt;/code&gt; values. Fetching the image URL gives you the complete sprite, not individual tiles. Individual screenshots by UID would cost 9 tool calls. One zoomed full-page screenshot solves both problems. This was one of those discoveries from iteration — the first attempt tried fetching tile URLs and got the same image 9 times.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For 4×4 grids:&lt;/strong&gt; Zooming doesn't help enough — 16 tiles are still too dense in a single screenshot. Instead: &lt;code&gt;take_snapshot(verbose=true)&lt;/code&gt; to get UIDs for all 16 tiles, then launch &lt;strong&gt;4 sub-agents in parallel&lt;/strong&gt; (one per row). Each agent screenshots its 4 tiles individually by UID and reports matches. Four rows analyzed simultaneously instead of 16 sequential tool calls. Estimated time: 30-45 seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Round 4 — Select and verify
&lt;/h3&gt;

&lt;p&gt;A single &lt;code&gt;evaluate_script&lt;/code&gt; that clicks all matching tiles and then clicks &lt;code&gt;#recaptcha-verify-button&lt;/code&gt;. Before clicking, it resets the zoom — otherwise click coordinates don't map correctly to the unscaled element positions.&lt;/p&gt;

&lt;p&gt;Combining tile selection and verify into one script call saves an entire round. Early versions did these as separate calls and kept hitting the timer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Round 5 — Detect the result
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;evaluate_script&lt;/code&gt; interrogates the DOM to determine the outcome:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;State&lt;/th&gt;
&lt;th&gt;DOM Signal&lt;/th&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;SUCCESS&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;#recaptcha-anchor[aria-checked="true"]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Done — submit the form&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;NEW_IMAGES&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;.rc-imageselect-error-dynamic-more&lt;/code&gt; visible&lt;/td&gt;
&lt;td&gt;Analyze only the replaced tiles&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WRONG_ANSWER&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;.rc-imageselect-incorrect-response&lt;/code&gt; visible&lt;/td&gt;
&lt;td&gt;Back to Round 2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;SELECT_MORE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;.rc-imageselect-error-select-more&lt;/code&gt; visible&lt;/td&gt;
&lt;td&gt;Analyze unselected tiles&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;EXPIRED&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;.rc-anchor-error-msg&lt;/code&gt; contains "expired"&lt;/td&gt;
&lt;td&gt;Back to Round 1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ERROR&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;.rc-anchor-error-msg&lt;/code&gt; contains "error"&lt;/td&gt;
&lt;td&gt;Reload page&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The trickiest state is &lt;code&gt;NEW_IMAGES&lt;/code&gt;: reCAPTCHA sometimes replaces only the selected tiles with new images and asks you to evaluate those too. The skill documents how to detect this, snapshot only the changed tiles, and run Round 3 specifically for those — without restarting the entire challenge.&lt;/p&gt;

&lt;p&gt;Visibility detection uses &lt;code&gt;offsetParent !== null&lt;/code&gt; as a proxy, since Google keeps all error elements in the DOM but hides inactive ones with &lt;code&gt;display: none&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;p&gt;Happy path for a 3×3 grid: &lt;strong&gt;5 tool calls + 1 sub-agent analysis&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;checkbox → detect → zoom+screenshot+agent → select+verify → detect_result
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Total estimated time: &lt;strong&gt;20-30 seconds&lt;/strong&gt;. For a 4×4 grid with parallel agents: &lt;strong&gt;30-45 seconds&lt;/strong&gt;. Both well within the 2-minute timer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Claude Code&lt;/strong&gt; as the main agent (&lt;code&gt;claude-sonnet-4-6&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;chrome-devtools-mcp&lt;/strong&gt; (official MCP server) — exposes the browser as a tool&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools used:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;evaluate_script&lt;/code&gt; — iframe interaction (the only reliable method)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;take_screenshot&lt;/code&gt; with &lt;code&gt;fullPage: true&lt;/code&gt; — zoomed 3×3 grid capture&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;take_screenshot&lt;/code&gt; with &lt;code&gt;uid&lt;/code&gt; — individual tile capture for 4×4&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;take_snapshot&lt;/code&gt; with &lt;code&gt;verbose: true&lt;/code&gt; — UID discovery for 4×4&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;navigate_page&lt;/code&gt; with &lt;code&gt;type: reload&lt;/code&gt; — error recovery&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;new_page&lt;/code&gt; with &lt;code&gt;isolatedContext&lt;/code&gt; — cookie-free sessions for testing&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Sub-agents&lt;/strong&gt; launched via the &lt;code&gt;Agent&lt;/code&gt; tool (&lt;code&gt;claude-sonnet-4-6&lt;/code&gt;) — parallel visual analysis&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  What I actually find interesting about this
&lt;/h2&gt;

&lt;p&gt;This isn't novel — the security community has known for a while that visual CAPTCHAs are on borrowed time. reCAPTCHA v3 moved away from visual challenges toward behavioral scoring back in 2018, which tells you Google saw this coming too.&lt;/p&gt;

&lt;p&gt;What's interesting to me isn't that an LLM can solve a CAPTCHA — it's the &lt;strong&gt;process of building the skill&lt;/strong&gt;. The pattern of "let the agent try → watch it fail → encode the lessons → try again" turned out to be a surprisingly effective way to develop automation playbooks. The agent finds edge cases you wouldn't think of, and the skill prevents it from forgetting them.&lt;/p&gt;

&lt;p&gt;If you're still relying on visual CAPTCHAs as a primary bot mitigation layer, it's probably worth revisiting. Rate limiting, device fingerprinting, anomaly detection, and proof-of-work challenges don't depend on the assumption that machines can't solve perceptual puzzles.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The skill was built iteratively — each failure taught the agent something new about iframes, timing, tile resolution, and reCAPTCHA's state machine. No CAPTCHA-solving APIs, no external dependencies. Just an LLM with a browser, a set of instructions, and enough failed attempts to figure it out.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Have you run into similar patterns building skills or automations with LLM agents? I'd love to hear about it in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>discuss</category>
      <category>claude</category>
    </item>
    <item>
      <title>TracePact: Catch AI agent tool-call regressions before production</title>
      <dc:creator>Daniel Castillo</dc:creator>
      <pubDate>Sun, 08 Mar 2026 08:03:52 +0000</pubDate>
      <link>https://forem.com/soydanicg/tracepact-catch-ai-agent-tool-call-regressions-before-production-4f5m</link>
      <guid>https://forem.com/soydanicg/tracepact-catch-ai-agent-tool-call-regressions-before-production-4f5m</guid>
      <description>&lt;p&gt;You changed a prompt. The output still looks fine. But your agent stopped reading the config before deploying and switched from running tests to running builds.&lt;/p&gt;

&lt;p&gt;Nobody noticed until production broke.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;Most agent failures aren't bad text — they're &lt;strong&gt;bad behavior&lt;/strong&gt;. The agent calls the wrong tools, in the wrong order, with the wrong arguments. Output evals don't catch this because the final response &lt;br&gt;
still looks plausible.&lt;/p&gt;

&lt;p&gt;Teams try to catch it manually: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reviewing traces in agent UIs&lt;/li&gt;
&lt;li&gt;parsing raw session logs&lt;/li&gt;
&lt;li&gt;comparing old vs new runs by hand&lt;/li&gt;
&lt;li&gt;debugging regressions only after users report them&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  What TracePact does
&lt;/h2&gt;

&lt;p&gt;TracePact is a behavioral testing framework for AI agents. It works at the &lt;strong&gt;tool-call level&lt;/strong&gt;, not the text level. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Write behavior contracts:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;TraceBuilder&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@tracepact/vitest&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;trace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;TraceBuilder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addCall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;read_file&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;src/service.ts&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addCall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;write_file&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;src/service.ts&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addCall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;run_tests&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;PASS&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; 

&lt;span class="c1"&gt;// Did it read before writing?&lt;/span&gt;
&lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toHaveCalledToolsInOrder&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;read_file&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;write_file&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;run_tests&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;]);&lt;/span&gt; 

&lt;span class="c1"&gt;// Did it avoid shell?&lt;/span&gt;
&lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toNotHaveCalledTool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;bash&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No API calls. No tokens. Runs in milliseconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Record &amp;amp; replay:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Record a baseline (one-time, live)&lt;/span&gt;
npx tracepact run &lt;span class="nt"&gt;--live&lt;/span&gt; &lt;span class="nt"&gt;--record&lt;/span&gt;

&lt;span class="c"&gt;# Replay without API calls (instant, deterministic)&lt;/span&gt;
npx tracepact run &lt;span class="nt"&gt;--replay&lt;/span&gt; ./cassettes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Diff runs to catch drift:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx tracepact diff baseline.json latest.json &lt;span class="nt"&gt;--fail-on&lt;/span&gt; warn
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;3 changes detected: 

- read_file (seq 1) (removed)
+ write_file (seq 3) (added)
~ bash.cmd: "npm test" -&amp;gt; "npm run build"

Summary: 1 removed, 1 added, 1 arg changed[BLOCK] 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Filter noisy args and irrelevant tools:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx tracepact diff baseline.json latest.json &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--ignore-keys&lt;/span&gt; timestamp,requestId &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--ignore-tools&lt;/span&gt; read_file
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Severity levels: &lt;code&gt;none&lt;/code&gt; (identical), &lt;code&gt;warn&lt;/code&gt; (args changed), &lt;code&gt;block&lt;/code&gt; (tools added/removed). Use &lt;code&gt;--fail-on&lt;/code&gt; in CI to gate deployments. &lt;/p&gt;

&lt;h2&gt;
  
  
  Good fit
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Coding agents&lt;/strong&gt; — read before write, run tests before finishing, never edit restricted files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ops agents&lt;/strong&gt; — inspect before restarting, check evidence before acting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workflow agents&lt;/strong&gt; — validate before mutation, avoid duplicate side effects&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Internal assistants&lt;/strong&gt; — use correct system for correct task&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Less useful for
&lt;/h2&gt;

&lt;p&gt;Pure chatbots, style evaluation, creative tasks, or systems where only text output matters. TracePact is for &lt;strong&gt;behavioral guarantees&lt;/strong&gt;, not response quality. &lt;/p&gt;

&lt;h2&gt;
  
  
  MCP server for IDEs
&lt;/h2&gt;

&lt;p&gt;TracePact ships an MCP server that works with Claude Code, Cursor, and Windsurf:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; 
&lt;/span&gt;&lt;span class="nl"&gt;"tracepact"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
&lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"@tracepact/mcp-server"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; 
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; 
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tools: &lt;code&gt;tracepact_audit&lt;/code&gt;, &lt;code&gt;tracepact_run&lt;/code&gt;, &lt;code&gt;tracepact_capture&lt;/code&gt;, &lt;code&gt;tracepact_replay&lt;/code&gt;, &lt;code&gt;tracepact_diff&lt;/code&gt;, &lt;code&gt;tracepact_list_tests&lt;/code&gt;. &lt;/p&gt;

&lt;h2&gt;
  
  
  Get started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; @tracepact/core @tracepact/vitest @tracepact/cli
npx tracepact init
npx tracepact 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GitHub: &lt;a href="https://github.com/dcdeve/tracepact" rel="noopener noreferrer"&gt;https://github.com/dcdeve/tracepact&lt;/a&gt; &lt;/p&gt;




&lt;p&gt;We built this because we kept running into the same problem: prompt or model changes that silently break agent behavior while the output still looks fine. If you're testing AI agents, I'd love to hear&lt;br&gt;
how you're handling tool-call regressions today.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>testing</category>
      <category>devtools</category>
      <category>opensource</category>
    </item>
    <item>
      <title>TracePact: Catch AI agent tool-call regressions before production</title>
      <dc:creator>Daniel Castillo</dc:creator>
      <pubDate>Sun, 08 Mar 2026 08:03:52 +0000</pubDate>
      <link>https://forem.com/soydanicg/tracepact-catch-ai-agent-tool-call-regressions-before-production-5mh</link>
      <guid>https://forem.com/soydanicg/tracepact-catch-ai-agent-tool-call-regressions-before-production-5mh</guid>
      <description>&lt;p&gt;You changed a prompt. The output still looks fine. But your agent stopped reading the config before deploying and switched from running tests to running builds.&lt;/p&gt;

&lt;p&gt;Nobody noticed until production broke.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;Most agent failures aren't bad text — they're &lt;strong&gt;bad behavior&lt;/strong&gt;. The agent calls the wrong tools, in the wrong order, with the wrong arguments. Output evals don't catch this because the final response &lt;br&gt;
still looks plausible.&lt;/p&gt;

&lt;p&gt;Teams try to catch it manually: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reviewing traces in agent UIs&lt;/li&gt;
&lt;li&gt;parsing raw session logs&lt;/li&gt;
&lt;li&gt;comparing old vs new runs by hand&lt;/li&gt;
&lt;li&gt;debugging regressions only after users report them&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  What TracePact does
&lt;/h2&gt;

&lt;p&gt;TracePact is a behavioral testing framework for AI agents. It works at the &lt;strong&gt;tool-call level&lt;/strong&gt;, not the text level. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Write behavior contracts:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;TraceBuilder&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@tracepact/vitest&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;trace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;TraceBuilder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addCall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;read_file&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;src/service.ts&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addCall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;write_file&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;src/service.ts&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addCall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;run_tests&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;PASS&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; 

&lt;span class="c1"&gt;// Did it read before writing?&lt;/span&gt;
&lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toHaveCalledToolsInOrder&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;read_file&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;write_file&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;run_tests&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;]);&lt;/span&gt; 

&lt;span class="c1"&gt;// Did it avoid shell?&lt;/span&gt;
&lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toNotHaveCalledTool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;bash&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No API calls. No tokens. Runs in milliseconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Record &amp;amp; replay:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Record a baseline (one-time, live)&lt;/span&gt;
npx tracepact run &lt;span class="nt"&gt;--live&lt;/span&gt; &lt;span class="nt"&gt;--record&lt;/span&gt;

&lt;span class="c"&gt;# Replay without API calls (instant, deterministic)&lt;/span&gt;
npx tracepact run &lt;span class="nt"&gt;--replay&lt;/span&gt; ./cassettes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Diff runs to catch drift:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx tracepact diff baseline.json latest.json &lt;span class="nt"&gt;--fail-on&lt;/span&gt; warn
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;3 changes detected: 

- read_file (seq 1) (removed)
+ write_file (seq 3) (added)
~ bash.cmd: "npm test" -&amp;gt; "npm run build"

Summary: 1 removed, 1 added, 1 arg changed[BLOCK] 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Filter noisy args and irrelevant tools:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx tracepact diff baseline.json latest.json &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--ignore-keys&lt;/span&gt; timestamp,requestId &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--ignore-tools&lt;/span&gt; read_file
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Severity levels: &lt;code&gt;none&lt;/code&gt; (identical), &lt;code&gt;warn&lt;/code&gt; (args changed), &lt;code&gt;block&lt;/code&gt; (tools added/removed). Use &lt;code&gt;--fail-on&lt;/code&gt; in CI to gate deployments. &lt;/p&gt;

&lt;h2&gt;
  
  
  Good fit
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Coding agents&lt;/strong&gt; — read before write, run tests before finishing, never edit restricted files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ops agents&lt;/strong&gt; — inspect before restarting, check evidence before acting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workflow agents&lt;/strong&gt; — validate before mutation, avoid duplicate side effects&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Internal assistants&lt;/strong&gt; — use correct system for correct task&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Less useful for
&lt;/h2&gt;

&lt;p&gt;Pure chatbots, style evaluation, creative tasks, or systems where only text output matters. TracePact is for &lt;strong&gt;behavioral guarantees&lt;/strong&gt;, not response quality. &lt;/p&gt;

&lt;h2&gt;
  
  
  MCP server for IDEs
&lt;/h2&gt;

&lt;p&gt;TracePact ships an MCP server that works with Claude Code, Cursor, and Windsurf:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; 
&lt;/span&gt;&lt;span class="nl"&gt;"tracepact"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
&lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"@tracepact/mcp-server"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; 
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; 
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tools: &lt;code&gt;tracepact_audit&lt;/code&gt;, &lt;code&gt;tracepact_run&lt;/code&gt;, &lt;code&gt;tracepact_capture&lt;/code&gt;, &lt;code&gt;tracepact_replay&lt;/code&gt;, &lt;code&gt;tracepact_diff&lt;/code&gt;, &lt;code&gt;tracepact_list_tests&lt;/code&gt;. &lt;/p&gt;

&lt;h2&gt;
  
  
  Get started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; @tracepact/core @tracepact/vitest @tracepact/cli
npx tracepact init
npx tracepact 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GitHub: &lt;a href="https://github.com/dcdeve/tracepact" rel="noopener noreferrer"&gt;https://github.com/dcdeve/tracepact&lt;/a&gt; &lt;/p&gt;




&lt;p&gt;We built this because we kept running into the same problem: prompt or model changes that silently break agent behavior while the output still looks fine. If you're testing AI agents, I'd love to hear&lt;br&gt;
how you're handling tool-call regressions today.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>testing</category>
      <category>devtools</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
