<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: MiniKao</title>
    <description>The latest articles on Forem by MiniKao (@kao273183).</description>
    <link>https://forem.com/kao273183</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3935400%2F6be1a77a-c14f-41c6-ba1c-d37a516718ba.jpeg</url>
      <title>Forem: MiniKao</title>
      <link>https://forem.com/kao273183</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/kao273183"/>
    <language>en</language>
    <item>
      <title>From mock-only-works to real-world-works: 48 hours of reCAPTCHA debugging</title>
      <dc:creator>MiniKao</dc:creator>
      <pubDate>Mon, 25 May 2026 02:46:50 +0000</pubDate>
      <link>https://forem.com/kao273183/from-mock-only-works-to-real-world-works-48-hours-of-recaptcha-debugging-d6e</link>
      <guid>https://forem.com/kao273183/from-mock-only-works-to-real-world-works-48-hours-of-recaptcha-debugging-d6e</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Honest framing first&lt;/strong&gt;: &lt;code&gt;mk-qa-master&lt;/code&gt; is an open-source MCP server for QA engineers. The reCAPTCHA solver in it is a &lt;strong&gt;Tier 3 fallback&lt;/strong&gt; for testing your own apps when Tier 1 (Google's official test keys) and Tier 2 (feature flags / IP allowlist) aren't available. It is &lt;strong&gt;not&lt;/strong&gt; a "beat captcha" tool. It refuses to run on Google / Apple / Microsoft / Discord login pages regardless of consent flag. With that out of the way…&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is a diary about the 48 hours it took to go from "shipped a reCAPTCHA solver, all unit tests green" to "it actually works against the real Google demo." Four versions (&lt;code&gt;v0.7.0&lt;/code&gt; → &lt;code&gt;v0.7.4&lt;/code&gt;), three broken intermediate ones, and a bunch of lessons that I want to write down before I forget.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;The idea behind the solver is simple. Two atomic MCP tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;inspect_visual_challenge&lt;/code&gt; — finds the captcha iframe on the current page, screenshots it, returns the tile grid coordinates + a screenshot.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;solve_visual_challenge&lt;/code&gt; — accepts the AI client's tile selection (which tiles contain buses, which contain crosswalks, etc.), clicks them, presses Verify, returns the token.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The AI client (Claude Code, Cursor, etc.) sees the screenshot, decides which tiles match the prompt, and calls solve. The server is the eyes and hands; the AI is the brain. Multimodal models like Claude 4.7 are surprisingly good at this — they were trained on the open web, which has a lot of bus pictures.&lt;/p&gt;

&lt;p&gt;So far so good in theory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 1 — v0.7.0 ships
&lt;/h2&gt;

&lt;p&gt;The first version landed Monday. It detected the reCAPTCHA &lt;code&gt;iframe[src*="bframe"]&lt;/code&gt;, screenshotted it, computed tile coordinates by dividing the iframe's bounding box into a 3×3 or 4×4 grid, and clicked at the center of each selected tile.&lt;/p&gt;

&lt;p&gt;Unit tests passed. The bundled mock fixture (a self-contained HTML page that mimics reCAPTCHA's structure) round-tripped end-to-end. I wrote a PRD, shipped a release, posted a &lt;code&gt;Dev.to&lt;/code&gt; walkthrough. Felt great.&lt;/p&gt;

&lt;p&gt;The mock fixture's structure was:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;table&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"rc-imageselect-table"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;&lt;/span&gt;...&lt;span class="nt"&gt;&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&lt;/span&gt;...&lt;span class="nt"&gt;&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&lt;/span&gt;...&lt;span class="nt"&gt;&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;/span&gt;
  ...
&lt;span class="nt"&gt;&amp;lt;/table&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Selectors in the fingerprint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tile_table_selector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.rc-imageselect-table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tile_cell_selector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;td&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What could go wrong?&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 2 — v0.7.1 adds hCaptcha
&lt;/h2&gt;

&lt;p&gt;The next day I extended the fingerprint table to support hCaptcha. Same architecture — different selectors. No new MCP tools. Tests stayed green. I felt good about the design: when a vendor changes, you add a row to the fingerprint table, you're done.&lt;/p&gt;

&lt;p&gt;I didn't run a real-world dogfood for hCaptcha either. (We'll come back to this.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 3 — v0.7.2: the first "fix"
&lt;/h2&gt;

&lt;p&gt;I wrote a tiny dogfood script — open Chromium, navigate to &lt;code&gt;https://www.google.com/recaptcha/api2/demo&lt;/code&gt;, click the anchor to trigger an image challenge, call &lt;code&gt;inspect_visual_challenge&lt;/code&gt;, save the screenshot, ask the AI for tile indices, call &lt;code&gt;solve_visual_challenge&lt;/code&gt;, see if a token comes back.&lt;/p&gt;

&lt;p&gt;The first run came back with status &lt;code&gt;failed&lt;/code&gt;. I asked the user (in this case: me) what they saw in the browser. The answer was unsettling: &lt;em&gt;"I told it to click 2, 5, and 8 — only 5 and 8 actually got highlighted."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I dug into the coordinate math. The iframe-divide approach split the full iframe into rows × cols cells. But the iframe contains a header banner (the prompt text) above the grid and a footer (the Verify button) below. So:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For a 3×3 grid in a 400×580 iframe with header ~130px and footer ~130px:

&lt;ul&gt;
&lt;li&gt;The actual grid is 320px tall, ~106px per row.&lt;/li&gt;
&lt;li&gt;Naive iframe-divide gives 193px per row.&lt;/li&gt;
&lt;li&gt;Row 0's computed center lands in the header banner.&lt;/li&gt;
&lt;li&gt;Row 2's computed center lands in the footer.&lt;/li&gt;
&lt;li&gt;Only row 1 happens to be roughly correct.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;I wrote v0.7.2 to fix this. Instead of dividing the iframe, I'd read each cell's real &lt;code&gt;bounding_box()&lt;/code&gt; from the DOM via Playwright:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tile_count&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;bb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cells&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;nth&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;bounding_box&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;is_real_dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bb&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# fall back to iframe-divide for mock fixtures
&lt;/span&gt;        &lt;span class="k"&gt;break&lt;/span&gt;
    &lt;span class="n"&gt;candidate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;viewport_x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;bb&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;...})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The unit test (against the mock fixture) immediately confirmed the fix. I bumped to v0.7.2, opened a PR, merged, released, published to PyPI. Done.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 4 morning — wait, it's still broken
&lt;/h2&gt;

&lt;p&gt;Next morning, ran the dogfood again. Console output for &lt;code&gt;inspect&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"tiles"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"index"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"viewport_x"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"viewport_y"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;92&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;"w"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;133&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"h"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;193&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"index"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"viewport_x"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;218&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"viewport_y"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;92&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"w"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;133&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"h"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;193&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;133 × 193&lt;/code&gt; rectangles. &lt;strong&gt;The exact dimensions you'd get from dividing a 400×580 iframe by 3×3.&lt;/strong&gt; Which meant the per-cell &lt;code&gt;bounding_box()&lt;/code&gt; path was returning &lt;code&gt;None&lt;/code&gt; on every cell in real reCAPTCHA, silently falling back to the same broken iframe-divide math.&lt;/p&gt;

&lt;p&gt;Looked at the code path: it had a &lt;code&gt;try / except&lt;/code&gt; swallowing the error. I added a debug field &lt;code&gt;_coord_method&lt;/code&gt; so the inspect response would show which path actually fired:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_coord_method&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iframe_divide&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="n"&gt;v0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mf"&gt;7.2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fix&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; never ran
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So &lt;code&gt;v0.7.2&lt;/code&gt; fixed the mock fixture and shipped to PyPI. In production, against real Google reCAPTCHA, it behaved identically to &lt;code&gt;v0.7.0&lt;/code&gt;. The unit test was green because the mock fixture's &lt;code&gt;&amp;lt;td&amp;gt;&lt;/code&gt; elements had real CSS dimensions; in real reCAPTCHA the tiles aren't &lt;code&gt;&amp;lt;td&amp;gt;&lt;/code&gt;. I just didn't know that yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 4 afternoon — going DOM-spelunking
&lt;/h2&gt;

&lt;p&gt;Wrote a one-off debug script that opened the real reCAPTCHA bframe and ran arbitrary JavaScript inside it. The first query was: &lt;em&gt;"does &lt;code&gt;.rc-imageselect-table&lt;/code&gt; even exist?"&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tableExists"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"altSelectors"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"table[class*=\"rc-imageselect\"]"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;".rc-imageselect-target"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;".rc-image-tile-wrapper"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;false&lt;/code&gt;. The class I'd been targeting since v0.7.0 doesn't exist in production.&lt;/p&gt;

&lt;p&gt;The real DOM looks like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Element&lt;/th&gt;
&lt;th&gt;Mock fixture&lt;/th&gt;
&lt;th&gt;Real Google reCAPTCHA&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Table class&lt;/td&gt;
&lt;td&gt;&lt;code&gt;rc-imageselect-table&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;rc-imageselect-table-33&lt;/code&gt; (or &lt;code&gt;-44&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tile element&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;&amp;lt;td&amp;gt;&lt;/code&gt; with real CSS&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;&amp;lt;div class="rc-image-tile-wrapper"&amp;gt;&lt;/code&gt; (the &lt;code&gt;&amp;lt;td&amp;gt;&lt;/code&gt; is 0×0 because tiles are absolutely-positioned)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Challenge text&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.rc-imageselect-desc&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;.rc-imageselect-desc-no-canonical&lt;/code&gt; (in dynamic-replace mode)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The whole fingerprint table had been wrong all along. Unit tests passed because &lt;em&gt;I wrote the mock fixture&lt;/em&gt; to match the selectors I'd hardcoded. Tautology. The mock fixture lied because the person who wrote it was the same person who wrote the selectors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 4 evening — v0.7.3 actually fixes it
&lt;/h2&gt;

&lt;p&gt;I rewrote the fingerprint to chain both real and mock selectors via the CSS comma operator (which means "or"):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;challenge_text_selector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.rc-imageselect-desc-no-canonical, .rc-imageselect-desc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tile_table_selector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;table[class*=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rc-imageselect-table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;], &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.rc-imageselect-target, .rc-imageselect-table&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tile_cell_selector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.rc-image-tile-wrapper, .rc-imageselect-table td&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the same fingerprint matches both production reCAPTCHA AND the mock fixture. The per-cell &lt;code&gt;bounding_box()&lt;/code&gt; path finally runs against real DOM, returning real 95×95 squares instead of distorted 133×193 rectangles. Tile 0 sits at &lt;code&gt;y=211&lt;/code&gt; (just below the 200px header), not &lt;code&gt;y=92&lt;/code&gt; (inside the header banner).&lt;/p&gt;

&lt;p&gt;I also fixed a different UX problem in the same release. The MCP server was returning the screenshot as a base64 string embedded in a JSON &lt;code&gt;TextContent&lt;/code&gt;. Multimodal AI clients can't "see" base64 — they see a giant string of &lt;code&gt;iVBORw0KG...&lt;/code&gt;. The fix: return the screenshot as a native MCP &lt;code&gt;ImageContent&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nc"&gt;ImageContent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;b64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mimeType&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image/png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;TextContent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now Claude Code receives the screenshot as if you'd dragged it into the chat. No manual screenshot juggling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 4 night — v0.7.4 closes the multi-round gap
&lt;/h2&gt;

&lt;p&gt;One more dogfood run, this time the challenge text was different: &lt;em&gt;"Select all images with buses. Click verify once there are none left."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is reCAPTCHA's &lt;strong&gt;dynamic-replace mode&lt;/strong&gt;. Click a matching tile, the tile gets replaced with a new image. You have to keep selecting until no buses remain, then click Verify. v0.7.3 always clicked Verify after the first round, so it always failed against this mode even with perfect tile judgment.&lt;/p&gt;

&lt;p&gt;v0.7.4 added a new return status: &lt;code&gt;"continue"&lt;/code&gt;. When solve detects dynamic mode (the prompt contains "none left" / "確定沒有遺漏" / equivalent), it does the clicks, waits for the replace animation, re-screenshots the iframe, and returns &lt;code&gt;status: "continue"&lt;/code&gt; with a fresh screenshot + new tile geometry. The AI client looks at the new grid, finds any remaining matches, calls solve again. When the AI sees no more matches, it passes an empty &lt;code&gt;selected_tile_indices: []&lt;/code&gt; to signal "click Verify now."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json-doc"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Round 1 response&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"continue"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"rounds_used"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"screenshot_base64"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"...new grid..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tiles"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="c1"&gt;...&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"hint"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Dynamic-replace round 1/5. Look at the new screenshot and call solve again."&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;// Round 2 AI sees no more buses&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"selected_tile_indices"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"confirm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;// Round 2 response&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"passed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"token"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"03AGdBq25..."&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hard cap of 5 rounds prevents infinite loops on pathological challenges. Static mode (no marker phrase) is unchanged — legacy flow runs verbatim.&lt;/p&gt;

&lt;p&gt;And — the lesson from this entire saga — I added a &lt;strong&gt;weekly GitHub Action that runs the dogfood script against the real Google reCAPTCHA demo&lt;/strong&gt; and asserts &lt;code&gt;_coord_method != "iframe_divide"&lt;/code&gt;. If Google ships a DOM change next week that breaks the fingerprint again, I'll get a CI failure email within seven days instead of finding out from a user issue six months later.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cron&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;2&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;0"&lt;/span&gt;  &lt;span class="c1"&gt;# Sunday 02:00 UTC&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What works now
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;✅ reCAPTCHA v2 image-grid (3×3 + 4×4) — verified against the real Google demo&lt;/li&gt;
&lt;li&gt;✅ hCaptcha image-select — same fingerprint infrastructure, &lt;strong&gt;fixture verified, real-vendor TBD&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;✅ Multi-round dynamic-replace — unit-test verified, &lt;strong&gt;end-to-end real-vendor TBD&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;✅ MCP &lt;code&gt;ImageContent&lt;/code&gt; — multimodal clients see screenshots natively&lt;/li&gt;
&lt;li&gt;✅ Consent gate, domain allowlist, hard-stop blacklist (Google / Apple / Microsoft / Discord login pages)&lt;/li&gt;
&lt;li&gt;✅ Weekly real-world CI guard&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What doesn't (yet)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;❌ Mobile WebView — &lt;code&gt;v0.8.0&lt;/code&gt; mini-PRD drafted, ~6 working days of implementation ahead&lt;/li&gt;
&lt;li&gt;❌ reCAPTCHA v3 — pure behavior scoring, no visible challenge, &lt;strong&gt;out of scope by design&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;❌ Cloudflare Turnstile — same reason&lt;/li&gt;
&lt;li&gt;❌ Audio captcha fallback — accessibility tier, low usage in QA context&lt;/li&gt;
&lt;li&gt;❌ The dynamic-replace loop on real Google reCAPTCHA with AI in the loop — that's my next dogfood session&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Lessons I want to remember
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Mock fixtures can lie.&lt;/strong&gt; When the same person writes both the production selectors and the mock that tests them, the mock matches by construction. There's no signal. The fix is dogfood against the real thing — and if you can't dogfood, at minimum run a recorded HAR of the real DOM and assert against that.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Silent fallbacks are the worst kind of bug.&lt;/strong&gt; v0.7.2's &lt;code&gt;try / except&lt;/code&gt; swallowed the failure of every per-cell &lt;code&gt;bounding_box()&lt;/code&gt; and quietly fell back to broken math. A &lt;code&gt;_coord_method&lt;/code&gt; debug field that surfaces which path actually fired would have caught this in minutes. I now add a debug field every time I have more than one code path for the same output.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multi-round is UX, not a bug.&lt;/strong&gt; reCAPTCHA's "Click verify once there are none left" isn't an edge case — it's the dominant mode on hard challenges. I built the static-only solver, said "ship it," and was surprised when most real-world challenges fell into the dynamic-replace bucket I hadn't designed for.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Weekly CI catches what unit tests can't.&lt;/strong&gt; The dogfood workflow runs once a week against a third-party demo. It's noisy, it depends on a vendor's continued cooperation, and it'd be wrong to depend on it for blocking merges. But as a &lt;em&gt;background&lt;/em&gt; signal that catches selector drift, it's exactly the right level of investment.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;mk-qa-master&lt;span class="o"&gt;==&lt;/span&gt;0.7.4

&lt;span class="c"&gt;# In your MCP host (Claude Code config, Cursor, etc.)&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"mcpServers"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"qa-master"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="s2"&gt;"command"&lt;/span&gt;: &lt;span class="s2"&gt;"python"&lt;/span&gt;,
      &lt;span class="s2"&gt;"args"&lt;/span&gt;: &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"-m"&lt;/span&gt;, &lt;span class="s2"&gt;"mk_qa_master.server"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;,
      &lt;span class="s2"&gt;"env"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"QA_VISUAL_CHALLENGE_CONSENT"&lt;/span&gt;: &lt;span class="s2"&gt;"true"&lt;/span&gt;,
        &lt;span class="s2"&gt;"QA_VISUAL_CHALLENGE_AUTHORIZED_DOMAINS"&lt;/span&gt;: &lt;span class="s2"&gt;"your-staging.example.com"&lt;/span&gt;
      &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then ask Claude: &lt;em&gt;"Test the signup flow on staging. If you hit a captcha, solve it."&lt;/em&gt; The MCP tools take it from there.&lt;/p&gt;

&lt;p&gt;Repo + walkthrough: &lt;code&gt;https://github.com/kao273183/mk-qa-master&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;v0.8.0&lt;/code&gt; — mobile WebView captcha via Maestro CLI (same fingerprint table, new driver). PRD is up in the repo. Probably another diary entry when that one ships.&lt;/p&gt;

&lt;p&gt;If you find a bug, the dogfood script lives at &lt;code&gt;scripts/dogfood-inspect-only.py&lt;/code&gt; — run it against the page that broke and the inspect output will tell you exactly which coordinate path fired. Beats debugging blind.&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>captcha</category>
      <category>playwright</category>
      <category>testing</category>
    </item>
    <item>
      <title>I open-sourced 24 QA skills for Claude Code — from spec to release</title>
      <dc:creator>MiniKao</dc:creator>
      <pubDate>Fri, 22 May 2026 03:16:52 +0000</pubDate>
      <link>https://forem.com/kao273183/i-open-sourced-24-qa-skills-for-claude-code-from-spec-to-release-2d57</link>
      <guid>https://forem.com/kao273183/i-open-sourced-24-qa-skills-for-claude-code-from-spec-to-release-2d57</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — I just open-sourced &lt;strong&gt;QA Claude Skill&lt;/strong&gt; — 24 production-grade QA skills for &lt;a href="https://claude.ai/code" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt; covering test design, automation, performance, security, mutation testing, and more. MIT for non-commercial use. &lt;strong&gt;&lt;a href="https://github.com/kao273183/qa-claude-skill" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;For two years I've been iterating a personal Claude Code workspace for QA work — bug reports, test plans, review checklists, regression matrices. It saved me hours every week.&lt;/p&gt;

&lt;p&gt;But every time a colleague asked "how do you write a test plan that fast?" — handing them my workspace meant they got dozens of files hard-coded with my JIRA project key, my Slack user ID, my AWS bucket. Useless to anyone else.&lt;/p&gt;

&lt;p&gt;So I spent the last two weeks extracting &lt;strong&gt;24 skills&lt;/strong&gt; into a properly generalized, open-source repo. Drop in your team's IDs via &lt;code&gt;config.json&lt;/code&gt; and it works for any team, any stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's in the box
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;24 skills across 8 categories:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Skills&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Test Design&lt;/strong&gt; (8)&lt;/td&gt;
&lt;td&gt;test-master · flutter-test-master · test-review · regression-test · speckit-to-tc · tc-version-diff · sheet-md-sync · smoke-test-analyzer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Automation&lt;/strong&gt; (3)&lt;/td&gt;
&lt;td&gt;test-automation · flutter-test-automation · tc-to-pytest&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Bug Management&lt;/strong&gt; (1)&lt;/td&gt;
&lt;td&gt;bug-report&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Quality Quantification&lt;/strong&gt; (2)&lt;/td&gt;
&lt;td&gt;mutation-testing · property-based-test-gen&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Reporting&lt;/strong&gt; (1)&lt;/td&gt;
&lt;td&gt;publish-regression&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Performance &amp;amp; Security&lt;/strong&gt; (3)&lt;/td&gt;
&lt;td&gt;performance-test-gen · security-scan · api-contract-test&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;CI Health&lt;/strong&gt; (2)&lt;/td&gt;
&lt;td&gt;visual-regression-gen · flaky-test-hunter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Quality Specialties&lt;/strong&gt; (4)&lt;/td&gt;
&lt;td&gt;a11y-audit · localization-test · push-notification-test · test-data-factory&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What it actually does
&lt;/h2&gt;

&lt;p&gt;Each skill activates on natural language triggers. Some examples:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. "I want to file a bug"
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;bug-report&lt;/code&gt; skill walks you through RIDER format (Reproduction / Impact / Device / Expected vs Actual / References), checks JIRA for duplicates, does root-cause analysis from git history, creates the ticket with the right priority, and sends a Slack DM — in one conversation.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. "Plan tests for this new feature"
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;test-master&lt;/code&gt; reads your JIRA ticket (or your description), scans both iOS and Android repos for affected modules, designs a test pyramid (70% Unit / 20% Integration / 10% UI), generates &lt;strong&gt;black-box + white-box test cases&lt;/strong&gt; in Google Sheets, identifies coverage gaps against existing tests, and builds an automation ROI roadmap.&lt;/p&gt;

&lt;p&gt;It also enforces a11y must-checks per UI feature (Dynamic Type / VoiceOver / contrast / touch targets) — no more "we forgot accessibility" at the end of the sprint.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. "Are my tests actually catching bugs?"
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;mutation-testing&lt;/code&gt; runs &lt;code&gt;mutmut&lt;/code&gt; on your Python backend. It changes &lt;code&gt;&amp;lt;&lt;/code&gt; to &lt;code&gt;&amp;lt;=&lt;/code&gt;, &lt;code&gt;True&lt;/code&gt; to &lt;code&gt;False&lt;/code&gt;, or numeric literals — then re-runs your pytest. If your tests still pass with the broken code, that mutation &lt;strong&gt;survived&lt;/strong&gt; = your TCs have fake coverage.&lt;/p&gt;

&lt;p&gt;Then &lt;code&gt;property-based-test-gen&lt;/code&gt; takes those survived mutations and generates &lt;code&gt;hypothesis&lt;/code&gt; strategies that fuzz 200 inputs per test to close the gap.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. "Which tests should run on every PR?"
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;smoke-test-analyzer&lt;/code&gt; scans your existing test suite (iOS XCUITest / Android Espresso / pytest), scores each test on 5 weighted criteria (criticality / speed / stability / independence / coverage value), and tiers them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;T0 PR Smoke&lt;/strong&gt; (&amp;lt; 3 min) — runs every PR&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;T1 Daily&lt;/strong&gt; (&amp;lt; 10 min) — runs nightly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;T2 Release&lt;/strong&gt; (&amp;lt; 60 min) — pre-release full regression&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;T3 Manual&lt;/strong&gt; — exploratory, visual, a11y&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then it generates &lt;code&gt;.xctestplan&lt;/code&gt; for iOS or Gradle filters for Android.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three modes for any tool stack
&lt;/h2&gt;

&lt;p&gt;Not every team has the same MCP servers installed. Same skills, three modes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;When to use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;full-mcp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;You have Atlassian + Slack + Google Workspace MCPs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;partial-mcp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Some MCPs missing — skills degrade gracefully&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;markdown-only&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Solo dev / no MCP / pure documentation flow&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;markdown-only&lt;/code&gt; mode is what makes this actually portable — every skill can still produce useful Markdown reports under &lt;code&gt;.claude/testing/&lt;/code&gt; without external dependencies. Solo developers can use the full suite without setting up anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  6 ready-to-use presets
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp &lt;/span&gt;config/presets/full-stack.json     config/config.json   &lt;span class="c"&gt;# All MCPs&lt;/span&gt;
&lt;span class="nb"&gt;cp &lt;/span&gt;config/presets/jira-only.json      config/config.json   &lt;span class="c"&gt;# JIRA only&lt;/span&gt;
&lt;span class="nb"&gt;cp &lt;/span&gt;config/presets/markdown-only.json  config/config.json   &lt;span class="c"&gt;# Pure docs&lt;/span&gt;
&lt;span class="nb"&gt;cp &lt;/span&gt;config/presets/startup.json        config/config.json   &lt;span class="c"&gt;# Small startup&lt;/span&gt;
&lt;span class="nb"&gt;cp &lt;/span&gt;config/presets/enterprise.json     config/config.json   &lt;span class="c"&gt;# 5 team boards&lt;/span&gt;
&lt;span class="nb"&gt;cp &lt;/span&gt;config/presets/government.json     config/config.json   &lt;span class="c"&gt;# High-compliance&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why I made it bilingual + 简体
&lt;/h2&gt;

&lt;p&gt;I'm Taiwanese, and most of the test-engineering content out there is English-first. So every skill ships with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;SKILL.md&lt;/code&gt; — Traditional Chinese (primary)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SKILL.en.md&lt;/code&gt; — English mirror&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;concept-zh.md&lt;/code&gt; — Beginner intros for unfamiliar concepts (mutation testing, property-based testing, spec-driven dev, test tiering)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The README is in English (primary), Traditional Chinese, and Simplified Chinese.&lt;/p&gt;

&lt;h2&gt;
  
  
  The license model
&lt;/h2&gt;

&lt;p&gt;I went with a &lt;strong&gt;dual license&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🟢 &lt;strong&gt;MIT&lt;/strong&gt; — Personal use / education / research / non-profits / 30-day evaluation / open-source contributions&lt;/li&gt;
&lt;li&gt;🔴 &lt;strong&gt;Commercial&lt;/strong&gt; — For-profit company internal use, paid products, SaaS, paid consulting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://github.com/kao273183/qa-claude-skill/blob/main/LICENSE-COMMERCIAL.md" rel="noopener noreferrer"&gt;See LICENSE-COMMERCIAL.md&lt;/a&gt; for how to obtain a commercial license. I'm doing this case-by-case via GitHub Issues — the goal isn't to monetize aggressively, but to leave space for sustainable enterprise support if it grows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick start
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/kao273183/qa-claude-skill.git
&lt;span class="nb"&gt;cd &lt;/span&gt;qa-claude-skill
&lt;span class="nb"&gt;cp &lt;/span&gt;config/config.example.json config/config.json   &lt;span class="c"&gt;# Edit your IDs&lt;/span&gt;
./install.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In Claude Code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Generate test plan for a user login feature
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;test-master&lt;/code&gt; skill activates and walks you through. Or try:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;"I want to file a bug — the checkout crashes on Android"&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;"Review these test cases [Google Sheet URL]"&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;"Check if my tests actually catch bugs in src/auth/"&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Windows users — there's a PowerShell version (&lt;code&gt;install.ps1&lt;/code&gt;) as of v1.3.0.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's still missing
&lt;/h2&gt;

&lt;p&gt;This is v1.6.2. The roadmap still has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Japanese translation&lt;/li&gt;
&lt;li&gt;Web UI for editing config.json visually&lt;/li&gt;
&lt;li&gt;More skills (test-impact-analyzer, oauth-flow-test, websocket-realtime-test, llm-quality-eval...)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PRs welcome. The CONTRIBUTING.md has the template for adding a new skill.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/kao273183/qa-claude-skill" rel="noopener noreferrer"&gt;&lt;strong&gt;GitHub: kao273183/qa-claude-skill&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I'd love to hear what skills are missing for your team's stack — drop an issue or comment below.&lt;/p&gt;

&lt;p&gt;If this saves your team time, you can &lt;a href="https://buymeacoffee.com/minikao" rel="noopener noreferrer"&gt;buy me a coffee&lt;/a&gt; ☕ — but a ⭐ on the repo helps more.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is a community / personal project for Claude Code users — NOT an official Anthropic product.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>qa</category>
      <category>testing</category>
      <category>opensource</category>
    </item>
    <item>
      <title>The 10% CAPTCHA problem in QA — and why your AI solver should refuse Google login</title>
      <dc:creator>MiniKao</dc:creator>
      <pubDate>Tue, 19 May 2026 08:34:00 +0000</pubDate>
      <link>https://forem.com/kao273183/the-10-captcha-problem-in-qa-and-why-your-ai-solver-should-refuse-google-login-3aoe</link>
      <guid>https://forem.com/kao273183/the-10-captcha-problem-in-qa-and-why-your-ai-solver-should-refuse-google-login-3aoe</guid>
      <description>&lt;h2&gt;
  
  
  The 10% that ruins QA day
&lt;/h2&gt;

&lt;p&gt;You've automated the login flow. Your Playwright suite hums along. Then a CAPTCHA shows up and the whole thing collapses.&lt;/p&gt;

&lt;p&gt;The honest answer from any QA engineer who's done this for more than six months is: &lt;strong&gt;stop trying to solve the CAPTCHA&lt;/strong&gt;. Configure the test environment so it never appears. Test mode keys. Backend bypass tokens. Feature flags. IP allowlist on staging. The list of "right ways" is long and almost all of them are boring.&lt;/p&gt;

&lt;p&gt;That works for ninety percent of testing. Then there's the remaining ten percent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A B2B integration test where the third party owns the CAPTCHA and won't change their config for you&lt;/li&gt;
&lt;li&gt;A client engagement with written authorization to test the production system, but no access to the backend&lt;/li&gt;
&lt;li&gt;A staging environment that intentionally mirrors prod CAPTCHA behavior to catch UX regressions&lt;/li&gt;
&lt;li&gt;A mobile webview test where IP allowlist doesn't reach&lt;/li&gt;
&lt;li&gt;An accessibility audit that needs to actually &lt;em&gt;see&lt;/em&gt; the challenge to test screen-reader behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For those, every shortcut violates someone's terms of service or your engagement contract. So we built &lt;code&gt;mk-qa-master&lt;/code&gt; v0.7.0: a pair of MCP tools that let an AI client read a reCAPTCHA v2 image grid and click the right tiles — but only after a consent gate, never against third-party login portals, and never retaining the screenshot beyond the active cycle.&lt;/p&gt;

&lt;p&gt;This post is about &lt;em&gt;why&lt;/em&gt; the safety design matters more than the AI magic.&lt;/p&gt;




&lt;h2&gt;
  
  
  The three-tier CAPTCHA strategy
&lt;/h2&gt;

&lt;p&gt;The strategy lives in the built-in QA knowledge layer (&lt;code&gt;get_qa_context(section="CAPTCHA")&lt;/code&gt;) so every test the AI generates respects the same hierarchy:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;When to use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1 — bypass&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;reCAPTCHA test keys, feature flags, IP allowlist, test-mode headers&lt;/td&gt;
&lt;td&gt;Default. Covers ~90% of cases.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2 — degrade&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mark as &lt;code&gt;external_dependency&lt;/code&gt;, skip downstream assertions&lt;/td&gt;
&lt;td&gt;When you can't change the backend but the test isn't about the CAPTCHA itself.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3 — AI visual judgment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;This feature.&lt;/td&gt;
&lt;td&gt;Only when 1 + 2 don't fit.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Tier 1 is the "boring" answer and it's right almost every time. Google publishes &lt;a href="https://developers.google.com/recaptcha/docs/faq#id-like-to-run-automated-tests-with-recaptcha.-what-should-i-do" rel="noopener noreferrer"&gt;test keys&lt;/a&gt; that always return success. Cloudflare Turnstile does the same. hCaptcha does the same. Your staging env can use them in seconds.&lt;/p&gt;

&lt;p&gt;Tier 2 is for when the CAPTCHA is on the way to what you're really testing — say, you want to verify the post-login dashboard, not the auth flow. Mark the auth step as &lt;code&gt;external_dependency&lt;/code&gt;, prove independently that the dashboard renders correctly with a seeded session, and you've decoupled the concern.&lt;/p&gt;

&lt;p&gt;Tier 3 is what this release is about. It's the last resort, and we designed it like one.&lt;/p&gt;




&lt;h2&gt;
  
  
  What v0.7.0 actually ships
&lt;/h2&gt;

&lt;p&gt;Two atomic MCP tools:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;inspect_visual_challenge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;confirm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="c1"&gt;# Returns: screenshot of the challenge frame (base64),
&lt;/span&gt;  &lt;span class="c1"&gt;# challenge text, 3x3 or 4x4 tile grid metadata.
&lt;/span&gt;  &lt;span class="c1"&gt;# Refuses on forbidden domains.
&lt;/span&gt;  &lt;span class="c1"&gt;# Requires QA_VISUAL_CHALLENGE_CONSENT=true.
&lt;/span&gt;
&lt;span class="nf"&gt;solve_visual_challenge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;tile_indices&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;   &lt;span class="c1"&gt;# AI client's tile selection
&lt;/span&gt;    &lt;span class="n"&gt;confirm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="c1"&gt;# Executes the click chain for the chosen tiles + Verify.
&lt;/span&gt;  &lt;span class="c1"&gt;# Returns: status (passed/failed), token, hint.
&lt;/span&gt;  &lt;span class="c1"&gt;# Same gates as inspect.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The AI client (Claude, Gemini, GPT-4V, whichever) is the actual solver. &lt;code&gt;mk-qa-master&lt;/code&gt; is just &lt;strong&gt;eyes and hands&lt;/strong&gt;: it screenshots, it accepts a list of indices, it clicks. The intelligence about &lt;em&gt;which&lt;/em&gt; tiles contain a bicycle lives in the multimodal model.&lt;/p&gt;

&lt;p&gt;That separation matters: it means the QA tool doesn't ship a CAPTCHA-solving ML model, doesn't compete with services like 2Captcha, doesn't accumulate know-how about how to beat specific challenge types. It just enables an AI client that already has vision to do its job inside a Playwright session.&lt;/p&gt;




&lt;h2&gt;
  
  
  The safety design
&lt;/h2&gt;

&lt;p&gt;When you read the implementation, ~40% of the code is feature logic. The other 60% is restraint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consent gate.&lt;/strong&gt; Default off. Nothing happens until you set:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;QA_VISUAL_CHALLENGE_CONSENT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And every tool call requires &lt;code&gt;confirm=true&lt;/code&gt; on top of that. Two locks, deliberately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-call disclaimer.&lt;/strong&gt; The first call surfaces the acceptable-use text in the error message:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ACCEPTABLE USE
This tool is intended for QA testing on:
- Sites you own
- Client sites where you have explicit written authorization
- Test environments where Tier 1 bypass is unavailable

DO NOT USE THIS TOOL ON:
- Third-party sites you do not own
- Production sites without explicit authorization
- Sites where automated access violates TOS or local law
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're the kind of engineer who'd skip a disclaimer, you'll see it three times before you can call this thing for real.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hard-stop domains.&lt;/strong&gt; Some places are refused regardless of consent flag:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;_FORBIDDEN_DOMAINS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;frozenset&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accounts.google.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;login.microsoftonline.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id.apple.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;appleid.apple.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;facebook.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;login.live.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;login.yahoo.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;twitter.com/login&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x.com/login&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Third-party identity portals. There is no legitimate QA reason to script a CAPTCHA solver against someone else's login portal. The match is suffix-based on &lt;code&gt;host&lt;/code&gt;, so &lt;code&gt;accounts.google.com.evil&lt;/code&gt; does not accidentally pass.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Optional authorized-domains allowlist.&lt;/strong&gt; For added discipline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;QA_VISUAL_CHALLENGE_AUTHORIZED_DOMAINS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;client-staging.example.com,internal-app.example.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When set, the tool refuses on any host that isn't on this list. Recommended for client engagements where you want a hard contract trail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Privacy.&lt;/strong&gt; Screenshots live only during the active &lt;code&gt;inspect → solve&lt;/code&gt; cycle. Telemetry logs the boolean outcome — never the screenshot, never the challenge text, never the tile selection. You don't accumulate a corpus of solved CAPTCHAs.&lt;/p&gt;




&lt;h2&gt;
  
  
  What a real session looks like
&lt;/h2&gt;

&lt;p&gt;Inside any MCP-compatible client (Claude Desktop, Cursor, Codex CLI, Gemini CLI, Cline...):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You: "Run the checkout suite. If a CAPTCHA blocks the test, resolve it
      so the rest of the flow can continue."

Claude:
  → run_tests()
  → ✗ failed at step 'click Checkout' — CAPTCHA modal detected

  → inspect_visual_challenge(confirm=true)
  → returns: screenshot + grid metadata + challenge text
            ("Select all images with traffic lights")

  → [Claude looks at the image, identifies tiles 0, 2, 5]

  → solve_visual_challenge(tile_indices=[0, 2, 5], confirm=true)
  → returns: { status: "passed", token: "...", hint: "CAPTCHA verified.
              Resume your test." }

  → run_failed()   # retry the failed step
  → ✓ checkout completes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The shape mirrors how MCP composes other capabilities — analyze, generate, run, advise. The AI orchestrates; the server just runs each step.&lt;/p&gt;




&lt;h2&gt;
  
  
  Scope, on purpose
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;v0.7.0&lt;/code&gt; covers &lt;strong&gt;reCAPTCHA v2 image-grid only&lt;/strong&gt;. That's deliberate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;reCAPTCHA v3&lt;/strong&gt; has no visible challenge — it's a behavioral risk score. There's nothing to inspect.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloudflare Turnstile&lt;/strong&gt; mostly runs invisibly. Same story.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;hCaptcha&lt;/strong&gt; lands in &lt;code&gt;v0.7.1&lt;/code&gt; once the same safety machinery is fully ported over to its tile layout.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Behavioral CAPTCHA&lt;/strong&gt; (mouse pattern, keystroke timing) is permanently out of scope. That's an anti-bot arms race we have no interest in feeding.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is a feature designed to retire as the web does. When test keys become universally available and behavioral risk scoring takes over, this entire module should become unnecessary. We're fine with that.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why MCP
&lt;/h2&gt;

&lt;p&gt;A few people have asked why we packaged this as an MCP tool instead of a pytest fixture or a Playwright plugin. Two reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The intelligence lives in the AI client, not in the server.&lt;/strong&gt; MCP is the only protocol that makes that clean — the server exposes capabilities, the client (which already has vision and reasoning) decides how to use them. A pytest fixture would have to choose a vision provider, manage credentials, run inference. None of that is the test runner's job.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Composition with the rest of the QA loop.&lt;/strong&gt; &lt;code&gt;mk-qa-master&lt;/code&gt; already exposes &lt;code&gt;analyze_url&lt;/code&gt;, &lt;code&gt;generate_test&lt;/code&gt;, &lt;code&gt;run_tests&lt;/code&gt;, &lt;code&gt;get_optimization_plan&lt;/code&gt;. Putting the visual solver on the same MCP surface means the AI can chain it naturally: detect failure → inspect → solve → re-run. No glue code.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you want the longer pitch on why MCP is the right shape for QA tooling, the &lt;a href="https://github.com/kao273183/mcp-test-runner" rel="noopener noreferrer"&gt;README&lt;/a&gt; walks through it. Short version: AI clients should orchestrate testing the way a senior engineer would, and MCP is the cleanest way to give them the building blocks.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;mk-qa-master  &lt;span class="c"&gt;# or: uvx mk-qa-master&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In your MCP client config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"mk-qa-master"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"uvx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"mk-qa-master"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"QA_RUNNER"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pytest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"QA_PROJECT_ROOT"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/path/to/your/tests"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"QA_VISUAL_CHALLENGE_CONSENT"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"true"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The repo includes &lt;a href="https://github.com/kao273183/mcp-test-runner/tree/main/examples/sample_captcha_fixture" rel="noopener noreferrer"&gt;&lt;code&gt;examples/sample_captcha_fixture/&lt;/code&gt;&lt;/a&gt; — a local HTML page wired up with Google's public reCAPTCHA test keys so you can verify the end-to-end inspect/solve loop without ever touching a real production CAPTCHA.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;v0.7.1&lt;/strong&gt; — hCaptcha support, same safety machinery&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.8.0&lt;/strong&gt; — &lt;code&gt;get_optimization_plan&lt;/code&gt; gains a "CAPTCHA pressure" metric that tells you when your suite is leaning too hard on Tier 3 and should be moved back to Tier 1&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Always&lt;/strong&gt; — no telemetry export of challenge content; no centralized solver model; no third-party identity portal support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you find a domain that should be on the hard-stop list and isn't, open an issue. If you find a use case where Tier 1 / Tier 2 &lt;em&gt;should&lt;/em&gt; work but the docs don't make it obvious, that's the higher-impact bug — the goal is for this feature to be used less, not more.&lt;/p&gt;




&lt;p&gt;If &lt;code&gt;mk-qa-master&lt;/code&gt; saved your QA flow, &lt;a href="https://www.buymeacoffee.com/minikao" rel="noopener noreferrer"&gt;a coffee keeps the late-night CAPTCHA debugging going&lt;/a&gt;. Star the repo, file an issue, send a Maestro flow that broke — they're all the same to me.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repo&lt;/strong&gt;: &lt;a href="https://github.com/kao273183/mcp-test-runner" rel="noopener noreferrer"&gt;&lt;code&gt;kao273183/mcp-test-runner&lt;/code&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;PyPI&lt;/strong&gt;: &lt;a href="https://pypi.org/project/mk-qa-master/" rel="noopener noreferrer"&gt;&lt;code&gt;mk-qa-master&lt;/code&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Glama&lt;/strong&gt;: &lt;a href="https://glama.ai/mcp/servers/kao273183/mcp-test-runner" rel="noopener noreferrer"&gt;glama.ai/mcp/servers/kao273183/mcp-test-runner&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Landing page&lt;/strong&gt;: &lt;a href="https://mcp.chenjundigital.com" rel="noopener noreferrer"&gt;mcp.chenjundigital.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>testing</category>
      <category>mcp</category>
      <category>python</category>
      <category>automation</category>
    </item>
    <item>
      <title>Claude can drive Schemathesis + Postman through one MCP — I shipped both runners in one day</title>
      <dc:creator>MiniKao</dc:creator>
      <pubDate>Sun, 17 May 2026 14:18:06 +0000</pubDate>
      <link>https://forem.com/kao273183/claude-can-drive-schemathesis-postman-through-one-mcp-i-shipped-both-runners-in-one-day-4m4</link>
      <guid>https://forem.com/kao273183/claude-can-drive-schemathesis-postman-through-one-mcp-i-shipped-both-runners-in-one-day-4m4</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — Today I shipped &lt;a href="https://pypi.org/project/mk-qa-master/" rel="noopener noreferrer"&gt;&lt;code&gt;mk-qa-master&lt;/code&gt;&lt;/a&gt; &lt;strong&gt;v0.6.0 (Schemathesis)&lt;/strong&gt; in the morning and &lt;strong&gt;v0.6.1 (Newman / Postman)&lt;/strong&gt; in the afternoon. Same MCP tool surface (still 16 tools), same &lt;code&gt;report.json&lt;/code&gt; / history / flake / coach pipeline, two new ways to drive API tests from Claude / Cursor / Codex. Total code: ~300 lines across two runners. Total elapsed: about 6 hours. This post is the architecture story.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;I'm a QA engineer building &lt;a href="https://mcp.chenjundigital.com" rel="noopener noreferrer"&gt;&lt;code&gt;mk-*&lt;/code&gt;&lt;/a&gt;, an open-source family of MCP servers for the AI dev pipeline. Last week I shipped v0.5.1 of mk-qa-master with five runners — pytest / Jest / Cypress / Go test / Maestro for mobile.&lt;/p&gt;

&lt;p&gt;Two days ago, while updating the family-site copy, I added a line that said &lt;em&gt;"mk-qa-master tests web + mobile + API"&lt;/em&gt;. The first two were honest. The third was a stretch — yes, your existing pytest-with-&lt;code&gt;httpx&lt;/code&gt; tests would run, but there was no &lt;strong&gt;dedicated API runner&lt;/strong&gt;. A QA reader could install it expecting OpenAPI ingestion or Postman support, and find neither.&lt;/p&gt;

&lt;p&gt;I had two options:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Walk the marketing copy back to "we drive web + mobile, your existing API tests ride along"&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Make the copy true&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I picked option 2. Two API runners, same day.&lt;/p&gt;

&lt;p&gt;This post is how that played out.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why MCP makes "ship two runners in one day" plausible
&lt;/h2&gt;

&lt;p&gt;The mk-qa-master architecture has a runner abstraction that already shipped with five frameworks. Each runner is a Python class implementing the same interface:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestRunner&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;list_tests&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_tests&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_failed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_report_summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Whatever framework the runner wraps, the MCP tool surface is the same 16 tools. The AI client (Claude, Cursor, Codex, Gemini) calls &lt;code&gt;run_tests&lt;/code&gt; / &lt;code&gt;get_optimization_plan&lt;/code&gt; / &lt;code&gt;get_failure_details&lt;/code&gt; the same way regardless of whether you're testing a React app, a Go service, an iOS Simulator, or an API. The runner translates.&lt;/p&gt;

&lt;p&gt;Adding a new runner = &lt;code&gt;~150 lines of Python&lt;/code&gt; + register in &lt;code&gt;REGISTRY&lt;/code&gt; + write a sample + bump version. That's it.&lt;/p&gt;

&lt;p&gt;This is the MCP-level value claim: &lt;strong&gt;the AI doesn't relearn your stack. You add a runner; the AI's tool surface inherits the new capability automatically.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So shipping API testing was less "design a new product" and more "fill in the runner slot the abstraction was waiting for."&lt;/p&gt;




&lt;h2&gt;
  
  
  v0.6.0 — Schemathesis (OpenAPI / Swagger)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://schemathesis.readthedocs.io/" rel="noopener noreferrer"&gt;Schemathesis&lt;/a&gt; reads an OpenAPI 3.x or Swagger 2.0 schema and fuzzes every operation with property-based tests — response schema conformance, status-code conformance, server-error detection. Hand it a URL or file path, it spits out coverage in 30–60 seconds.&lt;/p&gt;

&lt;p&gt;The runner wraps the &lt;code&gt;schemathesis run&lt;/code&gt; CLI. User-facing config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"mk-qa-master"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"uvx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"mk-qa-master[api]"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"QA_RUNNER"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"schemathesis"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"QA_OPENAPI_URL"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://api.example.com/openapi.json"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Restart your client. Then in any session:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Test the API at &lt;a href="https://api.example.com/openapi.json" rel="noopener noreferrer"&gt;https://api.example.com/openapi.json&lt;/a&gt; — find anything broken, then give me a prioritized action plan."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Real-feeling session transcript:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;you ▸ Test https://api.example.com/openapi.json — find anything broken
       and give me a prioritized action plan.

  → get_runner_info ✓ schemathesis · OpenAPI 3.0.3 detected
  → list_tests ✓ 24 endpoints × 5 checks = 120 cases
  → run_tests ⚠ 112 passed, 6 failed, 2 errored (47s)
  → get_optimization_plan ✓ next priorities:

      🔴 broken  · POST /users :: response_schema_conformance
        Same Schemathesis signature × 3 → "status 500, expected 201|400"
        Action: response schema doesn't allow 500; either fix the
        validation bug or add 500 to the schema's responses block

      🔴 broken  · GET /search :: not_a_server_error
        Crashes under `?q=null` and `?limit=-1`
        Action: missing input validation on the search handler

      🟡 warn    · DELETE /users/{id} returned 204 when schema says 200
        Likely safe to update the schema; verify with PM

      🟢 stable  · 18 endpoints, no findings
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The advisor's classification is the same logic the suite uses for UI tests — 3 consecutive failures with the same error signature = &lt;code&gt;broken&lt;/code&gt;. A test that's red-green-red across runs = &lt;code&gt;flaky&lt;/code&gt;. mk-qa-master doesn't differentiate "the API is broken" from "the UI is broken" — same flake-score, same broken classification, same advisor.&lt;/p&gt;

&lt;p&gt;That's the abstraction paying off.&lt;/p&gt;




&lt;h2&gt;
  
  
  The one CLI-flag mistake that cost me 20 minutes
&lt;/h2&gt;

&lt;p&gt;Here's the part that was &lt;em&gt;not&lt;/em&gt; smooth.&lt;/p&gt;

&lt;p&gt;The PRD I wrote in the morning said the runner would invoke schemathesis like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;schemathesis run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--checks&lt;/span&gt; all &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--report-json&lt;/span&gt; /tmp/report.json &lt;span class="se"&gt;\ &lt;/span&gt;  &lt;span class="c"&gt;# ⚠ this flag does not exist&lt;/span&gt;
  &lt;span class="nt"&gt;--junit-xml&lt;/span&gt; /tmp/junit.xml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--hypothesis-database&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;none &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nv"&gt;$URL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The subagent implementing the runner followed the spec faithfully. CI choked instantly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Error: No such option '--report-json'. Did you mean '--report'?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Schemathesis 3.x has &lt;strong&gt;no JSON-report flag&lt;/strong&gt;. The PRD assumed one based on... I'm not sure what. Maybe an older version, maybe wishful thinking, maybe just a hallucination in my own design doc.&lt;/p&gt;

&lt;p&gt;Fix: rewrite &lt;code&gt;_normalize_report&lt;/code&gt; to parse &lt;code&gt;--junit-xml&lt;/code&gt; output instead — JUnit XML is stdlib-parseable (&lt;code&gt;xml.etree.ElementTree&lt;/code&gt;) and standard across every test runner I've ever touched. Took 20 minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson&lt;/strong&gt;: when writing a PRD that hardcodes CLI flags, run &lt;code&gt;&amp;lt;tool&amp;gt; --help&lt;/code&gt; on the actual installed version before committing. The spec is only worth what the underlying tool actually supports.&lt;/p&gt;

&lt;p&gt;I'll be repeating this to myself for v0.7.&lt;/p&gt;




&lt;h2&gt;
  
  
  v0.6.1 — Newman (Postman collections)
&lt;/h2&gt;

&lt;p&gt;After lunch I shipped the second runner.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://learning.postman.com/docs/collections/using-newman-cli/command-line-integration-with-newman/" rel="noopener noreferrer"&gt;Newman&lt;/a&gt; is the official CLI for running Postman collections. Postman has ~30M users; a huge chunk of them have collections in version control already. Newman + that collection JSON = headless replay of every request and &lt;code&gt;pm.test(...)&lt;/code&gt; assertion.&lt;/p&gt;

&lt;p&gt;Runner shape, same as Schemathesis but for Postman:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"mk-qa-master"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"uvx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"mk-qa-master"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"QA_RUNNER"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"newman"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"QA_POSTMAN_COLLECTION"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/path/to/your-api.postman_collection.json"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"QA_POSTMAN_ENVIRONMENT"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/path/to/staging.postman_environment.json"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Newman is npm-side, not pip-side, so it's a &lt;strong&gt;system prerequisite&lt;/strong&gt; rather than a Python optional dep:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; newman
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This was a small choice that took 2 minutes to settle: do you bundle Newman into the Python optional dep group somehow? You can't — &lt;code&gt;pyproject.toml&lt;/code&gt; only knows about Python. So Newman gets the &lt;code&gt;npm install -g&lt;/code&gt; treatment, the runner does &lt;code&gt;shutil.which("newman")&lt;/code&gt;, and if it's missing the user sees a clear &lt;code&gt;ImportError&lt;/code&gt; pointing at the install command.&lt;/p&gt;

&lt;p&gt;The runner translates Newman's JSON report (&lt;code&gt;run.executions[]&lt;/code&gt; + &lt;code&gt;run.failures[]&lt;/code&gt;) into mk-qa-master's &lt;code&gt;report.json&lt;/code&gt; shape. One nodeid per assertion:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;GET {{baseUrl}}/books :: Books :: List books
POST {{baseUrl}}/books :: Books :: Create book
GET {{baseUrl}}/books/{{bookId}} :: Books :: Get book by id
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same history / flake / coach pipeline as before.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No CLI-flag mistake this time&lt;/strong&gt; — I ran &lt;code&gt;newman run --help&lt;/code&gt; first, sketched the flag list, then started implementation. Lesson learned from the morning.&lt;/p&gt;




&lt;h2&gt;
  
  
  Schemathesis vs Newman — when to use which
&lt;/h2&gt;

&lt;p&gt;I get asked this every time I show the two runners. Here's the call I make:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;You have…&lt;/th&gt;
&lt;th&gt;Use…&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;An OpenAPI 3.x / Swagger 2.0 schema and you want generated tests across the whole surface&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Schemathesis&lt;/strong&gt; — fuzz-driven, finds bugs you didn't think to write tests for&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A Postman collection your team already curates by hand&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Newman&lt;/strong&gt; — re-uses your existing investment, runs the assertions you already wrote&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Both (a schema for breadth + a collection for happy paths)&lt;/td&gt;
&lt;td&gt;Run both in the same session — Schemathesis catches schema drift, Newman catches business-logic regressions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Neither, but you have pytest tests hitting your API&lt;/td&gt;
&lt;td&gt;Stay on &lt;code&gt;QA_RUNNER=pytest&lt;/code&gt;, no migration needed — your existing tests already ride the same pipeline&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The point of having both isn't to replace either ecosystem. It's that the &lt;strong&gt;AI doesn't need to know which one is active&lt;/strong&gt;. From Claude's perspective, &lt;code&gt;run_tests&lt;/code&gt; returns the same shape. The runner does the translation.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;p&gt;Things I'd change on a redo:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Run &lt;code&gt;--help&lt;/code&gt; first&lt;/strong&gt; on every CLI before writing the PRD. (See above.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single PRD covering Phase 1 + Phase 2&lt;/strong&gt; instead of writing Phase 2 ratification as an appendix. Mid-sized features deserve a single design doc, not a doc + amendment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bundle the sample Postman collection with a Prism mock script&lt;/strong&gt; so users can &lt;code&gt;prism mock openapi.yaml &amp;amp;&lt;/code&gt; and immediately have something live to point Newman at. Right now the sample is correct but a bit lonely until the user provides a target.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Things I'd keep:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Optional deps for Python-side, system prereq for npm-side&lt;/strong&gt;. Forcing schemathesis onto every install would bloat. Forcing newman as a pip dep doesn't even work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--junit-xml&lt;/code&gt; as the normalization source&lt;/strong&gt; for Schemathesis. Standard format, stdlib parseable, future-proof.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-assertion nodeids for Newman, per-check nodeids for Schemathesis&lt;/strong&gt;. Finer granularity than "this endpoint passed" — the flake-score logic needs to know which assertion within an endpoint is unstable.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Quick start
&lt;/h2&gt;

&lt;p&gt;If you want to try it right now:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Schemathesis path (OpenAPI / Swagger)&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s1"&gt;'mk-qa-master[api]'&lt;/span&gt;

&lt;span class="c"&gt;# Newman path (Postman)&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; newman
pip &lt;span class="nb"&gt;install &lt;/span&gt;mk-qa-master
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then drop the snippet from above into your Claude Desktop / Claude Code / Cursor / Codex config. Restart your client. Ask Claude to test your API. That's the whole UX.&lt;/p&gt;

&lt;p&gt;The bundled sample at &lt;code&gt;examples/sample_api_project/&lt;/code&gt; has both an &lt;code&gt;openapi.yaml&lt;/code&gt; and a &lt;code&gt;postman-collection.json&lt;/code&gt; for the same fictional Library API — same 3 endpoints, two different runner paths, identical AI-side workflow. Drop a mock server (Prism, Mockoon, whatever) in front and you can dogfood the whole loop in ~5 minutes.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;v0.7.0&lt;/strong&gt; adds Pact provider verification + an &lt;code&gt;analyze_api&lt;/code&gt; tool (OpenAPI introspection → candidate test scenarios). Whether it ships depends on whether v0.6.0 / 0.6.1 produce real adoption signal. If 6 weeks from now nobody's filed an issue about Pact, I'll skip it and focus on something the community is actually asking for.&lt;/p&gt;

&lt;p&gt;This is the discipline I'm trying to learn — ship two runners on the same day the architecture allows it; &lt;em&gt;don't&lt;/em&gt; speculate a third just because the abstraction would still hold.&lt;/p&gt;




&lt;h2&gt;
  
  
  Family
&lt;/h2&gt;

&lt;p&gt;mk-qa-master is one of three open-source MCP servers I'm building:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/kao273183/mk-plan-master" rel="noopener noreferrer"&gt;mk-plan-master&lt;/a&gt;&lt;/strong&gt; — idea triage + RICE scoring + spec-draft bridge&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/kao273183/mk-spec-master" rel="noopener noreferrer"&gt;mk-spec-master&lt;/a&gt;&lt;/strong&gt; — specs → scenarios + coverage matrix&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/kao273183/mk-qa-master" rel="noopener noreferrer"&gt;mk-qa-master&lt;/a&gt;&lt;/strong&gt; (this) — drives the test runner across web / mobile / API&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Together they form: &lt;code&gt;Idea → Plan → Spec → Code (your IDE) → Test → Coverage → Coach&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Family site: &lt;a href="https://mcp.chenjundigital.com" rel="noopener noreferrer"&gt;mcp.chenjundigital.com&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;If your team is QA-heavy and you've been frustrated by AI tools that either write &lt;code&gt;# TODO&lt;/code&gt; for API tests or charge $50k/year to run them — give the v0.6 line a try. If you find anything weird, the issue tracker is the right place.&lt;/p&gt;

&lt;p&gt;A star helps the algorithm find people like you. Feedback helps more.&lt;/p&gt;

&lt;p&gt;— Jack Kao, building solo.&lt;/p&gt;

</description>
      <category>api</category>
      <category>testing</category>
      <category>mcp</category>
      <category>opensource</category>
    </item>
    <item>
      <title>I'm a QA engineer. After Claude wrote # TODO in my 100th test, I built an MCP server.</title>
      <dc:creator>MiniKao</dc:creator>
      <pubDate>Sat, 16 May 2026 19:32:44 +0000</pubDate>
      <link>https://forem.com/kao273183/im-a-qa-engineer-after-claude-wrote-todo-in-my-100th-test-i-built-an-mcp-server-3c4l</link>
      <guid>https://forem.com/kao273183/im-a-qa-engineer-after-claude-wrote-todo-in-my-100th-test-i-built-an-mcp-server-3c4l</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — &lt;a href="https://github.com/kao273183/mk-qa-master" rel="noopener noreferrer"&gt;mk-qa-master&lt;/a&gt; is an open-source MCP server that lets Claude / Cursor / Codex / Gemini &lt;strong&gt;drive your real test suite&lt;/strong&gt; — pytest, Jest, Cypress, Go test, and Maestro for mobile. 16 tools, 5 categories, a three-layer QA knowledge architecture. &lt;code&gt;uvx&lt;/code&gt;-installable. MIT.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The moment I stopped blaming the model
&lt;/h2&gt;

&lt;p&gt;The 5th time Claude wrote &lt;code&gt;# TODO: add real selector here&lt;/code&gt; in a generated test, I tried a smarter prompt. The 20th time, I switched models. The 100th time, I stopped blaming the LLM.&lt;/p&gt;

&lt;p&gt;I'm a QA engineer. I've watched LLMs write beautiful-looking test scaffolds for two years now, and every one of them collapses at the same place:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The model can &lt;strong&gt;read your code&lt;/strong&gt;. It cannot see your &lt;strong&gt;live DOM&lt;/strong&gt;, your &lt;strong&gt;mobile view hierarchy&lt;/strong&gt;, your &lt;strong&gt;last 10 test runs&lt;/strong&gt;, or that &lt;code&gt;checkout-flow.spec.ts&lt;/code&gt; has been red 7 times in 14 days.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So it guesses. Guesses are how you get &lt;code&gt;# TODO&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The fix isn't a smarter prompt. It's giving the LLM &lt;strong&gt;access&lt;/strong&gt; to the things it's currently guessing about.&lt;/p&gt;

&lt;p&gt;That's what the Model Context Protocol (MCP) is for. And that's why I built &lt;strong&gt;mk-qa-master&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What "AI for QA" usually means
&lt;/h2&gt;

&lt;p&gt;Most AI-for-testing products today fall into one of three buckets:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;IDE plugins that emit test files&lt;/strong&gt; — Copilot Tests, Cursor's test generator. Great in a screenshot. They write the file, &lt;em&gt;you&lt;/em&gt; fix the selectors.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Just prompt ChatGPT" tutorials&lt;/strong&gt; — works for one test, falls apart at ten. No persistence, no awareness of what's actually flaky, no runtime feedback.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;End-to-end AI testing SaaS&lt;/strong&gt; — record-and-playback wrappers. They own your test infrastructure, charge per seat, and you're locked in.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What's missing from all three: &lt;strong&gt;the AI never touches the runner&lt;/strong&gt;. It writes code; you run; you debug; you tell the AI what broke. It's a chatbot pretending to be an engineer.&lt;/p&gt;

&lt;p&gt;The reframe: stop asking AI to &lt;em&gt;write&lt;/em&gt; tests. &lt;strong&gt;Make it drive your test runner.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What MCP changes
&lt;/h2&gt;

&lt;p&gt;MCP (introduced by Anthropic in late 2024, now adopted by Cursor, Codex CLI, Gemini CLI, Zed, Cline and others) lets an AI client call &lt;strong&gt;tools&lt;/strong&gt; — not just see text, but trigger actions, read structured responses, chain them.&lt;/p&gt;

&lt;p&gt;An MCP server is just a process that exposes tools. Drop it into your client config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"mk-qa-master"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"uvx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"mk-qa-master"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"QA_RUNNER"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pytest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"QA_PROJECT_ROOT"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/path/to/your/project"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;…and now Claude has 16 new things it can do in your project: probe the DOM of a live URL, list your existing tests, generate new ones with &lt;strong&gt;real selectors&lt;/strong&gt;, run them, read JUnit XML, write an optimization plan based on the last N runs.&lt;/p&gt;

&lt;p&gt;Your runner just became part of the AI's tool surface.&lt;/p&gt;




&lt;h2&gt;
  
  
  mk-qa-master in 60 seconds
&lt;/h2&gt;

&lt;p&gt;16 tools across 5 categories. You don't need to memorize names; the README has a cookbook of natural-language prompts that map to each chain.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Tools&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Discover&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;get_runner_info&lt;/code&gt; · &lt;code&gt;list_tests&lt;/code&gt; · &lt;code&gt;analyze_url&lt;/code&gt; · &lt;code&gt;analyze_screen&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Which framework is active. What tests exist. Probe a URL or a live mobile screen for form / nav / CTA modules with real selectors.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Generate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;generate_test&lt;/code&gt; · &lt;code&gt;auto_generate_tests&lt;/code&gt; · &lt;code&gt;codegen&lt;/code&gt; · &lt;code&gt;init_qa_knowledge&lt;/code&gt; · &lt;code&gt;get_qa_context&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Emit runnable pytest &lt;code&gt;.py&lt;/code&gt; or Maestro &lt;code&gt;.yaml&lt;/code&gt;. Not &lt;code&gt;# TODO&lt;/code&gt; placeholders.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Run&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;run_tests&lt;/code&gt; · &lt;code&gt;run_failed&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Drive pytest / Jest / Cypress / Go test / Maestro. Auto-retry, JUnit XML, screenshots, Playwright &lt;code&gt;trace.zip&lt;/code&gt;, Maestro recordings.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Report&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;get_test_report&lt;/code&gt; · &lt;code&gt;get_failure_details&lt;/code&gt; · &lt;code&gt;generate_html_report&lt;/code&gt; · &lt;code&gt;get_test_history&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Outcome history, error signatures, per-test flake scores.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Advise&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;get_optimization_plan&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Three lenses: suite quality (flaky vs broken vs slow), MCP usability, AI effectiveness. Output is a ranked action list — &lt;em&gt;what to fix next, with evidence.&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Switch frameworks with a single env var: &lt;code&gt;QA_RUNNER=pytest | jest | cypress | go | maestro&lt;/code&gt;. Web and mobile share the same MCP surface — &lt;code&gt;analyze_screen&lt;/code&gt; works on iOS Simulator, Android Emulator, real devices, and (yes) BlueStacks via &lt;code&gt;adb connect&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The part nobody else builds: a three-layer QA knowledge architecture
&lt;/h2&gt;

&lt;p&gt;This is what makes mk-qa-master not monkey-testing.&lt;/p&gt;

&lt;p&gt;A DOM-only analyzer produces "empty field should error" for every form on the internet. That's not testing, it's noise. To produce a test that means anything, the generator needs &lt;strong&gt;domain context&lt;/strong&gt;. So I layered three:&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1 — Built-in
&lt;/h3&gt;

&lt;p&gt;ISTQB's seven principles, equivalence partitioning, decision tables, state transitions, the test pyramid, shift-left, mobile testing checklists, QA metrics — baked into the server. The AI gets methodology by default, not by accident.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2 — Your project's &lt;code&gt;qa-knowledge.md&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Drop a file at your project root with your business rules, historical bugs, standard assertion copy, user-journey snippets, technical constraints. &lt;code&gt;init_qa_knowledge&lt;/code&gt; scaffolds one. The MCP loads it on every relevant tool call. &lt;strong&gt;This is where the "AI doesn't know my business" problem actually gets solved.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3 — Per-test inline
&lt;/h3&gt;

&lt;p&gt;Pass a &lt;code&gt;business_context&lt;/code&gt; slice into &lt;code&gt;generate_test&lt;/code&gt;. It gets printed as a &lt;code&gt;# Business context:&lt;/code&gt; block inside the generated test, so the next reviewer sees &lt;em&gt;why&lt;/em&gt; this test exists without leaving the file.&lt;/p&gt;

&lt;p&gt;Three layers of context. One MCP. Pile them up and the AI stops producing "click the button, see something happen" garbage.&lt;/p&gt;




&lt;h2&gt;
  
  
  A real session
&lt;/h2&gt;

&lt;p&gt;Here's what a Monday morning with this looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;you ▸ Test https://your-site/login — one runnable case per module

  → analyze_url ✓ 4 modules · 12 endpoints · 18 candidate cases
  → generate_test ✓ tests/test_login.py (4 cases)
  → run_tests ⚠ 3 passed, 1 failed
  → get_optimization_plan ✓ next priorities:
      🔴 broken  · checkout-coupon-rule (same signature × 3 runs = real bug)
      🟡 flaky   · login-with-2fa (PFPFP outcome string, 60% flake score)
      🟢 stable  · all 12 nav-menu cases

you ▸ Fix the broken one first. Show me the failure.

  → get_failure_details ✓ checkout-coupon-rule:
      Expected: "Discount applied: $5.00"
      Got:      "Discount applied: NaN"
      First failed: 3 runs ago, on PR #142
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice what's happening here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The AI &lt;strong&gt;doesn't ask which test is flaky&lt;/strong&gt; — it pulls flake history from &lt;code&gt;tests-history/&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The AI &lt;strong&gt;doesn't guess selectors&lt;/strong&gt; — &lt;code&gt;analyze_url&lt;/code&gt; gave it real selectors from the live page.&lt;/li&gt;
&lt;li&gt;The AI &lt;strong&gt;doesn't just run tests&lt;/strong&gt; — it returns a ranked action list. "This is broken, this is flaky, this is stable." Evidence, not gut feel.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't AI writing tests. This is &lt;strong&gt;AI doing QA&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this deliberately is not
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Not&lt;/th&gt;
&lt;th&gt;Use this instead&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;A test framework&lt;/td&gt;
&lt;td&gt;You bring pytest / Jest / Cypress / Go test / Maestro — mk-qa-master drives them&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;An LLM&lt;/td&gt;
&lt;td&gt;Your AI client (Claude / Cursor / Codex / Gemini) does the reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A CI runner&lt;/td&gt;
&lt;td&gt;Runs locally, produces JUnit XML; pipe to GitHub Actions / Jenkins as usual&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A source-code analyzer&lt;/td&gt;
&lt;td&gt;Looks at live DOM and view hierarchy, not your repo's source&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A SaaS dashboard&lt;/td&gt;
&lt;td&gt;MCP-native, lives in your AI client. HTML reports are self-contained &lt;code&gt;.html&lt;/code&gt; files&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Knowing what a tool &lt;em&gt;isn't&lt;/em&gt; is half of trust.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick start
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uvx mk-qa-master
&lt;span class="c"&gt;# or: pip install mk-qa-master&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude Desktop config lives at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;macOS&lt;/strong&gt;: &lt;code&gt;~/Library/Application Support/Claude/claude_desktop_config.json&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Windows&lt;/strong&gt;: &lt;code&gt;%APPDATA%\Claude\claude_desktop_config.json&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Linux&lt;/strong&gt;: &lt;code&gt;~/.config/Claude/claude_desktop_config.json&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"mk-qa-master"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"uvx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"mk-qa-master"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"QA_RUNNER"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pytest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"QA_PROJECT_ROOT"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/path/to/your/project"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Restart your client. Then in any AI session, say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Test &lt;code&gt;https://your-site/login&lt;/code&gt; — one runnable case per module, then tell me which existing test is most likely flaky."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's the whole UX. No menus. No buttons. The AI chains the tools.&lt;/p&gt;




&lt;h2&gt;
  
  
  This is one of three
&lt;/h2&gt;

&lt;p&gt;mk-qa-master is the &lt;strong&gt;execution end&lt;/strong&gt; of a family I'm building solo:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;mk-plan-master&lt;/strong&gt; — turns a pile of 30–200 raw ideas into RICE-scored, spec-draft-ready initiatives. Hands off to ↓&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;mk-spec-master&lt;/strong&gt; — parses specs into scenarios, keeps a live spec ↔ test coverage matrix, grades the specs themselves. Hands off to ↓&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;mk-qa-master&lt;/strong&gt; — drives the runner, generates tests, advises on what's broken vs flaky vs slow.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Together they form an end-to-end AI dev pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Idea → Plan → Spec → Code (your IDE) → Test → Coverage → Coach
       mk-plan mk-spec your IDE       mk-qa  mk-spec     both
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The family wraps the &lt;strong&gt;rails&lt;/strong&gt;; code-writing stays in your IDE (Claude Code / Cursor / Copilot). I deliberately don't try to rebuild what your IDE already does well.&lt;/p&gt;

&lt;p&gt;The other two MCPs get their own posts. Follow if that pipeline sounds useful.&lt;/p&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/kao273183/mk-qa-master" rel="noopener noreferrer"&gt;https://github.com/kao273183/mk-qa-master&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyPI&lt;/strong&gt;: &lt;a href="https://pypi.org/project/mk-qa-master/" rel="noopener noreferrer"&gt;https://pypi.org/project/mk-qa-master/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Family site&lt;/strong&gt;: &lt;a href="https://mcp.chenjundigital.com" rel="noopener noreferrer"&gt;https://mcp.chenjundigital.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License&lt;/strong&gt;: MIT&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Family&lt;/strong&gt;: &lt;code&gt;mk-qa-master&lt;/code&gt; · &lt;code&gt;mk-spec-master&lt;/code&gt; · &lt;code&gt;mk-plan-master&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your team is QA-heavy and you've been frustrated by AI tools that write &lt;code&gt;# TODO&lt;/code&gt; instead of real tests — give it a try. If you've found a better way to do this, &lt;strong&gt;I'd genuinely love to hear about it in the comments&lt;/strong&gt;. This is an opinionated tool and I'm still iterating.&lt;/p&gt;

&lt;p&gt;A star helps the algorithm find people like you. Feedback helps more.&lt;/p&gt;

&lt;p&gt;— Jack Kao, building solo.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>testing</category>
      <category>opensource</category>
      <category>mcp</category>
    </item>
  </channel>
</rss>
