Forem: MiniKao

From mock-only-works to real-world-works: 48 hours of reCAPTCHA debugging

MiniKao — Mon, 25 May 2026 02:46:50 +0000

Honest framing first: mk-qa-master is an open-source MCP server for QA engineers. The reCAPTCHA solver in it is a Tier 3 fallback for testing your own apps when Tier 1 (Google's official test keys) and Tier 2 (feature flags / IP allowlist) aren't available. It is not a "beat captcha" tool. It refuses to run on Google / Apple / Microsoft / Discord login pages regardless of consent flag. With that out of the way…

This is a diary about the 48 hours it took to go from "shipped a reCAPTCHA solver, all unit tests green" to "it actually works against the real Google demo." Four versions (v0.7.0 → v0.7.4), three broken intermediate ones, and a bunch of lessons that I want to write down before I forget.

The setup

The idea behind the solver is simple. Two atomic MCP tools:

inspect_visual_challenge — finds the captcha iframe on the current page, screenshots it, returns the tile grid coordinates + a screenshot.
solve_visual_challenge — accepts the AI client's tile selection (which tiles contain buses, which contain crosswalks, etc.), clicks them, presses Verify, returns the token.

The AI client (Claude Code, Cursor, etc.) sees the screenshot, decides which tiles match the prompt, and calls solve. The server is the eyes and hands; the AI is the brain. Multimodal models like Claude 4.7 are surprisingly good at this — they were trained on the open web, which has a lot of bus pictures.

So far so good in theory.

Day 1 — v0.7.0 ships

The first version landed Monday. It detected the reCAPTCHA iframe[src*="bframe"], screenshotted it, computed tile coordinates by dividing the iframe's bounding box into a 3×3 or 4×4 grid, and clicked at the center of each selected tile.

Unit tests passed. The bundled mock fixture (a self-contained HTML page that mimics reCAPTCHA's structure) round-tripped end-to-end. I wrote a PRD, shipped a release, posted a Dev.to walkthrough. Felt great.

The mock fixture's structure was:

<table class="rc-imageselect-table">
  <tr><td>...</td><td>...</td><td>...</td></tr>
  ...
</table>

Selectors in the fingerprint:

"tile_table_selector": ".rc-imageselect-table",
"tile_cell_selector": "td",

What could go wrong?

Day 2 — v0.7.1 adds hCaptcha

The next day I extended the fingerprint table to support hCaptcha. Same architecture — different selectors. No new MCP tools. Tests stayed green. I felt good about the design: when a vendor changes, you add a row to the fingerprint table, you're done.

I didn't run a real-world dogfood for hCaptcha either. (We'll come back to this.)

Day 3 — v0.7.2: the first "fix"

I wrote a tiny dogfood script — open Chromium, navigate to https://www.google.com/recaptcha/api2/demo, click the anchor to trigger an image challenge, call inspect_visual_challenge, save the screenshot, ask the AI for tile indices, call solve_visual_challenge, see if a token comes back.

The first run came back with status failed. I asked the user (in this case: me) what they saw in the browser. The answer was unsettling: "I told it to click 2, 5, and 8 — only 5 and 8 actually got highlighted."

I dug into the coordinate math. The iframe-divide approach split the full iframe into rows × cols cells. But the iframe contains a header banner (the prompt text) above the grid and a footer (the Verify button) below. So:

For a 3×3 grid in a 400×580 iframe with header ~130px and footer ~130px:
- The actual grid is 320px tall, ~106px per row.
- Naive iframe-divide gives 193px per row.
- Row 0's computed center lands in the header banner.
- Row 2's computed center lands in the footer.
- Only row 1 happens to be roughly correct.

I wrote v0.7.2 to fix this. Instead of dividing the iframe, I'd read each cell's real bounding_box() from the DOM via Playwright:

for index in range(tile_count):
    bb = cells.nth(index).bounding_box()
    if not is_real_dict(bb):
        # fall back to iframe-divide for mock fixtures
        break
    candidate.append({"viewport_x": bb["x"], ...})

The unit test (against the mock fixture) immediately confirmed the fix. I bumped to v0.7.2, opened a PR, merged, released, published to PyPI. Done.

Day 4 morning — wait, it's still broken

Next morning, ran the dogfood again. Console output for inspect:

"tiles": [
  {"index": 0, "viewport_x": 85, "viewport_y": 92,  "w": 133, "h": 193},
  {"index": 1, "viewport_x": 218, "viewport_y": 92, "w": 133, "h": 193},
  ...
]

133 × 193 rectangles. The exact dimensions you'd get from dividing a 400×580 iframe by 3×3. Which meant the per-cell bounding_box() path was returning None on every cell in real reCAPTCHA, silently falling back to the same broken iframe-divide math.

Looked at the code path: it had a try / except swallowing the error. I added a debug field _coord_method so the inspect response would show which path actually fired:

"_coord_method": "iframe_divide"  // ← v0.7.2's "fix" never ran

So v0.7.2 fixed the mock fixture and shipped to PyPI. In production, against real Google reCAPTCHA, it behaved identically to v0.7.0. The unit test was green because the mock fixture's <td> elements had real CSS dimensions; in real reCAPTCHA the tiles aren't <td>. I just didn't know that yet.

Day 4 afternoon — going DOM-spelunking

Wrote a one-off debug script that opened the real reCAPTCHA bframe and ran arbitrary JavaScript inside it. The first query was: "does .rc-imageselect-table even exist?"

{
  "tableExists": false,
  "altSelectors": {
    "table[class*=\"rc-imageselect\"]": true,
    ".rc-imageselect-target": true,
    ".rc-image-tile-wrapper": true
  }
}

false. The class I'd been targeting since v0.7.0 doesn't exist in production.

The real DOM looks like this:

Element	Mock fixture	Real Google reCAPTCHA
Table class	`rc-imageselect-table`	`rc-imageselect-table-33` (or `-44`)
Tile element	`<td>` with real CSS	`<div class="rc-image-tile-wrapper">` (the `<td>` is 0×0 because tiles are absolutely-positioned)
Challenge text	`.rc-imageselect-desc`	`.rc-imageselect-desc-no-canonical` (in dynamic-replace mode)

The whole fingerprint table had been wrong all along. Unit tests passed because I wrote the mock fixture to match the selectors I'd hardcoded. Tautology. The mock fixture lied because the person who wrote it was the same person who wrote the selectors.

Day 4 evening — v0.7.3 actually fixes it

I rewrote the fingerprint to chain both real and mock selectors via the CSS comma operator (which means "or"):

"challenge_text_selector": (
    ".rc-imageselect-desc-no-canonical, .rc-imageselect-desc"
),
"tile_table_selector": (
    'table[class*="rc-imageselect-table"], '
    '.rc-imageselect-target, .rc-imageselect-table'
),
"tile_cell_selector": ".rc-image-tile-wrapper, .rc-imageselect-table td",

Now the same fingerprint matches both production reCAPTCHA AND the mock fixture. The per-cell bounding_box() path finally runs against real DOM, returning real 95×95 squares instead of distorted 133×193 rectangles. Tile 0 sits at y=211 (just below the 200px header), not y=92 (inside the header banner).

I also fixed a different UX problem in the same release. The MCP server was returning the screenshot as a base64 string embedded in a JSON TextContent. Multimodal AI clients can't "see" base64 — they see a giant string of iVBORw0KG.... The fix: return the screenshot as a native MCP ImageContent:

return [
    ImageContent(type="image", data=b64, mimeType="image/png"),
    TextContent(type="text", text=json.dumps(metadata)),
]

Now Claude Code receives the screenshot as if you'd dragged it into the chat. No manual screenshot juggling.

Day 4 night — v0.7.4 closes the multi-round gap

One more dogfood run, this time the challenge text was different: "Select all images with buses. Click verify once there are none left."

This is reCAPTCHA's dynamic-replace mode. Click a matching tile, the tile gets replaced with a new image. You have to keep selecting until no buses remain, then click Verify. v0.7.3 always clicked Verify after the first round, so it always failed against this mode even with perfect tile judgment.

v0.7.4 added a new return status: "continue". When solve detects dynamic mode (the prompt contains "none left" / "確定沒有遺漏" / equivalent), it does the clicks, waits for the replace animation, re-screenshots the iframe, and returns status: "continue" with a fresh screenshot + new tile geometry. The AI client looks at the new grid, finds any remaining matches, calls solve again. When the AI sees no more matches, it passes an empty selected_tile_indices: [] to signal "click Verify now."

// Round 1 response
{
  "status": "continue",
  "rounds_used": 1,
  "screenshot_base64": "...new grid...",
  "tiles": [...],
  "hint": "Dynamic-replace round 1/5. Look at the new screenshot and call solve again."
}

// Round 2 AI sees no more buses
{ "selected_tile_indices": [], "confirm": true }

// Round 2 response
{ "status": "passed", "token": "03AGdBq25..." }

Hard cap of 5 rounds prevents infinite loops on pathological challenges. Static mode (no marker phrase) is unchanged — legacy flow runs verbatim.

And — the lesson from this entire saga — I added a weekly GitHub Action that runs the dogfood script against the real Google reCAPTCHA demo and asserts _coord_method != "iframe_divide". If Google ships a DOM change next week that breaks the fingerprint again, I'll get a CI failure email within seven days instead of finding out from a user issue six months later.

on:
  schedule:
    - cron: "0 2 * * 0"  # Sunday 02:00 UTC

What works now

✅ reCAPTCHA v2 image-grid (3×3 + 4×4) — verified against the real Google demo
✅ hCaptcha image-select — same fingerprint infrastructure, fixture verified, real-vendor TBD
✅ Multi-round dynamic-replace — unit-test verified, end-to-end real-vendor TBD
✅ MCP ImageContent — multimodal clients see screenshots natively
✅ Consent gate, domain allowlist, hard-stop blacklist (Google / Apple / Microsoft / Discord login pages)
✅ Weekly real-world CI guard

What doesn't (yet)

❌ Mobile WebView — v0.8.0 mini-PRD drafted, ~6 working days of implementation ahead
❌ reCAPTCHA v3 — pure behavior scoring, no visible challenge, out of scope by design
❌ Cloudflare Turnstile — same reason
❌ Audio captcha fallback — accessibility tier, low usage in QA context
❌ The dynamic-replace loop on real Google reCAPTCHA with AI in the loop — that's my next dogfood session

Lessons I want to remember

Mock fixtures can lie. When the same person writes both the production selectors and the mock that tests them, the mock matches by construction. There's no signal. The fix is dogfood against the real thing — and if you can't dogfood, at minimum run a recorded HAR of the real DOM and assert against that.
Silent fallbacks are the worst kind of bug. v0.7.2's try / except swallowed the failure of every per-cell bounding_box() and quietly fell back to broken math. A _coord_method debug field that surfaces which path actually fired would have caught this in minutes. I now add a debug field every time I have more than one code path for the same output.
Multi-round is UX, not a bug. reCAPTCHA's "Click verify once there are none left" isn't an edge case — it's the dominant mode on hard challenges. I built the static-only solver, said "ship it," and was surprised when most real-world challenges fell into the dynamic-replace bucket I hadn't designed for.
Weekly CI catches what unit tests can't. The dogfood workflow runs once a week against a third-party demo. It's noisy, it depends on a vendor's continued cooperation, and it'd be wrong to depend on it for blocking merges. But as a background signal that catches selector drift, it's exactly the right level of investment.

Try it

pip install mk-qa-master==0.7.4

# In your MCP host (Claude Code config, Cursor, etc.)
{
  "mcpServers": {
    "qa-master": {
      "command": "python",
      "args": ["-m", "mk_qa_master.server"],
      "env": {
        "QA_VISUAL_CHALLENGE_CONSENT": "true",
        "QA_VISUAL_CHALLENGE_AUTHORIZED_DOMAINS": "your-staging.example.com"
      }
    }
  }
}

Then ask Claude: "Test the signup flow on staging. If you hit a captcha, solve it." The MCP tools take it from there.

Repo + walkthrough: https://github.com/kao273183/mk-qa-master.

What's next

v0.8.0 — mobile WebView captcha via Maestro CLI (same fingerprint table, new driver). PRD is up in the repo. Probably another diary entry when that one ships.

If you find a bug, the dogfood script lives at scripts/dogfood-inspect-only.py — run it against the page that broke and the inspect output will tell you exactly which coordinate path fired. Beats debugging blind.

I open-sourced 24 QA skills for Claude Code — from spec to release

MiniKao — Fri, 22 May 2026 03:16:52 +0000

TL;DR — I just open-sourced QA Claude Skill — 24 production-grade QA skills for Claude Code covering test design, automation, performance, security, mutation testing, and more. MIT for non-commercial use. GitHub repo.

The problem

For two years I've been iterating a personal Claude Code workspace for QA work — bug reports, test plans, review checklists, regression matrices. It saved me hours every week.

But every time a colleague asked "how do you write a test plan that fast?" — handing them my workspace meant they got dozens of files hard-coded with my JIRA project key, my Slack user ID, my AWS bucket. Useless to anyone else.

So I spent the last two weeks extracting 24 skills into a properly generalized, open-source repo. Drop in your team's IDs via config.json and it works for any team, any stack.

What's in the box

24 skills across 8 categories:

Category	Skills
Test Design (8)	test-master · flutter-test-master · test-review · regression-test · speckit-to-tc · tc-version-diff · sheet-md-sync · smoke-test-analyzer
Automation (3)	test-automation · flutter-test-automation · tc-to-pytest
Bug Management (1)	bug-report
Quality Quantification (2)	mutation-testing · property-based-test-gen
Reporting (1)	publish-regression
Performance & Security (3)	performance-test-gen · security-scan · api-contract-test
CI Health (2)	visual-regression-gen · flaky-test-hunter
Quality Specialties (4)	a11y-audit · localization-test · push-notification-test · test-data-factory

What it actually does

Each skill activates on natural language triggers. Some examples:

1. "I want to file a bug"

The bug-report skill walks you through RIDER format (Reproduction / Impact / Device / Expected vs Actual / References), checks JIRA for duplicates, does root-cause analysis from git history, creates the ticket with the right priority, and sends a Slack DM — in one conversation.

2. "Plan tests for this new feature"

test-master reads your JIRA ticket (or your description), scans both iOS and Android repos for affected modules, designs a test pyramid (70% Unit / 20% Integration / 10% UI), generates black-box + white-box test cases in Google Sheets, identifies coverage gaps against existing tests, and builds an automation ROI roadmap.

It also enforces a11y must-checks per UI feature (Dynamic Type / VoiceOver / contrast / touch targets) — no more "we forgot accessibility" at the end of the sprint.

3. "Are my tests actually catching bugs?"

mutation-testing runs mutmut on your Python backend. It changes < to <=, True to False, or numeric literals — then re-runs your pytest. If your tests still pass with the broken code, that mutation survived = your TCs have fake coverage.

Then property-based-test-gen takes those survived mutations and generates hypothesis strategies that fuzz 200 inputs per test to close the gap.

4. "Which tests should run on every PR?"

smoke-test-analyzer scans your existing test suite (iOS XCUITest / Android Espresso / pytest), scores each test on 5 weighted criteria (criticality / speed / stability / independence / coverage value), and tiers them:

T0 PR Smoke (< 3 min) — runs every PR
T1 Daily (< 10 min) — runs nightly
T2 Release (< 60 min) — pre-release full regression
T3 Manual — exploratory, visual, a11y

Then it generates .xctestplan for iOS or Gradle filters for Android.

Three modes for any tool stack

Not every team has the same MCP servers installed. Same skills, three modes:

Mode	When to use
`full-mcp`	You have Atlassian + Slack + Google Workspace MCPs
`partial-mcp`	Some MCPs missing — skills degrade gracefully
`markdown-only`	Solo dev / no MCP / pure documentation flow

The markdown-only mode is what makes this actually portable — every skill can still produce useful Markdown reports under .claude/testing/ without external dependencies. Solo developers can use the full suite without setting up anything.

6 ready-to-use presets

cp config/presets/full-stack.json     config/config.json   # All MCPs
cp config/presets/jira-only.json      config/config.json   # JIRA only
cp config/presets/markdown-only.json  config/config.json   # Pure docs
cp config/presets/startup.json        config/config.json   # Small startup
cp config/presets/enterprise.json     config/config.json   # 5 team boards
cp config/presets/government.json     config/config.json   # High-compliance

Why I made it bilingual + 简体

I'm Taiwanese, and most of the test-engineering content out there is English-first. So every skill ships with:

SKILL.md — Traditional Chinese (primary)
SKILL.en.md — English mirror
concept-zh.md — Beginner intros for unfamiliar concepts (mutation testing, property-based testing, spec-driven dev, test tiering)

The README is in English (primary), Traditional Chinese, and Simplified Chinese.

The license model

I went with a dual license:

🟢 MIT — Personal use / education / research / non-profits / 30-day evaluation / open-source contributions
🔴 Commercial — For-profit company internal use, paid products, SaaS, paid consulting

See LICENSE-COMMERCIAL.md for how to obtain a commercial license. I'm doing this case-by-case via GitHub Issues — the goal isn't to monetize aggressively, but to leave space for sustainable enterprise support if it grows.

Quick start

git clone https://github.com/kao273183/qa-claude-skill.git
cd qa-claude-skill
cp config/config.example.json config/config.json   # Edit your IDs
./install.sh

In Claude Code:

Generate test plan for a user login feature

The test-master skill activates and walks you through. Or try:

"I want to file a bug — the checkout crashes on Android"
"Review these test cases [Google Sheet URL]"
"Check if my tests actually catch bugs in src/auth/"

Windows users — there's a PowerShell version (install.ps1) as of v1.3.0.

What's still missing

This is v1.6.2. The roadmap still has:

Japanese translation
Web UI for editing config.json visually
More skills (test-impact-analyzer, oauth-flow-test, websocket-realtime-test, llm-quality-eval...)

PRs welcome. The CONTRIBUTING.md has the template for adding a new skill.

Try it

GitHub: kao273183/qa-claude-skill

I'd love to hear what skills are missing for your team's stack — drop an issue or comment below.

If this saves your team time, you can buy me a coffee ☕ — but a ⭐ on the repo helps more.

This is a community / personal project for Claude Code users — NOT an official Anthropic product.

The 10% CAPTCHA problem in QA — and why your AI solver should refuse Google login

MiniKao — Tue, 19 May 2026 08:34:00 +0000

The 10% that ruins QA day

You've automated the login flow. Your Playwright suite hums along. Then a CAPTCHA shows up and the whole thing collapses.

The honest answer from any QA engineer who's done this for more than six months is: stop trying to solve the CAPTCHA. Configure the test environment so it never appears. Test mode keys. Backend bypass tokens. Feature flags. IP allowlist on staging. The list of "right ways" is long and almost all of them are boring.

That works for ninety percent of testing. Then there's the remaining ten percent:

A B2B integration test where the third party owns the CAPTCHA and won't change their config for you
A client engagement with written authorization to test the production system, but no access to the backend
A staging environment that intentionally mirrors prod CAPTCHA behavior to catch UX regressions
A mobile webview test where IP allowlist doesn't reach
An accessibility audit that needs to actually see the challenge to test screen-reader behavior

For those, every shortcut violates someone's terms of service or your engagement contract. So we built mk-qa-master v0.7.0: a pair of MCP tools that let an AI client read a reCAPTCHA v2 image grid and click the right tiles — but only after a consent gate, never against third-party login portals, and never retaining the screenshot beyond the active cycle.

This post is about why the safety design matters more than the AI magic.

The three-tier CAPTCHA strategy

The strategy lives in the built-in QA knowledge layer (get_qa_context(section="CAPTCHA")) so every test the AI generates respects the same hierarchy:

Tier	Approach	When to use
1 — bypass	reCAPTCHA test keys, feature flags, IP allowlist, test-mode headers	Default. Covers ~90% of cases.
2 — degrade	Mark as `external_dependency`, skip downstream assertions	When you can't change the backend but the test isn't about the CAPTCHA itself.
3 — AI visual judgment	This feature.	Only when 1 + 2 don't fit.

Tier 1 is the "boring" answer and it's right almost every time. Google publishes test keys that always return success. Cloudflare Turnstile does the same. hCaptcha does the same. Your staging env can use them in seconds.

Tier 2 is for when the CAPTCHA is on the way to what you're really testing — say, you want to verify the post-login dashboard, not the auth flow. Mark the auth step as external_dependency, prove independently that the dashboard renders correctly with a seeded session, and you've decoupled the concern.

Tier 3 is what this release is about. It's the last resort, and we designed it like one.

What v0.7.0 actually ships

Two atomic MCP tools:

inspect_visual_challenge(confirm: bool = False)
  # Returns: screenshot of the challenge frame (base64),
  # challenge text, 3x3 or 4x4 tile grid metadata.
  # Refuses on forbidden domains.
  # Requires QA_VISUAL_CHALLENGE_CONSENT=true.

solve_visual_challenge(
    tile_indices: list[int],   # AI client's tile selection
    confirm: bool = False
)
  # Executes the click chain for the chosen tiles + Verify.
  # Returns: status (passed/failed), token, hint.
  # Same gates as inspect.

The AI client (Claude, Gemini, GPT-4V, whichever) is the actual solver. mk-qa-master is just eyes and hands: it screenshots, it accepts a list of indices, it clicks. The intelligence about which tiles contain a bicycle lives in the multimodal model.

That separation matters: it means the QA tool doesn't ship a CAPTCHA-solving ML model, doesn't compete with services like 2Captcha, doesn't accumulate know-how about how to beat specific challenge types. It just enables an AI client that already has vision to do its job inside a Playwright session.

The safety design

When you read the implementation, ~40% of the code is feature logic. The other 60% is restraint.

Consent gate. Default off. Nothing happens until you set:

QA_VISUAL_CHALLENGE_CONSENT=true

And every tool call requires confirm=true on top of that. Two locks, deliberately.

Per-call disclaimer. The first call surfaces the acceptable-use text in the error message:

ACCEPTABLE USE
This tool is intended for QA testing on:
- Sites you own
- Client sites where you have explicit written authorization
- Test environments where Tier 1 bypass is unavailable

DO NOT USE THIS TOOL ON:
- Third-party sites you do not own
- Production sites without explicit authorization
- Sites where automated access violates TOS or local law

If you're the kind of engineer who'd skip a disclaimer, you'll see it three times before you can call this thing for real.

Hard-stop domains. Some places are refused regardless of consent flag:

_FORBIDDEN_DOMAINS = frozenset({
    "accounts.google.com",
    "login.microsoftonline.com",
    "id.apple.com",
    "appleid.apple.com",
    "facebook.com",
    "login.live.com",
    "login.yahoo.com",
    "twitter.com/login",
    "x.com/login",
})

Third-party identity portals. There is no legitimate QA reason to script a CAPTCHA solver against someone else's login portal. The match is suffix-based on host, so accounts.google.com.evil does not accidentally pass.

Optional authorized-domains allowlist. For added discipline:

QA_VISUAL_CHALLENGE_AUTHORIZED_DOMAINS=client-staging.example.com,internal-app.example.com

When set, the tool refuses on any host that isn't on this list. Recommended for client engagements where you want a hard contract trail.

Privacy. Screenshots live only during the active inspect → solve cycle. Telemetry logs the boolean outcome — never the screenshot, never the challenge text, never the tile selection. You don't accumulate a corpus of solved CAPTCHAs.

What a real session looks like

Inside any MCP-compatible client (Claude Desktop, Cursor, Codex CLI, Gemini CLI, Cline...):

You: "Run the checkout suite. If a CAPTCHA blocks the test, resolve it
      so the rest of the flow can continue."

Claude:
  → run_tests()
  → ✗ failed at step 'click Checkout' — CAPTCHA modal detected

  → inspect_visual_challenge(confirm=true)
  → returns: screenshot + grid metadata + challenge text
            ("Select all images with traffic lights")

  → [Claude looks at the image, identifies tiles 0, 2, 5]

  → solve_visual_challenge(tile_indices=[0, 2, 5], confirm=true)
  → returns: { status: "passed", token: "...", hint: "CAPTCHA verified.
              Resume your test." }

  → run_failed()   # retry the failed step
  → ✓ checkout completes

The shape mirrors how MCP composes other capabilities — analyze, generate, run, advise. The AI orchestrates; the server just runs each step.

Scope, on purpose

v0.7.0 covers reCAPTCHA v2 image-grid only. That's deliberate:

reCAPTCHA v3 has no visible challenge — it's a behavioral risk score. There's nothing to inspect.
Cloudflare Turnstile mostly runs invisibly. Same story.
hCaptcha lands in v0.7.1 once the same safety machinery is fully ported over to its tile layout.
Behavioral CAPTCHA (mouse pattern, keystroke timing) is permanently out of scope. That's an anti-bot arms race we have no interest in feeding.

This is a feature designed to retire as the web does. When test keys become universally available and behavioral risk scoring takes over, this entire module should become unnecessary. We're fine with that.

Why MCP

A few people have asked why we packaged this as an MCP tool instead of a pytest fixture or a Playwright plugin. Two reasons:

The intelligence lives in the AI client, not in the server. MCP is the only protocol that makes that clean — the server exposes capabilities, the client (which already has vision and reasoning) decides how to use them. A pytest fixture would have to choose a vision provider, manage credentials, run inference. None of that is the test runner's job.
Composition with the rest of the QA loop. mk-qa-master already exposes analyze_url, generate_test, run_tests, get_optimization_plan. Putting the visual solver on the same MCP surface means the AI can chain it naturally: detect failure → inspect → solve → re-run. No glue code.

If you want the longer pitch on why MCP is the right shape for QA tooling, the README walks through it. Short version: AI clients should orchestrate testing the way a senior engineer would, and MCP is the cleanest way to give them the building blocks.

Try it

pip install mk-qa-master  # or: uvx mk-qa-master

In your MCP client config:

{
  "mcpServers": {
    "mk-qa-master": {
      "command": "uvx",
      "args": ["mk-qa-master"],
      "env": {
        "QA_RUNNER": "pytest",
        "QA_PROJECT_ROOT": "/path/to/your/tests",
        "QA_VISUAL_CHALLENGE_CONSENT": "true"
      }
    }
  }
}

The repo includes examples/sample_captcha_fixture/ — a local HTML page wired up with Google's public reCAPTCHA test keys so you can verify the end-to-end inspect/solve loop without ever touching a real production CAPTCHA.

What's next

v0.7.1 — hCaptcha support, same safety machinery
v0.8.0 — get_optimization_plan gains a "CAPTCHA pressure" metric that tells you when your suite is leaning too hard on Tier 3 and should be moved back to Tier 1
Always — no telemetry export of challenge content; no centralized solver model; no third-party identity portal support

If you find a domain that should be on the hard-stop list and isn't, open an issue. If you find a use case where Tier 1 / Tier 2 should work but the docs don't make it obvious, that's the higher-impact bug — the goal is for this feature to be used less, not more.

If mk-qa-master saved your QA flow, a coffee keeps the late-night CAPTCHA debugging going. Star the repo, file an issue, send a Maestro flow that broke — they're all the same to me.

Repo: kao273183/mcp-test-runner
PyPI: mk-qa-master
Glama: glama.ai/mcp/servers/kao273183/mcp-test-runner
Landing page: mcp.chenjundigital.com

Claude can drive Schemathesis + Postman through one MCP — I shipped both runners in one day

MiniKao — Sun, 17 May 2026 14:18:06 +0000

TL;DR — Today I shipped mk-qa-master v0.6.0 (Schemathesis) in the morning and v0.6.1 (Newman / Postman) in the afternoon. Same MCP tool surface (still 16 tools), same report.json / history / flake / coach pipeline, two new ways to drive API tests from Claude / Cursor / Codex. Total code: ~300 lines across two runners. Total elapsed: about 6 hours. This post is the architecture story.

The setup

I'm a QA engineer building mk-*, an open-source family of MCP servers for the AI dev pipeline. Last week I shipped v0.5.1 of mk-qa-master with five runners — pytest / Jest / Cypress / Go test / Maestro for mobile.

Two days ago, while updating the family-site copy, I added a line that said "mk-qa-master tests web + mobile + API". The first two were honest. The third was a stretch — yes, your existing pytest-with-httpx tests would run, but there was no dedicated API runner. A QA reader could install it expecting OpenAPI ingestion or Postman support, and find neither.

I had two options:

Walk the marketing copy back to "we drive web + mobile, your existing API tests ride along"
Make the copy true

I picked option 2. Two API runners, same day.

This post is how that played out.

Why MCP makes "ship two runners in one day" plausible

The mk-qa-master architecture has a runner abstraction that already shipped with five frameworks. Each runner is a Python class implementing the same interface:

class TestRunner:
    name: str

    def list_tests(self) -> str: ...
    def run_tests(self, filter: str | None = None, **kwargs) -> dict: ...
    def run_failed(self) -> dict: ...
    def get_report_summary(self) -> dict: ...

Whatever framework the runner wraps, the MCP tool surface is the same 16 tools. The AI client (Claude, Cursor, Codex, Gemini) calls run_tests / get_optimization_plan / get_failure_details the same way regardless of whether you're testing a React app, a Go service, an iOS Simulator, or an API. The runner translates.

Adding a new runner = ~150 lines of Python + register in REGISTRY + write a sample + bump version. That's it.

This is the MCP-level value claim: the AI doesn't relearn your stack. You add a runner; the AI's tool surface inherits the new capability automatically.

So shipping API testing was less "design a new product" and more "fill in the runner slot the abstraction was waiting for."

v0.6.0 — Schemathesis (OpenAPI / Swagger)

Schemathesis reads an OpenAPI 3.x or Swagger 2.0 schema and fuzzes every operation with property-based tests — response schema conformance, status-code conformance, server-error detection. Hand it a URL or file path, it spits out coverage in 30–60 seconds.

The runner wraps the schemathesis run CLI. User-facing config:

{
  "mcpServers": {
    "mk-qa-master": {
      "command": "uvx",
      "args": ["mk-qa-master[api]"],
      "env": {
        "QA_RUNNER": "schemathesis",
        "QA_OPENAPI_URL": "https://api.example.com/openapi.json"
      }
    }
  }
}

That's it. Restart your client. Then in any session:

"Test the API at https://api.example.com/openapi.json — find anything broken, then give me a prioritized action plan."

Real-feeling session transcript:

you ▸ Test https://api.example.com/openapi.json — find anything broken
       and give me a prioritized action plan.

  → get_runner_info ✓ schemathesis · OpenAPI 3.0.3 detected
  → list_tests ✓ 24 endpoints × 5 checks = 120 cases
  → run_tests ⚠ 112 passed, 6 failed, 2 errored (47s)
  → get_optimization_plan ✓ next priorities:

      🔴 broken  · POST /users :: response_schema_conformance
        Same Schemathesis signature × 3 → "status 500, expected 201|400"
        Action: response schema doesn't allow 500; either fix the
        validation bug or add 500 to the schema's responses block

      🔴 broken  · GET /search :: not_a_server_error
        Crashes under `?q=null` and `?limit=-1`
        Action: missing input validation on the search handler

      🟡 warn    · DELETE /users/{id} returned 204 when schema says 200
        Likely safe to update the schema; verify with PM

      🟢 stable  · 18 endpoints, no findings

The advisor's classification is the same logic the suite uses for UI tests — 3 consecutive failures with the same error signature = broken. A test that's red-green-red across runs = flaky. mk-qa-master doesn't differentiate "the API is broken" from "the UI is broken" — same flake-score, same broken classification, same advisor.

That's the abstraction paying off.

The one CLI-flag mistake that cost me 20 minutes

Here's the part that was not smooth.

The PRD I wrote in the morning said the runner would invoke schemathesis like this:

schemathesis run \
  --checks all \
  --report-json /tmp/report.json \   # ⚠ this flag does not exist
  --junit-xml /tmp/junit.xml \
  --hypothesis-database=none \
  $URL

The subagent implementing the runner followed the spec faithfully. CI choked instantly:

Error: No such option '--report-json'. Did you mean '--report'?

Schemathesis 3.x has no JSON-report flag. The PRD assumed one based on... I'm not sure what. Maybe an older version, maybe wishful thinking, maybe just a hallucination in my own design doc.

Fix: rewrite _normalize_report to parse --junit-xml output instead — JUnit XML is stdlib-parseable (xml.etree.ElementTree) and standard across every test runner I've ever touched. Took 20 minutes.

Lesson: when writing a PRD that hardcodes CLI flags, run <tool> --help on the actual installed version before committing. The spec is only worth what the underlying tool actually supports.

I'll be repeating this to myself for v0.7.

v0.6.1 — Newman (Postman collections)

After lunch I shipped the second runner.

Newman is the official CLI for running Postman collections. Postman has ~30M users; a huge chunk of them have collections in version control already. Newman + that collection JSON = headless replay of every request and pm.test(...) assertion.

Runner shape, same as Schemathesis but for Postman:

{
  "mcpServers": {
    "mk-qa-master": {
      "command": "uvx",
      "args": ["mk-qa-master"],
      "env": {
        "QA_RUNNER": "newman",
        "QA_POSTMAN_COLLECTION": "/path/to/your-api.postman_collection.json",
        "QA_POSTMAN_ENVIRONMENT": "/path/to/staging.postman_environment.json"
      }
    }
  }
}

Newman is npm-side, not pip-side, so it's a system prerequisite rather than a Python optional dep:

npm install -g newman

This was a small choice that took 2 minutes to settle: do you bundle Newman into the Python optional dep group somehow? You can't — pyproject.toml only knows about Python. So Newman gets the npm install -g treatment, the runner does shutil.which("newman"), and if it's missing the user sees a clear ImportError pointing at the install command.

The runner translates Newman's JSON report (run.executions[] + run.failures[]) into mk-qa-master's report.json shape. One nodeid per assertion:

GET {{baseUrl}}/books :: Books :: List books
POST {{baseUrl}}/books :: Books :: Create book
GET {{baseUrl}}/books/{{bookId}} :: Books :: Get book by id

Same history / flake / coach pipeline as before.

No CLI-flag mistake this time — I ran newman run --help first, sketched the flag list, then started implementation. Lesson learned from the morning.

Schemathesis vs Newman — when to use which

I get asked this every time I show the two runners. Here's the call I make:

You have…	Use…
An OpenAPI 3.x / Swagger 2.0 schema and you want generated tests across the whole surface	Schemathesis — fuzz-driven, finds bugs you didn't think to write tests for
A Postman collection your team already curates by hand	Newman — re-uses your existing investment, runs the assertions you already wrote
Both (a schema for breadth + a collection for happy paths)	Run both in the same session — Schemathesis catches schema drift, Newman catches business-logic regressions
Neither, but you have pytest tests hitting your API	Stay on `QA_RUNNER=pytest`, no migration needed — your existing tests already ride the same pipeline

The point of having both isn't to replace either ecosystem. It's that the AI doesn't need to know which one is active. From Claude's perspective, run_tests returns the same shape. The runner does the translation.

What I'd do differently

Things I'd change on a redo:

Run --help first on every CLI before writing the PRD. (See above.)
Single PRD covering Phase 1 + Phase 2 instead of writing Phase 2 ratification as an appendix. Mid-sized features deserve a single design doc, not a doc + amendment.
Bundle the sample Postman collection with a Prism mock script so users can prism mock openapi.yaml & and immediately have something live to point Newman at. Right now the sample is correct but a bit lonely until the user provides a target.

Things I'd keep:

Optional deps for Python-side, system prereq for npm-side. Forcing schemathesis onto every install would bloat. Forcing newman as a pip dep doesn't even work.
--junit-xml as the normalization source for Schemathesis. Standard format, stdlib parseable, future-proof.
Per-assertion nodeids for Newman, per-check nodeids for Schemathesis. Finer granularity than "this endpoint passed" — the flake-score logic needs to know which assertion within an endpoint is unstable.

Quick start

If you want to try it right now:

# Schemathesis path (OpenAPI / Swagger)
pip install 'mk-qa-master[api]'

# Newman path (Postman)
npm install -g newman
pip install mk-qa-master

Then drop the snippet from above into your Claude Desktop / Claude Code / Cursor / Codex config. Restart your client. Ask Claude to test your API. That's the whole UX.

The bundled sample at examples/sample_api_project/ has both an openapi.yaml and a postman-collection.json for the same fictional Library API — same 3 endpoints, two different runner paths, identical AI-side workflow. Drop a mock server (Prism, Mockoon, whatever) in front and you can dogfood the whole loop in ~5 minutes.

What's next

v0.7.0 adds Pact provider verification + an analyze_api tool (OpenAPI introspection → candidate test scenarios). Whether it ships depends on whether v0.6.0 / 0.6.1 produce real adoption signal. If 6 weeks from now nobody's filed an issue about Pact, I'll skip it and focus on something the community is actually asking for.

This is the discipline I'm trying to learn — ship two runners on the same day the architecture allows it; don't speculate a third just because the abstraction would still hold.

Family

mk-qa-master is one of three open-source MCP servers I'm building:

mk-plan-master — idea triage + RICE scoring + spec-draft bridge
mk-spec-master — specs → scenarios + coverage matrix
mk-qa-master (this) — drives the test runner across web / mobile / API

Together they form: Idea → Plan → Spec → Code (your IDE) → Test → Coverage → Coach.

Family site: mcp.chenjundigital.com

If your team is QA-heavy and you've been frustrated by AI tools that either write # TODO for API tests or charge $50k/year to run them — give the v0.6 line a try. If you find anything weird, the issue tracker is the right place.

A star helps the algorithm find people like you. Feedback helps more.

— Jack Kao, building solo.

I'm a QA engineer. After Claude wrote # TODO in my 100th test, I built an MCP server.

MiniKao — Sat, 16 May 2026 19:32:44 +0000

TL;DR — mk-qa-master is an open-source MCP server that lets Claude / Cursor / Codex / Gemini drive your real test suite — pytest, Jest, Cypress, Go test, and Maestro for mobile. 16 tools, 5 categories, a three-layer QA knowledge architecture. uvx-installable. MIT.

The moment I stopped blaming the model

The 5th time Claude wrote # TODO: add real selector here in a generated test, I tried a smarter prompt. The 20th time, I switched models. The 100th time, I stopped blaming the LLM.

I'm a QA engineer. I've watched LLMs write beautiful-looking test scaffolds for two years now, and every one of them collapses at the same place:

The model can read your code. It cannot see your live DOM, your mobile view hierarchy, your last 10 test runs, or that checkout-flow.spec.ts has been red 7 times in 14 days.

So it guesses. Guesses are how you get # TODO.

The fix isn't a smarter prompt. It's giving the LLM access to the things it's currently guessing about.

That's what the Model Context Protocol (MCP) is for. And that's why I built mk-qa-master.

What "AI for QA" usually means

Most AI-for-testing products today fall into one of three buckets:

IDE plugins that emit test files — Copilot Tests, Cursor's test generator. Great in a screenshot. They write the file, you fix the selectors.
"Just prompt ChatGPT" tutorials — works for one test, falls apart at ten. No persistence, no awareness of what's actually flaky, no runtime feedback.
End-to-end AI testing SaaS — record-and-playback wrappers. They own your test infrastructure, charge per seat, and you're locked in.

What's missing from all three: the AI never touches the runner. It writes code; you run; you debug; you tell the AI what broke. It's a chatbot pretending to be an engineer.

The reframe: stop asking AI to write tests. Make it drive your test runner.

What MCP changes

MCP (introduced by Anthropic in late 2024, now adopted by Cursor, Codex CLI, Gemini CLI, Zed, Cline and others) lets an AI client call tools — not just see text, but trigger actions, read structured responses, chain them.

An MCP server is just a process that exposes tools. Drop it into your client config:

{
  "mcpServers": {
    "mk-qa-master": {
      "command": "uvx",
      "args": ["mk-qa-master"],
      "env": {
        "QA_RUNNER": "pytest",
        "QA_PROJECT_ROOT": "/path/to/your/project"
      }
    }
  }
}

…and now Claude has 16 new things it can do in your project: probe the DOM of a live URL, list your existing tests, generate new ones with real selectors, run them, read JUnit XML, write an optimization plan based on the last N runs.

Your runner just became part of the AI's tool surface.

mk-qa-master in 60 seconds

16 tools across 5 categories. You don't need to memorize names; the README has a cookbook of natural-language prompts that map to each chain.

Category	Tools	What it does
Discover	`get_runner_info` · `list_tests` · `analyze_url` · `analyze_screen`	Which framework is active. What tests exist. Probe a URL or a live mobile screen for form / nav / CTA modules with real selectors.
Generate	`generate_test` · `auto_generate_tests` · `codegen` · `init_qa_knowledge` · `get_qa_context`	Emit runnable pytest `.py` or Maestro `.yaml`. Not `# TODO` placeholders.
Run	`run_tests` · `run_failed`	Drive pytest / Jest / Cypress / Go test / Maestro. Auto-retry, JUnit XML, screenshots, Playwright `trace.zip`, Maestro recordings.
Report	`get_test_report` · `get_failure_details` · `generate_html_report` · `get_test_history`	Outcome history, error signatures, per-test flake scores.
Advise	`get_optimization_plan`	Three lenses: suite quality (flaky vs broken vs slow), MCP usability, AI effectiveness. Output is a ranked action list — what to fix next, with evidence.

Switch frameworks with a single env var: QA_RUNNER=pytest | jest | cypress | go | maestro. Web and mobile share the same MCP surface — analyze_screen works on iOS Simulator, Android Emulator, real devices, and (yes) BlueStacks via adb connect.

The part nobody else builds: a three-layer QA knowledge architecture

This is what makes mk-qa-master not monkey-testing.

A DOM-only analyzer produces "empty field should error" for every form on the internet. That's not testing, it's noise. To produce a test that means anything, the generator needs domain context. So I layered three:

Layer 1 — Built-in

ISTQB's seven principles, equivalence partitioning, decision tables, state transitions, the test pyramid, shift-left, mobile testing checklists, QA metrics — baked into the server. The AI gets methodology by default, not by accident.

Layer 2 — Your project's `qa-knowledge.md`

Drop a file at your project root with your business rules, historical bugs, standard assertion copy, user-journey snippets, technical constraints. init_qa_knowledge scaffolds one. The MCP loads it on every relevant tool call. This is where the "AI doesn't know my business" problem actually gets solved.

Layer 3 — Per-test inline

Pass a business_context slice into generate_test. It gets printed as a # Business context: block inside the generated test, so the next reviewer sees why this test exists without leaving the file.

Three layers of context. One MCP. Pile them up and the AI stops producing "click the button, see something happen" garbage.

A real session

Here's what a Monday morning with this looks like:

you ▸ Test https://your-site/login — one runnable case per module

  → analyze_url ✓ 4 modules · 12 endpoints · 18 candidate cases
  → generate_test ✓ tests/test_login.py (4 cases)
  → run_tests ⚠ 3 passed, 1 failed
  → get_optimization_plan ✓ next priorities:
      🔴 broken  · checkout-coupon-rule (same signature × 3 runs = real bug)
      🟡 flaky   · login-with-2fa (PFPFP outcome string, 60% flake score)
      🟢 stable  · all 12 nav-menu cases

you ▸ Fix the broken one first. Show me the failure.

  → get_failure_details ✓ checkout-coupon-rule:
      Expected: "Discount applied: $5.00"
      Got:      "Discount applied: NaN"
      First failed: 3 runs ago, on PR #142

Notice what's happening here:

The AI doesn't ask which test is flaky — it pulls flake history from tests-history/.
The AI doesn't guess selectors — analyze_url gave it real selectors from the live page.
The AI doesn't just run tests — it returns a ranked action list. "This is broken, this is flaky, this is stable." Evidence, not gut feel.

This isn't AI writing tests. This is AI doing QA.

What this deliberately is not

Not	Use this instead
A test framework	You bring pytest / Jest / Cypress / Go test / Maestro — mk-qa-master drives them
An LLM	Your AI client (Claude / Cursor / Codex / Gemini) does the reasoning
A CI runner	Runs locally, produces JUnit XML; pipe to GitHub Actions / Jenkins as usual
A source-code analyzer	Looks at live DOM and view hierarchy, not your repo's source
A SaaS dashboard	MCP-native, lives in your AI client. HTML reports are self-contained `.html` files

Knowing what a tool isn't is half of trust.

Quick start

uvx mk-qa-master
# or: pip install mk-qa-master

Claude Desktop config lives at:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
Linux: ~/.config/Claude/claude_desktop_config.json

{
  "mcpServers": {
    "mk-qa-master": {
      "command": "uvx",
      "args": ["mk-qa-master"],
      "env": {
        "QA_RUNNER": "pytest",
        "QA_PROJECT_ROOT": "/path/to/your/project"
      }
    }
  }
}

Restart your client. Then in any AI session, say:

"Test https://your-site/login — one runnable case per module, then tell me which existing test is most likely flaky."

That's the whole UX. No menus. No buttons. The AI chains the tools.

This is one of three

mk-qa-master is the execution end of a family I'm building solo:

mk-plan-master — turns a pile of 30–200 raw ideas into RICE-scored, spec-draft-ready initiatives. Hands off to ↓
mk-spec-master — parses specs into scenarios, keeps a live spec ↔ test coverage matrix, grades the specs themselves. Hands off to ↓
mk-qa-master — drives the runner, generates tests, advises on what's broken vs flaky vs slow.

Together they form an end-to-end AI dev pipeline:

Idea → Plan → Spec → Code (your IDE) → Test → Coverage → Coach
       mk-plan mk-spec your IDE       mk-qa  mk-spec     both

The family wraps the rails; code-writing stays in your IDE (Claude Code / Cursor / Copilot). I deliberately don't try to rebuild what your IDE already does well.

The other two MCPs get their own posts. Follow if that pipeline sounds useful.

Links

GitHub: https://github.com/kao273183/mk-qa-master
PyPI: https://pypi.org/project/mk-qa-master/
Family site: https://mcp.chenjundigital.com
License: MIT
Family: mk-qa-master · mk-spec-master · mk-plan-master

If your team is QA-heavy and you've been frustrated by AI tools that write # TODO instead of real tests — give it a try. If you've found a better way to do this, I'd genuinely love to hear about it in the comments. This is an opinionated tool and I'm still iterating.

A star helps the algorithm find people like you. Feedback helps more.

— Jack Kao, building solo.

Forem: MiniKao

From mock-only-works to real-world-works: 48 hours of reCAPTCHA debugging

The setup

Day 1 — v0.7.0 ships

Day 2 — v0.7.1 adds hCaptcha

Day 3 — v0.7.2: the first "fix"

Day 4 morning — wait, it's still broken

Day 4 afternoon — going DOM-spelunking

Day 4 evening — v0.7.3 actually fixes it

Day 4 night — v0.7.4 closes the multi-round gap

What works now

What doesn't (yet)

Lessons I want to remember

Try it

What's next

I open-sourced 24 QA skills for Claude Code — from spec to release

The problem

What's in the box

What it actually does

1. "I want to file a bug"

2. "Plan tests for this new feature"

3. "Are my tests actually catching bugs?"

4. "Which tests should run on every PR?"

Three modes for any tool stack

6 ready-to-use presets

Why I made it bilingual + 简体

The license model

Quick start

What's still missing

Try it

The 10% CAPTCHA problem in QA — and why your AI solver should refuse Google login

The 10% that ruins QA day

The three-tier CAPTCHA strategy

What v0.7.0 actually ships

The safety design

What a real session looks like

Scope, on purpose

Why MCP

Try it

What's next

Claude can drive Schemathesis + Postman through one MCP — I shipped both runners in one day

The setup

Why MCP makes "ship two runners in one day" plausible

v0.6.0 — Schemathesis (OpenAPI / Swagger)

The one CLI-flag mistake that cost me 20 minutes

v0.6.1 — Newman (Postman collections)

Schemathesis vs Newman — when to use which

What I'd do differently

Quick start

What's next

Family

I'm a QA engineer. After Claude wrote # TODO in my 100th test, I built an MCP server.

The moment I stopped blaming the model

What "AI for QA" usually means

What MCP changes

mk-qa-master in 60 seconds

The part nobody else builds: a three-layer QA knowledge architecture

Layer 1 — Built-in

Layer 2 — Your project's qa-knowledge.md

Layer 3 — Per-test inline

A real session

What this deliberately is not

Quick start

This is one of three

Links

Layer 2 — Your project's `qa-knowledge.md`