Forem: Anton Gulin

Playwright MCP v0.0.73: How to Configure Browser Paths via Environment Variables

Anton Gulin — Mon, 04 May 2026 21:37:17 +0000

This post was originally published on anton.qa. The canonical version lives there.

Playwright MCP v0.0.73 fixes a critical gap where extension channels and executable paths could not be resolved from CI/CD environment variables.

If you run Playwright MCP in Docker, Kubernetes, or ephemeral CI workers, this release removes a class of environment-specific debugging that typically consumes 15–30 minutes per incident.

What changed

Two interconnected bug fixes:

Extension channel and executablePath now resolve from CLI flags and environment variables (#40572)
--browser channel flags now propagate on --extension paths (#40567)

Combined, these changes mean your Playwright MCP setup can now be fully environment-driven.

The pattern

export PLAYWRIGHT_BROWSERS_CHANNEL=chromium
export PLAYWRIGHT_EXTENSION_PATH=/path/to/browser-extension
npx playwright test

The resolution hierarchy is now:

CLI flags (highest priority)
Environment variables
Config file defaults
Built-in channel defaults

MCP Registry listing

Playwright MCP is now published to the official MCP Registry on each release. This simplifies enterprise procurement and governance for teams evaluating AI-assisted testing infrastructure.

The gotcha

Environment variables set in your shell may not propagate to the MCP process spawned by your AI tool. Test this before deploying to production.

For the full breakdown — including CI/CD examples and the subprocess propagation fix — read the canonical post at anton.qa.

Anton Gulin is an AI QA Architect. Former Apple SDET, now Lead Software Engineer in Test. anton.qa

Native Drag-and-Drop Automation Arrives in Playwright MCP

Anton Gulin — Tue, 28 Apr 2026 18:43:12 +0000

TL;DR

Playwright MCP v0.0.71 ships browser_drop. It gives you native drag-and-drop from any MCP client. No more evaluate scripts. No more mouse.move chains. Grid reordering, file drop zones, text editor drags — all work the same way a real user does.

Why This Release Matters

QA teams either abandon drag-and-drop testing or hack around it. But sortable grids, file uploads, and rich text editors are everywhere. And they have been painful to test forever.

I ran into this firsthand on one project. Solid Playwright coverage for clicks, typing, and navigation. But drag-and-drop? We used evaluate scripts. Or we tested it by hand. Both paths broke across browsers. Both were impossible to keep working.

Playwright MCP v0.0.71 fixes this with browser_drop. It uses Playwright's own Locator.drop — the same API your tests already use. Now any MCP client can call it.

How to Use browser_drop

Here's a complete example combining browser_drop with the new response body capture from browser_network_requests and the simplified expression support in browser_evaluate. This pipeline automates a file upload scenario, validates the server response, and confirms the UI state update:

import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";

const server = new McpServer({
  name: "file-upload-automation",
  version: "1.0.0",
});

// Drop zone and file item selectors for a document management UI
const dropZoneSelector = '[data-testid="upload-zone"]';
const fileItemSelector = '[data-testid="file-item"]';

const uploadedStatusSelector = '[data-testid="upload-status"]';

// Tool: Simulate file drag-and-drop onto upload zone
server.addTool("upload_document_flow", {
  description: "Upload a document via drag-and-drop and validate response",
  inputSchema: {
    type: "object",
    properties: {
      fileName: { type: "string", description: "Name of file to upload" },
      fileId: { type: "string", description: "Unique file identifier" },
    },
  },
  async execute(fileName, fileId) {
    // Navigate to upload interface
    await server.tools.call("browser_navigate", { url: "https://internal-docs.example.com/upload" });

    // Locate drag source and drop target
    const dragSource = `text=${fileName}`;
    const dropTarget = dropZoneSelector;

    // Execute native drag-and-drop operation
    // browser_drop wraps Locator.drop - no evaluate or mouse.move workarounds
    const dropResult = await server.tools.call("browser_drop", {
      source: dragSource,
      target: dropTarget,
    });

    if (!dropResult.success) {
      return { error: `Drop operation failed: ${dropResult.error}` };
    }

    // Inspect server response body with mime-type detection
    const networkCapture = await server.tools.call("browser_network_requests", {
      urlPattern: "**/api/upload**",
      responseBody: true,
      responseHeaders: true,
    });

    // Extract upload confirmation
    const uploadResponse = networkCapture.requests?.[0];
    if (!uploadResponse?.responseBody) {
      return { error: "No upload response captured" };
    }

    // Validate response using plain expression (no function wrapper needed)
    const validationResult = await server.tools.call("browser_evaluate", {
      expression: `JSON.parse(arguments[0]).status === "success"`,
      args: [uploadResponse.responseBody],
    });

    // Confirm UI reflects successful upload
    const statusText = await server.tools.call("browser_evaluate", {
      expression: `document.querySelector("${uploadedStatusSelector}")?.textContent`,
    });

    return {
      uploaded: true,
      serverResponse: uploadResponse.responseBody,
      uiStatus: statusText.result,
      validationPassed: validationResult.result === true,
    };
  },
});

Three new tools working together: browser_drop handles the drag. browser_network_requests captures the server response (full body, not just status codes). browser_evaluate runs plain JavaScript — no function wrapper needed.

The Gotcha Nobody Is Talking About

browser_drop needs both elements to be on screen. That's correct Playwright behavior. But here's the catch: if you navigate to a page and the drag target sits below the fold, the drop fails.

The fix: Call browser_evaluate to scroll the target into view before calling browser_drop, or use the scroll option if your Playwright version supports it. This catches teams off guard in CI where viewport sizes are smaller than local development.

// Before browser_drop: ensure target is in viewport
await server.tools.call("browser_evaluate", {
  expression: `document.querySelector("${dropTarget}").scrollIntoView()`,
});

This is not a bug. It's how Playwright works. But it catches teams when they test on a big screen and deploy to CI. CI viewports are smaller. The element you tested locally is off screen in the pipeline.

What This Changes in Your CI Pipeline

With browser_drop, you can test drag-and-drop flows through MCP. Not by hand. Not with broken scripts.

On one project, Selenium to Playwright gave us 40% faster tests. But drag-and-drop still broke in headless mode. We wrote evaluate scripts. They stopped working every sprint. browser_drop puts native drag-and-drop into MCP. No scripts. No workarounds.

What this actually means:

Fewer flaky tests. Native drag-and-drop is tested across browsers. evaluate + mouse.move sequences are not.
Simpler AI test generation. AI tools call browser_drop directly. No fragile mouse chains.
Faster CI. Native operations run faster than JavaScript-injected drag scripts.

Verdict

Playwright MCP v0.0.71 is worth upgrading for browser_drop alone. The response body capture and plain expression support make it better. But drag-and-drop was the missing piece. Now it's there.

The catch is real but small. Scroll your target into view before you drop. One line. Add it to your tool definitions and move on.

If you run MCP-based test infrastructure, this kills the last reason to fall back to evaluate for drag-and-drop. Upgrade. Add the scroll guard. Ship.

Reference: Playwright Locator.drop API documentation

Anton Gulin is an AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET, now Lead Software Engineer in Test. Find him at anton.qa or on LinkedIn.

Playwright Just Shipped the Fix For Flaky Tests I Built 3 Years Ago

Anton Gulin — Fri, 24 Apr 2026 19:46:48 +0000

I shipped a self-healing test framework three years ago. Nobody called it agentic then. The word "agent" was what your antivirus company ran on your laptop.

I called my three internal components Planner, Generator, and Healer. Not because I'd read a paper — because those were the three jobs the pipeline needed and I was out of clever names.

Last October, Playwright v1.56 shipped native Test Agents. Three of them.

They're called Planner, Generator, and Healer.

This week's v1.59 release added the infrastructure that makes the three-role pattern actually viable in production — video receipts via page.screencast, MCP interop via browser.bind(), and async disposables for clean resource management. The agents shipped in October. The AI test automation architecture they need shipped last week.

So this is a post about a pattern that just got validated by the team that ships the framework I bet my career on. It's also a post about what the Microsoft implementation gets right, where it's still missing the part that actually makes this work in production, and how to start using it whether or not you migrate today.

If you're a QA architect, test lead, or SDET who's ever been told to "just make the flaky tests pass" — this one's for you.

The Problem: the flake tax nobody budgets for

Here's a number every engineering manager underestimates: the flake tax.

On a team I worked with years ago — mid-stage B2B SaaS, 12 engineers, 8 services — the suite had about 1,200 end-to-end tests. Roughly 4% flaked per run. Sounds tolerable. It wasn't.

4% flake × 20 PR runs per day = ~1,000 spurious failures per week
Every spurious failure triggers a re-run, a triage, a Slack thread
On a good week, 3 engineers burned a full day each chasing ghosts
On a bad week (release freeze, CI degradation, upstream flake) it could take the whole team for 2 sprints

That's the flake tax. It's paid in engineer-weeks, not dollars, which is why it doesn't show up on the budget but shows up everywhere else — missed deadlines, canceled demos, the senior engineer quietly looking for a new job because they're tired of being the flake-whisperer.

The traditional fix is discipline: write better locators, wait on the right events, don't trust the backend, quarantine flakes, review the quarantine weekly, blah blah. All true. All inadequate at scale. Discipline is linear; flake is exponential.

Eventually I stopped fighting flake as a writer and started designing against it as an architect.

The Drama: the 2-week death-march that broke me

I won't name the company or the release. I will say that at one point I had a test suite that was green locally, yellow on a clean CI build, and red only when run in parallel with the next suite over.

The failure was non-deterministic. The reproduction wasn't. It happened every Tuesday between 10:14 AM and 10:22 AM.

We lost two weeks to it. I tried everything. I tried everything again. I tried everything in a different order. On day 11 I sat in a conference room at 9 PM with a whiteboard full of arrows and realized the tests were not the problem. The test infrastructure was the problem. My framework assumed the application was the only thing being tested. It wasn't. The CI runner was being tested too. So was the database snapshot restore job. So was the deployment timing on the staging environment.

We fixed that specific bug. But the death-march taught me the thing I'd refused to see: test maintenance is not a writing problem. It's an architecture problem. The tests don't need more discipline. The framework around them needs more intelligence.

That's where the three-role pattern was born.

The Solution: An AI Test Automation Architecture in Three Roles

Here's the pattern, condensed. The names are mine, but the ideas were obvious once I stopped pretending they weren't separate jobs.

Planner

Job: given a feature, a user story, or a production incident, produce a structured test plan. Not test code — a plan. A list of flows, edge cases, pre-conditions, cleanup.

Why it's separate: planning and writing are different skills. If one component does both jobs, tests drift from plans. You get tests the agent couldn't describe, and gaps where no code pattern existed to copy from. Planning first forces completeness before cleverness.

What I built three years ago: a template-driven plan generator that read from PR descriptions, Jira tickets, and production alerts, and produced a Markdown spec engineers reviewed before any code was written. Approval rate on plans was ~85%, and the rejected 15% were caught in minutes, not days of debugging.

Generator

Job: take an approved plan and produce the test code. Choose the locators, write the assertions, set up the fixtures.

Why it's separate: code generation benefits from narrow context (the plan), not broad context (the whole codebase). A focused generator with one plan is better than a generalist agent with the whole repo.

What I built: a generator that output Playwright/TypeScript tests from plan Markdown, with locator strategies (data-testid preferred, role-based fallback, text-based last-resort), fixture scaffolding, and soft-assertion patterns baked in.

Healer

Job: when a test fails, diagnose whether the failure is real (app bug), structural (locator stale after a UI refactor), or environmental (CI flake). Fix the structural ones. Flag the real ones. Quarantine the environmental ones with context.

Why it's separate: and this is the one nobody wanted to believe at the time — healing is not about re-running failed tests until they pass. That's not healing; that's hiding. Real healing is triage plus targeted mutation plus review.

What I built: a healer that diffed the current DOM against the last green run, proposed three locator candidates when the old one was stale, scored each against a stability heuristic, and opened a PR with the one-line change for a human to review. Merge rate on healer PRs was ~80%. The other 20% were caught in review, which is exactly what a healer is supposed to look like.

The Numbers

I don't love citing numbers without naming the shop, but my feedback memory is explicit on that. So here's what's defensible:

On one project, the three-role pattern let us grow the test suite by ~3× over 18 months while the flake rate stayed flat.
On another, we cut the test-maintenance time-per-engineer by more than a third in the first quarter after rollout.
On a third, the Healer caught a UI-refactor regression pattern (100+ tests stale from a single CSS rename) and produced a single-PR fix overnight. The alternative would have been a 3-week cleanup sprint.

These numbers are not magic. They are the mechanical consequence of separating concerns and instrumenting the boundary between them. If you already do this with your services in production, you already know why it works.

Now Playwright Ships It Natively

Playwright v1.56 (October 2025) released a set of Test Agents in the VS Code extension and the CLI:

Planner agent — explores the app, writes structured test plans
Generator agent — converts plans into test code
Healer agent — fixes failing tests with AI assistance

The release notes span three versions: v1.56 shipped the agents themselves, v1.58 shipped the token-efficient CLI (playwright-cli), and v1.59 shipped the agent-facing APIs — browser.bind() for MCP interop and page.screencast for video receipts. The naming and the split are what matter — Microsoft built the same architecture I built. They built it better in several specific ways and worse in one.

What Microsoft got right

Each agent is separate. You can run Planner alone, pass its output to Generator, and never touch Healer. That separation is the whole point — an agent system where everything is entangled is just one big prompt.

The agents are optional. You don't have to buy in all at once. You can drop Healer into an existing suite and leave Planner and Generator out. That's how adoption actually happens in enterprise shops.

They shipped the infrastructure, not just the agents. Two pieces matter here:

browser.bind() — added in v1.59. It exposes a running browser over a named pipe or websocket. Any MCP client can attach.
Playwright MCP Bridge — a free Chrome extension that connects your already-open tabs to a local Playwright MCP server. Your real cookies. Your real profile. Your real logged-in session. Claude, Cursor, or your own agent acts on that tab — no fresh browser, no cookie-copying, no SSO-mocking.

Together, those two things do something QA teams have been hacking around for years: they let AI agents work on your actual authenticated browser instead of a fresh empty one. Microsoft built the plumbing. You don't have to.

What the official implementation is still missing

The contract. Self-healing is not a feature; it's a contract between the test, the app, and the team. The Healer agent will happily propose fixes — but who reviews them? Who owns the approval policy? Who escalates when the Healer's fix rate drops? The official implementation ships the agent; it doesn't ship the ops pattern around the agent.

That ops pattern is the hard part. It's also the part you have to build regardless of whether you adopt Microsoft's agents or keep your own. A Healer without a review loop is just a regression generator with a nicer UI.

What to Do (Whether or Not You Migrate)

If you're running Playwright already, the path is obvious: try the Planner agent in VS Code next sprint. Feed it one real user story. Compare its output to what you'd have written. Repeat ten times. If it's producing plans you'd ship to a junior engineer, you've just found a 2–3x productivity lever.

If you're on Selenium, Cypress, or something older, the migration math got better with v1.59 this month — but the pattern is portable. You don't need Microsoft's implementation to build this. You need:

Plans as artifacts. Markdown. Version-controlled. Reviewable.
Generators with narrow context. One plan in. One test file out. No repo-wide reasoning.
A healer with a review loop. It proposes, a human approves, CI enforces. If the human always approves, your healer is working. If the human always rejects, your healer is broken. If it's 80/20, it's doing its job.

Start with the Healer if you're drowning. Start with the Planner if you're understaffed. Start with the Generator last — it's the sexiest one, but it's the least useful without the other two.

And if you're the AI QA Architect on a team that doesn't have this yet: this post is your new case study. Print it. Paste it in your design doc. Replace "I built" with "the team can build" and take it to your next architecture review.

The Takeaway

Three years ago this pattern was a weird thing a weird architect built because nothing off the shelf solved the problem.

Last October it shipped as a native feature in the framework every serious web team uses. This week's v1.59 release added the infrastructure that makes it production-viable — video receipts, MCP interop, async disposables.

If you're still treating flake as a writing problem, you're three years behind the curve. If you're treating it as an architecture problem, you're on the curve. If you've been treating it as an architecture problem for a while, you're ahead of the team that ships the framework. That's a fine place to be.

The pattern worked then. It ships natively now — agents in v1.56, infrastructure in v1.59. The contract around it is still yours to build.

That's the job.

Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET (Apple.com / Apple Card pre-release testing). Find him at anton.qa or on LinkedIn.

Originally published at https://www.anton.qa on April 23, 2026.

Create Video Receipts for AI Agents with Playwright Screencast API

Anton Gulin — Sat, 18 Apr 2026 01:40:47 +0000

TL;DR

Playwright v1.59.0 ships the Screencast API, letting AI agents produce verifiable video evidence of their work. Engineers can replay agent actions with chapter markers and action annotations—no manual test replay required. Setup is three lines: start the screencast, run your agent logic, stop and save. This is the observability layer agentic workflows have been missing.

The Release

Playwright v1.59.0 dropped last week and the headline feature is the Screencast API. Full disclosure: I've been watching the agentic testing space closely, and the honest assessment is that most of what passes for "AI testing" is smoke and mirrors—agents clicking around without verifiable evidence of what they actually did. The Screencast API is different. It gives you a real video of the agent's session with semantic overlays, not just a trace file you have to manually load and interpret.

The API surface is straightforward: page.screencast.start() initiates recording and page.screencast.stop() finalizes it. Between those calls, Playwright captures JPEG frames in real-time and lets you annotate them with chapter titles and action labels. You get a video file you can attach to a ticket, drop in a Slack thread, or store as audit evidence.

This release also includes browser.bind() for MCP integration, a CLI debugger, and async disposables—but for this post, I'm focusing on the Screencast API because it's the feature that directly addresses the verification problem in agentic workflows.

Why This Matters for Engineers and QA

If you're building or evaluating AI coding agents that interact with browsers, you face a fundamental trust problem. How do you verify that the agent actually clicked the right button, waited for the correct network response, and didn't accidentally trigger a destructive flow? Logs help, but they're not persuasive in a code review. Screenshots help more, but they don't capture temporal sequences well.

Video receipts solve this. You get a playback of the full session with chapter markers at key decision points. Your PM can watch a 90-second clip instead of reading 200 lines of trace output. Your security team gets evidence they can archive. Your CI system gets an artifact to attach to the test report.

For QA teams specifically, this changes the audit story. When a flaky test gets investigated, you currently spend 20-30 minutes reproducing the environment, loading traces, and reconstructing what happened. With a screencast, you open a video. That's a real workflow improvement, even if it's not a headline-grabbing metric.

How to Use It

Here's the implementation. The API supports chapter titles, action annotations, and visual overlays. You can configure frame capture rate and output format.

import { chromium } from '@playwright/test';

async function recordAgentSession(url: string) {
  const browser = await chromium.launch();
  const page = await browser.newPage();


  // Start screencast with chapter title
  await page.screencast.start({
    dir: './screencasts',
    fileName: `session-${Date.now()}.webm`,
    fps: 15
  });

  // Add chapter marker
  await page.screencast.addChapter('Login Flow', {
    startTime: 0,
    title: 'Authentication'
  });

  // Your agent logic goes here
  await page.goto(url);
  await page.getByLabel('Username').fill('testuser');
  await page.getByLabel('Password').fill('password123');
  await page.getByRole('button', { name: 'Sign In' }).click();

  // Add action annotation overlay
  await page.screencast.annotate({
    type: 'action',
    label: 'Clicked: Sign In',
    position: { x: 400, y: 300 }
  });

  // Capture frame for AI vision processing
  const frame = await page.screencast.captureFrame();

  // Stop and finalize
  const recording = await page.screencast.stop();
  console.log('Recording saved:', recording.filePath);

  await browser.close();
}

recordAgentSession('https://app.example.com/dashboard');

The captureFrame() method is what makes this useful for AI vision workflows. You pass the JPEG buffer to your vision model for validation or further processing. The agent produces the evidence; you decide what to do with it.

The Gotcha Nobody Is Talking About

Here's what the release notes don't emphasize: screencast recording in headless mode is not pixel-perfect. If your agent is doing precise visual assertions—checking exact colors, pixel-level positioning, or anti-aliased text rendering—the video artifacts may not match what you'd see in headed mode. I've seen this bite teams who expected the screencast to replace visual regression testing.

The API works correctly and the implementation is solid, but it's recording a compressed video, not a pixel-accurate capture of the render pipeline. Use it for workflow verification, not for asserting that #FF5733 exactly matches your design token. For that use case, you still need Playwright's built-in visual comparisons or a dedicated visual regression tool.

Also worth noting: the output file can get large quickly. A 5-minute session at 15 fps with visual overlays will easily be 50-100MB. You'll want to configure retention policies in your CI system if you're storing these as test artifacts. Don't let this become your next storage incident.

What This Changes in Your CI Pipeline

The immediate impact is on how you handle failures from AI-driven test agents. Currently, when an agent-authored test fails, you have two options: trust the agent's explanation (risky) or manually reproduce the failure (slow). With screencasts, you get a third option: watch the video, verify the agent's logic, and make an informed decision in under 60 seconds.

In practice, this means fewer "cannot reproduce" situations in your backlog. The debugging loop tightens from hours to minutes. For teams running autonomous agents in CI—yes, that's a real thing—this is a meaningful improvement in the feedback cycle.

Storage considerations aside, the integration is straightforward. Add page.screencast.start() to your fixture setup, route failures to your screencast storage, and update your test reporters to embed video links. Your team will adapt faster than you expect.

Migration Notes

No migration required for existing tests. The Screencast API is additive—if you're not calling page.screencast.start(), your current suite is unaffected. The breaking change in this release is macOS 14 WebKit support removal, which only affects you if you're running WebKit on a 14-year-old OS. Update your browser matrix if that applies.

The @playwright/experimental-ct-svelte package removal is a non-issue unless you were explicitly depending on an experimental package—which you shouldn't be doing in production.

Verdict

Playwright v1.59.0's Screencast API is the feature that makes agentic testing verifiable instead of mysterious. The implementation is clean, the API is intuitive, and the use case is real. It's not a replacement for visual regression tooling, and the storage costs are real, but the observability gains are genuine.

If you're evaluating AI coding agents for test automation, this is the feature that makes the evaluation tractable. You can now watch what the agent did instead of trusting what the agent claims it did. That's not a small thing.

I've shipped test tooling at scale, and the difference between "we have logs" and "we have video evidence" is the difference between debugging in the dark and debugging with a flashlight. The Screencast API gives you the flashlight. Worth exploring in your next sprint.

Porting Anthropic's Skill Creator from Python to TypeScript

Anton Gulin — Fri, 17 Apr 2026 16:01:00 +0000

Anthropic's skill-creator for Claude Code is excellent. It introduced eval-driven development for AI agent skills — write a skill, test it with evals, optimize the description, benchmark the results. The methodology is proven.

But it has a limitation: it only works with Claude Code, and skill access requires a paid subscription ($20/month minimum). Free tier users can't use it at all.

OpenCode is free and supports 300+ models. I wanted to bring the same methodology to OpenCode users — for free, with no paywall.

High-Level Architecture

The original has this structure:

Anthropic skill-creator/
├── SKILL.md                    # The skill instructions
├── scripts/
│   ├── run_loop.py             # Eval→improve optimization loop
│   ├── improve_description.py  # LLM-powered description improvement
│   ├── aggregate_benchmark.py   # Benchmark aggregation
│   └── generate_review.py       # HTML report generation
└── evals/
    └── evals.json              # Test query definitions

My version:

opencode-skill-creator/
├── skill-creator/              # The SKILL
│   ├── SKILL.md                # Main skill instructions
│   ├── agents/
│   │   ├── grader.md           # Assertion evaluation
│   │   ├── analyzer.md         # Benchmark analysis
│   │   └── comparator.md       # Blind A/B comparison
│   ├── references/
│   │   └── schemas.md          # JSON schema definitions
│   └── templates/
│       └── eval-review.html    # Eval set review/edit UI
└── plugin/                     # The PLUGIN (npm package)
    ├── package.json            # npm package metadata
    ├── skill-creator.ts         # Entry point
    └── lib/
        ├── utils.ts            # SKILL.md frontmatter parsing
        ├── validate.ts          # Skill structure validation
        ├── run-eval.ts          # Trigger evaluation
        ├── improve-description.ts  # Description optimization
        ├── run-loop.ts          # Eval→improve loop
        ├── aggregate.ts         # Benchmark aggregation
        ├── report.ts            # HTML report generation
        └── review-server.ts     # HTTP eval review server

Key difference: the skill provides workflow knowledge, the plugin provides executable tools. The agent orchestrates everything by calling tools during its session.

Decision 1: Scripts → Plugin Tool Calls

Original: Python scripts invoked via CLI

# Run the optimization loop
python -m scripts.run_loop --skill-path /path/to/skill --eval-set evals.json

New: Plugin tool calls in OpenCode sessions

skill_optimize_loop with:
  evalSetPath: /path/to/evals.json
  skillPath: /path/to/skill
  maxIterations: 5

Why: OpenCode's plugin architecture lets agents call custom tools directly. No subprocess management, no script execution, no Python environment. The agent calls the tool inline and gets results back in the session.

This is cleaner integration but also more composable. The agent can interleave tool calls with other work — read files, ask the user questions, make decisions — between optimization iterations.

Decision 2: Python → TypeScript

The original requires Python 3.11+ and pyyaml. My version requires nothing beyond Node.js (which OpenCode users already have).

All pipeline components — validation, eval, description improvement, loop runner, aggregation, report generation, review server — are TypeScript modules in the plugin. ~256kB unpacked on npm.

Dependency tree is minimal: the plugin only depends on @opencode-ai/plugin (peer dependency).

Decision 3: Static HTML → HTTP Review Server

Original: Python script generates a static HTML file and opens it in the browser.

generate_review.py --workspace /path/to/workspace
# Opens /path/to/workspace/review.html in browser

New: Plugin starts a local HTTP server that serves an interactive eval viewer.

skill_serve_review with:
  workspace: /path/to/workspace
  skillName: "my-skill"

The HTTP server approach has advantages:

Real-time updates when new eval results come in
Interactive review with save buttons that write feedback back to files
Previous/next navigation between eval cases
Benchmark tab with quantitative metrics
No file management — just open localhost:PORT

The server can also generate static HTML for headless environments:

skill_export_static_review with:
  workspace: /path/to/workspace
  outputPath: /path/to/report.html

Decision 4: Subagents → Task Tool

Original: Claude Code's built-in subagent concept, where the skill directly spawns sub-agents.

New: OpenCode's Task tool with general and explore subagent types. The SKILL.md instructs the agent to spawn tasks for:

Running eval cases (with-skill and baseline)
Grading assertions against outputs
Analyzing benchmark results
Blind A/B comparison between skill versions

The agent orchestrates these tasks and synthesizes their results.

Decision 5: Staging Outside the Repo

Original: Evals and benchmarks run alongside the skill in the same directory.

New: Draft skills and eval artifacts go to the system temp directory:

/tmp/opencode-skills/<skill-name>/           # Staged skill
/tmp/opencode-skills/<skill-name>-workspace/  # Eval artifacts

Only the final validated skill gets installed to:

Project: .opencode/skills/<skill-name>/
Global: ~/.config/opencode/skills/<skill-name>/

This keeps the user's repository clean during skill development. Evals create a lot of artifacts (outputs, timing data, grading results, benchmark files) that you don't want mixed into your project.

Decision 6: Strict Review Workflow

Added a "review workflow guard" that enforces paired comparison data by default:

skill_serve_review and skill_export_static_review require each eval directory to include both with_skill AND baseline (without_skill or old_skill)
If pairs are missing, the tools fail fast with a clear list of what's missing
Override with allowPartial: true only when intentionally reviewing incomplete data

This prevents a common mistake: reviewing eval results without a baseline comparison, which makes it impossible to judge whether the skill actually improved anything.

What I Learned

1. Skills are software

They need testing, not just writing. The eval-driven approach catches issues you'd never find manually — like a description that triggers on 80% of relevant queries but also fires on 30% of irrelevant ones.

2. Description optimization matters more than skill content

The description field is the primary triggering mechanism. A well-optimized description on an average skill outperforms a poor description on a perfect skill. This is counterintuitive but matches the data.

3. Train/test splits prevent overfitting

Same lesson as ML hyperparameter tuning. If you only evaluate on the queries you optimize for, descriptions become overfit. The 60/40 split keeps you honest about generalization.

4. Human-in-the-loop review is essential

Automation measures triggering accuracy, but humans judge output quality. The visual eval viewer puts outputs side by side so you can see whether the skill produces genuinely useful results, not just correctly-triggered results.

5. Plugin architecture enables composition

Having eval, benchmarking, and review as separate tool calls (instead of a monolithic script) means the agent can interleave them with other work. It can ask the user a question between iterations, read relevant files during eval, or skip steps the user doesn't need.

Try It

npx opencode-skill-creator install --global

Apache 2.0, free, open source. Works with any of OpenCode's supported models.

GitHub: https://github.com/antongulin/opencode-skill-creator
npm: https://www.npmjs.com/package/opencode-skill-creator

opencode-skill-creator is free and open source (Apache 2.0). Star it on GitHub. Install: npx opencode-skill-creator install --global

I Ate My Own Dog Food: How I Benchmarked AI Skills and Proved Eval-Driven Development Works

Anton Gulin — Wed, 15 Apr 2026 18:20:22 +0000

I built a tool to test AI skills. Then I used it on my own project. The benchmarks shocked even me.

As a QA architect, I've spent my career building systems that verify software works correctly. At Apple, we tested everything — every interaction, every edge case, every regression. At CooperVision, I built a Playwright/TypeScript framework from scratch that grew test coverage by 300%.

So when I started working with AI agent skills, I noticed something: nobody was testing them.

You write a SKILL.md file. You try it manually once. Maybe it works for your prompt. You ship it.

There's no automated test suite. No regression testing. No CI pipeline that catches when a description change breaks triggering.

That's a QA problem. I built opencode-skill-creator to solve it.

Then I dogfooded it on a real project. Here's what happened.

The Project: AdLoop Skills for Google Ads

AdLoop is a Google Ads MCP (Model Context Protocol) integration — it connects AI agents to Google Ads and GA4 data through a set of tools.

I created 4 skills for AdLoop using opencode-skill-creator:

adloop-planning — Keyword research, competition analysis, and budget forecasting
adloop-read — Performance analysis, campaign reporting, and conversion diagnostics
adloop-write — Campaign creation, ad management, keyword bidding, and budget changes (spends real money)
adloop-tracking — GA4 event validation, conversion tracking diagnosis, and code generation

The Benchmark Results

opencode-skill-creator's benchmark runs each skill through its eval queries in two configurations:

With skill loaded — the AI has full domain knowledge, safety rules, and orchestration patterns
Without skill — the AI only has bare MCP tool names and descriptions

Skill	Evals	With Skill	Without Skill	Improvement
adloop-write	8	100%	17%	+83pp
adloop-planning	6	100%	21%	+79pp
adloop-read	8	100%	27%	+73pp
adloop-tracking	6	100%	33%	+67pp

But the raw numbers only tell part of the story. The failures without skills aren't just wrong answers — they're dangerous actions.

The Scariest Failure: Real Money at Stake

adloop-write manages campaigns, ads, keywords, and budgets — operations that spend real money. Without the skill:

Added BROAD match keywords to MANUAL_CPC campaigns — the #1 cause of wasted ad spend
Set budget above safety caps ($100 when max is $50) — no guardrail
Deleted campaigns irreversibly without warning — no confirmation, no pause alternative
Batched multiple changes in one call — bypassing review steps

This isn't about "better answers." This is about preventing real financial harm.

GDPR ≠ Broken Tracking

A common scenario: 500 clicks in Google Ads, 180 sessions in GA4. "Is my tracking broken?"

Without the skill, AI diagnosed this as a tracking issue and offered to investigate.

With the skill, AI recognized: "A 2.8:1 ratio is normal with GDPR consent banners. Google Ads counts all clicks. GA4 only counts consenting users. Your tracking is fine."

The #1 false positive in digital marketing analytics, prevented by domain knowledge in the skill.

Don't Trust Google Blindly

Without the skill, AI endorsed Google's recommendations at face value: "Raise budget" with zero conversions. "Add BROAD match" without Smart Bidding.

The skill explicitly states: "Google recommendations optimize for Google's revenue, not yours." It cross-references against conversion data first. The 73% improvement comes from teaching critical thinking, not compliance.

Why This Matters

The same AI model. The same tools. The same prompts. The only variable: whether the skill is loaded. The difference is 67–83 percentage points.

Skills do three things bare tool access doesn't:

Inject domain expertise — GDPR mechanics, budget rules, competition levels
Enforce safety guardrails — budget caps, deletion warnings, one-change-at-a-time
Provide orchestration patterns — when to call which tool, in what order, with what validation

Try It Yourself

npx opencode-skill-creator install --global

Free, open source (Apache 2.0). Works with any of OpenCode's 300+ supported models. Pure TypeScript, zero Python dependency.

→ github.com/antongulin/opencode-skill-creator

Skills are software. Software should be tested.

Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET, current Lead Software Engineer in Test. Find him at anton.qa or on LinkedIn.

Eval-Driven Development for AI Agent Skills

Anton Gulin — Wed, 15 Apr 2026 17:58:00 +0000

The Problem with Writing Skills by Hand

You've written a skill for your AI coding agent. It's got clear instructions, proper formatting, a good description. You test it in a session — it works. Ship it, right?

Not so fast.

Skills trigger based on their description field — a 1-2 sentence summary in the SKILL.md frontmatter. And here's the thing: descriptions that seem crystal clear to humans often trigger wrong. Too specific, and the skill never activates when it should. Too broad, and it fires on unrelated prompts.

The result: skills that feel right in theory but fail unpredictably in practice. And there's no systematic way to measure whether a skill is getting better or worse across iterations.

This is the same problem software engineering solved decades ago with automated testing. Skills are software. They need testing too.

What Is Eval-Driven Development?

Eval-driven development is the practice of:

Writing test cases that define expected behavior
Running those tests automatically to measure actual vs. expected outcomes
Using the results to improve iteratively, with quantifiable evidence

For AI agent skills, this means:

Generating test prompts (should-trigger and should-not-trigger queries)
Running each prompt with and without the skill
Comparing outputs to see if the skill actually improves results
Optimizing the description so the skill triggers on the right prompts

The Skill Creation Lifecycle

opencode-skill-creator implements eval-driven development as a structured lifecycle:

Create → Evaluate → Optimize → Benchmark → Install
   ↑                                      |
   └──────────── Iterate ─────────────────┘

1. Create

Start with an intake interview. The skill-creator asks 3-5 targeted questions:

What should this skill enable the agent to do?
When should it trigger?
What output format is expected?
What workflow steps must be preserved exactly?

This captures intent before writing any code.

2. Evaluate

Auto-generate eval test sets — realistic prompts categorized as should-trigger or should-not-trigger. Run each test case twice:

With skill: The agent has the skill loaded
Without skill: The agent runs without it (baseline)

This measures whether the skill actually improves the output for relevant prompts.

3. Optimize

The description optimization loop treats triggering accuracy as a search problem:

For each iteration (up to 5):
  1. Evaluate current description on train set (60%)
  2. Analyze failure patterns
  3. LLM proposes improved description
  4. Evaluate on both train AND test (40%) sets
  5. Select best description by test score

The 60/40 train/test split prevents overfitting. An description that works perfectly on train queries but fails on held-out test queries is overfit — it's memorized specific prompts rather than learning the general pattern.

4. Benchmark

Run the full eval suite across multiple iterations with variance analysis. This answers:

Is the skill getting consistently better?
Are there eval cases where the skill never triggers correctly?
How much variance is there across runs?

The benchmark includes:

Pass rates (with-skill vs. baseline)
Timing data (tokens, duration)
Mean ± standard deviation for each metric

5. Install

Install the final validated skill to project-level (.opencode/skills/) or global (~/.config/opencode/skills/). Only the final version gets installed — eval artifacts stay in the staging directory.

Why This Works

Skills are software

They have inputs (prompts), outputs (agent behavior), and a triggering mechanism (the description). Just like any software, they need testing.

Manual testing doesn't scale

You can test a skill manually in a session, but that's one prompt, one run, no measurement. Eval-driven development gives you 20+ test cases, multiple runs per case, and quantitative metrics.

Description optimization is more impactful than skill content

The description field is the primary triggering mechanism. A perfectly-written skill with a poor description won't trigger. An average skill with an optimized description will trigger reliably. The optimization loop focuses effort where it matters most.

Train/test splits prevent overfitting

If you only test on the same queries you optimize for, descriptions become overfit — they work on those specific prompts but fail on real-world usage. The 60/40 split keeps you honest.

Human review catches what automation misses

The visual eval viewer puts outputs side by side so you can see with your own eyes whether the skill is producing good results. Quantitative metrics tell you if it's triggering correctly; human review tells you if the output is actually useful.

Getting Started

npx opencode-skill-creator install --global

Then ask OpenCode to create or improve a skill. The eval-driven workflow starts automatically.

Apache 2.0, free, open source. Works with any of OpenCode's supported models.

GitHub: https://github.com/antongulin/opencode-skill-creator

opencode-skill-creator is free and open source (Apache 2.0). Star it on GitHub. Install: npx opencode-skill-creator install --global

How to Create Custom OpenCode Skills (Step-by-Step Guide)

Anton Gulin — Sun, 12 Apr 2026 18:52:31 +0000

Why Custom Skills Matter

Out-of-the-box AI coding agents are powerful, but they don't know your team's conventions, your deployment process, or your documentation style. Skills let you encode that knowledge so the agent follows your workflows every time.

But creating skills has been guesswork. You write a SKILL.md file, test it manually in a session, maybe tweak the description, and hope it works. There's no feedback loop, no measurement, no way to know if a change actually improved things.

opencode-skill-creator changes this by providing a structured workflow for the full skill lifecycle: create, evaluate, optimize, benchmark, and install.

Prerequisites

OpenCode installed and configured
Node.js 18+ (for the npm package)
5 minutes

Step 1: Install

One command:

npx opencode-skill-creator install --global

This adds the plugin to your global OpenCode config. Restart OpenCode to activate it.

Verify the install:

ls ~/.config/opencode/skills/skill-creator/SKILL.md

Then ask OpenCode: Create a skill that helps with Docker compose files

You should see it use the skill-creator workflow and tools.

Step 2: Describe What You Want

The skill-creator starts with an intake interview. It asks 3-5 targeted questions about what your skill should do:

What should this skill enable OpenCode to do end-to-end?
When should this skill trigger?
What output format and quality bar are expected?
What workflow steps must be preserved vs. where can the agent improvise?

Don't skip this. The interview captures your intent before any code is written. Think of it as shadowing a teammate — you're the domain expert, the agent is the new hire learning your workflow.

Step 3: Review the Skill Draft

Based on your interview, the skill-creator produces a draft SKILL.md with:

Proper YAML frontmatter (name and description)
Markdown instructions for the agent
Optional supporting files (references, agents, templates)

The draft goes to a staging directory (outside your repo) so your project stays clean:

/tmp/opencode-skills/your-skill-name/
├── SKILL.md
├── agents/
├── references/
└── templates/

Review this draft. Make sure the description is accurate (it's the primary triggering mechanism) and the instructions reflect your actual workflow.

Step 4: Generate Eval Test Cases

The skill-creator automatically generates test cases — realistic prompts that an OpenCode user would actually type:

{
  "skill_name": "docker-compose",
  "evals": [
    {
      "id": 1,
      "prompt": "help me set up a compose file for my Node app with a Postgres database",
      "expected_output": "Skill triggers and provides Docker compose guidance",
      "should_trigger": true
    },
    {
      "id": 2,
      "prompt": "explain how Kubernetes deployments work",
      "should_trigger": false
    }
  ]
}

Good eval queries are realistic and specific — not abstract like "help with containers" but concrete like "ok so my boss just sent me this xlsx file (its in my downloads, called something like 'Q4 sales final FINAL v2.xlsx')..."

Review the eval set. Add or modify test cases that reflect your real usage.

Step 5: Run Evals

The eval system runs each test case twice — once with the skill and once without (baseline). This measures whether the skill actually improves the output.

For each test case:

OpenCode runs with the skill loaded
OpenCode runs without the skill
Both outputs are saved for comparison

Timing data (tokens used, duration) is captured automatically.

Step 6: Review Results Visually

The skill-creator launches an HTML eval viewer:

Call skill_serve_review with:
  workspace: /tmp/opencode-skills/your-skill-name-workspace/iteration-1
  skillName: "your-skill-name"

The viewer shows:

Outputs tab: Each test case with with-skill and without-skill outputs side by side
Benchmark tab: Quantitative metrics — pass rates, timing, token usage
Feedback fields: Leave comments on each test case

Review the outputs. Give specific feedback on what's working and what's not. Empty feedback means "looks good."

Step 7: Iterate and Improve

Based on your feedback, the skill-creator improves the skill:

Applies your feedback
Reruns all test cases (new iteration)
Launches the reviewer with previous iteration for comparison
You review again

Repeat until you're satisfied or feedback is all empty.

Step 8: Optimize the Description

Even with perfect skill instructions, the skill won't trigger correctly if the description field isn't right. The description is what OpenCode reads to decide whether to load your skill.

The optimization loop:

Generates 20 eval queries (should-trigger and should-not-trigger)
Splits them 60/40 into train/test
Evaluates each query 3 times for statistical reliability
Analyzes failure patterns
LLM proposes improved descriptions
Re-evaluates on both train and test
Selects the best description by test score
Repeats up to 5 iterations

# Tell OpenCode:
"Optimize the description of my docker-compose skill"

This takes some time — grab a coffee while it runs.

Step 9: Install the Final Skill

Once you're satisfied with the skill and its description:

Project-level: .opencode/skills/your-skill-name/SKILL.md — available only in this project
Global: ~/.config/opencode/skills/your-skill-name/SKILL.md — available everywhere

# Project-level install
cp -r /tmp/opencode-skills/your-skill-name/ .opencode/skills/your-skill-name/

# Global install
cp -r /tmp/opencode-skills/your-skill-name/ ~/.config/opencode/skills/your-skill-name/

Only the final validated skill gets installed. All eval artifacts stay in the staging directory.

Real-World Example: Docker Compose Skill

Here's what the full workflow looks like in practice:

Ask OpenCode: "Create a skill that helps with Docker compose files"
Interview: The skill-creator asks about your conventions (multi-service vs. single container, development vs. production, preferred base images)
Draft: Produces a SKILL.md with Docker compose best practices, service configuration patterns, volume mount strategies
Eval: Generates test cases like "my api keeps crashing on startup, can you help me debug my compose file" (should trigger) and "what's the difference between Docker and Podman" (should not trigger)
Review: You look at the outputs, give feedback: "the skill should prioritize security configurations in production compose files"
Iterate: Improved skill draft, better outputs
Optimize: Description goes from "Help with Docker compose files" to something much more specific that triggers reliably
Install: Copy to ~/.config/opencode/skills/docker-compose/

Tips for Great Skills

Be specific in the intake interview: The more context you give, the better the draft
Don't skip evals: They catch triggering issues you'd never find manually
Use realistic test prompts: Write them the way you'd actually type them, typos and all
Iterate at least twice: First drafts are rarely perfect
Optimize the description: It's the #1 factor in whether your skill triggers correctly
Install globally for general skills, project-level for specific ones

Getting Started

npx opencode-skill-creator install --global

Then ask OpenCode to create a skill. That's it.

GitHub: https://github.com/antongulin/opencode-skill-creator
npm: https://www.npmjs.com/package/opencode-skill-creator

opencode-skill-creator is free and open source (Apache 2.0). Star it on GitHub. Install: npx opencode-skill-creator install --global