<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Anton Gulin</title>
    <description>The latest articles on Forem by Anton Gulin (@aiwithanton).</description>
    <link>https://forem.com/aiwithanton</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3872452%2F17f47297-ddc6-457c-9920-47c0dd1acd1b.png</url>
      <title>Forem: Anton Gulin</title>
      <link>https://forem.com/aiwithanton</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/aiwithanton"/>
    <language>en</language>
    <item>
      <title>I Ate My Own Dog Food: How I Benchmarked AI Skills and Proved Eval-Driven Development Works</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Wed, 15 Apr 2026 18:20:22 +0000</pubDate>
      <link>https://forem.com/aiwithanton/i-ate-my-own-dog-food-how-i-benchmarked-ai-skills-and-proved-eval-driven-development-works-c0l</link>
      <guid>https://forem.com/aiwithanton/i-ate-my-own-dog-food-how-i-benchmarked-ai-skills-and-proved-eval-driven-development-works-c0l</guid>
      <description>&lt;p&gt;I built a tool to test AI skills. Then I used it on my own project. The benchmarks shocked even me.&lt;/p&gt;

&lt;p&gt;As a QA architect, I've spent my career building systems that verify software works correctly. At Apple, we tested everything — every interaction, every edge case, every regression. At CooperVision, I built a Playwright/TypeScript framework from scratch that grew test coverage by 300%.&lt;/p&gt;

&lt;p&gt;So when I started working with AI agent skills, I noticed something: &lt;strong&gt;nobody was testing them.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You write a SKILL.md file. You try it manually once. Maybe it works for your prompt. You ship it.&lt;/p&gt;

&lt;p&gt;There's no automated test suite. No regression testing. No CI pipeline that catches when a description change breaks triggering.&lt;/p&gt;

&lt;p&gt;That's a QA problem. I built &lt;a href="https://github.com/antongulin/opencode-skill-creator" rel="noopener noreferrer"&gt;opencode-skill-creator&lt;/a&gt; to solve it.&lt;/p&gt;

&lt;p&gt;Then I dogfooded it on a real project. Here's what happened.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Project: AdLoop Skills for Google Ads
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/kLOsk/adloop" rel="noopener noreferrer"&gt;AdLoop&lt;/a&gt; is a Google Ads MCP (Model Context Protocol) integration — it connects AI agents to Google Ads and GA4 data through a set of tools.&lt;/p&gt;

&lt;p&gt;I created 4 skills for AdLoop using opencode-skill-creator:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;adloop-planning&lt;/strong&gt; — Keyword research, competition analysis, and budget forecasting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;adloop-read&lt;/strong&gt; — Performance analysis, campaign reporting, and conversion diagnostics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;adloop-write&lt;/strong&gt; — Campaign creation, ad management, keyword bidding, and budget changes (spends real money)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;adloop-tracking&lt;/strong&gt; — GA4 event validation, conversion tracking diagnosis, and code generation&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Benchmark Results
&lt;/h2&gt;

&lt;p&gt;opencode-skill-creator's benchmark runs each skill through its eval queries in two configurations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;With skill loaded&lt;/strong&gt; — the AI has full domain knowledge, safety rules, and orchestration patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Without skill&lt;/strong&gt; — the AI only has bare MCP tool names and descriptions&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Skill&lt;/th&gt;
&lt;th&gt;Evals&lt;/th&gt;
&lt;th&gt;With Skill&lt;/th&gt;
&lt;th&gt;Without Skill&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;adloop-write&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;17%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+83pp&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;adloop-planning&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;21%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+79pp&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;adloop-read&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;27%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+73pp&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;adloop-tracking&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;33%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+67pp&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;But the raw numbers only tell part of the story. The &lt;em&gt;failures&lt;/em&gt; without skills aren't just wrong answers — they're dangerous actions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Scariest Failure: Real Money at Stake
&lt;/h2&gt;

&lt;p&gt;adloop-write manages campaigns, ads, keywords, and budgets — operations that &lt;strong&gt;spend real money&lt;/strong&gt;. Without the skill:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Added BROAD match keywords to MANUAL_CPC campaigns&lt;/strong&gt; — the #1 cause of wasted ad spend&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set budget above safety caps&lt;/strong&gt; ($100 when max is $50) — no guardrail&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deleted campaigns irreversibly without warning&lt;/strong&gt; — no confirmation, no pause alternative&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batched multiple changes in one call&lt;/strong&gt; — bypassing review steps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't about "better answers." This is about &lt;strong&gt;preventing real financial harm&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  GDPR ≠ Broken Tracking
&lt;/h2&gt;

&lt;p&gt;A common scenario: 500 clicks in Google Ads, 180 sessions in GA4. "Is my tracking broken?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without the skill&lt;/strong&gt;, AI diagnosed this as a tracking issue and offered to investigate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With the skill&lt;/strong&gt;, AI recognized: "A 2.8:1 ratio is normal with GDPR consent banners. Google Ads counts all clicks. GA4 only counts consenting users. Your tracking is fine."&lt;/p&gt;

&lt;p&gt;The #1 false positive in digital marketing analytics, prevented by domain knowledge in the skill.&lt;/p&gt;

&lt;h2&gt;
  
  
  Don't Trust Google Blindly
&lt;/h2&gt;

&lt;p&gt;Without the skill, AI endorsed Google's recommendations at face value: "Raise budget" with zero conversions. "Add BROAD match" without Smart Bidding.&lt;/p&gt;

&lt;p&gt;The skill explicitly states: &lt;strong&gt;"Google recommendations optimize for Google's revenue, not yours."&lt;/strong&gt; It cross-references against conversion data first. The 73% improvement comes from teaching critical thinking, not compliance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;The same AI model. The same tools. The same prompts. The only variable: whether the skill is loaded. The difference is 67–83 percentage points.&lt;/p&gt;

&lt;p&gt;Skills do three things bare tool access doesn't:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Inject domain expertise&lt;/strong&gt; — GDPR mechanics, budget rules, competition levels&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enforce safety guardrails&lt;/strong&gt; — budget caps, deletion warnings, one-change-at-a-time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provide orchestration patterns&lt;/strong&gt; — when to call which tool, in what order, with what validation&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx opencode-skill-creator &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--global&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Free, open source (Apache 2.0). Works with any of OpenCode's 300+ supported models. Pure TypeScript, zero Python dependency.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://github.com/antongulin/opencode-skill-creator" rel="noopener noreferrer"&gt;github.com/antongulin/opencode-skill-creator&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Skills are software. Software should be tested.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET, current Lead Software Engineer in Test. Find him at &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt; or on &lt;a href="https://linkedin.com/in/antongulin" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>qa</category>
      <category>opensource</category>
      <category>openclaw</category>
    </item>
    <item>
      <title>Eval-Driven Development for AI Agent Skills</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Wed, 15 Apr 2026 17:58:00 +0000</pubDate>
      <link>https://forem.com/aiwithanton/eval-driven-development-for-ai-agent-skills-3jpg</link>
      <guid>https://forem.com/aiwithanton/eval-driven-development-for-ai-agent-skills-3jpg</guid>
      <description>&lt;h2&gt;
  
  
  The Problem with Writing Skills by Hand
&lt;/h2&gt;

&lt;p&gt;You've written a skill for your AI coding agent. It's got clear instructions, proper formatting, a good description. You test it in a session — it works. Ship it, right?&lt;/p&gt;

&lt;p&gt;Not so fast.&lt;/p&gt;

&lt;p&gt;Skills trigger based on their description field — a 1-2 sentence summary in the SKILL.md frontmatter. And here's the thing: descriptions that seem crystal clear to humans often trigger wrong. Too specific, and the skill never activates when it should. Too broad, and it fires on unrelated prompts.&lt;/p&gt;

&lt;p&gt;The result: skills that feel right in theory but fail unpredictably in practice. And there's no systematic way to measure whether a skill is getting better or worse across iterations.&lt;/p&gt;

&lt;p&gt;This is the same problem software engineering solved decades ago with automated testing. Skills are software. They need testing too.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is Eval-Driven Development?
&lt;/h2&gt;

&lt;p&gt;Eval-driven development is the practice of:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Writing test cases&lt;/strong&gt; that define expected behavior&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Running those tests automatically&lt;/strong&gt; to measure actual vs. expected outcomes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Using the results to improve&lt;/strong&gt; iteratively, with quantifiable evidence&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For AI agent skills, this means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generating test prompts (should-trigger and should-not-trigger queries)&lt;/li&gt;
&lt;li&gt;Running each prompt with and without the skill&lt;/li&gt;
&lt;li&gt;Comparing outputs to see if the skill actually improves results&lt;/li&gt;
&lt;li&gt;Optimizing the description so the skill triggers on the right prompts&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Skill Creation Lifecycle
&lt;/h2&gt;

&lt;p&gt;opencode-skill-creator implements eval-driven development as a structured lifecycle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Create → Evaluate → Optimize → Benchmark → Install
   ↑                                      |
   └──────────── Iterate ─────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  1. Create
&lt;/h3&gt;

&lt;p&gt;Start with an intake interview. The skill-creator asks 3-5 targeted questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What should this skill enable the agent to do?&lt;/li&gt;
&lt;li&gt;When should it trigger?&lt;/li&gt;
&lt;li&gt;What output format is expected?&lt;/li&gt;
&lt;li&gt;What workflow steps must be preserved exactly?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This captures intent before writing any code.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Evaluate
&lt;/h3&gt;

&lt;p&gt;Auto-generate eval test sets — realistic prompts categorized as should-trigger or should-not-trigger. Run each test case twice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;With skill&lt;/strong&gt;: The agent has the skill loaded&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Without skill&lt;/strong&gt;: The agent runs without it (baseline)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This measures whether the skill actually improves the output for relevant prompts.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Optimize
&lt;/h3&gt;

&lt;p&gt;The description optimization loop treats triggering accuracy as a search problem:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;For each iteration (up to 5):
  1. Evaluate current description on train set (60%)
  2. Analyze failure patterns
  3. LLM proposes improved description
  4. Evaluate on both train AND test (40%) sets
  5. Select best description by test score
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 60/40 train/test split prevents overfitting. An description that works perfectly on train queries but fails on held-out test queries is overfit — it's memorized specific prompts rather than learning the general pattern.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Benchmark
&lt;/h3&gt;

&lt;p&gt;Run the full eval suite across multiple iterations with variance analysis. This answers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is the skill getting consistently better?&lt;/li&gt;
&lt;li&gt;Are there eval cases where the skill never triggers correctly?&lt;/li&gt;
&lt;li&gt;How much variance is there across runs?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The benchmark includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pass rates (with-skill vs. baseline)&lt;/li&gt;
&lt;li&gt;Timing data (tokens, duration)&lt;/li&gt;
&lt;li&gt;Mean ± standard deviation for each metric&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. Install
&lt;/h3&gt;

&lt;p&gt;Install the final validated skill to project-level (&lt;code&gt;.opencode/skills/&lt;/code&gt;) or global (&lt;code&gt;~/.config/opencode/skills/&lt;/code&gt;). Only the final version gets installed — eval artifacts stay in the staging directory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Works
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Skills are software
&lt;/h3&gt;

&lt;p&gt;They have inputs (prompts), outputs (agent behavior), and a triggering mechanism (the description). Just like any software, they need testing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Manual testing doesn't scale
&lt;/h3&gt;

&lt;p&gt;You can test a skill manually in a session, but that's one prompt, one run, no measurement. Eval-driven development gives you 20+ test cases, multiple runs per case, and quantitative metrics.&lt;/p&gt;

&lt;h3&gt;
  
  
  Description optimization is more impactful than skill content
&lt;/h3&gt;

&lt;p&gt;The description field is the primary triggering mechanism. A perfectly-written skill with a poor description won't trigger. An average skill with an optimized description will trigger reliably. The optimization loop focuses effort where it matters most.&lt;/p&gt;

&lt;h3&gt;
  
  
  Train/test splits prevent overfitting
&lt;/h3&gt;

&lt;p&gt;If you only test on the same queries you optimize for, descriptions become overfit — they work on those specific prompts but fail on real-world usage. The 60/40 split keeps you honest.&lt;/p&gt;

&lt;h3&gt;
  
  
  Human review catches what automation misses
&lt;/h3&gt;

&lt;p&gt;The visual eval viewer puts outputs side by side so you can see with your own eyes whether the skill is producing good results. Quantitative metrics tell you if it's triggering correctly; human review tells you if the output is actually useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx opencode-skill-creator &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--global&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then ask OpenCode to create or improve a skill. The eval-driven workflow starts automatically.&lt;/p&gt;

&lt;p&gt;Apache 2.0, free, open source. Works with any of OpenCode's supported models.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/antongulin/opencode-skill-creator" rel="noopener noreferrer"&gt;https://github.com/antongulin/opencode-skill-creator&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;opencode-skill-creator is free and open source (Apache 2.0). Star it on GitHub. Install: &lt;code&gt;npx opencode-skill-creator install --global&lt;/code&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>opencode</category>
      <category>typescript</category>
      <category>ai</category>
      <category>opensource</category>
    </item>
    <item>
      <title>How to Create Custom OpenCode Skills (Step-by-Step Guide)</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Sun, 12 Apr 2026 18:52:31 +0000</pubDate>
      <link>https://forem.com/aiwithanton/how-to-create-custom-opencode-skills-step-by-step-guide-4ijd</link>
      <guid>https://forem.com/aiwithanton/how-to-create-custom-opencode-skills-step-by-step-guide-4ijd</guid>
      <description>&lt;h2&gt;
  
  
  Why Custom Skills Matter
&lt;/h2&gt;

&lt;p&gt;Out-of-the-box AI coding agents are powerful, but they don't know your team's conventions, your deployment process, or your documentation style. Skills let you encode that knowledge so the agent follows your workflows every time.&lt;/p&gt;

&lt;p&gt;But creating skills has been guesswork. You write a SKILL.md file, test it manually in a session, maybe tweak the description, and hope it works. There's no feedback loop, no measurement, no way to know if a change actually improved things.&lt;/p&gt;

&lt;p&gt;opencode-skill-creator changes this by providing a structured workflow for the full skill lifecycle: create, evaluate, optimize, benchmark, and install.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;OpenCode installed and configured&lt;/li&gt;
&lt;li&gt;Node.js 18+ (for the npm package)&lt;/li&gt;
&lt;li&gt;5 minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 1: Install
&lt;/h2&gt;

&lt;p&gt;One command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx opencode-skill-creator &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--global&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This adds the plugin to your global OpenCode config. Restart OpenCode to activate it.&lt;/p&gt;

&lt;p&gt;Verify the install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;ls&lt;/span&gt; ~/.config/opencode/skills/skill-creator/SKILL.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then ask OpenCode: &lt;code&gt;Create a skill that helps with Docker compose files&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;You should see it use the skill-creator workflow and tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Describe What You Want
&lt;/h2&gt;

&lt;p&gt;The skill-creator starts with an intake interview. It asks 3-5 targeted questions about what your skill should do:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What should this skill enable OpenCode to do end-to-end?&lt;/li&gt;
&lt;li&gt;When should this skill trigger?&lt;/li&gt;
&lt;li&gt;What output format and quality bar are expected?&lt;/li&gt;
&lt;li&gt;What workflow steps must be preserved vs. where can the agent improvise?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Don't skip this. The interview captures your intent before any code is written. Think of it as shadowing a teammate — you're the domain expert, the agent is the new hire learning your workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Review the Skill Draft
&lt;/h2&gt;

&lt;p&gt;Based on your interview, the skill-creator produces a draft SKILL.md with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Proper YAML frontmatter (name and description)&lt;/li&gt;
&lt;li&gt;Markdown instructions for the agent&lt;/li&gt;
&lt;li&gt;Optional supporting files (references, agents, templates)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The draft goes to a staging directory (outside your repo) so your project stays clean:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/tmp/opencode-skills/your-skill-name/
├── SKILL.md
├── agents/
├── references/
└── templates/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Review this draft. Make sure the description is accurate (it's the primary triggering mechanism) and the instructions reflect your actual workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Generate Eval Test Cases
&lt;/h2&gt;

&lt;p&gt;The skill-creator automatically generates test cases — realistic prompts that an OpenCode user would actually type:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"skill_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"docker-compose"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"evals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"help me set up a compose file for my Node app with a Postgres database"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"expected_output"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Skill triggers and provides Docker compose guidance"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"should_trigger"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"explain how Kubernetes deployments work"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"should_trigger"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Good eval queries are realistic and specific — not abstract like "help with containers" but concrete like "ok so my boss just sent me this xlsx file (its in my downloads, called something like 'Q4 sales final FINAL v2.xlsx')..."&lt;/p&gt;

&lt;p&gt;Review the eval set. Add or modify test cases that reflect your real usage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Run Evals
&lt;/h2&gt;

&lt;p&gt;The eval system runs each test case twice — once with the skill and once without (baseline). This measures whether the skill actually improves the output.&lt;/p&gt;

&lt;p&gt;For each test case:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;OpenCode runs with the skill loaded&lt;/li&gt;
&lt;li&gt;OpenCode runs without the skill&lt;/li&gt;
&lt;li&gt;Both outputs are saved for comparison&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Timing data (tokens used, duration) is captured automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 6: Review Results Visually
&lt;/h2&gt;

&lt;p&gt;The skill-creator launches an HTML eval viewer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;Call skill_serve_review with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;workspace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/tmp/opencode-skills/your-skill-name-workspace/iteration-1&lt;/span&gt;
  &lt;span class="na"&gt;skillName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-skill-name"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The viewer shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Outputs tab&lt;/strong&gt;: Each test case with with-skill and without-skill outputs side by side&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Benchmark tab&lt;/strong&gt;: Quantitative metrics — pass rates, timing, token usage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feedback fields&lt;/strong&gt;: Leave comments on each test case&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Review the outputs. Give specific feedback on what's working and what's not. Empty feedback means "looks good."&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 7: Iterate and Improve
&lt;/h2&gt;

&lt;p&gt;Based on your feedback, the skill-creator improves the skill:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Applies your feedback&lt;/li&gt;
&lt;li&gt;Reruns all test cases (new iteration)&lt;/li&gt;
&lt;li&gt;Launches the reviewer with previous iteration for comparison&lt;/li&gt;
&lt;li&gt;You review again&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Repeat until you're satisfied or feedback is all empty.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 8: Optimize the Description
&lt;/h2&gt;

&lt;p&gt;Even with perfect skill instructions, the skill won't trigger correctly if the description field isn't right. The description is what OpenCode reads to decide whether to load your skill.&lt;/p&gt;

&lt;p&gt;The optimization loop:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Generates 20 eval queries (should-trigger and should-not-trigger)&lt;/li&gt;
&lt;li&gt;Splits them 60/40 into train/test&lt;/li&gt;
&lt;li&gt;Evaluates each query 3 times for statistical reliability&lt;/li&gt;
&lt;li&gt;Analyzes failure patterns&lt;/li&gt;
&lt;li&gt;LLM proposes improved descriptions&lt;/li&gt;
&lt;li&gt;Re-evaluates on both train and test&lt;/li&gt;
&lt;li&gt;Selects the best description by test score&lt;/li&gt;
&lt;li&gt;Repeats up to 5 iterations
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Tell OpenCode:&lt;/span&gt;
&lt;span class="s2"&gt;"Optimize the description of my docker-compose skill"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This takes some time — grab a coffee while it runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 9: Install the Final Skill
&lt;/h2&gt;

&lt;p&gt;Once you're satisfied with the skill and its description:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Project-level&lt;/strong&gt;: &lt;code&gt;.opencode/skills/your-skill-name/SKILL.md&lt;/code&gt; — available only in this project&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Global&lt;/strong&gt;: &lt;code&gt;~/.config/opencode/skills/your-skill-name/SKILL.md&lt;/code&gt; — available everywhere
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Project-level install&lt;/span&gt;
&lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; /tmp/opencode-skills/your-skill-name/ .opencode/skills/your-skill-name/

&lt;span class="c"&gt;# Global install&lt;/span&gt;
&lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; /tmp/opencode-skills/your-skill-name/ ~/.config/opencode/skills/your-skill-name/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Only the final validated skill gets installed. All eval artifacts stay in the staging directory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Example: Docker Compose Skill
&lt;/h2&gt;

&lt;p&gt;Here's what the full workflow looks like in practice:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ask OpenCode&lt;/strong&gt;: "Create a skill that helps with Docker compose files"&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Interview&lt;/strong&gt;: The skill-creator asks about your conventions (multi-service vs. single container, development vs. production, preferred base images)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Draft&lt;/strong&gt;: Produces a SKILL.md with Docker compose best practices, service configuration patterns, volume mount strategies&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Eval&lt;/strong&gt;: Generates test cases like "my api keeps crashing on startup, can you help me debug my compose file" (should trigger) and "what's the difference between Docker and Podman" (should not trigger)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Review&lt;/strong&gt;: You look at the outputs, give feedback: "the skill should prioritize security configurations in production compose files"&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Iterate&lt;/strong&gt;: Improved skill draft, better outputs&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Optimize&lt;/strong&gt;: Description goes from "Help with Docker compose files" to something much more specific that triggers reliably&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Install&lt;/strong&gt;: Copy to &lt;code&gt;~/.config/opencode/skills/docker-compose/&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Tips for Great Skills
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Be specific in the intake interview&lt;/strong&gt;: The more context you give, the better the draft&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't skip evals&lt;/strong&gt;: They catch triggering issues you'd never find manually&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use realistic test prompts&lt;/strong&gt;: Write them the way you'd actually type them, typos and all&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iterate at least twice&lt;/strong&gt;: First drafts are rarely perfect&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimize the description&lt;/strong&gt;: It's the #1 factor in whether your skill triggers correctly&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Install globally for general skills, project-level for specific ones&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx opencode-skill-creator &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--global&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then ask OpenCode to create a skill. That's it.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/antongulin/opencode-skill-creator" rel="noopener noreferrer"&gt;https://github.com/antongulin/opencode-skill-creator&lt;/a&gt;&lt;br&gt;
npm: &lt;a href="https://www.npmjs.com/package/opencode-skill-creator" rel="noopener noreferrer"&gt;https://www.npmjs.com/package/opencode-skill-creator&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;opencode-skill-creator is free and open source (Apache 2.0). Star it on GitHub. Install: &lt;code&gt;npx opencode-skill-creator install --global&lt;/code&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>typescript</category>
      <category>ai</category>
      <category>opencode</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
