<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Stefan Broenner</title>
    <description>The latest articles on Forem by Stefan Broenner (@sbroenne).</description>
    <link>https://forem.com/sbroenne</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3758524%2F817515ae-dac8-4fbe-ae02-e1e59922895c.jpg</url>
      <title>Forem: Stefan Broenner</title>
      <link>https://forem.com/sbroenne</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/sbroenne"/>
    <language>en</language>
    <item>
      <title>I Gave AI Agents Real Excel. They Did Not Use It Like I Expected - Proven By 90 Days of Telemetry.</title>
      <dc:creator>Stefan Broenner</dc:creator>
      <pubDate>Thu, 30 Apr 2026 06:24:09 +0000</pubDate>
      <link>https://forem.com/sbroenne/i-gave-ai-agents-real-excel-they-did-not-use-it-like-i-expected-proven-by-90-days-of-telemetry-4m78</link>
      <guid>https://forem.com/sbroenne/i-gave-ai-agents-real-excel-they-did-not-use-it-like-i-expected-proven-by-90-days-of-telemetry-4m78</guid>
      <description>&lt;p&gt;&lt;strong&gt;90 days of telemetry from an open-source MCP server that drives the actual Excel desktop app. The numbers were not where I thought they would be.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Half a year ago I asked a simple question: why can AI agents write a React app from scratch but choke on &lt;code&gt;Revenue Model v27 final final.xlsx&lt;/code&gt;?&lt;/p&gt;

&lt;p&gt;The answer turned out to be boring and important.&lt;/p&gt;

&lt;p&gt;Agents had spreadsheet &lt;em&gt;libraries&lt;/em&gt;. They did not have &lt;strong&gt;Excel&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;So I built &lt;a href="https://github.com/sbroenne/mcp-server-excel" rel="noopener noreferrer"&gt;Excel MCP Server&lt;/a&gt;: an open-source MCP server that drives the &lt;strong&gt;real Excel desktop application&lt;/strong&gt; through COM automation. Not a &lt;code&gt;.xlsx&lt;/code&gt; parser. Not Open XML. The actual app, with its calculation engine, VBA, Power Query, pivot caches, and all the quirks that make a workbook a workbook.&lt;/p&gt;

&lt;h2&gt;
  
  
  The "just use openpyxl" problem
&lt;/h2&gt;

&lt;p&gt;Every time I talk about this, someone says "why not use a file-based Excel library?"&lt;/p&gt;

&lt;p&gt;For generating a &lt;code&gt;.xlsx&lt;/code&gt; from a server, those libraries are great. Linux containers, no Office install, fast.&lt;/p&gt;

&lt;p&gt;They are not Excel.&lt;/p&gt;

&lt;p&gt;They do not run Excel's calculation engine. They do not execute VBA. They do not refresh Power Query. They do not touch the COM object model. They cannot tell you whether the report &lt;strong&gt;looks right&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For generating spreadsheets, that is fine. For automating workbooks that already run a business, it is the difference between "I edited a file Excel can open" and "I worked inside Excel."&lt;/p&gt;

&lt;p&gt;Excel MCP Server is for the second case. The workbook is treated as an application, not a zip of XML.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture, in one diagram
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AI assistant
  -&amp;gt; MCP protocol
  -&amp;gt; Excel MCP Server  (in-process MCP, or named-pipe CLI daemon)
  -&amp;gt; Excel COM automation
  -&amp;gt; Real Excel.exe with the workbook open
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That last line is the entire thesis. 90 days later I have telemetry. The numbers told me things I did not expect.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I expected vs what 90 days of telemetry showed
&lt;/h2&gt;

&lt;p&gt;90 days, anonymous, opt-in: &lt;strong&gt;2,908 users, 86,090 sessions, 488,548 tool invocations&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Growth was sharper than I expected. Weekly users went from &lt;strong&gt;84 to 1,209&lt;/strong&gt;. Weekly sessions from &lt;strong&gt;310 to 11,769&lt;/strong&gt;. February: 800 monthly users. March: 1,682. April was on track to clear 2,000 before the month closed.&lt;/p&gt;

&lt;p&gt;Fine. Growth charts are nice. Here is the part that actually surprised me.&lt;/p&gt;

&lt;h3&gt;
  
  
  Surprise 1: agents care about how the workbook &lt;em&gt;looks&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;7,700+ screenshot operations.&lt;/strong&gt; Plus 1,144 calls to arrange Excel windows, plus 1,875 status-bar messages.&lt;/p&gt;

&lt;p&gt;I built screenshot capture as a debugging affordance. Users and agents are using it as part of the loop. They render the sheet, look at it, and decide what to do next.&lt;/p&gt;

&lt;p&gt;If your automation target is a visual app, your tool surface needs visual feedback. Headless is not always the answer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Surprise 2: VBA is extremely not dead
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;20,000+ VBA operations from 500+ users.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not nostalgia. Enterprise reality. There are companies running their month-end close on a macro a finance manager wrote in 2014. If your agent cannot read, edit, import, or run VBA, it cannot touch any of that.&lt;/p&gt;

&lt;p&gt;Modern AI tooling has to meet legacy automation where it lives. Macros included.&lt;/p&gt;

&lt;h3&gt;
  
  
  Surprise 3: people are not using this for "read cell A1"
&lt;/h3&gt;

&lt;p&gt;The median user fires &lt;strong&gt;65&lt;/strong&gt; tool invocations. The 95th percentile fires &lt;strong&gt;1,000+&lt;/strong&gt;. The 99th fires &lt;strong&gt;3,000+&lt;/strong&gt;. The most active anonymous user fired &lt;strong&gt;10,060&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That is not a demo. That is somebody with an agent in the loop on real work.&lt;/p&gt;

&lt;p&gt;And the work itself is heavier than I assumed: &lt;strong&gt;25K+ Power Query, connections, and Data Model operations.&lt;/strong&gt; People are using this to inspect M code, refresh queries, and poke at DAX models. Excel as a local BI environment, driven by an LLM.&lt;/p&gt;

&lt;h3&gt;
  
  
  Surprise 4: presentation is half the job
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;~68K formatting and presentation operations.&lt;/strong&gt; Number formats, column widths, merged cells, conditional formatting.&lt;/p&gt;

&lt;p&gt;Many Excel tasks do not end at "the data is correct." They end at "the workbook is ready to send." For agents, polishing is part of the workflow, not a nice-to-have.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bonus surprise: nobody lets old tools die
&lt;/h3&gt;

&lt;p&gt;After I shipped cleaner domain-focused tools, the legacy &lt;code&gt;excel_*&lt;/code&gt; tools still pulled &lt;strong&gt;45,507 invocations from 214 users&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Once an LLM client or prompt depends on a tool name, that name is load-bearing. Renames are breaking changes. MCP tool surfaces are APIs and they age like APIs.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the GitHub history said
&lt;/h2&gt;

&lt;p&gt;I also went through the full repo: &lt;strong&gt;197 issues, 417 PRs, 382 merged, 3 issues open&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I expected most of it to be "please add a wrapper for X." It was not. The recurring themes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;168 issues&lt;/strong&gt; on COM stability, hangs, cleanup, Click-to-Run quirks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;147 issues&lt;/strong&gt; on testing, CI, regression, smoke tests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;138 issues&lt;/strong&gt; on MCP behavior, schemas, tool definitions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;104 issues&lt;/strong&gt; on the CLI, daemon, named pipes, packaging&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;91 issues&lt;/strong&gt; on Power Query, Data Model, DAX&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;27 issues&lt;/strong&gt; on VBA&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most-commented issue in the project's history is "create Excel Tables programmatically." That sounds boring until you realize how much agent usefulness collapses if you cannot turn a loose range into a structured table object.&lt;/p&gt;

&lt;p&gt;The hard part of this project was never exposing Excel APIs. The hard part was surviving STA threading, modal dialogs, workbook locks, daemon startup races, schema drift, and the difference between Office Click-to-Run and MSI installs on a Tuesday.&lt;/p&gt;

&lt;p&gt;That weird path is exactly what file-based libraries skip, and exactly where real users live.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three things I would tell anyone building an MCP server
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Tool names are user intent, not API mechanics.&lt;/strong&gt; &lt;code&gt;format-range&lt;/code&gt;, &lt;code&gt;refresh&lt;/code&gt;, &lt;code&gt;create-pivottable&lt;/code&gt;, &lt;code&gt;capture-sheet&lt;/code&gt;. Each one removes a translation step the model would otherwise have to invent. A few well-named domain tools beat dozens of generic primitives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. State is the whole game.&lt;/strong&gt; Workbooks have state. Excel has state. Calculation mode has state. COM has state. An MCP server for a stateful desktop app is not a stateless HTTP wrapper with fancier transport. Sessions are the product.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Once a tool ships, the name is permanent-ish.&lt;/strong&gt; See the 45K invocations of "deprecated" tools above. Treat your tool surface like a public SDK from day one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;The interesting AI work is not only in greenfield apps. It is also in the messy, load-bearing tools businesses already trust.&lt;/p&gt;

&lt;p&gt;Excel is full of formulas, macros, queries, layouts, and conventions that nobody wants to rewrite. Some of that lives in the file. A lot of it only comes alive when Excel opens it.&lt;/p&gt;

&lt;p&gt;Agents can finally work &lt;em&gt;inside&lt;/em&gt; that environment. Not next to it. Not with a copy of it. Inside it.&lt;/p&gt;

&lt;p&gt;That is the gap Excel MCP Server is trying to close, and the telemetry says people are walking through it faster than I thought.&lt;/p&gt;




&lt;p&gt;Open source, MIT, Windows-only because Excel:&lt;br&gt;
&lt;strong&gt;&lt;a href="https://github.com/sbroenne/mcp-server-excel" rel="noopener noreferrer"&gt;github.com/sbroenne/mcp-server-excel&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you are building MCP servers for other desktop apps, or you have war stories about COM, modal dialogs, or "we shipped a tool rename and broke 200 prompts" — I want to hear them.&lt;/p&gt;

</description>
      <category>excel</category>
      <category>ai</category>
      <category>mcp</category>
      <category>opensource</category>
    </item>
    <item>
      <title>skillpm - Package Manager for Agent Skills. Built on npm.</title>
      <dc:creator>Stefan Broenner</dc:creator>
      <pubDate>Wed, 25 Feb 2026 19:12:17 +0000</pubDate>
      <link>https://forem.com/sbroenne/skillpm-package-manager-for-agent-skills-built-on-npm-3d31</link>
      <guid>https://forem.com/sbroenne/skillpm-package-manager-for-agent-skills-built-on-npm-3d31</guid>
      <description>&lt;p&gt;Every &lt;a href="https://agentskills.io/home" rel="noopener noreferrer"&gt;agent skill&lt;/a&gt; today is a monolith.&lt;/p&gt;

&lt;p&gt;Authors cram React patterns, TypeScript best practices, and testing guidelines into a single massive SKILL.md — because there's no way to say "just depend on that other skill." No registry. No dependency management. No versioning. The &lt;a href="https://agentskills.io" rel="noopener noreferrer"&gt;Agent Skills spec&lt;/a&gt; defines what a skill &lt;em&gt;is&lt;/em&gt;, but says nothing about how to publish, install, or share them.&lt;/p&gt;

&lt;p&gt;We (Sonnet 4.6 &amp;amp; myself) built &lt;strong&gt;&lt;a href="https://skillpm.dev" rel="noopener noreferrer"&gt;skillpm&lt;/a&gt;&lt;/strong&gt; to fix that — a lightweight orchestration layer on top of npm. ~630 lines of code, 3 dependencies, zero reinvention. Small skills that compose, not monoliths that overlap.&lt;/p&gt;

&lt;h2&gt;
  
  
  The idea: don't reinvent npm — extend it
&lt;/h2&gt;

&lt;p&gt;When we started, the tempting path was to build a custom registry, a custom resolver, a custom lockfile format. We chose the opposite.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;skillpm is a thin orchestration layer on top of npm.&lt;/strong&gt; Same &lt;code&gt;package.json&lt;/code&gt;. Same &lt;code&gt;node_modules/&lt;/code&gt;. Same &lt;code&gt;package-lock.json&lt;/code&gt;. Same registry (npmjs.org). skillpm only adds what npm can't do on its own:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Scanning&lt;/strong&gt; &lt;code&gt;node_modules/&lt;/code&gt; for packages containing &lt;code&gt;skills/*/SKILL.md&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wiring&lt;/strong&gt; discovered skills into agent directories via &lt;a href="https://www.npmjs.com/package/skills" rel="noopener noreferrer"&gt;&lt;code&gt;skills&lt;/code&gt;&lt;/a&gt; (Claude, Cursor, VS Code, Codex, and many more)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configuring MCP servers&lt;/strong&gt; declared by skills, transitively across the dependency tree, via &lt;a href="https://github.com/neondatabase/add-mcp" rel="noopener noreferrer"&gt;&lt;code&gt;add-mcp&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. Everything else — resolution, caching, lockfiles, audit, semver — is npm.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install a skill (no global install needed)&lt;/span&gt;
npx skillpm &lt;span class="nb"&gt;install &lt;/span&gt;excel-mcp-skill

&lt;span class="c"&gt;# List what's installed&lt;/span&gt;
npx skillpm list

&lt;span class="c"&gt;# Scaffold a new skill&lt;/span&gt;
npx skillpm init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you run &lt;code&gt;npx skillpm install &amp;lt;skill&amp;gt;&lt;/code&gt;, here's what happens:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;npx skillpm install react-patterns
  │
  ▼
📦 npm install react-patterns
   npm handles resolution, download, lockfile
  │
  ▼
🔍 Scan node_modules/
   find packages with skills/*/SKILL.md
  │
  ▼
🔗 npx skills add ./node_modules/...
   wire into Claude, Cursor, VS Code, Codex...
  │
  ▼
📄 Read skillpm.mcpServers
   walk entire dependency tree
  │
  ▼
🔌 npx add-mcp &amp;lt;server&amp;gt;
   configure each MCP server across agents
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four tools, each doing one thing well, orchestrated together.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a skill package looks like
&lt;/h2&gt;

&lt;p&gt;A skill is just an npm package with a specific directory structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my-skill/
├── package.json          # keywords: ["agent-skill"]
├── README.md
├── LICENSE
└── skills/
    └── my-skill/
        ├── SKILL.md      # The skill definition
        ├── scripts/      # Optional executable scripts
        ├── references/   # Optional reference docs
        └── assets/       # Optional templates/data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;SKILL.md&lt;/code&gt; is where the magic lives — YAML frontmatter for metadata, Markdown body with instructions for the agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-skill&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Refactor&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;React&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;class&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;components&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;functional&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;components&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;hooks."&lt;/span&gt;
&lt;span class="na"&gt;allowed-tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Bash Read Edit&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="c1"&gt;# React Refactoring&lt;/span&gt;

&lt;span class="c1"&gt;## When to use this skill&lt;/span&gt;
&lt;span class="s"&gt;Use when the user asks to modernize React components...&lt;/span&gt;

&lt;span class="c1"&gt;## Instructions&lt;/span&gt;
&lt;span class="s"&gt;1. Identify class components in the target files&lt;/span&gt;
&lt;span class="s"&gt;2. Convert lifecycle methods to useEffect hooks&lt;/span&gt;
&lt;span class="s"&gt;3. ...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Skills can depend on other skills
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting. Because skills are npm packages, they can depend on each other. &lt;strong&gt;This solves the biggest problem with Agent Skills today: prompt bloat.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Without dependency management, if you want a skill that builds a full-stack React app, you have to copy-paste instructions for React, TypeScript, testing, and styling into one massive &lt;code&gt;SKILL.md&lt;/code&gt;. The agent gets overwhelmed, context windows fill up, and the skill becomes impossible to maintain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;skillpm brings standard software engineering practices to Agent Skills.&lt;/strong&gt; You don't copy-paste code anymore; you shouldn't copy-paste prompts either. With skillpm, you just declare dependencies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fullstack-react"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.0.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"keywords"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"agent-skill"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"dependencies"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"react-patterns"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"^2.0.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"typescript-best-practices"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"^1.3.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"testing-with-vitest"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"^1.0.0"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"skillpm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"@anthropic/mcp-server-filesystem"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;npx skillpm install fullstack-react&lt;/code&gt; resolves the entire tree — all three dependencies get installed, scanned, wired into agents, and their MCP servers configured. One command.&lt;/p&gt;

&lt;p&gt;Instead of monolithic, 500-line prompt files, you can build small, composable, single-purpose skills that build on top of each other. It's modularity for AI.&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP server configuration, handled
&lt;/h2&gt;

&lt;p&gt;Skills often need MCP servers to function. The &lt;code&gt;skillpm.mcpServers&lt;/code&gt; field declares those requirements:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"skillpm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"@anthropic/mcp-server-filesystem"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"https://mcp.context7.com/mcp"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;skillpm walks the &lt;em&gt;entire&lt;/em&gt; dependency tree, collects all MCP server requirements (deduplicated), and configures each one via &lt;a href="https://github.com/neondatabase/add-mcp" rel="noopener noreferrer"&gt;add-mcp&lt;/a&gt;. The user never has to manually configure MCP servers.&lt;/p&gt;

&lt;h2&gt;
  
  
  The ecosystem today
&lt;/h2&gt;

&lt;p&gt;There are already &lt;strong&gt;90+ skill packages&lt;/strong&gt; on npm with the &lt;code&gt;agent-skill&lt;/code&gt; keyword. We built an &lt;a href="https://skillpm.dev/registry/" rel="noopener noreferrer"&gt;Agent Skills Registry&lt;/a&gt; that indexes them all — searchable, filterable by keyword, sortable by downloads or recency.&lt;/p&gt;

&lt;p&gt;Most existing packages follow the original spec (root &lt;code&gt;SKILL.md&lt;/code&gt;) rather than our npm packaging convention (&lt;code&gt;skills/&amp;lt;name&amp;gt;/SKILL.md&lt;/code&gt;). skillpm handles both — legacy packages get installed with a friendly warning pointing to the migration guide.&lt;/p&gt;

&lt;h2&gt;
  
  
  Create and publish a skill in 60 seconds
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;my-awesome-skill &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;my-awesome-skill
npx skillpm init
&lt;span class="c"&gt;# Edit skills/my-awesome-skill/SKILL.md with your instructions&lt;/span&gt;
npx skillpm publish
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's literally it. Your skill is now on npmjs.org, discoverable by anyone running &lt;code&gt;npx skillpm install my-awesome-skill&lt;/code&gt;, and automatically wired into their agents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why not just use npm directly?
&lt;/h2&gt;

&lt;p&gt;You can! Skills are valid npm packages. But skillpm adds what &lt;code&gt;npm install&lt;/code&gt; alone can't do:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Scan for &lt;code&gt;SKILL.md&lt;/code&gt; files in installed packages&lt;/li&gt;
&lt;li&gt;✅ Link skills into agent directories via &lt;a href="https://www.npmjs.com/package/skills" rel="noopener noreferrer"&gt;&lt;code&gt;skills&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;✅ Configure MCP servers via &lt;a href="https://github.com/neondatabase/add-mcp" rel="noopener noreferrer"&gt;&lt;code&gt;add-mcp&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;✅ Validate skill packages before publishing&lt;/li&gt;
&lt;li&gt;✅ Show you what skills are installed and where they're wired&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's the gap skillpm fills. It's npm + skill awareness.&lt;/p&gt;

&lt;h2&gt;
  
  
  Built on the shoulders of giants
&lt;/h2&gt;

&lt;p&gt;We deliberately don't reinvent anything. skillpm shells out to four battle-tested tools:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;How skillpm uses it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;npm&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Package management&lt;/td&gt;
&lt;td&gt;All installs, resolution, lockfiles, caching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;a href="https://www.npmjs.com/package/skills" rel="noopener noreferrer"&gt;&lt;strong&gt;skills&lt;/strong&gt;&lt;/a&gt; (Vercel)&lt;/td&gt;
&lt;td&gt;Agent directory linking&lt;/td&gt;
&lt;td&gt;Wires skills into agent directories&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/neondatabase/add-mcp" rel="noopener noreferrer"&gt;&lt;strong&gt;add-mcp&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;MCP server configuration&lt;/td&gt;
&lt;td&gt;Configures servers across agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://www.npmjs.com/package/skills-ref" rel="noopener noreferrer"&gt;&lt;strong&gt;skills-ref&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Spec validation&lt;/td&gt;
&lt;td&gt;Validates SKILL.md during publish&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Get started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Try it now — no install required&lt;/span&gt;
npx skillpm &lt;span class="nb"&gt;install &lt;/span&gt;skillpm-skill

&lt;span class="c"&gt;# Browse the registry&lt;/span&gt;
&lt;span class="c"&gt;# https://skillpm.dev/registry/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;📦 &lt;strong&gt;npm&lt;/strong&gt;: &lt;a href="https://www.npmjs.com/package/skillpm" rel="noopener noreferrer"&gt;npmjs.com/package/skillpm&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📖 &lt;strong&gt;Docs&lt;/strong&gt;: &lt;a href="https://skillpm.dev" rel="noopener noreferrer"&gt;skillpm.dev&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🔧 &lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/sbroenne/skillpm" rel="noopener noreferrer"&gt;github.com/sbroenne/skillpm&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📋 &lt;strong&gt;Agent Skills spec&lt;/strong&gt;: &lt;a href="https://agentskills.io" rel="noopener noreferrer"&gt;agentskills.io&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  We're actively looking for contributors! 🤝
&lt;/h2&gt;

&lt;p&gt;skillpm is a young project, and there's a lot of room to grow. Whether it's adding support for new agent directories, improving the CLI experience, or building out the registry, we'd love your help. Check out the &lt;a href="https://github.com/sbroenne/skillpm" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt; and look for the &lt;code&gt;good first issue&lt;/code&gt; label!&lt;/p&gt;

&lt;p&gt;We'd love your feedback. Open an issue, try publishing a skill, or just tell us what you think.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What kind of skills are you building for your AI agents? Is this useful? What are we missing (e.g. custom agents, / prompts)? Let me know in the comments!&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;skillpm is MIT licensed and open source.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>npm</category>
      <category>showdev</category>
      <category>agentskills</category>
      <category>skillsengineering</category>
    </item>
    <item>
      <title>Are you human? Or are you malware?</title>
      <dc:creator>Stefan Broenner</dc:creator>
      <pubDate>Thu, 19 Feb 2026 09:18:24 +0000</pubDate>
      <link>https://forem.com/sbroenne/are-you-human-or-are-you-malware-a1k</link>
      <guid>https://forem.com/sbroenne/are-you-human-or-are-you-malware-a1k</guid>
      <description>&lt;p&gt;Someone opened a GitHub issue on my Excel MCP Server project questioning if I’m actually human. The reasons behind this assumption included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High commit velocity&lt;/li&gt;
&lt;li&gt;An AI-generated demo video&lt;/li&gt;
&lt;li&gt;Consistent structure and documentation&lt;/li&gt;
&lt;li&gt;The belief that my work might be AI-generated or even malware&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Great analysis, great question!&lt;/p&gt;

&lt;p&gt;My response was straightforward: I’m human — I just use AI tools very deliberately. GitHub Copilot helps me build faster, and tools like HeyGen enhance my communication.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This interaction highlighted an important realization: the line between “human work” and “AI-assisted work” is already blurry — and that’s okay.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Open source is evolving, developer workflows are changing, and trust models will need to adapt as well. These are interesting times to be building software in public. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/sbroenne/mcp-server-excel/issues/479" rel="noopener noreferrer"&gt;https://github.com/sbroenne/mcp-server-excel/issues/479&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>githubcopilot</category>
      <category>devjournal</category>
      <category>agentic</category>
    </item>
    <item>
      <title>pytest-aitest: Unit Tests Can't Test Your MCP Server. AI Can.</title>
      <dc:creator>Stefan Broenner</dc:creator>
      <pubDate>Fri, 13 Feb 2026 03:58:55 +0000</pubDate>
      <link>https://forem.com/sbroenne/pytest-aitest-unit-tests-cant-test-your-mcp-server-ai-can-1ebn</link>
      <guid>https://forem.com/sbroenne/pytest-aitest-unit-tests-cant-test-your-mcp-server-ai-can-1ebn</guid>
      <description>&lt;h2&gt;
  
  
  I Learned This the Hard Way
&lt;/h2&gt;

&lt;p&gt;I built two MCP servers — &lt;a href="https://github.com/sbroenne/excel-mcp-server" rel="noopener noreferrer"&gt;Excel MCP Server&lt;/a&gt; and &lt;a href="https://github.com/sbroenne/windows-mcp-server" rel="noopener noreferrer"&gt;Windows MCP Server&lt;/a&gt;. Both had solid test suites. Both broke the moment a real LLM tried to use them.&lt;/p&gt;

&lt;p&gt;I spent weeks doing manual testing with GitHub Copilot. Open a chat, type a prompt, watch the LLM pick the wrong tool, tweak the description, try again. Sometimes the design was fundamentally broken and I spent weeks on a wild goose chase before realizing the whole approach needed rethinking.&lt;/p&gt;

&lt;p&gt;The failure modes were always the same:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The LLM picks the wrong tool out of 15 similar-sounding options&lt;/li&gt;
&lt;li&gt;It passes &lt;code&gt;{"account_id": "checking"}&lt;/code&gt; when the parameter is &lt;code&gt;account&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;It ignores the system prompt entirely&lt;/li&gt;
&lt;li&gt;It asks the user "Would you like me to do that?" instead of just doing it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why?&lt;/strong&gt; Because I tested the code, not the AI interface.&lt;/p&gt;

&lt;p&gt;For LLMs, your API isn't functions and types — it's &lt;strong&gt;tool descriptions, parameter schemas, and system prompts&lt;/strong&gt;. That's what the model actually reads. No compiler catches a bad tool description. No unit test validates that an LLM will pick the right tool. And if you also inject Agent Skills — do they actually help? Or make things worse? Do LLMs really behave the way you think they will?&lt;/p&gt;

&lt;p&gt;(No. They don't.)&lt;/p&gt;

&lt;p&gt;So I built &lt;a href="https://github.com/sbroenne/pytest-aitest" rel="noopener noreferrer"&gt;pytest-aitest&lt;/a&gt;, heavily inspired by &lt;a href="https://github.com/mykhaliev/agent-benchmark" rel="noopener noreferrer"&gt;agent-benchmark&lt;/a&gt; by Dmytro Mykhaliev. &lt;/p&gt;

&lt;p&gt;It's a pytest plugin — &lt;code&gt;uv add pytest-aitest&lt;/code&gt; and you're done. No new CLI, no new syntax. Works with your existing fixtures, markers, and CI/CD pipelines.&lt;/p&gt;

&lt;h2&gt;
  
  
  Write Tests as Prompts
&lt;/h2&gt;

&lt;p&gt;Your test &lt;em&gt;is&lt;/em&gt; a prompt. Write what a user would say. Let the LLM figure out how to use your tools. Assert on what happened.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pytest_aitest&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Provider&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MCPServer&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_balance_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;aitest_run&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Provider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;azure/gpt-5-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;mcp_servers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;MCPServer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my_banking_server&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;aitest_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s my checking balance?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool_was_called&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_balance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;If this fails, the problem isn't your code — it's your tool description. The LLM couldn't figure out which tool to call or what parameters to pass. Fix the description, run again. This is &lt;strong&gt;TDD for AI interfaces&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Red/Green/Refactor Cycle — For Tool Descriptions
&lt;/h2&gt;
&lt;h3&gt;
  
  
  🔴 Red: Write a failing test
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_transfer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;aitest_run&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;aitest_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Move $200 from checking to savings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool_was_called&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transfer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The LLM reads your tool descriptions, gets confused, calls the wrong thing. Test fails.&lt;/p&gt;
&lt;h3&gt;
  
  
  🟢 Green: Fix the interface
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before — too vague
&lt;/span&gt;&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transfer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;from_acct&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;to_acct&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Transfer money.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="c1"&gt;# After — the LLM knows exactly what to do
&lt;/span&gt;&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transfer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;from_account&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;to_account&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Transfer money between accounts (checking, savings).
    Amount must be positive. Returns new balances for both accounts.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Run again. Test passes.&lt;/p&gt;
&lt;h3&gt;
  
  
  🔄 Refactor: Let AI analysis tell you what else to fix
&lt;/h3&gt;

&lt;p&gt;This is where it gets interesting. pytest-aitest doesn't just tell you pass/fail — it runs a second LLM that analyzes every failure and tells you &lt;em&gt;why&lt;/em&gt; it happened and &lt;em&gt;what to improve&lt;/em&gt;. Traditional testing requires a human to interpret failures. Here, the AI does it:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyof1k8ii6jvtetjy3bvt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyof1k8ii6jvtetjy3bvt.png" alt="Screenshot of pytest-aitest report showing deploy recommendation for gpt-5-mini, pass rate comparison across models, cost metrics, and AI-generated failure analysis"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The report tells you which model to deploy, why it wins, and what to fix. It analyzes cost efficiency, tool usage patterns, and prompt effectiveness across all your configurations. Unused tools? The AI flags them. Prompt causing permission-seeking behavior? It explains the mechanism. &lt;a href="https://sbroenne.github.io/pytest-aitest/demo/hero-report.html" rel="noopener noreferrer"&gt;See a full sample report →&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Compare Models, Prompts, and Server Versions
&lt;/h2&gt;

&lt;p&gt;The real power is comparison. Test multiple configurations against the same test suite:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;MODELS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;PROMPTS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;brief&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Be concise.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;detailed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain your reasoning.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;AGENTS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;prompt_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Provider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;azure/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;mcp_servers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;banking_server&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;MODELS&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;prompt_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;PROMPTS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nd"&gt;@pytest.mark.parametrize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AGENTS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_balance_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;aitest_run&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;aitest_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s my checking balance?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;4 configurations. Same tests. The report generates an &lt;strong&gt;Agent Leaderboard&lt;/strong&gt; — winner by pass rate, then cost as tiebreaker:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent&lt;/th&gt;
&lt;th&gt;Pass Rate&lt;/th&gt;
&lt;th&gt;Tokens&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5-mini-brief&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;747&lt;/td&gt;
&lt;td&gt;$0.002&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-4.1-brief&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;560&lt;/td&gt;
&lt;td&gt;$0.008&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5-mini-detailed&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;1,203&lt;/td&gt;
&lt;td&gt;$0.004&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Deploy: gpt-5-mini&lt;/strong&gt; (brief prompt) — 100% pass rate at lowest cost.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The same pattern works for A/B testing server versions (did your refactor break tool discoverability?), comparing system prompts, and measuring the impact of &lt;a href="https://agentskills.io" rel="noopener noreferrer"&gt;Agent Skills&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Multi-Turn Sessions
&lt;/h2&gt;

&lt;p&gt;Real users don't ask one question. They have conversations:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@pytest.mark.session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;banking-chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestBankingConversation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_check_balance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;aitest_run&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;aitest_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s my checking balance?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_transfer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;aitest_run&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Agent remembers we were talking about checking
&lt;/span&gt;        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;aitest_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Transfer $200 to savings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool_was_called&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transfer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_verify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;aitest_run&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Agent remembers the transfer
&lt;/span&gt;        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;aitest_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What are my new balances?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Tests share conversation history. The report shows the full session flow with sequence diagrams.&lt;/p&gt;
&lt;h2&gt;
  
  
  Who This Is For
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MCP server authors&lt;/strong&gt; — Validate that LLMs can actually use your tools, not just that the code works&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent builders&lt;/strong&gt; — Find the cheapest model + prompt combo that passes your test suite&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Teams shipping AI products&lt;/strong&gt; — Gate deployments on LLM-facing regression tests in CI/CD&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Works with &lt;a href="https://docs.litellm.ai/docs/providers" rel="noopener noreferrer"&gt;100+ LLM providers&lt;/a&gt; via LiteLLM — Azure, OpenAI, Anthropic, Google, local models, whatever you're running.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Key Insight
&lt;/h2&gt;

&lt;p&gt;The test is a prompt. The LLM is the test harness. The report tells you what to fix.&lt;/p&gt;

&lt;p&gt;Traditional testing validates that your code works. pytest-aitest validates that &lt;strong&gt;an LLM can understand and use your code&lt;/strong&gt;. These are different things, and the gap between them is where your production bugs live.&lt;/p&gt;

&lt;p&gt;Your tool descriptions are an API. Test them like one.&lt;/p&gt;
&lt;h2&gt;
  
  
  Get Started
&lt;/h2&gt;

&lt;p&gt;pytest-aitest is open source. Contributions welcome!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/sbroenne/pytest-aitest" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;Star pytest-aitest on GitHub&lt;/a&gt;
&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://sbroenne.github.io/pytest-aitest/" rel="noopener noreferrer"&gt;Documentation&lt;/a&gt; — Full guides and API reference&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pypi.org/project/pytest-aitest/" rel="noopener noreferrer"&gt;PyPI&lt;/a&gt; — &lt;code&gt;uv add pytest-aitest&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://sbroenne.github.io/pytest-aitest/demo/hero-report.html" rel="noopener noreferrer"&gt;Sample Report&lt;/a&gt; — See AI analysis in action&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;

&lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/sbroenne" rel="noopener noreferrer"&gt;
        sbroenne
      &lt;/a&gt; / &lt;a href="https://github.com/sbroenne/pytest-aitest" rel="noopener noreferrer"&gt;
        pytest-aitest
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      The testing framework for skill engineering. Test tool descriptions, prompt templates, agent skills, and custom agents with real LLMs. AI analyzes results and tells you what to fix.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;blockquote&gt;
&lt;p&gt;🗄️ &lt;strong&gt;This project is archived and no longer maintained.&lt;/strong&gt; It has been replaced by &lt;a href="https://github.com/sbroenne/pytest-skill-engineering" rel="noopener noreferrer"&gt;pytest-skill-engineering&lt;/a&gt;. Do not use this project for new work. This repository is kept as a read-only archive.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;pytest-aitest&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;&lt;a href="https://pypi.org/project/pytest-aitest/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/8ddb5f9cd42c915ff6ecddb74e709fd948465b78b0c2b25086c7b425187c6645/68747470733a2f2f696d672e736869656c64732e696f2f707970692f762f7079746573742d616974657374" alt="PyPI version"&gt;&lt;/a&gt;
&lt;a href="https://pypi.org/project/pytest-aitest/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/f4cc6b256242d23285dafa42f03f6a3465c0458ea4ae72515d1bbe6b26537ab9/68747470733a2f2f696d672e736869656c64732e696f2f707970692f707976657273696f6e732f7079746573742d616974657374" alt="Python versions"&gt;&lt;/a&gt;
&lt;a href="https://github.com/sbroenne/pytest-aitest/actions/workflows/ci.yml" rel="noopener noreferrer"&gt;&lt;img src="https://github.com/sbroenne/pytest-aitest/actions/workflows/ci.yml/badge.svg" alt="CI"&gt;&lt;/a&gt;
&lt;a href="https://opensource.org/licenses/MIT" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/fdf2982b9f5d7489dcf44570e714e3a15fce6253e0cc6b5aa61a075aac2ff71b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d79656c6c6f772e737667" alt="License: MIT"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Skill Engineering. Test-driven. AI-analyzed.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A pytest plugin for skill engineering — test your MCP server tools, prompt templates, agent skills, and custom agents with real LLMs. Red/Green/Refactor for the skill stack. Let AI analysis tell you what to fix.&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Why?&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;Modern AI systems are built on &lt;strong&gt;skill engineering&lt;/strong&gt; — the discipline of designing modular, reliable, callable capabilities that an LLM can discover, invoke, and orchestrate to perform real tasks. Skills are what separate "text generator" from "agent that actually does things."&lt;/p&gt;
&lt;p&gt;An MCP server is the runtime for those skills. It doesn't ship alone — it comes bundled with the &lt;strong&gt;full skill engineering stack&lt;/strong&gt;: &lt;strong&gt;tools&lt;/strong&gt; (callable functions), &lt;strong&gt;prompt templates&lt;/strong&gt; (server-side reasoning starters), &lt;strong&gt;agent skills&lt;/strong&gt; (domain knowledge…&lt;/p&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/sbroenne/pytest-aitest" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;




</description>
      <category>python</category>
      <category>mcp</category>
      <category>testing</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
