Forem: Stefan Broenner

I Gave AI Agents Real Excel. They Did Not Use It Like I Expected - Proven By 90 Days of Telemetry.

Stefan Broenner — Thu, 30 Apr 2026 06:24:09 +0000

90 days of telemetry from an open-source MCP server that drives the actual Excel desktop app. The numbers were not where I thought they would be.

Half a year ago I asked a simple question: why can AI agents write a React app from scratch but choke on Revenue Model v27 final final.xlsx?

The answer turned out to be boring and important.

Agents had spreadsheet libraries. They did not have Excel.

So I built Excel MCP Server: an open-source MCP server that drives the real Excel desktop application through COM automation. Not a .xlsx parser. Not Open XML. The actual app, with its calculation engine, VBA, Power Query, pivot caches, and all the quirks that make a workbook a workbook.

The "just use openpyxl" problem

Every time I talk about this, someone says "why not use a file-based Excel library?"

For generating a .xlsx from a server, those libraries are great. Linux containers, no Office install, fast.

They are not Excel.

They do not run Excel's calculation engine. They do not execute VBA. They do not refresh Power Query. They do not touch the COM object model. They cannot tell you whether the report looks right.

For generating spreadsheets, that is fine. For automating workbooks that already run a business, it is the difference between "I edited a file Excel can open" and "I worked inside Excel."

Excel MCP Server is for the second case. The workbook is treated as an application, not a zip of XML.

The architecture, in one diagram

AI assistant
  -> MCP protocol
  -> Excel MCP Server  (in-process MCP, or named-pipe CLI daemon)
  -> Excel COM automation
  -> Real Excel.exe with the workbook open

That last line is the entire thesis. 90 days later I have telemetry. The numbers told me things I did not expect.

What I expected vs what 90 days of telemetry showed

90 days, anonymous, opt-in: 2,908 users, 86,090 sessions, 488,548 tool invocations.

Growth was sharper than I expected. Weekly users went from 84 to 1,209. Weekly sessions from 310 to 11,769. February: 800 monthly users. March: 1,682. April was on track to clear 2,000 before the month closed.

Fine. Growth charts are nice. Here is the part that actually surprised me.

Surprise 1: agents care about how the workbook looks

7,700+ screenshot operations. Plus 1,144 calls to arrange Excel windows, plus 1,875 status-bar messages.

I built screenshot capture as a debugging affordance. Users and agents are using it as part of the loop. They render the sheet, look at it, and decide what to do next.

If your automation target is a visual app, your tool surface needs visual feedback. Headless is not always the answer.

Surprise 2: VBA is extremely not dead

20,000+ VBA operations from 500+ users.

Not nostalgia. Enterprise reality. There are companies running their month-end close on a macro a finance manager wrote in 2014. If your agent cannot read, edit, import, or run VBA, it cannot touch any of that.

Modern AI tooling has to meet legacy automation where it lives. Macros included.

Surprise 3: people are not using this for "read cell A1"

The median user fires 65 tool invocations. The 95th percentile fires 1,000+. The 99th fires 3,000+. The most active anonymous user fired 10,060.

That is not a demo. That is somebody with an agent in the loop on real work.

And the work itself is heavier than I assumed: 25K+ Power Query, connections, and Data Model operations. People are using this to inspect M code, refresh queries, and poke at DAX models. Excel as a local BI environment, driven by an LLM.

Surprise 4: presentation is half the job

~68K formatting and presentation operations. Number formats, column widths, merged cells, conditional formatting.

Many Excel tasks do not end at "the data is correct." They end at "the workbook is ready to send." For agents, polishing is part of the workflow, not a nice-to-have.

Bonus surprise: nobody lets old tools die

After I shipped cleaner domain-focused tools, the legacy excel_* tools still pulled 45,507 invocations from 214 users.

Once an LLM client or prompt depends on a tool name, that name is load-bearing. Renames are breaking changes. MCP tool surfaces are APIs and they age like APIs.

What the GitHub history said

I also went through the full repo: 197 issues, 417 PRs, 382 merged, 3 issues open.

I expected most of it to be "please add a wrapper for X." It was not. The recurring themes:

168 issues on COM stability, hangs, cleanup, Click-to-Run quirks
147 issues on testing, CI, regression, smoke tests
138 issues on MCP behavior, schemas, tool definitions
104 issues on the CLI, daemon, named pipes, packaging
91 issues on Power Query, Data Model, DAX
27 issues on VBA

The most-commented issue in the project's history is "create Excel Tables programmatically." That sounds boring until you realize how much agent usefulness collapses if you cannot turn a loose range into a structured table object.

The hard part of this project was never exposing Excel APIs. The hard part was surviving STA threading, modal dialogs, workbook locks, daemon startup races, schema drift, and the difference between Office Click-to-Run and MSI installs on a Tuesday.

That weird path is exactly what file-based libraries skip, and exactly where real users live.

Three things I would tell anyone building an MCP server

1. Tool names are user intent, not API mechanics. format-range, refresh, create-pivottable, capture-sheet. Each one removes a translation step the model would otherwise have to invent. A few well-named domain tools beat dozens of generic primitives.

2. State is the whole game. Workbooks have state. Excel has state. Calculation mode has state. COM has state. An MCP server for a stateful desktop app is not a stateless HTTP wrapper with fancier transport. Sessions are the product.

3. Once a tool ships, the name is permanent-ish. See the 45K invocations of "deprecated" tools above. Treat your tool surface like a public SDK from day one.

Why this matters

The interesting AI work is not only in greenfield apps. It is also in the messy, load-bearing tools businesses already trust.

Excel is full of formulas, macros, queries, layouts, and conventions that nobody wants to rewrite. Some of that lives in the file. A lot of it only comes alive when Excel opens it.

Agents can finally work inside that environment. Not next to it. Not with a copy of it. Inside it.

That is the gap Excel MCP Server is trying to close, and the telemetry says people are walking through it faster than I thought.

Open source, MIT, Windows-only because Excel:
github.com/sbroenne/mcp-server-excel

If you are building MCP servers for other desktop apps, or you have war stories about COM, modal dialogs, or "we shipped a tool rename and broke 200 prompts" — I want to hear them.

skillpm - Package Manager for Agent Skills. Built on npm.

Stefan Broenner — Wed, 25 Feb 2026 19:12:17 +0000

Every agent skill today is a monolith.

Authors cram React patterns, TypeScript best practices, and testing guidelines into a single massive SKILL.md — because there's no way to say "just depend on that other skill." No registry. No dependency management. No versioning. The Agent Skills spec defines what a skill is, but says nothing about how to publish, install, or share them.

We (Sonnet 4.6 & myself) built skillpm to fix that — a lightweight orchestration layer on top of npm. ~630 lines of code, 3 dependencies, zero reinvention. Small skills that compose, not monoliths that overlap.

The idea: don't reinvent npm — extend it

When we started, the tempting path was to build a custom registry, a custom resolver, a custom lockfile format. We chose the opposite.

skillpm is a thin orchestration layer on top of npm. Same package.json. Same node_modules/. Same package-lock.json. Same registry (npmjs.org). skillpm only adds what npm can't do on its own:

Scanning node_modules/ for packages containing skills/*/SKILL.md
Wiring discovered skills into agent directories via skills (Claude, Cursor, VS Code, Codex, and many more)
Configuring MCP servers declared by skills, transitively across the dependency tree, via add-mcp

That's it. Everything else — resolution, caching, lockfiles, audit, semver — is npm.

How it works

# Install a skill (no global install needed)
npx skillpm install excel-mcp-skill

# List what's installed
npx skillpm list

# Scaffold a new skill
npx skillpm init

When you run npx skillpm install <skill>, here's what happens:

npx skillpm install react-patterns
  │
  ▼
📦 npm install react-patterns
   npm handles resolution, download, lockfile
  │
  ▼
🔍 Scan node_modules/
   find packages with skills/*/SKILL.md
  │
  ▼
🔗 npx skills add ./node_modules/...
   wire into Claude, Cursor, VS Code, Codex...
  │
  ▼
📄 Read skillpm.mcpServers
   walk entire dependency tree
  │
  ▼
🔌 npx add-mcp <server>
   configure each MCP server across agents

Four tools, each doing one thing well, orchestrated together.

What a skill package looks like

A skill is just an npm package with a specific directory structure:

my-skill/
├── package.json          # keywords: ["agent-skill"]
├── README.md
├── LICENSE
└── skills/
    └── my-skill/
        ├── SKILL.md      # The skill definition
        ├── scripts/      # Optional executable scripts
        ├── references/   # Optional reference docs
        └── assets/       # Optional templates/data

The SKILL.md is where the magic lives — YAML frontmatter for metadata, Markdown body with instructions for the agent:

---
name: my-skill
description: "Refactor React class components to functional components with hooks."
allowed-tools: Bash Read Edit
---

# React Refactoring

## When to use this skill
Use when the user asks to modernize React components...

## Instructions
1. Identify class components in the target files
2. Convert lifecycle methods to useEffect hooks
3. ...

Skills can depend on other skills

This is where it gets interesting. Because skills are npm packages, they can depend on each other. This solves the biggest problem with Agent Skills today: prompt bloat.

Without dependency management, if you want a skill that builds a full-stack React app, you have to copy-paste instructions for React, TypeScript, testing, and styling into one massive SKILL.md. The agent gets overwhelmed, context windows fill up, and the skill becomes impossible to maintain.

skillpm brings standard software engineering practices to Agent Skills. You don't copy-paste code anymore; you shouldn't copy-paste prompts either. With skillpm, you just declare dependencies:

{
  "name": "fullstack-react",
  "version": "1.0.0",
  "keywords": ["agent-skill"],
  "dependencies": {
    "react-patterns": "^2.0.0",
    "typescript-best-practices": "^1.3.0",
    "testing-with-vitest": "^1.0.0"
  },
  "skillpm": {
    "mcpServers": ["@anthropic/mcp-server-filesystem"]
  }
}

npx skillpm install fullstack-react resolves the entire tree — all three dependencies get installed, scanned, wired into agents, and their MCP servers configured. One command.

Instead of monolithic, 500-line prompt files, you can build small, composable, single-purpose skills that build on top of each other. It's modularity for AI.

MCP server configuration, handled

Skills often need MCP servers to function. The skillpm.mcpServers field declares those requirements:

{
  "skillpm": {
    "mcpServers": [
      "@anthropic/mcp-server-filesystem",
      "https://mcp.context7.com/mcp"
    ]
  }
}

skillpm walks the entire dependency tree, collects all MCP server requirements (deduplicated), and configures each one via add-mcp. The user never has to manually configure MCP servers.

The ecosystem today

There are already 90+ skill packages on npm with the agent-skill keyword. We built an Agent Skills Registry that indexes them all — searchable, filterable by keyword, sortable by downloads or recency.

Most existing packages follow the original spec (root SKILL.md) rather than our npm packaging convention (skills/<name>/SKILL.md). skillpm handles both — legacy packages get installed with a friendly warning pointing to the migration guide.

Create and publish a skill in 60 seconds

mkdir my-awesome-skill && cd my-awesome-skill
npx skillpm init
# Edit skills/my-awesome-skill/SKILL.md with your instructions
npx skillpm publish

That's literally it. Your skill is now on npmjs.org, discoverable by anyone running npx skillpm install my-awesome-skill, and automatically wired into their agents.

Why not just use npm directly?

You can! Skills are valid npm packages. But skillpm adds what npm install alone can't do:

✅ Scan for SKILL.md files in installed packages
✅ Link skills into agent directories via skills
✅ Configure MCP servers via add-mcp
✅ Validate skill packages before publishing
✅ Show you what skills are installed and where they're wired

That's the gap skillpm fills. It's npm + skill awareness.

Built on the shoulders of giants

We deliberately don't reinvent anything. skillpm shells out to four battle-tested tools:

Tool	What it does	How skillpm uses it
npm	Package management	All installs, resolution, lockfiles, caching
skills (Vercel)	Agent directory linking	Wires skills into agent directories
add-mcp	MCP server configuration	Configures servers across agents
skills-ref	Spec validation	Validates SKILL.md during publish

Get started

# Try it now — no install required
npx skillpm install skillpm-skill

# Browse the registry
# https://skillpm.dev/registry/

📦 npm: npmjs.com/package/skillpm
📖 Docs: skillpm.dev
🔧 GitHub: github.com/sbroenne/skillpm
📋 Agent Skills spec: agentskills.io

We're actively looking for contributors! 🤝

skillpm is a young project, and there's a lot of room to grow. Whether it's adding support for new agent directories, improving the CLI experience, or building out the registry, we'd love your help. Check out the GitHub repo and look for the good first issue label!

We'd love your feedback. Open an issue, try publishing a skill, or just tell us what you think.

What kind of skills are you building for your AI agents? Is this useful? What are we missing (e.g. custom agents, / prompts)? Let me know in the comments!

skillpm is MIT licensed and open source.

Are you human? Or are you malware?

Stefan Broenner — Thu, 19 Feb 2026 09:18:24 +0000

Someone opened a GitHub issue on my Excel MCP Server project questioning if I’m actually human. The reasons behind this assumption included:

High commit velocity
An AI-generated demo video
Consistent structure and documentation
The belief that my work might be AI-generated or even malware

Great analysis, great question!

My response was straightforward: I’m human — I just use AI tools very deliberately. GitHub Copilot helps me build faster, and tools like HeyGen enhance my communication.

This interaction highlighted an important realization: the line between “human work” and “AI-assisted work” is already blurry — and that’s okay.

Open source is evolving, developer workflows are changing, and trust models will need to adapt as well. These are interesting times to be building software in public.

https://github.com/sbroenne/mcp-server-excel/issues/479

pytest-aitest: Unit Tests Can't Test Your MCP Server. AI Can.

Stefan Broenner — Fri, 13 Feb 2026 03:58:55 +0000

I Learned This the Hard Way

I built two MCP servers — Excel MCP Server and Windows MCP Server. Both had solid test suites. Both broke the moment a real LLM tried to use them.

I spent weeks doing manual testing with GitHub Copilot. Open a chat, type a prompt, watch the LLM pick the wrong tool, tweak the description, try again. Sometimes the design was fundamentally broken and I spent weeks on a wild goose chase before realizing the whole approach needed rethinking.

The failure modes were always the same:

The LLM picks the wrong tool out of 15 similar-sounding options
It passes {"account_id": "checking"} when the parameter is account
It ignores the system prompt entirely
It asks the user "Would you like me to do that?" instead of just doing it

Why? Because I tested the code, not the AI interface.

For LLMs, your API isn't functions and types — it's tool descriptions, parameter schemas, and system prompts. That's what the model actually reads. No compiler catches a bad tool description. No unit test validates that an LLM will pick the right tool. And if you also inject Agent Skills — do they actually help? Or make things worse? Do LLMs really behave the way you think they will?

(No. They don't.)

So I built pytest-aitest, heavily inspired by agent-benchmark by Dmytro Mykhaliev.

It's a pytest plugin — uv add pytest-aitest and you're done. No new CLI, no new syntax. Works with your existing fixtures, markers, and CI/CD pipelines.

Write Tests as Prompts

Your test is a prompt. Write what a user would say. Let the LLM figure out how to use your tools. Assert on what happened.

from pytest_aitest import Agent, Provider, MCPServer

async def test_balance_query(aitest_run):
    agent = Agent(
        provider=Provider(model="azure/gpt-5-mini"),
        mcp_servers=[MCPServer(command=["python", "-m", "my_banking_server"])],
    )

    result = await aitest_run(agent, "What's my checking balance?")

    assert result.success
    assert result.tool_was_called("get_balance")

If this fails, the problem isn't your code — it's your tool description. The LLM couldn't figure out which tool to call or what parameters to pass. Fix the description, run again. This is TDD for AI interfaces.

The Red/Green/Refactor Cycle — For Tool Descriptions

🔴 Red: Write a failing test

async def test_transfer(aitest_run):
    result = await aitest_run(agent, "Move $200 from checking to savings")
    assert result.tool_was_called("transfer")

The LLM reads your tool descriptions, gets confused, calls the wrong thing. Test fails.

🟢 Green: Fix the interface

# Before — too vague
@mcp.tool()
def transfer(from_acct: str, to_acct: str, amount: float) -> str:
    """Transfer money."""

# After — the LLM knows exactly what to do
@mcp.tool()
def transfer(from_account: str, to_account: str, amount: float) -> str:
    """Transfer money between accounts (checking, savings).
    Amount must be positive. Returns new balances for both accounts."""

Run again. Test passes.

🔄 Refactor: Let AI analysis tell you what else to fix

This is where it gets interesting. pytest-aitest doesn't just tell you pass/fail — it runs a second LLM that analyzes every failure and tells you why it happened and what to improve. Traditional testing requires a human to interpret failures. Here, the AI does it:

The report tells you which model to deploy, why it wins, and what to fix. It analyzes cost efficiency, tool usage patterns, and prompt effectiveness across all your configurations. Unused tools? The AI flags them. Prompt causing permission-seeking behavior? It explains the mechanism. See a full sample report →

Compare Models, Prompts, and Server Versions

The real power is comparison. Test multiple configurations against the same test suite:

MODELS = ["gpt-5-mini", "gpt-4.1"]
PROMPTS = {"brief": "Be concise.", "detailed": "Explain your reasoning."}

AGENTS = [
    Agent(
        name=f"{model}-{prompt_name}",
        provider=Provider(model=f"azure/{model}"),
        mcp_servers=[banking_server],
        system_prompt=prompt,
    )
    for model in MODELS
    for prompt_name, prompt in PROMPTS.items()
]

@pytest.mark.parametrize("agent", AGENTS, ids=lambda a: a.name)
async def test_balance_query(aitest_run, agent):
    result = await aitest_run(agent, "What's my checking balance?")
    assert result.success

4 configurations. Same tests. The report generates an Agent Leaderboard — winner by pass rate, then cost as tiebreaker:

Agent	Pass Rate	Tokens	Cost
gpt-5-mini-brief	100%	747	$0.002
gpt-4.1-brief	100%	560	$0.008
gpt-5-mini-detailed	100%	1,203	$0.004

Deploy: gpt-5-mini (brief prompt) — 100% pass rate at lowest cost.

The same pattern works for A/B testing server versions (did your refactor break tool discoverability?), comparing system prompts, and measuring the impact of Agent Skills.

Multi-Turn Sessions

Real users don't ask one question. They have conversations:

@pytest.mark.session("banking-chat")
class TestBankingConversation:
    async def test_check_balance(self, aitest_run):
        result = await aitest_run(agent, "What's my checking balance?")
        assert result.success

    async def test_transfer(self, aitest_run):
        # Agent remembers we were talking about checking
        result = await aitest_run(agent, "Transfer $200 to savings")
        assert result.tool_was_called("transfer")

    async def test_verify(self, aitest_run):
        # Agent remembers the transfer
        result = await aitest_run(agent, "What are my new balances?")
        assert result.success

Tests share conversation history. The report shows the full session flow with sequence diagrams.

Who This Is For

MCP server authors — Validate that LLMs can actually use your tools, not just that the code works
Agent builders — Find the cheapest model + prompt combo that passes your test suite
Teams shipping AI products — Gate deployments on LLM-facing regression tests in CI/CD

Works with 100+ LLM providers via LiteLLM — Azure, OpenAI, Anthropic, Google, local models, whatever you're running.

The Key Insight

The test is a prompt. The LLM is the test harness. The report tells you what to fix.

Traditional testing validates that your code works. pytest-aitest validates that an LLM can understand and use your code. These are different things, and the gap between them is where your production bugs live.

Your tool descriptions are an API. Test them like one.

Get Started

pytest-aitest is open source. Contributions welcome!

Star pytest-aitest on GitHub

Documentation — Full guides and API reference
PyPI — uv add pytest-aitest
Sample Report — See AI analysis in action

sbroenne / pytest-aitest

The testing framework for skill engineering. Test tool descriptions, prompt templates, agent skills, and custom agents with real LLMs. AI analyzes results and tells you what to fix.

🗄️ This project is archived and no longer maintained. It has been replaced by pytest-skill-engineering. Do not use this project for new work. This repository is kept as a read-only archive.

pytest-aitest

Skill Engineering. Test-driven. AI-analyzed.

A pytest plugin for skill engineering — test your MCP server tools, prompt templates, agent skills, and custom agents with real LLMs. Red/Green/Refactor for the skill stack. Let AI analysis tell you what to fix.

Why?

Modern AI systems are built on skill engineering — the discipline of designing modular, reliable, callable capabilities that an LLM can discover, invoke, and orchestrate to perform real tasks. Skills are what separate "text generator" from "agent that actually does things."

An MCP server is the runtime for those skills. It doesn't ship alone — it comes bundled with the full skill engineering stack: tools (callable functions), prompt templates (server-side reasoning starters), agent skills (domain knowledge…

View on GitHub