<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Gijs Jansen</title>
    <description>The latest articles on Forem by Gijs Jansen (@datagobes).</description>
    <link>https://forem.com/datagobes</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3799772%2F7056aadc-f0be-4d72-8378-d06ab8e25fe4.jpeg</url>
      <title>Forem: Gijs Jansen</title>
      <link>https://forem.com/datagobes</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/datagobes"/>
    <language>en</language>
    <item>
      <title>Building an AI Community Jukebox Overnight (and What Broke by Morning)</title>
      <dc:creator>Gijs Jansen</dc:creator>
      <pubDate>Thu, 12 Mar 2026 16:48:18 +0000</pubDate>
      <link>https://forem.com/datagobes/building-an-ai-community-jukebox-overnight-and-what-broke-by-morning-4nfn</link>
      <guid>https://forem.com/datagobes/building-an-ai-community-jukebox-overnight-and-what-broke-by-morning-4nfn</guid>
      <description>&lt;h2&gt;
  
  
  The $50 Challenge
&lt;/h2&gt;

&lt;p&gt;A few days ago I got accepted into the MiniMax developer program. The email was short and direct: here's a $50 API voucher, we're curious to see what you'll build. That's it. No strings, no required deliverable, just fifty dollars and a question.&lt;/p&gt;

&lt;p&gt;Some context on me: I've spent 15 years building backends, data platforms, and pipelines. I'm comfortable with databases, APIs, infrastructure — the stuff nobody sees. What I've never done is build something consumer-facing and try to get people to use it. Never marketed a product, never asked anyone to pay for something I made. That whole muscle is atrophied, if it ever existed.&lt;/p&gt;

&lt;p&gt;So I made a bet with myself. I'd use the MiniMax credits to build a community feature for my personal site — an AI-powered jukebox where visitors could generate music tracks and interact with each other's creations. Then I'd try to sustain it through crowd-funding. If I can't convince a handful of people that this is worth keeping alive, there's probably no point trying anything more ambitious on the commercial side.&lt;/p&gt;

&lt;p&gt;The other piece of context: I didn't build this alone. Claude Code was my coding partner through the entire process. I'm going to be transparent about that from the start because the human+AI dynamic is a big part of this story. What I directed, what I caught, where Claude surprised me, where it fell short — that's all in here. This isn't a hype piece about AI-assisted development. It's a report from someone figuring it out in real time.&lt;/p&gt;

&lt;p&gt;Here's how four sessions across one night and one morning turned a $50 voucher into a live community feature.&lt;/p&gt;

&lt;h2&gt;
  
  
  From Zero to Jukebox
&lt;/h2&gt;

&lt;p&gt;Session one started around 11pm on a weeknight. I had a rough idea: visitors type a prompt describing a song, MiniMax generates it, the track shows up in a public feed. No accounts, no logins, just show up and make music.&lt;/p&gt;

&lt;p&gt;I described the vision to Claude and it scaffolded the full architecture in one pass. The first commit touched 30 files: 11 UI components, 5 API routes, a Supabase edge function for music generation, database migrations, and 106 tests. One commit.&lt;/p&gt;

&lt;p&gt;The core generation flow uses a fire-and-forget pattern. A visitor submits a prompt, the API creates a pending track in Supabase, then triggers an edge function that calls MiniMax's music-2.5+ model. When generation completes (usually 30-60 seconds), the edge function updates the track status and stores the audio URL. The visitor's browser polls for updates.&lt;/p&gt;

&lt;p&gt;One decision I made early: no user accounts. Visitors interact anonymously, identified only by a daily-rotating hash of their IP and user agent. This keeps things frictionless while still enabling per-visitor rate limiting. It was also a deliberate privacy stance — I don't want to know who my visitors are, and I don't want their email addresses. I need just enough identity to prevent spam and enable reactions, not one bit more. A daily-rotating hash gives me exactly that. It means fire reactions reset every day, which I initially saw as a bug but now see as a feature: tracks have to earn their fires fresh each day.&lt;/p&gt;

&lt;p&gt;Here's the visitor hash — it rotates daily so there's no persistent tracking:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;getVisitorHash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;forwarded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;x-forwarded-for&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ip&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;forwarded&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]?.&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;unknown&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ua&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user-agent&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;unknown&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;salt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getDailySalt&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;createHash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;sha256&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ip&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;|&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;ua&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;|&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;salt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;digest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;hex&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The UI followed the site's "Ember" branding — dark background, warm cream text, that &lt;code&gt;#c75c2c&lt;/code&gt; accent color. Claude nailed the terminal aesthetic without me having to micro-manage component styles. The track cards show a waveform visualization, playback controls, and the original prompt. It even matched the monospace vibe of the rest of the site without being told to. It looks like it belongs, which is more than I expected from a first pass.&lt;/p&gt;

&lt;p&gt;Around 12:30am I generated the first track. I typed "lo-fi jazz for debugging at midnight" and waited. Thirty seconds later, a piano riff with brushed drums started playing through my laptop speakers. It sounded... good? Like, genuinely good. I sat there for a minute just listening, slightly stunned that this worked on the first try. The apartment was dead quiet except for this warm little piano loop bleeding out of my laptop, and I remember thinking: I should be asleep, but I don't want to stop this.&lt;/p&gt;

&lt;p&gt;That feeling wore off quickly. Because the next thing I had to do was actually review what had just been committed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Vibe Coding Reality Check
&lt;/h2&gt;

&lt;p&gt;Thirty files in one commit. Let's sit with that for a second.&lt;/p&gt;

&lt;p&gt;I didn't write those files. I described what I wanted, reviewed the output, asked for adjustments, and approved the result. But the actual keystrokes, the architectural decisions at the function level, the naming conventions, the error handling patterns — those came from Claude. My role was more like a tech lead doing a very fast code review than a developer writing code.&lt;/p&gt;

&lt;p&gt;This is the part of AI-assisted development that doesn't get talked about enough. The speed is real. But the speed comes with a specific cost: you're now responsible for code you didn't write and don't have muscle memory for. You can read it, understand it, even approve it — but you didn't &lt;em&gt;think&lt;/em&gt; it into existence line by line. That gap matters when something breaks at 2am.&lt;/p&gt;

&lt;p&gt;I did double-check certain things. The Supabase Row Level Security policies got a careful read — that's where data leaks happen. The rate limiting logic got scrutinized. The API route handlers got a pass for obvious injection vectors. But did I trace every component's render path? No. Did I verify every edge case in the 106 tests? Also no.&lt;/p&gt;

&lt;p&gt;And about those 106 tests — Claude wrote those too. They pass, they cover the main flows, but when I actually sat down to read through them, I found the coverage was thinner than it looked. There were twelve tests on the track card component that all tested slight variations of rendering props, but not a single test for what happens when the audio URL comes back null from MiniMax — which is a real failure mode listed in their API docs. Green checkmarks, blind spots. A test suite that gives you confidence without earning it is worse than no tests at all, because at least with no tests you know you're flying blind.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Vibe coding has a debt that comes due when something breaks.&lt;/strong&gt; If you don't understand the code well enough to debug it without AI assistance, you haven't saved time — you've borrowed it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I spent about 45 minutes after that first commit just reading. Not fixing anything, not even taking notes — just building a mental map. I traced the generation flow from form submission through the edge function and back. I found one place where an error in the MiniMax callback would silently swallow the failure, leaving a track stuck in "generating" forever. I flagged it, Claude fixed it in one shot. That 45 minutes probably saved me a 2am debugging session later. That's the tax. It's real, it's unavoidable, and anyone telling you AI-assisted development is "10x faster" is probably not counting it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is where the story gets interesting — the community reactions system, the production debugging saga at 10am, and the crowdfunding experiment. &lt;a href="https://datagobes.dev/blog/building-ai-jukebox?utm_source=devto&amp;amp;utm_medium=crosspost" rel="noopener noreferrer"&gt;Read the full post on datagobes.dev →&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>nextjs</category>
      <category>webdev</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Agent Duelist: Benchmark LLM Providers in One Command</title>
      <dc:creator>Gijs Jansen</dc:creator>
      <pubDate>Sun, 01 Mar 2026 10:40:27 +0000</pubDate>
      <link>https://forem.com/datagobes/introducing-agent-duelist-benchmark-llm-providers-like-a-pro-4hh0</link>
      <guid>https://forem.com/datagobes/introducing-agent-duelist-benchmark-llm-providers-like-a-pro-4hh0</guid>
      <description>&lt;p&gt;TL;DR: Agent Duelist is a TypeScript‑first framework for benchmarking multiple LLM providers on your real tasks. Get structured, reproducible metrics for correctness, latency, tokens, and cost from a single unified interface.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Output Looks Like
&lt;/h2&gt;

&lt;p&gt;Here’s the CLI benchmark summary you get from a single &lt;code&gt;npx duelist run&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8tjaqt8ftat9rdcufwgb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8tjaqt8ftat9rdcufwgb.png" alt="agent-duelist console output" width="800" height="1450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And here’s a run rendered as an HTML table:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F37c1sqlb1xqdedkmecsj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F37c1sqlb1xqdedkmecsj.png" alt="agent-duelist html output" width="800" height="677"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Run It Yourself (10 seconds)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install &lt;/span&gt;agent-duelist
npx duelist init
npx duelist run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;You're building with LLMs and you need to answer questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Should I use GPT-5.2 or Claude Opus 4.6 for this task?&lt;/li&gt;
&lt;li&gt;Is Azure OpenAI faster than standard OpenAI for my use case?&lt;/li&gt;
&lt;li&gt;How much will switching models actually cost me?&lt;/li&gt;
&lt;li&gt;Which provider handles tool calls best?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Today, answering these questions typically means:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Wiring up separate integrations for each provider&lt;/li&gt;
&lt;li&gt;Manually tracking latency, tokens, and errors across runs&lt;/li&gt;
&lt;li&gt;Copy‑pasting outputs into spreadsheets&lt;/li&gt;
&lt;li&gt;Guesstimating costs from pricing pages and blog posts&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;There has to be a better way to compare models than ad‑hoc scripts and spreadsheets.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Enter Agent Duelist
&lt;/h2&gt;

&lt;p&gt;Agent Duelist is a benchmarking framework that lets you:&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Define tasks once, run them everywhere&lt;/strong&gt; — Benchmark OpenAI, Azure, Anthropic, Gemini, and any OpenAI‑compatible gateway without changing task code.&lt;br&gt;
✅ &lt;strong&gt;Get real metrics&lt;/strong&gt; — Capture latency, token counts, and cost estimates using a bundled pricing catalog.&lt;br&gt;
✅ &lt;strong&gt;Compare providers objectively&lt;/strong&gt; — Use built‑in scorers for correctness, schema validation, fuzzy similarity, and LLM‑as‑judge.&lt;br&gt;
✅ &lt;strong&gt;Benchmark agent workflows&lt;/strong&gt; — Measure tool‑calling behavior with local handlers.&lt;br&gt;
✅ &lt;strong&gt;TypeScript-native DX&lt;/strong&gt; — Strong types, Zod schemas, and full IDE support.&lt;br&gt;
✅ &lt;strong&gt;CLI-first&lt;/strong&gt; — From zero to comparison tables in a single command.&lt;/p&gt;


&lt;h2&gt;
  
  
  Quick Start (60 seconds)
&lt;/h2&gt;

&lt;p&gt;Install it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install &lt;/span&gt;agent-duelist
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Initialize a config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx duelist init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates &lt;code&gt;arena.config.ts&lt;/code&gt;. Here's a minimal example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;defineArena&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;anthropic&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;agent-duelist&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;zod&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="nf"&gt;defineArena&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nf"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gpt-5.2&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;claude-sonnet-4-6-20260217&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;simple-qa&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;In one sentence, explain what a monorepo is.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;A monorepo is a single repository that contains code for multiple projects.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;structured-extraction&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Extract the company name and year from: "Acme was founded in 2024."&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;company&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Acme&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;year&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2024&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="na"&gt;company&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="na"&gt;year&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;number&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
      &lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;scorers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;latency&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;cost&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;correctness&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;schema-correctness&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;fuzzy-similarity&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the benchmark:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx duelist run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You get a beautiful table showing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rows:&lt;/strong&gt; Your tasks (simple-qa, structured-extraction)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Columns:&lt;/strong&gt; Your providers (OpenAI GPT-5.2, Anthropic Claude Sonnet 4.6)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cells:&lt;/strong&gt; Correctness score, latency, tokens, estimated cost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For CI or dashboards:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx duelist run &lt;span class="nt"&gt;--reporter&lt;/span&gt; json &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; results.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Why Agent Duelist?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Provider-Agnostic
&lt;/h3&gt;

&lt;p&gt;Write your tasks once. Swap models and providers without rewriting anything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;providers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="nf"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gpt-5.2&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="nf"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gpt-5.2-chat-latest&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="nf"&gt;azureOpenai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gpt-5.2&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;deployment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;my-deployment&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="nf"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;claude-opus-4-6-20260205&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="nf"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;claude-sonnet-4-6-20260217&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="nf"&gt;gemini&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gemini-3.1-pro-preview&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="nf"&gt;gemini&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gemini-3-flash-preview&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="nf"&gt;openaiCompatible&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;local/llama&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Local LLaMA&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;baseURL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;http://localhost:11434/v1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;llama-3.3&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Agent-Focused
&lt;/h3&gt;

&lt;p&gt;Designed for real-world agent workflows with tool calling:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;weatherTool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;getCurrentWeather&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Get the current weather in a given city&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;city&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="na"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;city&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;city&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;tempC&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="nf"&gt;defineArena&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gpt-5.2&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
  &lt;span class="na"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;weather-agent&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;What is the temperature in Amsterdam? Use the tool.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;city&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Amsterdam&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;weatherTool&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;scorers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;latency&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;cost&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;tool-usage&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model calls the tool, your handler executes, and you get metrics on tool usage accuracy.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Realistic Metrics
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Wall-clock response time in milliseconds&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token counts:&lt;/strong&gt; Direct from the provider APIs—the source of truth&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost estimation:&lt;/strong&gt; Transparent and conservative&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bundled pricing catalog derived from OpenRouter's public data&lt;/li&gt;
&lt;li&gt;Maps &lt;code&gt;(provider, model)&lt;/code&gt; → &lt;code&gt;{ inputPerM, outputPerM }&lt;/code&gt; in USD per 1M tokens&lt;/li&gt;
&lt;li&gt;Azure models resolve back to base OpenAI pricing automatically&lt;/li&gt;
&lt;li&gt;Formula: &lt;code&gt;(promptTokens * inputPerM + completionTokens * outputPerM) / 1_000_000&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tokens: prompt: 142, completion: 38
Cost: ~$0.189m (millicents)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Rich Scoring System
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Built-in scorers:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scorer&lt;/th&gt;
&lt;th&gt;What It Measures&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;latency&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Wall-clock response time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cost&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Estimated USD cost from tokens + pricing catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;correctness&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Exact match against expected (deep-equal)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;schema-correctness&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Validates output against Zod schema&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fuzzy-similarity&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Jaccard token-overlap similarity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llm-judge-correctness&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;LLM-as-judge scoring (accuracy, completeness, conciseness)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tool-usage&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Whether expected tools were invoked&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;LLM-as-Judge Example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;defineArena&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;scorers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;latency&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;cost&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;llm-judge-correctness&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;judgeModel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gpt-5.2-chat-latest&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// or 'gemini-3.1-pro-preview'&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The judge evaluates outputs on three criteria and returns a composite 0–1 score.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. TypeScript-Native
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Strongly typed provider interfaces&lt;/li&gt;
&lt;li&gt;Zod schemas for structured outputs&lt;/li&gt;
&lt;li&gt;Full IDE autocomplete and type safety&lt;/li&gt;
&lt;li&gt;No runtime surprises&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Real-World Example
&lt;/h2&gt;

&lt;p&gt;Let's say you're building an extraction pipeline and need to choose between the latest frontier models:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;defineArena&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;gemini&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;agent-duelist&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;zod&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="nf"&gt;defineArena&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nf"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gpt-5.2&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gpt-5.2-chat-latest&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;claude-sonnet-4-6-20260217&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;gemini&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gemini-3-flash-preview&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;extract-company&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Extract company and role as JSON from: "I work at Acme Corp as a senior engineer."&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;company&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Acme Corp&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;senior engineer&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;company&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;classify-sentiment&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Classify sentiment as "positive", "negative", or "neutral": "The product works great!"&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;positive&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;scorers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;latency&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;cost&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;correctness&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;schema-correctness&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run &lt;code&gt;npx duelist run&lt;/code&gt; and get:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task: extract-company
Provider                        Latency    Cost       Tokens    Match    Schema
────────────────────────────────────────────────────────────────────────────────
openai/gpt-5.2                  1905ms    ~$0.312m     140      100%     100%
openai/gpt-5.2-chat-latest       842ms    ~$0.091m     132      100%     100%
anthropic/claude-sonnet-4.6     1493ms    ~$0.189m     126      100%     100%
gemini/gemini-3-flash-preview    610ms    ~$0.041m     119      100%     100%

Summary
◆ Most correct: all providers tied (avg 100%)
◆ Fastest: gemini/gemini-3-flash-preview (avg 610ms)
◆ Cheapest: gemini/gemini-3-flash-preview (avg ~$0.041m)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you can make data-driven decisions: &lt;strong&gt;Gemini 3 Flash is fastest and cheapest, GPT-5.2 gives you maximum reasoning depth, Claude Sonnet 4.6 leads on computer use tasks&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Use Cases
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;🔬 Model Selection&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Compare models on your actual tasks before committing&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;💰 Cost Optimization&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Identify which model gives you the best quality/cost ratio — in 2026, prices vary wildly between frontier models&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚡ Performance Tuning&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Track latency across providers and deployments&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🛠️ Agent Development&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Benchmark tool-calling accuracy for multi-step workflows&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;📊 CI/CD Integration&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Run benchmarks in CI and fail if metrics regress:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx duelist run &lt;span class="nt"&gt;--reporter&lt;/span&gt; json | jq &lt;span class="s1"&gt;'.summary.avgLatency'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;📈 Dashboarding&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Export JSON results to Grafana, Datadog, or your metrics platform&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Already shipped:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ OpenAI (GPT-5, GPT-5.2), Azure, Anthropic (Claude 4.6 series), Gemini (3/3.1 series), OpenAI-compatible providers&lt;/li&gt;
&lt;li&gt;✅ 7 built-in scorers including LLM-as-judge&lt;/li&gt;
&lt;li&gt;✅ Tool-calling support for agent benchmarking&lt;/li&gt;
&lt;li&gt;✅ Console &amp;amp; JSON reporters&lt;/li&gt;
&lt;li&gt;✅ Pricing catalog with refresh script&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Roadmap&lt;/strong&gt; (shaped by community feedback):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;📜 More providers (OpenRouter-native, more gateways)&lt;/li&gt;
&lt;li&gt;📜 Markdown/HTML/CSV reporters&lt;/li&gt;
&lt;li&gt;📜 GitHub Actions summaries&lt;/li&gt;
&lt;li&gt;📜 Multi-step agent workflows&lt;/li&gt;
&lt;li&gt;📜 Plugin system for custom scorers&lt;/li&gt;
&lt;li&gt;📜 Embedding-based semantic similarity&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Get Started Now
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install &lt;/span&gt;agent-duelist
npx duelist init
npx duelist run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;📦 Package:&lt;/strong&gt; &lt;a href="https://www.npmjs.com/package/agent-duelist" rel="noopener noreferrer"&gt;npm.com/package/agent-duelist&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;📖 GitHub:&lt;/strong&gt; &lt;a href="https://github.com/DataGobes/agent-duelist" rel="noopener noreferrer"&gt;github.com/DataGobes/agent-duelist&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;🐛 Issues:&lt;/strong&gt; &lt;a href="https://github.com/DataGobes/agent-duelist/issues" rel="noopener noreferrer"&gt;github.com/DataGobes/agent-duelist/issues&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Contributing
&lt;/h2&gt;

&lt;p&gt;Contributions welcome! 🎉&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bug reports / ideas:&lt;/strong&gt; Open a GitHub issue&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code changes:&lt;/strong&gt; Fork, branch, test (&lt;code&gt;npm test&lt;/code&gt;), build (&lt;code&gt;npm run build&lt;/code&gt;), PR&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Keep PRs focused (one provider, one scorer) for easier review.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Choosing the right LLM provider shouldn't involve spreadsheets, guesswork, and manual copy-paste. Agent Duelist gives you a single command to get objective, reproducible comparisons.&lt;/p&gt;

&lt;p&gt;Whether you're optimizing costs, improving latency, or validating correctness—&lt;strong&gt;let the models duel it out.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;⚔️ &lt;strong&gt;May the best model win.&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Tags
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;#typescript&lt;/code&gt; &lt;code&gt;#llm&lt;/code&gt; &lt;code&gt;#ai&lt;/code&gt; &lt;code&gt;#benchmarking&lt;/code&gt; &lt;code&gt;#openai&lt;/code&gt; &lt;code&gt;#anthropic&lt;/code&gt; &lt;code&gt;#gemini&lt;/code&gt; &lt;code&gt;#agents&lt;/code&gt; &lt;code&gt;#devtools&lt;/code&gt; &lt;code&gt;#opensource&lt;/code&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What LLM provider battle are you running first? Drop a comment!&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>devops</category>
      <category>typescript</category>
    </item>
  </channel>
</rss>
