Forem: Gijs Jansen

Building an AI Community Jukebox Overnight (and What Broke by Morning)

Gijs Jansen — Thu, 12 Mar 2026 16:48:18 +0000

The $50 Challenge

A few days ago I got accepted into the MiniMax developer program. The email was short and direct: here's a $50 API voucher, we're curious to see what you'll build. That's it. No strings, no required deliverable, just fifty dollars and a question.

Some context on me: I've spent 15 years building backends, data platforms, and pipelines. I'm comfortable with databases, APIs, infrastructure — the stuff nobody sees. What I've never done is build something consumer-facing and try to get people to use it. Never marketed a product, never asked anyone to pay for something I made. That whole muscle is atrophied, if it ever existed.

So I made a bet with myself. I'd use the MiniMax credits to build a community feature for my personal site — an AI-powered jukebox where visitors could generate music tracks and interact with each other's creations. Then I'd try to sustain it through crowd-funding. If I can't convince a handful of people that this is worth keeping alive, there's probably no point trying anything more ambitious on the commercial side.

The other piece of context: I didn't build this alone. Claude Code was my coding partner through the entire process. I'm going to be transparent about that from the start because the human+AI dynamic is a big part of this story. What I directed, what I caught, where Claude surprised me, where it fell short — that's all in here. This isn't a hype piece about AI-assisted development. It's a report from someone figuring it out in real time.

Here's how four sessions across one night and one morning turned a $50 voucher into a live community feature.

From Zero to Jukebox

Session one started around 11pm on a weeknight. I had a rough idea: visitors type a prompt describing a song, MiniMax generates it, the track shows up in a public feed. No accounts, no logins, just show up and make music.

I described the vision to Claude and it scaffolded the full architecture in one pass. The first commit touched 30 files: 11 UI components, 5 API routes, a Supabase edge function for music generation, database migrations, and 106 tests. One commit.

The core generation flow uses a fire-and-forget pattern. A visitor submits a prompt, the API creates a pending track in Supabase, then triggers an edge function that calls MiniMax's music-2.5+ model. When generation completes (usually 30-60 seconds), the edge function updates the track status and stores the audio URL. The visitor's browser polls for updates.

One decision I made early: no user accounts. Visitors interact anonymously, identified only by a daily-rotating hash of their IP and user agent. This keeps things frictionless while still enabling per-visitor rate limiting. It was also a deliberate privacy stance — I don't want to know who my visitors are, and I don't want their email addresses. I need just enough identity to prevent spam and enable reactions, not one bit more. A daily-rotating hash gives me exactly that. It means fire reactions reset every day, which I initially saw as a bug but now see as a feature: tracks have to earn their fires fresh each day.

Here's the visitor hash — it rotates daily so there's no persistent tracking:

export function getVisitorHash(request: Request): string {
  const forwarded = request.headers.get("x-forwarded-for");
  const ip = forwarded?.split(",")[0]?.trim() ?? "unknown";
  const ua = request.headers.get("user-agent") ?? "unknown";
  const salt = getDailySalt();

  return createHash("sha256")
    .update(ip + "|" + ua + "|" + salt)
    .digest("hex")
    .slice(0, 16);
}

The UI followed the site's "Ember" branding — dark background, warm cream text, that #c75c2c accent color. Claude nailed the terminal aesthetic without me having to micro-manage component styles. The track cards show a waveform visualization, playback controls, and the original prompt. It even matched the monospace vibe of the rest of the site without being told to. It looks like it belongs, which is more than I expected from a first pass.

Around 12:30am I generated the first track. I typed "lo-fi jazz for debugging at midnight" and waited. Thirty seconds later, a piano riff with brushed drums started playing through my laptop speakers. It sounded... good? Like, genuinely good. I sat there for a minute just listening, slightly stunned that this worked on the first try. The apartment was dead quiet except for this warm little piano loop bleeding out of my laptop, and I remember thinking: I should be asleep, but I don't want to stop this.

That feeling wore off quickly. Because the next thing I had to do was actually review what had just been committed.

The Vibe Coding Reality Check

Thirty files in one commit. Let's sit with that for a second.

I didn't write those files. I described what I wanted, reviewed the output, asked for adjustments, and approved the result. But the actual keystrokes, the architectural decisions at the function level, the naming conventions, the error handling patterns — those came from Claude. My role was more like a tech lead doing a very fast code review than a developer writing code.

This is the part of AI-assisted development that doesn't get talked about enough. The speed is real. But the speed comes with a specific cost: you're now responsible for code you didn't write and don't have muscle memory for. You can read it, understand it, even approve it — but you didn't think it into existence line by line. That gap matters when something breaks at 2am.

I did double-check certain things. The Supabase Row Level Security policies got a careful read — that's where data leaks happen. The rate limiting logic got scrutinized. The API route handlers got a pass for obvious injection vectors. But did I trace every component's render path? No. Did I verify every edge case in the 106 tests? Also no.

And about those 106 tests — Claude wrote those too. They pass, they cover the main flows, but when I actually sat down to read through them, I found the coverage was thinner than it looked. There were twelve tests on the track card component that all tested slight variations of rendering props, but not a single test for what happens when the audio URL comes back null from MiniMax — which is a real failure mode listed in their API docs. Green checkmarks, blind spots. A test suite that gives you confidence without earning it is worse than no tests at all, because at least with no tests you know you're flying blind.

Vibe coding has a debt that comes due when something breaks. If you don't understand the code well enough to debug it without AI assistance, you haven't saved time — you've borrowed it.

I spent about 45 minutes after that first commit just reading. Not fixing anything, not even taking notes — just building a mental map. I traced the generation flow from form submission through the edge function and back. I found one place where an error in the MiniMax callback would silently swallow the failure, leaving a track stuck in "generating" forever. I flagged it, Claude fixed it in one shot. That 45 minutes probably saved me a 2am debugging session later. That's the tax. It's real, it's unavoidable, and anyone telling you AI-assisted development is "10x faster" is probably not counting it.

This is where the story gets interesting — the community reactions system, the production debugging saga at 10am, and the crowdfunding experiment. Read the full post on datagobes.dev →

Agent Duelist: Benchmark LLM Providers in One Command

Gijs Jansen — Sun, 01 Mar 2026 10:40:27 +0000

TL;DR: Agent Duelist is a TypeScript‑first framework for benchmarking multiple LLM providers on your real tasks. Get structured, reproducible metrics for correctness, latency, tokens, and cost from a single unified interface.

What the Output Looks Like

Here’s the CLI benchmark summary you get from a single npx duelist run:

And here’s a run rendered as an HTML table:

Run It Yourself (10 seconds)

npm install agent-duelist
npx duelist init
npx duelist run

The Problem

You're building with LLMs and you need to answer questions like:

Should I use GPT-5.2 or Claude Opus 4.6 for this task?
Is Azure OpenAI faster than standard OpenAI for my use case?
How much will switching models actually cost me?
Which provider handles tool calls best?

Today, answering these questions typically means:

Wiring up separate integrations for each provider
Manually tracking latency, tokens, and errors across runs
Copy‑pasting outputs into spreadsheets
Guesstimating costs from pricing pages and blog posts

There has to be a better way to compare models than ad‑hoc scripts and spreadsheets.

Enter Agent Duelist

Agent Duelist is a benchmarking framework that lets you:

✅ Define tasks once, run them everywhere — Benchmark OpenAI, Azure, Anthropic, Gemini, and any OpenAI‑compatible gateway without changing task code.
✅ Get real metrics — Capture latency, token counts, and cost estimates using a bundled pricing catalog.
✅ Compare providers objectively — Use built‑in scorers for correctness, schema validation, fuzzy similarity, and LLM‑as‑judge.
✅ Benchmark agent workflows — Measure tool‑calling behavior with local handlers.
✅ TypeScript-native DX — Strong types, Zod schemas, and full IDE support.
✅ CLI-first — From zero to comparison tables in a single command.

Quick Start (60 seconds)

Install it:

npm install agent-duelist

Initialize a config:

npx duelist init

This creates arena.config.ts. Here's a minimal example:

import { defineArena, openai, anthropic } from 'agent-duelist'
import { z } from 'zod'

export default defineArena({
  providers: [
    openai('gpt-5.2'),
    anthropic('claude-sonnet-4-6-20260217'),
  ],
  tasks: [
    {
      name: 'simple-qa',
      prompt: 'In one sentence, explain what a monorepo is.',
      expected: 'A monorepo is a single repository that contains code for multiple projects.',
    },
    {
      name: 'structured-extraction',
      prompt: 'Extract the company name and year from: "Acme was founded in 2024."',
      expected: { company: 'Acme', year: 2024 },
      schema: z.object({
        company: z.string(),
        year: z.number(),
      }),
    },
  ],
  scorers: ['latency', 'cost', 'correctness', 'schema-correctness', 'fuzzy-similarity'],
  runs: 3,
})

Run the benchmark:

npx duelist run

You get a beautiful table showing:

Rows: Your tasks (simple-qa, structured-extraction)
Columns: Your providers (OpenAI GPT-5.2, Anthropic Claude Sonnet 4.6)
Cells: Correctness score, latency, tokens, estimated cost

For CI or dashboards:

npx duelist run --reporter json > results.json

Why Agent Duelist?

1. Provider-Agnostic

Write your tasks once. Swap models and providers without rewriting anything:

const providers = [
  openai('gpt-5.2'),
  openai('gpt-5.2-chat-latest'),
  azureOpenai('gpt-5.2', { deployment: 'my-deployment' }),
  anthropic('claude-opus-4-6-20260205'),
  anthropic('claude-sonnet-4-6-20260217'),
  gemini('gemini-3.1-pro-preview'),
  gemini('gemini-3-flash-preview'),
  openaiCompatible({
    id: 'local/llama',
    name: 'Local LLaMA',
    baseURL: 'http://localhost:11434/v1',
    model: 'llama-3.3',
  }),
]

2. Agent-Focused

Designed for real-world agent workflows with tool calling:

const weatherTool = {
  name: 'getCurrentWeather',
  description: 'Get the current weather in a given city',
  parameters: z.object({ city: z.string() }),
  handler: async ({ city }) => ({
    city,
    tempC: 20,
  }),
}

export default defineArena({
  providers: [openai('gpt-5.2')],
  tasks: [
    {
      name: 'weather-agent',
      prompt: 'What is the temperature in Amsterdam? Use the tool.',
      expected: { city: 'Amsterdam' },
      tools: [weatherTool],
    },
  ],
  scorers: ['latency', 'cost', 'tool-usage'],
})

The model calls the tool, your handler executes, and you get metrics on tool usage accuracy.

3. Realistic Metrics

Latency: Wall-clock response time in milliseconds

Token counts: Direct from the provider APIs—the source of truth

Cost estimation: Transparent and conservative

Bundled pricing catalog derived from OpenRouter's public data
Maps (provider, model) → { inputPerM, outputPerM } in USD per 1M tokens
Azure models resolve back to base OpenAI pricing automatically
Formula: (promptTokens * inputPerM + completionTokens * outputPerM) / 1_000_000

Example output:

Tokens: prompt: 142, completion: 38
Cost: ~$0.189m (millicents)

4. Rich Scoring System

Built-in scorers:

Scorer	What It Measures
`latency`	Wall-clock response time
`cost`	Estimated USD cost from tokens + pricing catalog
`correctness`	Exact match against expected (deep-equal)
`schema-correctness`	Validates output against Zod schema
`fuzzy-similarity`	Jaccard token-overlap similarity
`llm-judge-correctness`	LLM-as-judge scoring (accuracy, completeness, conciseness)
`tool-usage`	Whether expected tools were invoked

LLM-as-Judge Example:

defineArena({
  scorers: ['latency', 'cost', 'llm-judge-correctness'],
  judgeModel: 'gpt-5.2-chat-latest', // or 'gemini-3.1-pro-preview'
})

The judge evaluates outputs on three criteria and returns a composite 0–1 score.

5. TypeScript-Native

Strongly typed provider interfaces
Zod schemas for structured outputs
Full IDE autocomplete and type safety
No runtime surprises

Real-World Example

Let's say you're building an extraction pipeline and need to choose between the latest frontier models:

import { defineArena, openai, anthropic, gemini } from 'agent-duelist'
import { z } from 'zod'

export default defineArena({
  providers: [
    openai('gpt-5.2'),
    openai('gpt-5.2-chat-latest'),
    anthropic('claude-sonnet-4-6-20260217'),
    gemini('gemini-3-flash-preview'),
  ],
  tasks: [
    {
      name: 'extract-company',
      prompt: 'Extract company and role as JSON from: "I work at Acme Corp as a senior engineer."',
      expected: { company: 'Acme Corp', role: 'senior engineer' },
      schema: z.object({ company: z.string(), role: z.string() }),
    },
    {
      name: 'classify-sentiment',
      prompt: 'Classify sentiment as "positive", "negative", or "neutral": "The product works great!"',
      expected: 'positive',
    },
  ],
  scorers: ['latency', 'cost', 'correctness', 'schema-correctness'],
  runs: 3,
})

Run npx duelist run and get:

Task: extract-company
Provider                        Latency    Cost       Tokens    Match    Schema
────────────────────────────────────────────────────────────────────────────────
openai/gpt-5.2                  1905ms    ~$0.312m     140      100%     100%
openai/gpt-5.2-chat-latest       842ms    ~$0.091m     132      100%     100%
anthropic/claude-sonnet-4.6     1493ms    ~$0.189m     126      100%     100%
gemini/gemini-3-flash-preview    610ms    ~$0.041m     119      100%     100%

Summary
◆ Most correct: all providers tied (avg 100%)
◆ Fastest: gemini/gemini-3-flash-preview (avg 610ms)
◆ Cheapest: gemini/gemini-3-flash-preview (avg ~$0.041m)

Now you can make data-driven decisions: Gemini 3 Flash is fastest and cheapest, GPT-5.2 gives you maximum reasoning depth, Claude Sonnet 4.6 leads on computer use tasks.

Use Cases

🔬 Model Selection

Compare models on your actual tasks before committing

💰 Cost Optimization

Identify which model gives you the best quality/cost ratio — in 2026, prices vary wildly between frontier models

⚡ Performance Tuning

Track latency across providers and deployments

🛠️ Agent Development

Benchmark tool-calling accuracy for multi-step workflows

📊 CI/CD Integration

Run benchmarks in CI and fail if metrics regress:

npx duelist run --reporter json | jq '.summary.avgLatency'

📈 Dashboarding

Export JSON results to Grafana, Datadog, or your metrics platform

What's Next?

Already shipped:

✅ OpenAI (GPT-5, GPT-5.2), Azure, Anthropic (Claude 4.6 series), Gemini (3/3.1 series), OpenAI-compatible providers
✅ 7 built-in scorers including LLM-as-judge
✅ Tool-calling support for agent benchmarking
✅ Console & JSON reporters
✅ Pricing catalog with refresh script

Roadmap (shaped by community feedback):

📜 More providers (OpenRouter-native, more gateways)
📜 Markdown/HTML/CSV reporters
📜 GitHub Actions summaries
📜 Multi-step agent workflows
📜 Plugin system for custom scorers
📜 Embedding-based semantic similarity

Get Started Now

npm install agent-duelist
npx duelist init
npx duelist run

📦 Package: npm.com/package/agent-duelist

📖 GitHub: github.com/DataGobes/agent-duelist

🐛 Issues: github.com/DataGobes/agent-duelist/issues

Contributing

Contributions welcome! 🎉

Bug reports / ideas: Open a GitHub issue
Code changes: Fork, branch, test (npm test), build (npm run build), PR

Keep PRs focused (one provider, one scorer) for easier review.

Final Thoughts

Choosing the right LLM provider shouldn't involve spreadsheets, guesswork, and manual copy-paste. Agent Duelist gives you a single command to get objective, reproducible comparisons.

Whether you're optimizing costs, improving latency, or validating correctness—let the models duel it out.

⚔️ May the best model win.