Forem: OpenMark

Benchmarking the Model Is the Wrong Abstraction

OpenMark — Sun, 15 Mar 2026 19:40:54 +0000

Benchmarking the workflow is the right one.

I've spent over a year benchmarking AI models. Thousands of evaluations across 100+ models, dozens of task types, multiple scoring modes. And the single biggest thing I've learned is something most people in this space haven't internalized yet:

Model performance is not a number. It's a function.

performance = f( model, task_type, task_theme, prompt_structure, output_constraints, decoding_parameters, dataset_distribution )

Change any one of these variables, and the rankings reshuffle. Sometimes dramatically. The model that wins on your classification task might lose on mine, not because one of us is wrong, but because the task/model pairing is different.

This has massive implications for how we should think about evaluation, routing, and cost.

Prompt structure reshuffles winners

One of the most consistent patterns I've observed: changing the prompt style, not the question, just the syntax and framing, can completely reorder which model comes out on top.

Rephrase a sentiment classification prompt from "Classify as positive/negative/neutral" to "What is the sentiment? Reply with one word," and you'll get different winners. Same task. Same intent. Different leaderboard.

There's one consolation: the worst models tend to stay the worst regardless of how you phrase things. Prompt engineering mostly reshuffles the top-tier competitors. Lower-capability models saturate early and no amount of prompt craft saves them.

But for anyone choosing between the top 5-10 models for a production task, this means your prompt is part of your evaluation, not separate from it.

Task type alone doesn't predict performance

There's a common mental model that goes something like:

Reasoning tasks → reasoning models
Extraction tasks → smaller instruction models
Creative tasks → large frontier models

It sounds logical. It's also wrong more often than you'd expect.

I've run benchmarks where non-reasoning models outperform dedicated reasoning models on reasoning tasks. Where a "Medium" pricing tier model ties with a "Very High" tier flagship. Where the cheapest model in the roster co-leads with the most expensive one.

Performance depends on task theme, prompt syntax, output formatting constraints, and dataset characteristics in ways that broad categories simply can't capture. "Classification" is not one task. It's thousands of tasks that happen to share a label.

Smaller models win more often than people think

In production workflows, RAG pipelines, agent chains, extraction flows, smaller models frequently outperform frontier models on individual steps. They're faster, cheaper, more deterministic, and often better at following rigid output constraints.

The insight that changed how I build systems:

optimal system ≠ best model
optimal system = best model per step

Most pipelines only need a frontier model for a small minority of steps. The rest can run on models that cost 10-25x less with equal or better results on that specific sub-task.

But you'll never discover this by looking at a leaderboard. You'll only see it by benchmarking each step individually.

Model capability is a vector, not a score

Every leaderboard reduces a model to a single number. But model capability is multidimensional:

Reasoning depth
Extraction precision
Format obedience
Hallucination resistance
Instruction following
Long-context handling
Tool use reliability
Latency efficiency

Different tasks project onto different parts of this capability space. A model can be exceptional at reasoning and terrible at format obedience. It can handle 100K context windows flawlessly and still fail at single-label classification because it can't resist adding an explanation.

When you flatten all of this into one score, you lose the information that actually matters for your decision.

Variance follows capability boundaries

Here's something I didn't expect to find: model variance is not strongly correlated with model size or price. It follows a capability boundary pattern.

Capability far exceeds task difficulty → stable success
Capability roughly matches task difficulty → high variance
Capability far below task difficulty → stable failure

The most dangerous zone is the middle one. A model near the edge of its capability for a task will give you brilliant output sometimes and garbage other times. Single-run benchmarks can't detect this. You need multiple passes with stability tracking to see it.

This is why consistency metrics matter as much as accuracy. A model that scores 75% with perfect stability is often more valuable in production than one that scores 82% but fluctuates wildly.

Models regress silently

Another pattern that doesn't get enough attention: capability drift.

I've observed models regress on tasks even when the model name stays the same and prompts remain unchanged. A model scores 82% in January, you retest in March, it scores 71%. Same API endpoint. Same prompt. Different results.

Possible causes: alignment layer adjustments, silent model updates, decoding policy changes, backend routing changes. The providers don't announce these. Most developers never detect it because they don't run controlled evaluations on a schedule.

This is why I treat benchmark results as perishable data. If you're routing production traffic based on an evaluation you ran three months ago, you might already be misrouting.

The prompt that generates the benchmark can fail the benchmark

One of the more interesting things I've noticed: when a model generates evaluation prompts and expected answers, it doesn't necessarily perform well on those tasks itself.

A model can write a perfectly valid classification test with correct expected labels, then fail that exact test when evaluated. The asymmetry between generating instructions and following them is real, and it means you can't trust a model to evaluate itself.

The real question

The AI industry is obsessed with: "Which model is best?"

After a year of benchmarking, I'm convinced this is the wrong question.

The right question is: "Which model is best for this specific task, with this specific prompt structure, in this specific workflow?"

That question can only be answered by benchmarking the workflow, not the model.

Static leaderboards answer the first question. Custom, task-specific, repeatable benchmarking answers the second. The gap between these two approaches is where most teams are silently overpaying, underperforming, or both.

Bio: Marc Kean Paker is the founder of OpenMark, an AI model benchmarking platform for deterministic, cost-aware model selection across 100+ models.

The Price Per Million Tokens Is Lying to You

OpenMark — Thu, 05 Mar 2026 01:57:10 +0000

About 9 months ago, I was building a RAG system, for those who don't know its a kind of enhanced memory system for AI agents. One of the agentic flows needed semantic similarity, and I had GPT-4o running it because, well, it was OpenAI’s flagship model. Best model, best results, right?

I decided to actually test that assumption. After a few days of systematic testing, I found that a model costing roughly 10x less (GPT-4.1-mini at the time) was giving me equal or better results on that specific task. Not marginally. Noticeably better. On a task I assumed required the most recent, most expensive option.

That experience broke something in how I thought about AI model selection, and I've spent the months since digging into why this happens and how widespread it is.

The pricing page tells you almost nothing.

Every AI provider publishes a price per million tokens. Input tokens, output tokens, maybe a cached rate. Simple enough. But this number is close to meaningless in production because it ignores two things that completely change the math.

First, tokenization. Different models tokenize the same input differently. GPT-5, Claude Sonnet 4.5, Gemini 3.0 Flash etc. Give them the exact same prompt, the exact same input text, and they will produce different token counts. Sometimes the difference is 10-15%. Sometimes it's more. So "price per million tokens" is comparing apples to oranges from the start, because a million tokens from one model does not represent the same amount of work as a million tokens from another.

Second, and this is the bigger one: output volume. This is where reasoning and chain-of-thought models completely blow up the math. A model like DeepSeek Reasoner, gpt-5.2-pro or Claude Opus 4.6 will think through a problem step by step, and that thinking generates tokens. Lots of them. You ask two models the same question, one gives you a 200 token answer, the other gives you 3,000 tokens of reasoning plus a 200 token answer. The second model might be cheaper per million tokens and still cost you 5x more on the actual task.

I've seen this over and over. A model that is "10x cheaper" by the pricing page ends up being more expensive in practice because of how it handles the workload. And a model that looks expensive on paper can be cheaper per task because it's efficient with its tokens.

Why generic benchmarks don't help here

The instinct when choosing a model is to check the leaderboards. MMLU, HumanEval, LMArena, LiveBench. These are useful for understanding general capability. But they tell you nothing about your specific use case.

I'm not being contrarian here. This is just the reality of how these models work. The variables are incredibly subtle. The way you phrase a prompt, the structure of your input, even the position of a comma can change which model performs best. A model that scores 92% on MMLU might score 60% on your classification task while a model that scores 85% on MMLU nails it at 95%.

And none of these benchmarks account for cost. You could be using the "best" model on the leaderboard and spending 10x what you need to, because a model three tiers below it handles your specific workload just as well, if not better.

What actually matters in production

If you're running AI in production, or even just evaluating which model to use for a project, the metrics that matter are:

Accuracy on YOUR task. Not a generic benchmark. Your actual prompts, your actual data, your actual expected outputs.
Real token cost. Not price per million, but what the model actually costs you per task, per call, per pipeline run. This includes input tokens (which vary by tokenizer), output tokens (which vary wildly by model behavior), and any reasoning tokens that get billed.
Latency. Time to first token and total completion time. For agentic workflows or user-facing features, this matters as much as cost.
Consistency. Some models give you brilliant output 70% of the time and garbage the other 30%. Others are boringly reliable. For production, boring and reliable wins every time.

The problem is that getting these numbers requires actually running your workload across multiple models. Not once, not with one prompt, but systematically, on a schedule, with enough variation to get statistically meaningful results. Most teams don't do this because it's tedious and time consuming. They pick the model that "feels right" based on what seems to work and leaderboard rankings, ship it, and never look back.

This is how you end up spending $10k/month on API calls when $2k would give you the same output quality.

The real lesson

The AI model market is moving fast. New models every few weeks. Price cuts, capability jumps, new providers entering. The model that was optimal for your use case three months ago might not be optimal today.

The only way to actually know what works best for you is to test it. On your data, with your prompts, measuring the things that matter for your specific situation. Everything else is guessing.

I learned this the hard way when I found out I was overpaying by 10x on a pipeline I assumed needed a flagship model. Since then, I've made it a practice to re-evaluate model selection whenever a significant new release drops. The cost savings and performance improvements make it worth it every single time.

Bio: Marc Kean Paker is the founder of OpenMark, an AI model benchmarking platform designed to move teams away from leaderboard guessing and toward deterministic, cost-aware model selection.

I Benchmarked 10 AI Models on Reading Human Emotions

OpenMark — Thu, 19 Feb 2026 14:17:38 +0000

Every time a new AI model drops, the same ritual plays out. The leaderboard updates. Twitter erupts. Someone posts a chart showing Model X beat Model Y by 2.3% on MMLU. People make purchasing decisions based on these numbers.

And I think most of it is nonsense.

I don't say this lightly. I've spent the last year building OpenMark, a platform that lets you benchmark AI models on your own tasks with deterministic scoring and real API cost tracking. The deeper I go into benchmarking, the more I realize how fundamentally broken the way we evaluate AI models is.

Let me show you what I mean with real data.

The Experiment: Can AI Read Human Emotions?

I took four movie stills. Scenes most humans would immediately recognize, and asked 10 AI models to identify the emotion. The twist: increasing complexity.

Julia Roberts, Pretty Woman Obviously happy. Baseline.
Matthew McConaughey, Interstellar Obviously sad. Still straightforward.
Michael Scott, The Office Happy teary eyed experession? This is where it gets ambiguous.
Joaquin Phoenix, Joker Neutral expression (cover picture), the joker makeup really messes with the AI models ability to understand what is going on.

Each model ran the task 3 times (so 12 total calls per model, 4 images × 3 runs) with stability tracking. Here's what happened:

The Results

Model	Score	Stability	Cost/task
gpt-5.2 (OpenAI)	75%	±0.000	$0.0085
gemini-3-pro (Google)	75%	±0.000	$0.0614
gemini-3-flash (Google)	68%	±1.000	$0.0060
grok-4-1-fast (xAI)	57%	±1.000	$0.0009
sonar (Perplexity)	57%	±1.000	$0.0256
llama4-maverick (Meta)	50%	±0.000	$0.0020
Qwen3.5-397B (Alibaba)	50%	±0.000	$0.0073
claude-sonnet-4.6 (Anthropic)	50%	±0.000	$0.0148
claude-opus-4.6 (Anthropic)	50%	±0.000	$0.0246
mistral-medium (Mistral)	42%	±1.000	$0.0022

10 models. Real API costs. 3 runs per model for stability. Exported from OpenMark.

Now, stop and look at this data.

The Most Expensive Model Tied With One Costing 12x Less

Claude Opus 4.6, Anthropic's flagship, "Very High" pricing tier at $0.025 per task scored exactly 50%. The same score as Llama 4 Maverick at $0.002 per task.

That's a 12x price difference for identical performance.

On any generic leaderboard, Opus 4.6 ranks significantly above Maverick. MMLU, HumanEval, MATH, Opus wins on all of them. And yet, on this specific task, with this specific prompt, the budget model matched the premium one perfectly.

If you were making a purchasing decision based on leaderboard rankings, you'd be overpaying by 1,200%.

Half the Models Changed Their Mind

Look at the stability column. Half the models scored ±0.000, they gave the exact same answer every single run. The other half scored ±1.000, they literally changed their interpretation of the same image across runs.

Gemini 3 Flash, Grok, Sonar, Mistral, all unstable. Same image, same prompt, different answer depending on when you ask.

This is why single-run benchmarks are fundamentally meaningless. If your model can't give the same answer twice, what exactly did your benchmark measure? The model's capability? Or just... luck?

The 80,000-Call Problem (And Why Every Leaderboard Is Lying to You)

Here's where I get genuinely frustrated.

To properly benchmark a model on a single task, you'd need to account for:

Stability: Run each prompt at least 10 times to get reliable variance data
Language variation: Test across at least 20 languages (tokenization affects reasoning)
Syntax variation: Rephrase the same question 20 different ways (formal, casual, terse, verbose, with typos, without)
Prompt variation: 20 fundamentally different phrasings of the same underlying question

That's 10 × 20 × 20 × 20 = 80,000 calls. For one task. On one model.

And this is conservative. Add tool use? Multiply by another N. Add multimodal inputs? Another N. Different system prompts? Another N. You're easily at 500,000+ calls to truly benchmark one model on one capability.

No leaderboard does this. Not MMLU. Not HumanEval. Not LMArena. Not SWE-bench. Why ? Because its not possible, the resources required to to run 500 000 minimum, for each tasks, for each models, would be unfathomable. They run each question once, maybe a handful of times, and call it a score. Then people use that score to decide which model to bet their product on.

Brilliant researchers are out there designing these benchmarks, and I respect the work deeply. But the fundamental limitation isn't effort or intelligence, it's that you can never escape the prompt problem. The way you ask the question is the test, as much as the question itself.

The Car Wash Problem: When the Benchmark Is Dumber Than the Model

There's a popular "gotcha" making the rounds. The car wash problem:

"I need to get my car washed. The car wash is 100 meters away. Should I go by car or by foot?"

Many models say "by foot", it's only 100 meters. And people hold this up triumphantly: "See? AI can't reason! You need your car at the car wash!"

But think about this for two seconds. The question is intentionally ambiguous. Maybe your car is already at the car wash. Maybe someone else is driving it there. Maybe you're asking about how you should get there, not the car. The question doesn't specify.

This isn't a model failure. It's a prompt failure. The question is designed to be misleading, and then we blame the model for being misled.

You know what's worse than the car wash problem? Ask humans what's heavier, 1 kg of feathers or 1 kg of lead. Way too many will say lead. And that's a question with an objectively correct, unambiguous answer, not an intentionally vague one. The car wash example is manufactured outrage, and people cling to it because it confirms the narrative that "AI isn't ready."

AI might not be ready. But the car wash problem doesn't prove it.

What Actually Matters

Here's what I've learned from building a benchmarking platform and running thousands of model evaluations:

The only benchmark that matters is yours.

Not MMLU. Not HumanEval. Not some leaderboard aggregating scores across tasks you'll never use. The question is brutally simple:

Does this specific model, with this specific prompt, for this specific task, give me the result I expect, reliably and at a price I can afford?

That's it. That's the whole question.

In my emotion detection test, GPT-5.2 won at 75% with perfect stability for $0.0085 per task. But if your use case is high-volume classification where "good enough" works and cost matters, Grok 4.1 Fast at 57% for $0.0009 gives you 2,510 accuracy-per-dollar, 7x better value than the winner.

No leaderboard will tell you that. Only your benchmark will.

Try It Yourself

I built OpenMark because I was tired of making model decisions based on other people's benchmarks. You can write any task, code review, classification, creative writing, vision analysis, anything, pick your models, and get deterministic scores with real API costs.

100+ models. Side-by-side comparison. Stability metrics. Accuracy-per-dollar. The stuff leaderboards don't show you.

The benchmark I ran for this article took about 2 minutes to set up .

Stop trusting leaderboards. Benchmark your own work.

I Built a Tool to Benchmark 100+ LLMs on My Actual Use Case — Here's What I Learned

OpenMark — Mon, 09 Feb 2026 23:20:40 +0000

Static leaderboards rank LLMs on generic benchmarks like MMLU and HumanEval. But when I needed to pick a model for my specific task — extracting structured data from messy legal documents — those scores were useless.

So I built OpenMark, an open benchmarking platform that lets you test 100+ AI models on your actual prompt.

The Problem with Static Benchmarks

Every week there's a new "State of the Art" model. MMLU scores keep climbing. But here's the thing:

A model that tops coding benchmarks might be terrible at your specific domain
Pricing varies 100x between providers — and leaderboards don't show real API costs
Response quality can vary between runs — stability matters as much as peak performance

I was tired of switching models based on hype, only to find they performed worse on my actual workload.

What I Built

OpenMark lets you:

Write your real prompt (or describe your task and let AI generate a benchmark YAML)
Select models across providers — GPT-5.2, Claude 4.5 Sonnet, Gemini 3.0 Flash, DeepSeek chat, Llama, Mistral, and 100+ more
Run a benchmark that hits real APIs and scores responses deterministically
Compare results with actual latency, token costs, and consistency metrics

Here's what makes it different from playgrounds and arenas:

Feature	Static Leaderboard	Playground/Arena	OpenMark
Your actual task	❌	✅ (manual)	✅ (automated)
Real API costs	❌	❌	✅
Deterministic scoring	❌	❌ (vibes)	✅
100+ models at once	❌	❌ (2-4)	✅
Stability metrics	❌	❌	✅

Surprising Things I Learned

After running hundreds of benchmarks, some patterns emerged:

1. Expensive ≠ Better (for your task)

For straightforward extraction tasks, GPT-4.1 Mini and Gemini 2.0 Flash consistently matched or beat models costing 10-50x more. The expensive models shine on complex reasoning — but most production prompts don't need that.

2. Stability Varies Wildly

Some models give you a perfect answer 9/10 times and garbage the 10th. If you're building production systems, that 10% failure rate matters more than the peak score. Running multiple iterations revealed which models you can actually trust.

3. The "Best" Model Changes With Your Prompt

I tested the same concept with three different prompt phrasings. The model rankings reshuffled each time. This is why static benchmarks are misleading — they test one phrasing, once.

4. Newer Isn't Always Better

The latest release often has rough edges. Models that have been available for a few months tend to be more stable and better optimized for cost.

How to Try It

You can run a benchmark in under 60 seconds:

Go to openmark.ai
Describe your task (e.g., "Which LLM is best at summarizing medical research papers?")
Click Quick Benchmark — it auto-generates a task, picks diverse models, and starts running
Watch results stream in with scores, costs, and latency

This is the 'quick' flow. You can also go full hands on and manually create everything from scratch and select exactly the configuration and models you want to run.

Free tier gives you 100 credits to start. A typical benchmark across 8 models costs ~4-8 credits.

What's Next

I'm working on:

Programmatic content pages for common comparisons (best LLM for coding, best LLM for writing, etc.)
Benchmark history so you can track model improvements over time
Team sharing for collaborative evaluation

If you've been picking models based on Twitter hype or static leaderboards, give it a shot with your actual use case. The results might surprise you, and be actually really valuable for your use cases. I found that it was *very * valuable to me establishing which models I wanted to power my RAG's agentic flows.

🔗 OpenMark — Benchmark AI Models on Your Actual Task

What's your experience been with model selection? Have you found that benchmark scores match your real-world results? Drop a comment 👇