<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: OpenMark</title>
    <description>The latest articles on Forem by OpenMark (@openmarkai).</description>
    <link>https://forem.com/openmarkai</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3762992%2F09b65eca-ee66-4613-acde-4e96fb1ee398.png</url>
      <title>Forem: OpenMark</title>
      <link>https://forem.com/openmarkai</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/openmarkai"/>
    <language>en</language>
    <item>
      <title>Benchmarking the Model Is the Wrong Abstraction</title>
      <dc:creator>OpenMark</dc:creator>
      <pubDate>Sun, 15 Mar 2026 19:40:54 +0000</pubDate>
      <link>https://forem.com/openmarkai/benchmarking-the-model-is-the-wrong-abstraction-3bi6</link>
      <guid>https://forem.com/openmarkai/benchmarking-the-model-is-the-wrong-abstraction-3bi6</guid>
      <description>&lt;p&gt;&lt;strong&gt;&lt;em&gt;Benchmarking the workflow is the right one.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I've spent over a year benchmarking AI models. Thousands of evaluations across 100+ models, dozens of task types, multiple scoring modes. And the single biggest thing I've learned is something most people in this space haven't internalized yet:&lt;/p&gt;

&lt;p&gt;Model performance is not a number. It's a function.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;br&gt;
performance = f(&lt;br&gt;
  model,&lt;br&gt;
  task_type,&lt;br&gt;
  task_theme,&lt;br&gt;
  prompt_structure,&lt;br&gt;
  output_constraints,&lt;br&gt;
  decoding_parameters,&lt;br&gt;
  dataset_distribution&lt;br&gt;
)&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Change any one of these variables, and the rankings reshuffle. Sometimes dramatically. The model that wins on your classification task might lose on mine, not because one of us is wrong, but because the task/model pairing is different.&lt;/p&gt;

&lt;p&gt;This has massive implications for how we should think about evaluation, routing, and cost.&lt;/p&gt;
&lt;h2&gt;
  
  
  Prompt structure reshuffles winners
&lt;/h2&gt;

&lt;p&gt;One of the most consistent patterns I've observed: changing the prompt style, not the question, just the syntax and framing, can completely reorder which model comes out on top.&lt;/p&gt;

&lt;p&gt;Rephrase a sentiment classification prompt from "Classify as positive/negative/neutral" to "What is the sentiment? Reply with one word," and you'll get different winners. Same task. Same intent. Different leaderboard.&lt;/p&gt;

&lt;p&gt;There's one consolation: the worst models tend to stay the worst regardless of how you phrase things. Prompt engineering mostly reshuffles the top-tier competitors. Lower-capability models saturate early and no amount of prompt craft saves them.&lt;/p&gt;

&lt;p&gt;But for anyone choosing between the top 5-10 models for a production task, this means your prompt &lt;em&gt;is&lt;/em&gt; part of your evaluation, not separate from it.&lt;/p&gt;
&lt;h2&gt;
  
  
  Task type alone doesn't predict performance
&lt;/h2&gt;

&lt;p&gt;There's a common mental model that goes something like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reasoning tasks → reasoning models&lt;/li&gt;
&lt;li&gt;Extraction tasks → smaller instruction models&lt;/li&gt;
&lt;li&gt;Creative tasks → large frontier models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It sounds logical. It's also wrong more often than you'd expect.&lt;/p&gt;

&lt;p&gt;I've run benchmarks where non-reasoning models outperform dedicated reasoning models on reasoning tasks. Where a "Medium" pricing tier model ties with a "Very High" tier flagship. Where the cheapest model in the roster co-leads with the most expensive one.&lt;/p&gt;

&lt;p&gt;Performance depends on task theme, prompt syntax, output formatting constraints, and dataset characteristics in ways that broad categories simply can't capture. "Classification" is not one task. It's thousands of tasks that happen to share a label.&lt;/p&gt;
&lt;h2&gt;
  
  
  Smaller models win more often than people think
&lt;/h2&gt;

&lt;p&gt;In production workflows, RAG pipelines, agent chains, extraction flows, smaller models frequently outperform frontier models on individual steps. They're faster, cheaper, more deterministic, and often better at following rigid output constraints.&lt;/p&gt;

&lt;p&gt;The insight that changed how I build systems:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;optimal system ≠ best model
optimal system = best model per step
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Most pipelines only need a frontier model for a small minority of steps. The rest can run on models that cost 10-25x less with equal or better results on that specific sub-task.&lt;/p&gt;

&lt;p&gt;But you'll never discover this by looking at a leaderboard. You'll only see it by benchmarking each step individually.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model capability is a vector, not a score
&lt;/h2&gt;

&lt;p&gt;Every leaderboard reduces a model to a single number. But model capability is multidimensional:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reasoning depth&lt;/li&gt;
&lt;li&gt;Extraction precision&lt;/li&gt;
&lt;li&gt;Format obedience&lt;/li&gt;
&lt;li&gt;Hallucination resistance&lt;/li&gt;
&lt;li&gt;Instruction following&lt;/li&gt;
&lt;li&gt;Long-context handling&lt;/li&gt;
&lt;li&gt;Tool use reliability&lt;/li&gt;
&lt;li&gt;Latency efficiency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Different tasks project onto different parts of this capability space. A model can be exceptional at reasoning and terrible at format obedience. It can handle 100K context windows flawlessly and still fail at single-label classification because it can't resist adding an explanation.&lt;/p&gt;

&lt;p&gt;When you flatten all of this into one score, you lose the information that actually matters for your decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  Variance follows capability boundaries
&lt;/h2&gt;

&lt;p&gt;Here's something I didn't expect to find: model variance is not strongly correlated with model size or price. It follows a capability boundary pattern.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Capability far exceeds task difficulty → stable success&lt;/li&gt;
&lt;li&gt;Capability roughly matches task difficulty → high variance&lt;/li&gt;
&lt;li&gt;Capability far below task difficulty → stable failure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most dangerous zone is the middle one. A model near the edge of its capability for a task will give you brilliant output sometimes and garbage other times. Single-run benchmarks can't detect this. You need multiple passes with stability tracking to see it.&lt;/p&gt;

&lt;p&gt;This is why consistency metrics matter as much as accuracy. A model that scores 75% with perfect stability is often more valuable in production than one that scores 82% but fluctuates wildly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Models regress silently
&lt;/h2&gt;

&lt;p&gt;Another pattern that doesn't get enough attention: capability drift.&lt;/p&gt;

&lt;p&gt;I've observed models regress on tasks even when the model name stays the same and prompts remain unchanged. A model scores 82% in January, you retest in March, it scores 71%. Same API endpoint. Same prompt. Different results.&lt;/p&gt;

&lt;p&gt;Possible causes: alignment layer adjustments, silent model updates, decoding policy changes, backend routing changes. The providers don't announce these. Most developers never detect it because they don't run controlled evaluations on a schedule.&lt;/p&gt;

&lt;p&gt;This is why I treat benchmark results as perishable data. If you're routing production traffic based on an evaluation you ran three months ago, you might already be misrouting.&lt;/p&gt;

&lt;h2&gt;
  
  
  The prompt that generates the benchmark can fail the benchmark
&lt;/h2&gt;

&lt;p&gt;One of the more interesting things I've noticed: when a model generates evaluation prompts and expected answers, it doesn't necessarily perform well on those tasks itself.&lt;/p&gt;

&lt;p&gt;A model can write a perfectly valid classification test with correct expected labels, then fail that exact test when evaluated. The asymmetry between generating instructions and following them is real, and it means you can't trust a model to evaluate itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real question
&lt;/h2&gt;

&lt;p&gt;The AI industry is obsessed with: &lt;em&gt;"Which model is best?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;After a year of benchmarking, I'm convinced this is the wrong question.&lt;/p&gt;

&lt;p&gt;The right question is: &lt;em&gt;"Which model is best for this specific task, with this specific prompt structure, in this specific workflow?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That question can only be answered by benchmarking the workflow, not the model.&lt;/p&gt;

&lt;p&gt;Static leaderboards answer the first question. Custom, task-specific, repeatable benchmarking answers the second. The gap between these two approaches is where most teams are silently overpaying, underperforming, or both.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Bio: Marc Kean Paker is the founder of &lt;a href="//openmark.ai"&gt;OpenMark&lt;/a&gt;, an AI model benchmarking platform for deterministic, cost-aware model selection across 100+ models.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>benchmark</category>
      <category>devtools</category>
    </item>
    <item>
      <title>The Price Per Million Tokens Is Lying to You</title>
      <dc:creator>OpenMark</dc:creator>
      <pubDate>Thu, 05 Mar 2026 01:57:10 +0000</pubDate>
      <link>https://forem.com/openmarkai/the-price-per-million-tokens-is-lying-to-you-1j28</link>
      <guid>https://forem.com/openmarkai/the-price-per-million-tokens-is-lying-to-you-1j28</guid>
      <description>&lt;p&gt;About 9 months ago, I was building a RAG system, for those who don't know its a kind of enhanced memory system for AI agents. One of the agentic flows needed semantic similarity, and I had GPT-4o running it because, well, it was OpenAI’s flagship model. Best model, best results, right? &lt;/p&gt;

&lt;p&gt;I decided to actually test that assumption. After a few days of systematic testing, I found that a model costing roughly 10x less (GPT-4.1-mini at the time) was giving me equal or better results on that specific task. Not marginally. Noticeably better. On a task I assumed required the most recent, most expensive option.&lt;/p&gt;

&lt;p&gt;That experience broke something in how I thought about AI model selection, and I've spent the months since digging into why this happens and how widespread it is.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The pricing page tells you almost nothing.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every AI provider publishes a price per million tokens. Input tokens, output tokens, maybe a cached rate. Simple enough. But this number is close to meaningless in production because it ignores two things that completely change the math.&lt;/p&gt;

&lt;p&gt;First, tokenization. Different models tokenize the same input differently. GPT-5, Claude Sonnet 4.5, Gemini 3.0 Flash etc. Give them the exact same prompt, the exact same input text, and they will produce different token counts. Sometimes the difference is 10-15%. Sometimes it's more. So "price per million tokens" is comparing apples to oranges from the start, because a million tokens from one model does not represent the same amount of work as a million tokens from another.&lt;/p&gt;

&lt;p&gt;Second, and this is the bigger one: output volume. This is where reasoning and chain-of-thought models completely blow up the math. A model like DeepSeek Reasoner, gpt-5.2-pro or Claude Opus 4.6  will think through a problem step by step, and that thinking generates tokens. Lots of them. You ask two models the same question, one gives you a 200 token answer, the other gives you 3,000 tokens of reasoning plus a 200 token answer. The second model might be cheaper per million tokens and still cost you 5x more on the actual task.&lt;/p&gt;

&lt;p&gt;I've seen this over and over. A model that is "10x cheaper" by the pricing page ends up being more expensive in practice because of how it handles the workload. And a model that looks expensive on paper can be cheaper per task because it's efficient with its tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why generic benchmarks don't help here&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The instinct when choosing a model is to check the leaderboards. MMLU, HumanEval, LMArena, LiveBench. These are useful for understanding general capability. But they tell you nothing about your specific use case.&lt;/p&gt;

&lt;p&gt;I'm not being contrarian here. This is just the reality of how these models work. The variables are incredibly subtle. The way you phrase a prompt, the structure of your input, even the position of a comma can change which model performs best. A model that scores 92% on MMLU might score 60% on your classification task while a model that scores 85% on MMLU nails it at 95%.&lt;/p&gt;

&lt;p&gt;And none of these benchmarks account for cost. You could be using the "best" model on the leaderboard and spending 10x what you need to, because a model three tiers below it handles your specific workload just as well, if not better.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What actually matters in production&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you're running AI in production, or even just evaluating which model to use for a project, the metrics that matter are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Accuracy on YOUR task. Not a generic benchmark. Your actual prompts, your actual data, your actual expected outputs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Real token cost. Not price per million, but what the model actually costs you per task, per call, per pipeline run. This includes input tokens (which vary by tokenizer), output tokens (which vary wildly by model behavior), and any reasoning tokens that get billed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Latency. Time to first token and total completion time. For agentic workflows or user-facing features, this matters as much as cost.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Consistency. Some models give you brilliant output 70% of the time and garbage the other 30%. Others are boringly reliable. For production, boring and reliable wins every time.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The problem is that getting these numbers requires actually running your workload across multiple models. Not once, not with one prompt, but systematically, on a schedule, with enough variation to get statistically meaningful results. Most teams don't do this because it's tedious and time consuming. They pick the model that "feels right" based on what seems to work and leaderboard rankings, ship it, and never look back.&lt;/p&gt;

&lt;p&gt;This is how you end up spending $10k/month on API calls when $2k would give you the same output quality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The real lesson&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The AI model market is moving fast. New models every few weeks. Price cuts, capability jumps, new providers entering. The model that was optimal for your use case three months ago might not be optimal today.&lt;/p&gt;

&lt;p&gt;The only way to actually know what works best for you is to test it. On your data, with your prompts, measuring the things that matter for your specific situation. Everything else is guessing.&lt;/p&gt;

&lt;p&gt;I learned this the hard way when I found out I was overpaying by 10x on a pipeline I assumed needed a flagship model. Since then, I've made it a practice to re-evaluate model selection whenever a significant new release drops. The cost savings and performance improvements make it worth it every single time.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Bio: Marc Kean Paker is the founder of &lt;a href="https://openmark.ai" rel="noopener noreferrer"&gt;OpenMark&lt;/a&gt;, an AI model benchmarking platform designed to move teams away from leaderboard guessing and toward deterministic, cost-aware model selection.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>benchmark</category>
      <category>devtools</category>
    </item>
    <item>
      <title>I Benchmarked 10 AI Models on Reading Human Emotions</title>
      <dc:creator>OpenMark</dc:creator>
      <pubDate>Thu, 19 Feb 2026 14:17:38 +0000</pubDate>
      <link>https://forem.com/openmarkai/i-benchmarked-10-ai-models-on-reading-human-emotions-3m0b</link>
      <guid>https://forem.com/openmarkai/i-benchmarked-10-ai-models-on-reading-human-emotions-3m0b</guid>
      <description>&lt;p&gt;Every time a new AI model drops, the same ritual plays out. The leaderboard updates. Twitter erupts. Someone posts a chart showing Model X beat Model Y by 2.3% on MMLU. People make purchasing decisions based on these numbers.&lt;/p&gt;

&lt;p&gt;And I think most of it is nonsense.&lt;/p&gt;

&lt;p&gt;I don't say this lightly. I've spent the last year building &lt;a href="https://openmark.ai" rel="noopener noreferrer"&gt;OpenMark&lt;/a&gt;, a platform that lets you benchmark AI models on &lt;em&gt;your own tasks&lt;/em&gt; with deterministic scoring and real API cost tracking. The deeper I go into benchmarking, the more I realize how fundamentally broken the way we evaluate AI models is.&lt;/p&gt;

&lt;p&gt;Let me show you what I mean with real data.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Experiment: Can AI Read Human Emotions?
&lt;/h2&gt;

&lt;p&gt;I took four movie stills. Scenes most humans would immediately recognize, and asked 10 AI models to identify the emotion. The twist: increasing complexity.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Julia Roberts, Pretty Woman&lt;/strong&gt; Obviously happy. Baseline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Matthew McConaughey, Interstellar&lt;/strong&gt; Obviously sad. Still straightforward.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Michael Scott, The Office&lt;/strong&gt; Happy teary eyed experession? This is where it gets ambiguous.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Joaquin Phoenix, Joker&lt;/strong&gt; Neutral expression (cover picture), the joker makeup really messes with the AI models ability to understand what is going on.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each model ran the task 3 times (so 12 total calls per model, 4 images × 3 runs) with stability tracking. Here's what happened:&lt;/p&gt;

&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Stability&lt;/th&gt;
&lt;th&gt;Cost/task&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5.2 (OpenAI)&lt;/td&gt;
&lt;td&gt;75%&lt;/td&gt;
&lt;td&gt;±0.000&lt;/td&gt;
&lt;td&gt;$0.0085&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gemini-3-pro (Google)&lt;/td&gt;
&lt;td&gt;75%&lt;/td&gt;
&lt;td&gt;±0.000&lt;/td&gt;
&lt;td&gt;$0.0614&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gemini-3-flash (Google)&lt;/td&gt;
&lt;td&gt;68%&lt;/td&gt;
&lt;td&gt;±1.000&lt;/td&gt;
&lt;td&gt;$0.0060&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;grok-4-1-fast (xAI)&lt;/td&gt;
&lt;td&gt;57%&lt;/td&gt;
&lt;td&gt;±1.000&lt;/td&gt;
&lt;td&gt;$0.0009&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sonar (Perplexity)&lt;/td&gt;
&lt;td&gt;57%&lt;/td&gt;
&lt;td&gt;±1.000&lt;/td&gt;
&lt;td&gt;$0.0256&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;llama4-maverick (Meta)&lt;/td&gt;
&lt;td&gt;50%&lt;/td&gt;
&lt;td&gt;±0.000&lt;/td&gt;
&lt;td&gt;$0.0020&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3.5-397B (Alibaba)&lt;/td&gt;
&lt;td&gt;50%&lt;/td&gt;
&lt;td&gt;±0.000&lt;/td&gt;
&lt;td&gt;$0.0073&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;claude-sonnet-4.6 (Anthropic)&lt;/td&gt;
&lt;td&gt;50%&lt;/td&gt;
&lt;td&gt;±0.000&lt;/td&gt;
&lt;td&gt;$0.0148&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;claude-opus-4.6 (Anthropic)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;50%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;±0.000&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.0246&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mistral-medium (Mistral)&lt;/td&gt;
&lt;td&gt;42%&lt;/td&gt;
&lt;td&gt;±1.000&lt;/td&gt;
&lt;td&gt;$0.0022&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;10 models. Real API costs. 3 runs per model for stability. Exported from &lt;a href="https://openmark.ai" rel="noopener noreferrer"&gt;OpenMark&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Now, stop and look at this data.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Most Expensive Model Tied With One Costing 12x Less
&lt;/h2&gt;

&lt;p&gt;Claude Opus 4.6, Anthropic's flagship, "Very High" pricing tier at $0.025 per task scored &lt;strong&gt;exactly 50%&lt;/strong&gt;. The same score as Llama 4 Maverick at $0.002 per task.&lt;/p&gt;

&lt;p&gt;That's a 12x price difference for identical performance.&lt;/p&gt;

&lt;p&gt;On any generic leaderboard, Opus 4.6 ranks significantly above Maverick. MMLU, HumanEval, MATH, Opus wins on all of them. And yet, on &lt;em&gt;this specific task&lt;/em&gt;, with &lt;em&gt;this specific prompt&lt;/em&gt;, the budget model matched the premium one perfectly.&lt;/p&gt;

&lt;p&gt;If you were making a purchasing decision based on leaderboard rankings, you'd be overpaying by 1,200%.&lt;/p&gt;

&lt;h2&gt;
  
  
  Half the Models Changed Their Mind
&lt;/h2&gt;

&lt;p&gt;Look at the stability column. Half the models scored ±0.000, they gave the exact same answer every single run. The other half scored ±1.000,  they literally changed their interpretation of the same image across runs.&lt;/p&gt;

&lt;p&gt;Gemini 3 Flash, Grok, Sonar, Mistral, all unstable. Same image, same prompt, different answer depending on when you ask.&lt;/p&gt;

&lt;p&gt;This is why single-run benchmarks are fundamentally meaningless. If your model can't give the same answer twice, what exactly did your benchmark measure? The model's capability? Or just... luck?&lt;/p&gt;

&lt;h2&gt;
  
  
  The 80,000-Call Problem (And Why Every Leaderboard Is Lying to You)
&lt;/h2&gt;

&lt;p&gt;Here's where I get genuinely frustrated.&lt;/p&gt;

&lt;p&gt;To properly benchmark a model on a single task, you'd need to account for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stability&lt;/strong&gt;: Run each prompt at least &lt;strong&gt;10 times&lt;/strong&gt; to get reliable variance data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Language variation&lt;/strong&gt;: Test across at least &lt;strong&gt;20 languages&lt;/strong&gt; (tokenization affects reasoning)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Syntax variation&lt;/strong&gt;: Rephrase the same question &lt;strong&gt;20 different ways&lt;/strong&gt; (formal, casual, terse, verbose, with typos, without)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt variation&lt;/strong&gt;: &lt;strong&gt;20 fundamentally different phrasings&lt;/strong&gt; of the same underlying question&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's &lt;code&gt;10 × 20 × 20 × 20 = 80,000 calls&lt;/code&gt;. For &lt;em&gt;one task&lt;/em&gt;. On one model.&lt;/p&gt;

&lt;p&gt;And this is conservative. Add tool use? Multiply by another N. Add multimodal inputs? Another N. Different system prompts? Another N. You're easily at 500,000+ calls to truly benchmark one model on one capability.&lt;/p&gt;

&lt;p&gt;No leaderboard does this. Not MMLU. Not HumanEval. Not LMArena. Not SWE-bench. Why ? Because its not possible, the resources required to to run 500 000 minimum, for each tasks, for each models, would be unfathomable. They run each question once, maybe a handful of times, and call it a score. Then people use that score to decide which model to bet their product on.&lt;/p&gt;

&lt;p&gt;Brilliant researchers are out there designing these benchmarks, and I respect the work deeply. But the fundamental limitation isn't effort or intelligence, it's that &lt;strong&gt;you can never escape the prompt problem&lt;/strong&gt;. The way you ask the question &lt;em&gt;is&lt;/em&gt; the test, as much as the question itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Car Wash Problem: When the Benchmark Is Dumber Than the Model
&lt;/h2&gt;

&lt;p&gt;There's a popular "gotcha" making the rounds. The car wash problem:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"I need to get my car washed. The car wash is 100 meters away. Should I go by car or by foot?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Many models say "by foot", it's only 100 meters. And people hold this up triumphantly: &lt;em&gt;"See? AI can't reason! You need your car at the car wash!"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;But think about this for two seconds. The question is intentionally ambiguous. Maybe your car is &lt;em&gt;already at the car wash&lt;/em&gt;. Maybe someone else is driving it there. Maybe you're asking about how &lt;em&gt;you&lt;/em&gt; should get there, not the car. The question doesn't specify.&lt;/p&gt;

&lt;p&gt;This isn't a model failure. It's a prompt failure. The question is designed to be misleading, and then we blame the model for being misled.&lt;/p&gt;

&lt;p&gt;You know what's worse than the car wash problem? Ask &lt;em&gt;humans&lt;/em&gt; what's heavier, 1 kg of feathers or 1 kg of lead. Way too many will say lead. And that's a question with an objectively correct, unambiguous answer, not an intentionally vague one. The car wash example is manufactured outrage, and people cling to it because it confirms the narrative that "AI isn't ready."&lt;/p&gt;

&lt;p&gt;AI &lt;em&gt;might&lt;/em&gt; not be ready. But the car wash problem doesn't prove it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Matters
&lt;/h2&gt;

&lt;p&gt;Here's what I've learned from building a benchmarking platform and running thousands of model evaluations:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The only benchmark that matters is yours.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not MMLU. Not HumanEval. Not some leaderboard aggregating scores across tasks you'll never use. The question is brutally simple:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Does this specific model, with this specific prompt, for this specific task, give me the result I expect, reliably and at a price I can afford?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That's it. That's the whole question.&lt;/p&gt;

&lt;p&gt;In my emotion detection test, GPT-5.2 won at 75% with perfect stability for $0.0085 per task. But if your use case is high-volume classification where "good enough" works and cost matters, Grok 4.1 Fast at 57% for $0.0009 gives you 2,510 accuracy-per-dollar, 7x better value than the winner.&lt;/p&gt;

&lt;p&gt;No leaderboard will tell you that. Only &lt;em&gt;your&lt;/em&gt; benchmark will.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;I built &lt;a href="https://openmark.ai" rel="noopener noreferrer"&gt;OpenMark&lt;/a&gt; because I was tired of making model decisions based on other people's benchmarks. You can write any task, code review, classification, creative writing, vision analysis, anything, pick your models, and get deterministic scores with real API costs.&lt;/p&gt;

&lt;p&gt;100+ models. Side-by-side comparison. Stability metrics. Accuracy-per-dollar. The stuff leaderboards don't show you.&lt;/p&gt;

&lt;p&gt;The benchmark I ran for this article took about 2 minutes to set up . &lt;/p&gt;

&lt;p&gt;Stop trusting leaderboards. Benchmark your own work.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
    </item>
    <item>
      <title>I Built a Tool to Benchmark 100+ LLMs on My Actual Use Case — Here's What I Learned</title>
      <dc:creator>OpenMark</dc:creator>
      <pubDate>Mon, 09 Feb 2026 23:20:40 +0000</pubDate>
      <link>https://forem.com/openmarkai/i-built-a-tool-to-benchmark-100-llms-on-my-actual-use-case-heres-what-i-learned-9ll</link>
      <guid>https://forem.com/openmarkai/i-built-a-tool-to-benchmark-100-llms-on-my-actual-use-case-heres-what-i-learned-9ll</guid>
      <description>&lt;p&gt;Static leaderboards rank LLMs on generic benchmarks like MMLU and HumanEval. But when I needed to pick a model for &lt;em&gt;my&lt;/em&gt; specific task — extracting structured data from messy legal documents — those scores were useless.&lt;/p&gt;

&lt;p&gt;So I built &lt;a href="https://openmark.ai" rel="noopener noreferrer"&gt;OpenMark&lt;/a&gt;, an open benchmarking platform that lets you test 100+ AI models on &lt;strong&gt;your actual prompt&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Static Benchmarks
&lt;/h2&gt;

&lt;p&gt;Every week there's a new "State of the Art" model. MMLU scores keep climbing. But here's the thing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A model that tops coding benchmarks might be terrible at your specific domain&lt;/li&gt;
&lt;li&gt;Pricing varies 100x between providers — and leaderboards don't show real API costs&lt;/li&gt;
&lt;li&gt;Response quality can vary between runs — stability matters as much as peak performance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I was tired of switching models based on hype, only to find they performed worse on my actual workload.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;OpenMark lets you:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Write your real prompt&lt;/strong&gt; (or describe your task and let AI generate a benchmark YAML)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Select models&lt;/strong&gt; across providers — GPT-5.2, Claude 4.5 Sonnet, Gemini 3.0 Flash, DeepSeek chat, Llama, Mistral, and 100+ more&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run a benchmark&lt;/strong&gt; that hits real APIs and scores responses deterministically&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compare results&lt;/strong&gt; with actual latency, token costs, and consistency metrics&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here's what makes it different from playgrounds and arenas:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Static Leaderboard&lt;/th&gt;
&lt;th&gt;Playground/Arena&lt;/th&gt;
&lt;th&gt;OpenMark&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Your actual task&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅ (manual)&lt;/td&gt;
&lt;td&gt;✅ (automated)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Real API costs&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deterministic scoring&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌ (vibes)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100+ models at once&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌ (2-4)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stability metrics&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Surprising Things I Learned
&lt;/h2&gt;

&lt;p&gt;After running hundreds of benchmarks, some patterns emerged:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Expensive ≠ Better (for your task)
&lt;/h3&gt;

&lt;p&gt;For straightforward extraction tasks, &lt;strong&gt;GPT-4.1 Mini&lt;/strong&gt; and &lt;strong&gt;Gemini 2.0 Flash&lt;/strong&gt; consistently matched or beat models costing 10-50x more. The expensive models shine on complex reasoning — but most production prompts don't need that.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Stability Varies Wildly
&lt;/h3&gt;

&lt;p&gt;Some models give you a perfect answer 9/10 times and garbage the 10th. If you're building production systems, that 10% failure rate matters more than the peak score. Running multiple iterations revealed which models you can actually &lt;em&gt;trust&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The "Best" Model Changes With Your Prompt
&lt;/h3&gt;

&lt;p&gt;I tested the same &lt;em&gt;concept&lt;/em&gt; with three different prompt phrasings. The model rankings reshuffled each time. This is why static benchmarks are misleading — they test one phrasing, once.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Newer Isn't Always Better
&lt;/h3&gt;

&lt;p&gt;The latest release often has rough edges. Models that have been available for a few months tend to be more stable and better optimized for cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Try It
&lt;/h2&gt;

&lt;p&gt;You can run a benchmark in under 60 seconds:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go to &lt;a href="https://openmark.ai" rel="noopener noreferrer"&gt;openmark.ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Describe your task (e.g., "Which LLM is best at summarizing medical research papers?")&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Quick Benchmark&lt;/strong&gt; — it auto-generates a task, picks diverse models, and starts running&lt;/li&gt;
&lt;li&gt;Watch results stream in with scores, costs, and latency&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is the 'quick' flow. You can also go full hands on and manually create everything from scratch and select exactly the configuration and models you want to run.&lt;/p&gt;

&lt;p&gt;Free tier gives you 100 credits to start. A typical benchmark across 8 models costs ~4-8 credits.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;I'm working on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Programmatic content pages&lt;/strong&gt; for common comparisons (best LLM for coding, best LLM for writing, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Benchmark history&lt;/strong&gt; so you can track model improvements over time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Team sharing&lt;/strong&gt; for collaborative evaluation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you've been picking models based on Twitter hype or static leaderboards, give it a shot with your actual use case. The results might surprise you, and be actually really valuable for your use cases. I found that it was *&lt;em&gt;very *&lt;/em&gt; valuable to me establishing which models I wanted to power my RAG's agentic flows.&lt;/p&gt;




&lt;p&gt;🔗 &lt;a href="https://openmark.ai" rel="noopener noreferrer"&gt;OpenMark — Benchmark AI Models on Your Actual Task&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What's your experience been with model selection? Have you found that benchmark scores match your real-world results? Drop a comment 👇&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>machinelearning</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
