Benchmark

👋 Sign in for the ability to sort posts by relevant, latest, or top.

Dayna Blackwell

May 25

We Benchmarked the Most Popular Code Search Tools. We Beat All of Them.

#ai #mcp #benchmark #devtools

11 min read

Vilius

May 26

Two Models Just Hit 90% on Agent Coding. One Cost Less Than a Penny.

#ai #agents #benchmark #llm

2 min read

Cover image for The False Positive Tax: a 1:1 TP:FP analysis of eslint-plugin-security

Ofri Peretz

May 25

The False Positive Tax: a 1:1 TP:FP analysis of eslint-plugin-security

#security #eslint #javascript #benchmark

11 min read

Cover image for Multi-Shot vs Zero-Shot: When Adding Examples Actually Hurts Accuracy

Gabriel Anhaia

May 24

Multi-Shot vs Zero-Shot: When Adding Examples Actually Hurts Accuracy

#ai #llm #prompt #benchmark

8 min read

Dmytro Klymentiev

May 23

How does an AI agent pick from 686 skills in a second?

#ai #benchmark #embeddings #claudecode

7 min read

Jangwook Kim

May 22

LMR-BENCH: Can LLM Agents Reproduce NLP Research Code? (EMNLP 2025)

#benchmark #researchreproducibility #llmagents #paperpoc

5 min read

Michael Fairchild

May 21

AI-generated accessibility, an update — frontier models still fail, but skills change the game

#a11y #llm #ai #benchmark

6 min read

Cover image for I Benchmarked 17 ESLint Security Plugins. Only One Found Every Vulnerability.

Ofri Peretz

May 25

I Benchmarked 17 ESLint Security Plugins. Only One Found Every Vulnerability.

#security #eslint #javascript #benchmark

9 min read

Cover image for Why Code Golfing is the Ultimate Test for Multimodal LLMs (And a New Benchmark to Prove It)

Andreas Ebner

May 20

Why Code Golfing is the Ultimate Test for Multimodal LLMs (And a New Benchmark to Prove It)

#opensource #ai #webdev #benchmark

1 min read

shaun vd

May 20

Claude Sonnet 4.6 vs GPT-4.1 vs Gemini 2.5 Flash: which wins JSON extraction?

#ai #llm #benchmark #claude

3 min read

Vitaliy Ryumshyn

May 18

Benchmarks- Kubernetes MCP Servers Passed. That Was Not Enough.

#kubernetes #ai #benchmark #opensource

4 min read

Cover image for How do you benchmark an MCP server you built?

Luc B. Perussault-Diallo

May 15

How do you benchmark an MCP server you built?

#ai #mcp #claude #benchmark

8 min read

Rob

May 11

Model Showdown Round 4: Opus vs Qwen — Writers, Not Coders

#ai #llm #benchmark #agents

10 min read

Bruno Juca

May 10

Why Most Browser AI Demos Fail on Real Hardware

#ai #inference #hardware #benchmark

4 min read

Rob

May 8

The Agentic Gap: Claude Oneshots, Gemma Fails

#ai #llm #benchmark #homelab

9 min read

👋 Sign in for the ability to sort posts by relevant, latest, or top.

Forem

# benchmark

We Benchmarked the Most Popular Code Search Tools. We Beat All of Them.

Two Models Just Hit 90% on Agent Coding. One Cost Less Than a Penny.

The False Positive Tax: a 1:1 TP:FP analysis of eslint-plugin-security

Multi-Shot vs Zero-Shot: When Adding Examples Actually Hurts Accuracy

How does an AI agent pick from 686 skills in a second?

LMR-BENCH: Can LLM Agents Reproduce NLP Research Code? (EMNLP 2025)

AI-generated accessibility, an update — frontier models still fail, but skills change the game

I Benchmarked 17 ESLint Security Plugins. Only One Found Every Vulnerability.

Why Code Golfing is the Ultimate Test for Multimodal LLMs (And a New Benchmark to Prove It)

Claude Sonnet 4.6 vs GPT-4.1 vs Gemini 2.5 Flash: which wins JSON extraction?

Benchmarks- Kubernetes MCP Servers Passed. That Was Not Enough.

How do you benchmark an MCP server you built?

Model Showdown Round 4: Opus vs Qwen — Writers, Not Coders

Why Most Browser AI Demos Fail on Real Hardware

The Agentic Gap: Claude Oneshots, Gemma Fails