Benchmarks

👋 Sign in for the ability to sort posts by relevant, latest, or top.

Cover image for Four Chinese Labs Rewrote the Open-Weights Leaderboard in 18 Days

Peremptory

May 22

Four Chinese Labs Rewrote the Open-Weights Leaderboard in 18 Days

#openweights #chineseai #benchmarks #codingmodels

3 min read

Cover image for The cheapest and fastest way to generate an image

Konstantin Komelin

May 17

The cheapest and fastest way to generate an image

#ai #benchmarks #nanobanana #vercel

1 min read

Ismail zamareh

May 16

Beyond Scores: A Critical Review of Benchmark Reports for Evaluating Large Language Models

#llmevaluation #benchmarks #machinelearning #productiondeployment

7 min read

Arkadiusz Przychocki

May 14

What you measure depends on where you draw the boundary

#java #performance #benchmarks #saga

9 min read

Cover image for A Startup Claims to Have Broken the Transformer's Core Bottleneck

Peremptory

May 19

A Startup Claims to Have Broken the Transformer's Core Bottleneck

#architecture #contextwindow #benchmarks #research

3 min read

Vincenzo Rubino

Apr 24

I benchmarked 10 LLMs on slopsquatting — up to 87% installed fake packages

#ai #security #webdev #benchmarks

9 min read

Owen

Apr 24

DeepSeek V4 Released: Open-Source 1.6T MoE, 1M Context, Apache 2.0 — and It's Already on the API

#ai #deepseek #opensource #benchmarks

6 min read

Owen

Apr 24

GPT-5.5 Released: First Fully Retrained Base Model Since GPT-4.5, 1M Context, $5/$30 Pricing

#ai #openai #gpt #benchmarks

6 min read

김이더

Apr 24

GPT-5.5 Is Out — What the Numbers Actually Say

#ai #openai #gpt #benchmarks

4 min read

Cover image for How to Choose the Right AI Model for the Right Job

Shafiq Ur Rehman

Apr 21

How to Choose the Right AI Model for the Right Job

#ai #benchmarks #modelselection

13 min read

t49qnsx7qt-kpanks

Apr 21

How I took LongMemEval oracle from 62% to 82.8% without touching the retriever

#ai #llm #benchmarks #memory

3 min read

EClawbot Official

Apr 15

What Is Agent Evaluation? How EClaw Arena Benchmarks AI Agents Across 12 Dimensions

#ai #agents #benchmarks #evaluation

3 min read

Cover image for Sonnet 4.6 vs Haiku 4.5 vs Opus 4.6: I Tested 3 Claude Models on 10 Real Tasks

James AI

Apr 15

Sonnet 4.6 vs Haiku 4.5 vs Opus 4.6: I Tested 3 Claude Models on 10 Real Tasks

#ai #llm #claude #benchmarks

3 min read

Natnael Alemseged

May 5

Why Merged LoRA Barely Changes Inference Time

#machinelearning #llm #benchmarks #ai

6 min read

Cover image for The YC President Endorsed an AI Memory System With Fake Benchmarks. He Also Shipped His Own. We Read the Code.

Penfield

Apr 11

The YC President Endorsed an AI Memory System With Fake Benchmarks. He Also Shipped His Own. We Read the Code.

#ai #aimemory #benchmarks #yc

3 min read

👋 Sign in for the ability to sort posts by relevant, latest, or top.

Forem

# benchmarks

Four Chinese Labs Rewrote the Open-Weights Leaderboard in 18 Days

The cheapest and fastest way to generate an image

Beyond Scores: A Critical Review of Benchmark Reports for Evaluating Large Language Models

What you measure depends on where you draw the boundary

A Startup Claims to Have Broken the Transformer's Core Bottleneck

I benchmarked 10 LLMs on slopsquatting — up to 87% installed fake packages

DeepSeek V4 Released: Open-Source 1.6T MoE, 1M Context, Apache 2.0 — and It's Already on the API

GPT-5.5 Released: First Fully Retrained Base Model Since GPT-4.5, 1M Context, $5/$30 Pricing

GPT-5.5 Is Out — What the Numbers Actually Say

How to Choose the Right AI Model for the Right Job

How I took LongMemEval oracle from 62% to 82.8% without touching the retriever

What Is Agent Evaluation? How EClaw Arena Benchmarks AI Agents Across 12 Dimensions

Sonnet 4.6 vs Haiku 4.5 vs Opus 4.6: I Tested 3 Claude Models on 10 Real Tasks

Why Merged LoRA Barely Changes Inference Time

The YC President Endorsed an AI Memory System With Fake Benchmarks. He Also Shipped His Own. We Read the Code.