Benchmarks

Showing 27 benchmarks articles

All AI News Comparisons Reviews Benchmarks Tutorials Best Of Roundups

AGI Ranker Audit: Every LLM Score Dropped 6-15 Points

A self-audit of the AGI Ranker leaderboard exposed scoring bias that inflated every model by 6-15 points. Here's what...

July 31, 20265 min

Two people looking at data on a laptop screen

Benchmarks

LLM Agents Flop at Coordination: Inside the ALEM Benchmark

A new open-ended coordination benchmark tests 13 LLMs across communication, trading, crafting, and combat. Most agents...

July 19, 20268 min

108

Benchmarks

Apple SpeechAnalyzer vs Whisper: Benchmark Verdict

Apple's new SpeechAnalyzer API landed in iOS 26 with big claims. Benchmark data from Inscribe puts it head-to-head with...

July 18, 20267 min

Benchmarks

Talos-XII: Hand-Written Rust Autograd Hits 10k Sims/Sec

A solo-built Rust autograd stack with custom SIMD dispatch models gacha probabilities at 10k+ sims per second. Here's...

July 10, 20267 min

Benchmarks

REAP Explained: Real Coding Benchmarks From Live Agent Traffic

REAP mines production coding agent sessions to build execution-based benchmarks. On the Harvest benchmark it produced,...

July 5, 20267 min

114

Benchmarks

ScarfBench: IBM's Brutal Test for Java Migration AI

IBM Research's ScarfBench puts AI coding agents through real enterprise Java framework migrations. The results show a...

July 3, 20267 min

104

Benchmarks

Senior SWE-Bench: The Benchmark That Humbles AI Agents

Snorkel AI's new Senior SWE-Bench flips the script on coding agents by testing them as senior engineers. Top models...

July 2, 20268 min

124

Benchmarks

FFASR Leaderboard: ASR Benchmarked on Real-World Audio

Treble Technologies and Hugging Face just dropped the FFASR Leaderboard, a far-field ASR benchmark that exposes how...

June 27, 20268 min

172

Benchmarks

DeepSWE Benchmark: 91 Repos, 5 Languages, Zero Leaks

DeepSWE is a fresh contamination-free coding benchmark spanning 91 repos and 5 languages. Here's what the numbers say...

June 25, 20268 min

145

Benchmarks

Rio3.5 vs Qwen3.7: Why This Viral Benchmark Smells Off

A tweet claims Rio de Janeiro's city government built an LLM that beats Qwen3.7. No paper, no leaderboard, no weights....

June 23, 20267 min

187

Benchmarks

Agentic LLM Benchmark: Open Models On Real Tooling

Hugging Face's new agentic benchmark stress-tests open models against your actual toolset. The results expose a gap...

June 18, 20268 min

171

Benchmarks

Bilingual Voice Agents Hit a Wall: ASR Code-Switch Benchmark

Frontier ASR models stumble when customers mix two languages in one sentence. A new ServiceNow-AI benchmark exposes how...

June 10, 20268 min

193

123