AI Benchmark Dashboard

Compare 18+ AI models across 8 standardized benchmarks. Scores sourced from Papers with Code, Hugging Face Open LLM Leaderboard, and official model cards. Updated regularly with the latest results.

8 benchmarks
18 models tracked
4 categories
o33 wins
Claude Opus 4.62 wins
o3 (high compute)1 win
#4GPT-4o1 win

ARC-AGI

Abstraction and Reasoning Corpus — tests novel pattern recognition and abstract reasoning. Considered the hardest AI benchmark.

reasoning6 modelsUpdated March 18, 2026
Source
o3 (high compute)OpenAI
87.5%
o3-miniOpenAI
77.0%
Claude Opus 4.6Anthropic
53.0%
4
DeepSeek R1DeepSeek
42.0%
5
Gemini 2.0 UltraGoogle
38.5%
6
GPT-4oOpenAI
21.0%

GSM8K

Grade School Math — 8,500 grade school math word problems. Tests basic mathematical reasoning and multi-step problem solving.

math10 modelsUpdated March 18, 2026
Source
o3OpenAI
99.2%
Claude Opus 4.6Anthropic
97.8%
Gemini 2.0 UltraGoogle
96.1%
4
GPT-4oOpenAI
95.8%
5
DeepSeek V3DeepSeek
95.0%
6
Claude Sonnet 4.6Anthropic
94.5%
7
Llama 4 MaverickMeta
93.7%
8
Qwen 3 235BAlibaba
93.2%
9
GPT-4o MiniOpenAI
87.0%
10
Gemini 2.0 FlashGoogle
86.5%

LMSYS Chatbot Arena Elo

Crowdsourced blind human preference rankings. Users compare anonymous model outputs — the most democratic LLM benchmark.

reasoning10 modelsUpdated March 18, 2026
Source
GPT-4oOpenAI
1287.0Elo
Claude Opus 4.6Anthropic
1280.0Elo
Gemini 2.0 UltraGoogle
1275.0Elo
4
Grok 3xAI
1268.0Elo
5
Claude Sonnet 4.6Anthropic
1260.0Elo
6
DeepSeek V3DeepSeek
1248.0Elo
7
Llama 4 MaverickMeta
1240.0Elo
8
Qwen 3 235BAlibaba
1232.0Elo
9
GPT-4o MiniOpenAI
1200.0Elo
10
Mistral Large 2.5Mistral
1195.0Elo

SWE-bench Verified

Tests ability to solve real GitHub issues from popular Python repos. The gold standard for agentic coding evaluation.

coding7 modelsUpdated March 18, 2026
Source
Claude Opus 4.6 + ScaffoldAnthropic
72.0%
o3 + ScaffoldOpenAI
69.1%
Claude Sonnet 4.6Anthropic
55.3%
4
DeepSeek R1DeepSeek
49.2%
5
Gemini 2.0 UltraGoogle
47.8%
6
GPT-4oOpenAI
38.4%
7
Llama 4 MaverickMeta
32.1%

GPQA Diamond

Graduate-level science questions written by PhD experts. Extremely hard — even domain experts only score ~65%.

reasoning8 modelsUpdated March 18, 2026
Source
o3OpenAI
87.7%
Claude Opus 4.6Anthropic
74.9%
Gemini 2.0 UltraGoogle
72.1%
4
DeepSeek R1DeepSeek
71.5%
5
Claude Sonnet 4.6Anthropic
65.0%
6
Llama 4 MaverickMeta
58.3%
7
Grok 3xAI
56.7%
8
GPT-4oOpenAI
53.6%

MATH

Competition-level math problems from AMC, AIME, and Olympiad. Tests mathematical reasoning and problem-solving.

math10 modelsUpdated March 18, 2026
Source
o3OpenAI
96.7%
Claude Opus 4.6Anthropic
85.1%
Gemini 2.0 UltraGoogle
83.9%
4
DeepSeek R1DeepSeek
83.5%
5
Qwen 3 235BAlibaba
81.2%
6
Llama 4 MaverickMeta
79.1%
7
GPT-4oOpenAI
76.6%
8
Grok 3xAI
76.0%
9
Mistral Large 2.5Mistral
74.5%
10
Claude Sonnet 4.6Anthropic
73.8%

HumanEval

Evaluates code generation by asking models to complete Python functions from docstrings. Tests practical programming ability.

coding12 modelsUpdated March 18, 2026
Source
Claude Opus 4.6Anthropic
93.7%
GPT-4oOpenAI
90.2%
DeepSeek V3DeepSeek
89.8%
4
Gemini 2.0 UltraGoogle
88.4%
5
Claude Sonnet 4.6Anthropic
88.0%
6
Qwen 3 235BAlibaba
87.5%
7
Llama 4 MaverickMeta
85.5%
8
Grok 3xAI
85.0%
9
Mistral Large 2.5Mistral
84.1%
10
GPT-4o MiniOpenAI
80.5%
11
Gemini 2.0 FlashGoogle
79.2%
12
Claude Haiku 4.5Anthropic
78.3%

MMLU

Massive Multitask Language Understanding — measures knowledge across 57 academic subjects including STEM, humanities, and social sciences.

knowledge12 modelsUpdated March 18, 2026
Source
Claude Opus 4.6Anthropic
92.3%
Gemini 2.0 UltraGoogle
90.8%
Claude Sonnet 4.6Anthropic
89.5%
4
GPT-4oOpenAI
88.7%
5
Llama 4 MaverickMeta
88.2%
6
DeepSeek V3DeepSeek
87.1%
7
Qwen 3 235BAlibaba
86.5%
8
Mistral Large 2.5Mistral
86.3%
9
Grok 3xAI
85.7%
10
Gemini 2.0 FlashGoogle
83.4%
11
GPT-4o MiniOpenAI
82.0%
12
Claude Haiku 4.5Anthropic
80.1%

Frequently Asked Questions

What is the best AI model in 2026?

It depends on the task. Claude Opus 4.6 leads on MMLU (92.3%) and HumanEval (93.7%) for general knowledge and coding. OpenAI o3 dominates math with 96.7% on MATH and 87.7% on GPQA Diamond. GPT-4o leads the Chatbot Arena Elo ratings.

What is the MMLU benchmark?

MMLU (Massive Multitask Language Understanding) tests AI models across 57 subjects including STEM, humanities, and social sciences. It measures broad knowledge and reasoning ability. Scores above 90% indicate expert-level performance.

What is HumanEval?

HumanEval measures AI code generation using 164 hand-written Python programming problems. Models must generate correct, functional code that passes test cases. Scores above 90% indicate near-human coding ability.

How are AI benchmarks scored?

Most benchmarks use accuracy percentage (0-100%). Some use Elo ratings (Chatbot Arena) or pass rates (SWE-bench). Scores come from official evaluations by model providers and independent organizations.

Data sourced from Papers with Code, Hugging Face, and official model evaluations. Updated regularly. Scores reflect official reported results only.