AI Benchmark Dashboard

Name: SWE-bench Verified
Creator: SWE-bench

Compare 20+ AI models across 8 standardized benchmarks. Scores sourced from Papers with Code, Hugging Face Open LLM Leaderboard, and official model cards. Updated regularly with the latest results.

8 benchmarks

20 models tracked

4 categories

o33 wins

Claude Opus 4.62 wins

Claude Opus 4.6 + Scaffold1 win

#4o3 (high compute)1 win

coding (2)math (2)knowledge (1)reasoning (3)

SWE-bench Verified

Tests ability to solve real GitHub issues from popular Python repos. The gold standard for agentic coding evaluation.

coding8 modelsUpdated March 20, 2026

Source

Claude Opus 4.6 + ScaffoldAnthropic

72.0%

o3 + ScaffoldOpenAI

69.1%

Claude Sonnet 4.6Anthropic

55.3%

GPT-4.1OpenAI

54.6%

DeepSeek R1DeepSeek

49.2%

Gemini 2.0 UltraGoogle

47.8%

GPT-4oOpenAI

38.4%

Llama 4 MaverickMeta

32.1%

MATH

Competition-level math problems from AMC, AIME, and Olympiad. Tests mathematical reasoning and problem-solving.

math11 modelsUpdated March 20, 2026

Source

o3OpenAI

96.7%

o1OpenAI

94.8%

Claude Opus 4.6Anthropic

85.1%

Gemini 2.0 UltraGoogle

83.9%

DeepSeek R1DeepSeek

83.5%

Qwen 3 235BAlibaba

81.2%

Llama 4 MaverickMeta

79.1%

GPT-4oOpenAI

76.6%

Grok 3xAI

76.0%

Mistral Large 2.5Mistral

74.5%

Claude Sonnet 4.6Anthropic

73.8%

MMLU

Massive Multitask Language Understanding — measures knowledge across 57 academic subjects including STEM, humanities, and social sciences.

knowledge13 modelsUpdated March 20, 2026

Source

Claude Opus 4.6Anthropic

92.3%

Gemini 2.0 UltraGoogle

90.8%

DeepSeek R1DeepSeek

90.8%

Claude Sonnet 4.6Anthropic

89.5%

GPT-4oOpenAI

88.7%

Llama 4 MaverickMeta

88.2%

DeepSeek V3DeepSeek

87.1%

Qwen 3 235BAlibaba

86.5%

Mistral Large 2.5Mistral

86.3%

Grok 3xAI

85.7%

Gemini 2.0 FlashGoogle

83.4%

GPT-4o MiniOpenAI

82.0%

Claude Haiku 4.5Anthropic

80.1%

GPQA Diamond

Graduate-level science questions written by PhD experts. Extremely hard — even domain experts only score ~65%.

reasoning9 modelsUpdated March 20, 2026

Source

o3OpenAI

87.7%

o1OpenAI

78.0%

Claude Opus 4.6Anthropic

74.9%

Gemini 2.0 UltraGoogle

72.1%

DeepSeek R1DeepSeek

71.5%

Claude Sonnet 4.6Anthropic

65.0%

Llama 4 MaverickMeta

58.3%

Grok 3xAI

56.7%

GPT-4oOpenAI

53.6%

ARC-AGI

Abstraction and Reasoning Corpus — tests novel pattern recognition and abstract reasoning. Considered the hardest AI benchmark.

reasoning6 modelsUpdated March 18, 2026

Source

o3 (high compute)OpenAI

87.5%

o3-miniOpenAI

77.0%

Claude Opus 4.6Anthropic

53.0%

DeepSeek R1DeepSeek

42.0%

Gemini 2.0 UltraGoogle

38.5%

GPT-4oOpenAI

21.0%

GSM8K

Grade School Math — 8,500 grade school math word problems. Tests basic mathematical reasoning and multi-step problem solving.

math10 modelsUpdated March 18, 2026

Source

o3OpenAI

99.2%

Claude Opus 4.6Anthropic

97.8%

Gemini 2.0 UltraGoogle

96.1%

GPT-4oOpenAI

95.8%

DeepSeek V3DeepSeek

95.0%

Claude Sonnet 4.6Anthropic

94.5%

Llama 4 MaverickMeta

93.7%

Qwen 3 235BAlibaba

93.2%

GPT-4o MiniOpenAI

87.0%

Gemini 2.0 FlashGoogle

86.5%

LMSYS Chatbot Arena Elo

Crowdsourced blind human preference rankings. Users compare anonymous model outputs — the most democratic LLM benchmark.

reasoning10 modelsUpdated March 18, 2026

Source

GPT-4oOpenAI

1287.0Elo

Claude Opus 4.6Anthropic

1280.0Elo

Gemini 2.0 UltraGoogle

1275.0Elo

Grok 3xAI

1268.0Elo

Claude Sonnet 4.6Anthropic

1260.0Elo

DeepSeek V3DeepSeek

1248.0Elo

Llama 4 MaverickMeta

1240.0Elo

Qwen 3 235BAlibaba

1232.0Elo

GPT-4o MiniOpenAI

1200.0Elo

Mistral Large 2.5Mistral

1195.0Elo

HumanEval

Evaluates code generation by asking models to complete Python functions from docstrings. Tests practical programming ability.

coding12 modelsUpdated March 18, 2026

Source

Claude Opus 4.6Anthropic

93.7%

GPT-4oOpenAI

90.2%

DeepSeek V3DeepSeek

89.8%

Gemini 2.0 UltraGoogle

88.4%

Claude Sonnet 4.6Anthropic

88.0%

Qwen 3 235BAlibaba

87.5%

Llama 4 MaverickMeta

85.5%

Grok 3xAI

85.0%

Mistral Large 2.5Mistral

84.1%

GPT-4o MiniOpenAI

80.5%

Gemini 2.0 FlashGoogle

79.2%

Claude Haiku 4.5Anthropic

78.3%

Frequently Asked Questions

What is the best AI model in 2026?

It depends on the task. Claude Opus 4.6 leads on MMLU (92.3%) and HumanEval (93.7%) for general knowledge and coding. OpenAI o3 dominates math with 96.7% on MATH and 87.7% on GPQA Diamond. GPT-4o leads the Chatbot Arena Elo ratings.

What is the MMLU benchmark?

MMLU (Massive Multitask Language Understanding) tests AI models across 57 subjects including STEM, humanities, and social sciences. It measures broad knowledge and reasoning ability. Scores above 90% indicate expert-level performance.

What is HumanEval?

HumanEval measures AI code generation using 164 hand-written Python programming problems. Models must generate correct, functional code that passes test cases. Scores above 90% indicate near-human coding ability.

How are AI benchmarks scored?

Most benchmarks use accuracy percentage (0-100%). Some use Elo ratings (Chatbot Arena) or pass rates (SWE-bench). Scores come from official evaluations by model providers and independent organizations.

View all benchmark articles →

Data sourced from Papers with Code, Hugging Face, and official model evaluations. Updated regularly. Scores reflect official reported results only.