AI Benchmark Dashboard
Real-time benchmark scores from official sources. Data sourced from Papers with Code, Hugging Face Open LLM Leaderboard, and official model cards.
MMLU
Massive Multitask Language Understanding — measures knowledge across 57 academic subjects including STEM, humanities, and social sciences.
knowledgeUpdated March 10, 2026
Claude Opus 4.6Anthropic
92.3%
Gemini 2.0 UltraGoogle
90.8%
GPT-4oOpenAI
88.7%
4
Llama 4 MaverickMeta
88.2%
5
Mistral Large 2.5Mistral AI
86.3%
HumanEval
Evaluates code generation ability by asking models to complete Python functions from docstrings. Measures pass@1 rate.
codingUpdated March 10, 2026
Claude Opus 4.6Anthropic
93.7%
GPT-4oOpenAI
90.2%
Gemini 2.0 UltraGoogle
88.4%
4
Llama 4 MaverickMeta
85.5%
5
Mistral Large 2.5Mistral AI
84.1%
MATH
Tests mathematical problem-solving across algebra, geometry, number theory, probability, and calculus at competition level.
mathUpdated March 10, 2026
Claude Opus 4.6Anthropic
85.1%
Gemini 2.0 UltraGoogle
83.9%
Llama 4 MaverickMeta
79.1%
4
GPT-4oOpenAI
76.6%
5
Mistral Large 2.5Mistral AI
74.5%