Benchmark Leaderboard
Compare 13+ language models across MMLU, HumanEval, and GSM8K benchmarks. Sort, filter by use case, and compare models head-to-head.
Top Knowledge
Claude 4 Opus
91.2%
MMLU
Top Coder
Claude 4 Opus
96.2%
HumanEval
Top Math
o1 Pro
99.1%
GSM8K
Best Value
Gemini 2.0 Flash
$0.1
$/1M tokens
Google · ~1T (MoE) · 2M
$5/1M
DeepSeek · 671B (MoE, 37B active) · 128K
$0.27/1M
Meta · 400B (MoE, 17B active) · 1M
Meta · 109B (MoE, 17B active) · 10M
Mistral · ~123B · 128K
$2/1M
Massive Multitask Language Understanding — tests knowledge across 57 academic subjects including math, science, law, and humanities.
Code generation benchmark — 164 Python programming problems. Measures real-world coding ability.
Grade School Math 8K — 8,500 grade school math problems requiring multi-step reasoning.