LLM Leaderboard 2026

Benchmark Leaderboard

Compare 13+ language models across MMLU, HumanEval, and GSM8K benchmarks. Sort, filter by use case, and compare models head-to-head.

Top Knowledge

Claude 4 Opus

91.2%

MMLU

Top Coder

Claude 4 Opus

96.2%

HumanEval

Top Math

o1 Pro

99.1%

GSM8K

Best Value

Gemini 2.0 Flash

$0.1

$/1M tokens

Sort:
#1Claude 4 Opus

Anthropic · ~400B · 500K

$15/1M

MMLU
91.2%
HumanEval
96.2%
GSM8K
98.1%
reasoningresearch
View
#2Gemini 3 Ultra

Google · ~1T (MoE) · 2M

$5/1M

MMLU
90.8%
HumanEval
91.5%
GSM8K
97.2%
multimodallong-context
View
#3o1 Pro

OpenAI · Unknown · 200K

$60/1M

MMLU
90.3%
HumanEval
92.4%
GSM8K
99.1%
reasoningmath
View
#4GPT-4o

OpenAI · ~200B · 128K

$2.5/1M

MMLU
88.7%
HumanEval
90.2%
GSM8K
95.8%
multimodalvision
View
#5Claude 3.5 Sonnet

Anthropic · ~70B · 200K

$3/1M

MMLU
88.3%
HumanEval
92%
GSM8K
96.4%
codinganalysis
View
#6DeepSeek V3

DeepSeek · 671B (MoE, 37B active) · 128K

$0.27/1M

MMLU
87.5%
HumanEval
89.6%
GSM8K
94.8%
open-sourcecoding
View
#7Llama 4 Maverick

Meta · 400B (MoE, 17B active) · 1M

Free
MMLU
87.5%
HumanEval
85.5%
GSM8K
93.7%
open-sourcegeneral
View
#8Grok 3

xAI · ~314B · 131K

$3/1M

MMLU
87.5%
HumanEval
88.9%
GSM8K
94.8%
reasoningreal-time
View
#9Qwen 2.5 72B

Alibaba · 72B · 128K

Free
MMLU
86%
HumanEval
86.7%
GSM8K
95.2%
open-sourcemultilingual
View
#10Llama 4 Scout

Meta · 109B (MoE, 17B active) · 10M

Free
MMLU
84.8%
HumanEval
78.2%
GSM8K
90.5%
open-sourcelong-context
View
#11Phi-4

Microsoft · 14B · 16K

Free
MMLU
84.8%
HumanEval
82.6%
GSM8K
91.5%
open-sourcelightweight
View
#12Mistral Large 3

Mistral · ~123B · 128K

$2/1M

MMLU
84%
HumanEval
84.2%
GSM8K
91.3%
open-sourceenterprise
View
#13Gemini 2.0 Flash

Google · ~8B · 1M

$0.1/1M

MMLU
81.2%
HumanEval
78.4%
GSM8K
89.3%
fastbudget
View
MMLU

Massive Multitask Language Understanding — tests knowledge across 57 academic subjects including math, science, law, and humanities.

HumanEval

Code generation benchmark — 164 Python programming problems. Measures real-world coding ability.

GSM8K

Grade School Math 8K — 8,500 grade school math problems requiring multi-step reasoning.

Open Source / Free weights
Proprietary API only
Price per 1M input tokens