LLM Leaderboard 2026

Benchmark Leaderboard

Compare 13+ language models across MMLU, HumanEval, and GSM8K benchmarks. Sort, filter by use case, and compare models head-to-head.

Top Knowledge

Claude 4 Opus

91.2%

MMLU

Top Coder

Claude 4 Opus

96.2%

HumanEval

Top Math

o1 Pro

99.1%

GSM8K

Best Value

Gemini 2.0 Flash

$0.1

$/1M tokens

Sort:

#1Claude 4 Opus

Anthropic · ~400B · 500K

$15/1M

MMLU

91.2%

HumanEval

96.2%

GSM8K

98.1%

reasoningresearch

View

#2Gemini 3 Ultra

Google · ~1T (MoE) · 2M

$5/1M

MMLU

90.8%

HumanEval

91.5%

GSM8K

97.2%

multimodallong-context

View

#3o1 Pro

OpenAI · Unknown · 200K

$60/1M

MMLU

90.3%

HumanEval

92.4%

GSM8K

99.1%

reasoningmath

View

#4GPT-4o

OpenAI · ~200B · 128K

$2.5/1M

MMLU

88.7%

HumanEval

90.2%

GSM8K

95.8%

multimodalvision

View

#5Claude 3.5 Sonnet

Anthropic · ~70B · 200K

$3/1M

MMLU

88.3%

HumanEval

92%

GSM8K

96.4%

codinganalysis

View

#6DeepSeek V3

DeepSeek · 671B (MoE, 37B active) · 128K

$0.27/1M

MMLU

87.5%

HumanEval

89.6%

GSM8K

94.8%

open-sourcecoding

View

#7Llama 4 Maverick

Meta · 400B (MoE, 17B active) · 1M

Free

MMLU

87.5%

HumanEval

85.5%

GSM8K

93.7%

open-sourcegeneral

View

#8Grok 3

xAI · ~314B · 131K

$3/1M

MMLU

87.5%

HumanEval

88.9%

GSM8K

94.8%

reasoningreal-time

View

#9Qwen 2.5 72B

Alibaba · 72B · 128K

Free

MMLU

86%

HumanEval

86.7%

GSM8K

95.2%

open-sourcemultilingual

View

#10Llama 4 Scout

Meta · 109B (MoE, 17B active) · 10M

Free

MMLU

84.8%

HumanEval

78.2%

GSM8K

90.5%

open-sourcelong-context

View

#11Phi-4

Microsoft · 14B · 16K

Free

MMLU

84.8%

HumanEval

82.6%

GSM8K

91.5%

open-sourcelightweight

View

#12Mistral Large 3

Mistral · ~123B · 128K

$2/1M

MMLU

84%

HumanEval

84.2%

GSM8K

91.3%

open-sourceenterprise

View

#13Gemini 2.0 Flash

Google · ~8B · 1M

$0.1/1M

MMLU

81.2%

HumanEval

78.4%

GSM8K

89.3%

fastbudget

View

#						Context
1st	Claude 4 Opus Anthropic · ~400B reasoningresearch	91.2%	96.2%	98.1%	$15 in $75 out	500K
2nd	Gemini 3 Ultra Google · ~1T (MoE) multimodallong-context	90.8%	91.5%	97.2%	$5 in $20 out	2M
3rd	o1 Pro OpenAI · Unknown reasoningmath	90.3%	92.4%	99.1%	$60 in $240 out	200K
4	GPT-4o OpenAI · ~200B multimodalvision	88.7%	90.2%	95.8%	$2.5 in $10 out	128K
5	Claude 3.5 Sonnet Anthropic · ~70B codinganalysis	88.3%	92%	96.4%	$3 in $15 out	200K
6	DeepSeek V3 DeepSeek · 671B (MoE, 37B active) open-sourcecoding	87.5%	89.6%	94.8%	$0.27 in $1.1 out	128K
7	Llama 4 Maverick Meta · 400B (MoE, 17B active) open-sourcegeneral	87.5%	85.5%	93.7%	Free	1M
8	Grok 3 xAI · ~314B reasoningreal-time	87.5%	88.9%	94.8%	$3 in $15 out	131K
9	Qwen 2.5 72B Alibaba · 72B open-sourcemultilingual	86%	86.7%	95.2%	Free	128K
10	Llama 4 Scout Meta · 109B (MoE, 17B active) open-sourcelong-context	84.8%	78.2%	90.5%	Free	10M
11	Phi-4 Microsoft · 14B open-sourcelightweight	84.8%	82.6%	91.5%	Free	16K
12	Mistral Large 3 Mistral · ~123B open-sourceenterprise	84%	84.2%	91.3%	$2 in $6 out	128K
13	Gemini 2.0 Flash Google · ~8B fastbudget	81.2%	78.4%	89.3%	$0.1 in $0.4 out	1M

MMLU

Massive Multitask Language Understanding — tests knowledge across 57 academic subjects including math, science, law, and humanities.

HumanEval

Code generation benchmark — 164 Python programming problems. Measures real-world coding ability.

GSM8K

Grade School Math 8K — 8,500 grade school math problems requiring multi-step reasoning.

Open Source / Free weights

Proprietary API only

Price per 1M input tokens

Professional AI/ML Bootcamp · Starts July 11th

Go Beyond Comparing — Build With LLMs

Stop reading comparisons. Start building RAG systems, AI agents, and production LLM apps.

Learn to fine-tune, deploy, and scale the models you just compared.

20 Weeks

Live Weekend Classes

< 30 Seats

Small Batch

Batch Full

Next Soon

Get priority access

7-Day

Money-Back Guarantee

Taught by Debasish Maji — Senior AI Engineer · Ex-Atlassian (Rovo Agent) · Ex-PhonePe (550M+ users)

Get notified when the next batch opens + free AI resources

Benchmark Leaderboard

Go Beyond Comparing — Build With LLMs

AI/ML Professional Bootcamp — Starts July 11