x
Get Better Results with Vellum: Systematically refine your prompts and easily measure your AI's performance. Learn more here  →
last update: 4 Feb 2025

LLM Leaderboard

Top Models per Task
Best in Multitask Reasoning (MMLU)
Top 5 MMLU Models
Best in Coding (Human Eval)
Top 5 HumanEval Models
Best in Math (MATH)
Charts
Fastest and Most Affordable Models
Fastest Models
Charts
Lowest Latency (TTFT)
Latency Chart
Cheapest Models
Price Comparison Chart
Compare Models
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
vs
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Model
Context size
Cutoff date
Input/Output cost
Max output tokens
Latency (TTFT)
Throughput
Grok-2 mini
128,000
Decermber 2024
N/A
/
8,000
N/A
N/A
Grok-2
128,000
December 2024
N/A
/
8,000
N/A
N/A
DeepSeek-R1
128,000
December 2024
$0.55
/
$2.19
8,000
40s
58 t/s
OpenAI o3-mini
200,000
December 2024
$1.1
/
$4.4
8,000
10.5s
214 t/s
OpenAI o1-mini
128,000
December 2024
$3
/
$12
8,000
11.43s
196 t/s
Qwen2.5-72b
131,000
December 2024
$0.4
/
$0.75
8,000
1.34s
61 t/s
DeepSeek V3
128,000
December 2024
$0.27
/
$1.1
8,000
1.47 s
47 t/s
OpenAI o1
200,000
October, 2023
$15 input & $7.50 cashed
/
$60
100,000
/
/
Gemini 2.0 Flash
1,000,000
December 2024
/
8,192
0.48s
169 t/s
Llama 3.3 70b
128,000
July 2024
$0.59
/
$0.79
32,768
n/a
250 t/s via Groq
Nova Micro
128,000
July 2024
$0.04
/
$0.14
4096
0.3s
n/a
Nova Lite
300,000
July 2024
$0.06
/
$0.24
4096
0.4s
150 t/s
Nova Pro
300,000
July 2024
$0.8
/
$3.2
4096
0.4s
97 t/s
Claude 3.5 Haiku
200,000
July 2024
$0.80
/
$4
4096
0.77s
60 t/s
Llama 3.1 405b
128,000
Dec 2023
$5
/
$10
4096
0.59s
163 t/s via SambaNova
Llama 3.1 70b
128,000
Dec 2023
$0.6
/
4096
0.38s
2,100 t/s (Cerebras)
Llama 3.1 8b
128,000
Dec 2023
$0.05
/
$0.08
4096
0.32s
~1800 t/s (Cerebras)
Gemini 1.5 Flash
1,000,000
May 2024
$0.075
/
$0.3
4096
1.06S
166 t/s
Gemini 1.5 Pro
2,000,000
May 2024
$3.5
/
$10.5
4096
1.12s
61 t/s
GPT-3.5 Turbo
16,400
Sept 2023
$0.5
/
$1.5
4096
0.37s
84 t/s
GPT-4o mini
128,000
Oct 2023
$0.15
/
$0.6
4096
0.56s
97 t/s
GPT-Turbo
128,000
Dec 2023
$10
/
$30
4096
0.6s
28 t/s
GPT-4o
128,000
Oct 2023
$2.5
/
$10
4096
0.48s
79 t/s
Claude 3 Haiku
200,000
Apr 2024
$0.25
/
$1.25
4096
0.55s
133 t/s
Claude 3.5 Sonnet
200,000
Apr 2024
$3
/
$15
4096
1.22s
78 t/s
Claude 3 Opus
200,000
Aug 2023
$15
/
$75
4096
1.99s
25 t/s
GPT-4
8192
Dec 2023
$30
/
$60
4096
0.59s
125 t/s
Grok-2 mini
128,000
Decermber 2024
N/A
/
8,000
N/A
N/A
Grok-2
128,000
December 2024
N/A
/
8,000
N/A
N/A
DeepSeek-R1
128,000
December 2024
$0.55
/
$2.19
8,000
40s
58 t/s
OpenAI o3-mini
200,000
December 2024
$1.1
/
$4.4
8,000
10.5s
214 t/s
OpenAI o1-mini
128,000
December 2024
$3
/
$12
8,000
11.43s
196 t/s
Qwen2.5-72b
131,000
December 2024
$0.4
/
$0.75
8,000
1.34s
61 t/s
DeepSeek V3
128,000
December 2024
$0.27
/
$1.1
8,000
1.47 s
47 t/s
OpenAI o1
200,000
October, 2023
$15 input & $7.50 cashed
/
$60
100,000
/
/
Gemini 2.0 Flash
1,000,000
December 2024
/
8,192
0.48s
169 t/s
Llama 3.3 70b
128,000
July 2024
$0.59
/
$0.79
32,768
n/a
250 t/s via Groq
Nova Micro
128,000
July 2024
$0.04
/
$0.14
4096
0.3s
n/a
Nova Lite
300,000
July 2024
$0.06
/
$0.24
4096
0.4s
150 t/s
Nova Pro
300,000
July 2024
$0.8
/
$3.2
4096
0.4s
97 t/s
Claude 3.5 Haiku
200,000
July 2024
$0.80
/
$4
4096
0.77s
60 t/s
Llama 3.1 405b
128,000
Dec 2023
$5
/
$10
4096
0.59s
163 t/s via SambaNova
Llama 3.1 70b
128,000
Dec 2023
$0.6
/
4096
0.38s
2,100 t/s (Cerebras)
Llama 3.1 8b
128,000
Dec 2023
$0.05
/
$0.08
4096
0.32s
~1800 t/s (Cerebras)
Gemini 1.5 Flash
1,000,000
May 2024
$0.075
/
$0.3
4096
1.06S
166 t/s
Gemini 1.5 Pro
2,000,000
May 2024
$3.5
/
$10.5
4096
1.12s
61 t/s
GPT-3.5 Turbo
16,400
Sept 2023
$0.5
/
$1.5
4096
0.37s
84 t/s
GPT-4o mini
128,000
Oct 2023
$0.15
/
$0.6
4096
0.56s
97 t/s
GPT-Turbo
128,000
Dec 2023
$10
/
$30
4096
0.6s
28 t/s
GPT-4o
128,000
Oct 2023
$2.5
/
$10
4096
0.48s
79 t/s
Claude 3 Haiku
200,000
Apr 2024
$0.25
/
$1.25
4096
0.55s
133 t/s
Claude 3.5 Sonnet
200,000
Apr 2024
$3
/
$15
4096
1.22s
78 t/s
Claude 3 Opus
200,000
Aug 2023
$15
/
$75
4096
1.99s
25 t/s
GPT-4
8192
Dec 2023
$30
/
$60
4096
0.59s
125 t/s
Standard Benchmarks
Dynamic Chart

Model Comparison

AI Benchmark Table
Average MMLU (General) GPQA (Reasoning) HumanEval (Coding) Math BFCL (Tool Use) MGSM (Multilingual)
* For some models, we don't show an average value because there is missing data. Also, every model on this list is the latest version.
x
report

The 2025 State of AI Development

Thank you!
Your submission has been received!
Oops! Something went wrong while submitting the form.

Cost and Context Window Comparison

Comparison of context window and cost per 1M tokens.
Models Context Window Input Cost / 1M tokens Output Cost / 1M tokens
GPT-4 8,000 $30.00 $60.00
GPT-4-32k 32,000 $60.00 $120.00
GPT-4 Turbo 128,000 $10.00 $30.00
GPT-3.5 Turbo 16,000 $0.5 $1.5
GPT-3.5 Turbo Instruct 4,000 $1.5 $2.00
Gemini Pro 32,000 $0.125 $0.375
Gemini 1.5 Pro 128,000 $7 $21
Mistral Small 16,000 $2.00 $6.00
Mistral Medium 32,000 $2.7 $8.1
Mistral Large 32,000 $8.00 $24.00
Claude 3 Opus 200,000 $15.00 $75.00
Claude 3 Sonnet 200,000 $3.00 $15.00
Claude 3 Haiku 200,000 $0.25 $1.25
GPT4o 128,000 $5 $15
Gemini 1.5 Flash 1,000,000 $0.35 $0.70
Claude 3.5 Sonnet 200,000 $3 $15
GPT-4o mini 128,000 $0.15 $0.60
Claude 3.5 Haiku 200,000 $0.80 $4
AWS Nova Pro 300,000 $0.0008 $0.0032
AWS Nova Lite 300,000 $0.00006 $0.00024
AWS Nova Micro 300,000 $0.000035 $0.00014
OpenAI o1 128,000 $15 $60
OpenAI o1-mini 64,000 $1.10 $4.40
OpenAI o3-mini 128,000 $1.10 $4.40
DeepSeek V3 128,000 $0.27 $1.1
DeepSeek R1 128,000 $0.55 $2.19
Gemini 2.0 Flash (Exp) 1,000,000 - -
Qwen2.5-72b 131,000 $0.4 $0.75

HumanEval: Coding Leaderboard

Comparison of pre-trained proprietary and open-source models for code generation.
Model HumanEval (0 shot)
Llama 3.3 70b80.5%
AWS Nova Micro81.1%
AWS Nova Lite85.4%
AWS Nova Pro89%
Claude 3.5 Haiku88.1%
Claude 3.5 Sonnet93.7%
Claude 3 Opus84.9%
GPT-486.6%
GPT-4o90.2%
GPT-Turbo87.1%
GPT-4o mini87.2%
GPT-3.5 Turbo68%
Gemini 1.5 Pro71.9%
Gemini 1.5 Flash71.5%
Llama 3.1 8b72.6%
Llama 3.1 70b80.5%
Llama 3.1 405b89%
Qwen2.5-70b88%
OpenAI o1-mini82.6%
Grok-288.4%
Grok-2 mini85.7%
Evaluation reports
View More

Sources

This leaderboard shows a comparison of capabilities, price and context window for leading commercial and open-source LLMs, based on the benchmark data provided in the models' technical reports. Updated March 2024.

Technical reports

Pricing info