x

Evaluate your Prompts and AI Workflows with Vellum

See it in action
Thank you!
Your submission has been received!
Oops! Something went wrong while submitting the form.
updated
15 April 2025

LLM Leaderboard

This LLM leaderboard displays the latest public benchmark performance for SOTA model versions released after April 2024. The data comes from model providers as well as independently run evaluations by Vellum or the open-source community. We feature results from non-saturated benchmarks, excluding outdated benchmarks (e.g. MMLU). If you want to evaluate these models on your use-cases, try Vellum Evals.

Top models per tasks
Best in Reasoning (GPQA Diamond)
Score (Percentage)
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Grok 3 [Beta]
84.6
Gemini 2.5 Pro
84
OpenAI o3
83.3
OpenAI o4-mini
81.4
OpenAI o3-mini
79.7
Best in High School Math (AIME 2024)
Score (Percentage)
100%
90%
80%
70%
60%
50%
OpenAI o4-mini
93.4
Grok 3 [Beta]
93.3
Gemini 2.5 Pro
92
OpenAI o3
91.6
Gemini 2.5 Flash
88
Best in Agentic Coding (SWE Bench)
Score (Percentage)
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Claude 3.7 Sonnet [R]
70.3
OpenAI o3
69.1
OpenAI o4-mini
68.1
Gemini 2.5 Pro
63.8
Claude 3.7 Sonnet
62.3
Idependent evals
Best in Tool Use (BFCL)
Score (Percentage)
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Llama 3.1 405b
81.1
Llama 3.3 70b
77.3
GPT-4o
72.08
GPT-4.5
69.94
Nova Pro
68.4
Best in Adaptive Reasoning (GRIND)
Score (Percentage)
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Gemini 2.5 Pro
82.1
Claude 3.7 Sonnet [R]
60.7
Nemotron Ultra 253B
57.1
OpenAI o1
57.1
Llama 4 Maverick
53.6
Best Overall (Humanity's Last Exam)
Score (Percentage)
50
40
30
20
10
0
OpenAI o3
20.32
Gemini 2.5 Pro
18.8
OpenAI o4-mini
14.28
OpenAI o3-mini
14
Gemini 2.5 Flash
12.1
Fastest and most affordable models
Fastest Models
Tokens/seconds
2500
2000
1500
1000
500
0
Llama 4 Scout
2600
Llama 3.3 70b
2500
Llama 3.1 70b
2100
Llama 3.1 8b
1800
Llama 3.1 405b
969
Lowest Latency (TTFT)
Seconds to first token
0.6s
0.5s
0.4s
0.3s
0.2s
0.1s
0.0s
Nova Micro
0.3
Llama 3.1 8b
0.32
Llama 4 Scout
0.33
Gemini 2.0 Flash
0.34
GPT-4o mini
0.35
Cheapest Models
Input
Output
USD per 1M tokens
0.8
0.65
0.5
0.35
0.2
0.05
Nova Micro
$
0.04
$
0.14
Gemma 3 27b
$
0.07
$
0.07
Gemini 1.5 Flash
$
0.075
$
0.3
Gemini 2.0 Flash
$
0.1
$
0.4
Compare models
Select two models to compare
GPT-4o
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
vs
Claude 3.5 Sonnet
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Model
Context size
Cutoff date
I/O cost
Max output
Latency
Speed
Gemini 2.5 Flash
1,000,000
May 2024
0.15
/
0.6
30,000
0.35
s
n/a
200
t/s
n/a
OpenAI o3
200,000
May 2024
10
/
40
100,000
28
s
n/a
94
t/s
n/a
OpenAI o4-mini
200,000
May 2024
1.1
/
4.4
100,000
35.3
s
n/a
135
t/s
n/a
Nemotron Ultra 253B
/
s
n/a
t/s
n/a
GPT-4.1 nano
1,000,000
December 2024
0.1
/
0.4
32,000
s
n/a
t/s
n/a
GPT-4.1 mini
1,000,000
December 2024
0.4
/
1.6
16,000
s
n/a
t/s
n/a
GPT-4.1
1,000,000
December 2024
2
/
8
16,000
s
n/a
t/s
n/a
Llama 4 Behemoth
November 2024
/
s
n/a
t/s
n/a
Llama 4 Scout
10,000,000
November 2024
0.11
/
0.34
8,000
0.33
s
n/a
2600
t/s
n/a
Llama 4 Maverick
10,000,000
November 2024
0.2
/
0.6
8,000
0.45
s
n/a
126
t/s
n/a
Gemma 3 27b
128,000
Nov 2024
0.07
/
0.07
8192
0.72
s
n/a
59
t/s
n/a
Grok 3 mini [Beta]
/
Nov 2024
/
s
n/a
t/s
n/a
Grok 3 [Beta]
/
Nov 2024
/
s
n/a
t/s
n/a
Gemini 2.5 Pro
1,000,000
Nov 2024
1.25
/
10
65,000
30
s
n/a
191
t/s
n/a
Claude 3.7 Sonnet
200,000
Nov 2024
3
/
15
128,000
0.91
s
n/a
78
t/s
n/a
GPT-4.5
128,000
Nov 2024
75
/
150
16,384
1.25
s
n/a
48
t/s
n/a
Claude 3.7 Sonnet [R]
200,000
Nov 2024
3
/
15
64,000
0.95
s
n/a
78
t/s
n/a
DeepSeek-R1
128,000
Dec 2024
0.55
/
2.19
8,000
4
s
n/a
24
t/s
n/a
OpenAI o3-mini
200,000
Dec 2024
1.1
/
4.4
8,000
14
s
n/a
214
t/s
n/a
OpenAI o1-mini
128,000
Dec 2024
3
/
12
8,000
11.43
s
n/a
220
t/s
n/a
Qwen2.5-VL-32B
131,000
Dec 2024
/
8,000
s
n/a
t/s
n/a
DeepSeek V3 0324
128,000
Dec 2024
0.27
/
1.1
8,000
4
s
n/a
33
t/s
n/a
OpenAI o1
200,000
Oct 2023
15
/
60
100,000
30
s
n/a
100
t/s
n/a
Gemini 2.0 Flash
1,000,000
Aug 2024
0.1
/
0.4
8,192
0.34
s
n/a
257
t/s
n/a
Llama 3.3 70b
128,000
July 2024
0.59
/
0.7
32,768
0.52
s
n/a
2500
t/s
n/a
Nova Micro
128,000
July 2024
0.04
/
0.14
4096
0.3
s
n/a
t/s
n/a
Nova Lite
300,000
July 2024
/
4096
0.4
s
n/a
t/s
n/a
Nova Pro
300,000
July 2024
1
/
4
4096
0.64
s
n/a
128
t/s
n/a
Claude 3.5 Haiku
200,000
July 2024
0.8
/
4
4096
0.88
s
n/a
66
t/s
n/a
Llama 3.1 405b
128,000
Dec 2023
3.5
/
3.5
4096
0.73
s
n/a
969
t/s
n/a
Llama 3.1 70b
128,000
Dec 2023
/
4096
s
n/a
2100
t/s
n/a
Llama 3.1 8b
128,000
Dec 2023
/
4096
0.32
s
n/a
1800
t/s
n/a
Gemini 1.5 Flash
1,000,000
May 2024
0.075
/
0.3
4096
1.06
s
n/a
166
t/s
n/a
Gemini 1.5 Pro
2,000,000
May 2024
/
4096
s
n/a
61
t/s
n/a
GPT-3.5 Turbo
16,400
Sept 2023
/
4096
s
n/a
t/s
n/a
GPT-4o mini
128,000
Oct 2023
0.15
/
0.6
4096
0.35
s
n/a
65
t/s
n/a
GPT-Turbo
128,000
Dec 2023
/
4096
s
n/a
t/s
n/a
GPT-4o
128,000
Oct 2023
2.5
/
10
4096
0.51
s
n/a
143
t/s
n/a
Claude 3 Haiku
200,000
Apr 2024
/
4096
s
n/a
t/s
n/a
Claude 3.5 Sonnet
200,000
Apr 2024
3
/
15
4096
1.22
s
n/a
78
t/s
n/a
Claude 3 Opus
200,000
Aug 2023
/
4096
s
n/a
t/s
n/a
GPT-4
8192
Dec 2023
/
4096
s
n/a
t/s
n/a
Gemini 2.5 Flash
1,000,000
May 2024
0.15
/
0.6
30,000
0.35
s
n/a
200
t/s
n/a
OpenAI o3
200,000
May 2024
10
/
40
100,000
28
s
n/a
94
t/s
n/a
OpenAI o4-mini
200,000
May 2024
1.1
/
4.4
100,000
35.3
s
n/a
135
t/s
n/a
Nemotron Ultra 253B
/
s
n/a
t/s
n/a
GPT-4.1 nano
1,000,000
December 2024
0.1
/
0.4
32,000
s
n/a
t/s
n/a
GPT-4.1 mini
1,000,000
December 2024
0.4
/
1.6
16,000
s
n/a
t/s
n/a
GPT-4.1
1,000,000
December 2024
2
/
8
16,000
s
n/a
t/s
n/a
Llama 4 Behemoth
November 2024
/
s
n/a
t/s
n/a
Llama 4 Scout
10,000,000
November 2024
0.11
/
0.34
8,000
0.33
s
n/a
2600
t/s
n/a
Llama 4 Maverick
10,000,000
November 2024
0.2
/
0.6
8,000
0.45
s
n/a
126
t/s
n/a
Gemma 3 27b
128,000
Nov 2024
0.07
/
0.07
8192
0.72
s
n/a
59
t/s
n/a
Grok 3 mini [Beta]
/
Nov 2024
/
s
n/a
t/s
n/a
Grok 3 [Beta]
/
Nov 2024
/
s
n/a
t/s
n/a
Gemini 2.5 Pro
1,000,000
Nov 2024
1.25
/
10
65,000
30
s
n/a
191
t/s
n/a
Claude 3.7 Sonnet
200,000
Nov 2024
3
/
15
128,000
0.91
s
n/a
78
t/s
n/a
GPT-4.5
128,000
Nov 2024
75
/
150
16,384
1.25
s
n/a
48
t/s
n/a
Claude 3.7 Sonnet [R]
200,000
Nov 2024
3
/
15
64,000
0.95
s
n/a
78
t/s
n/a
DeepSeek-R1
128,000
Dec 2024
0.55
/
2.19
8,000
4
s
n/a
24
t/s
n/a
OpenAI o3-mini
200,000
Dec 2024
1.1
/
4.4
8,000
14
s
n/a
214
t/s
n/a
OpenAI o1-mini
128,000
Dec 2024
3
/
12
8,000
11.43
s
n/a
220
t/s
n/a
Qwen2.5-VL-32B
131,000
Dec 2024
/
8,000
s
n/a
t/s
n/a
DeepSeek V3 0324
128,000
Dec 2024
0.27
/
1.1
8,000
4
s
n/a
33
t/s
n/a
OpenAI o1
200,000
Oct 2023
15
/
60
100,000
30
s
n/a
100
t/s
n/a
Gemini 2.0 Flash
1,000,000
Aug 2024
0.1
/
0.4
8,192
0.34
s
n/a
257
t/s
n/a
Llama 3.3 70b
128,000
July 2024
0.59
/
0.7
32,768
0.52
s
n/a
2500
t/s
n/a
Nova Micro
128,000
July 2024
0.04
/
0.14
4096
0.3
s
n/a
t/s
n/a
Nova Lite
300,000
July 2024
/
4096
0.4
s
n/a
t/s
n/a
Nova Pro
300,000
July 2024
1
/
4
4096
0.64
s
n/a
128
t/s
n/a
Claude 3.5 Haiku
200,000
July 2024
0.8
/
4
4096
0.88
s
n/a
66
t/s
n/a
Llama 3.1 405b
128,000
Dec 2023
3.5
/
3.5
4096
0.73
s
n/a
969
t/s
n/a
Llama 3.1 70b
128,000
Dec 2023
/
4096
s
n/a
2100
t/s
n/a
Llama 3.1 8b
128,000
Dec 2023
/
4096
0.32
s
n/a
1800
t/s
n/a
Gemini 1.5 Flash
1,000,000
May 2024
0.075
/
0.3
4096
1.06
s
n/a
166
t/s
n/a
Gemini 1.5 Pro
2,000,000
May 2024
/
4096
s
n/a
61
t/s
n/a
GPT-3.5 Turbo
16,400
Sept 2023
/
4096
s
n/a
t/s
n/a
GPT-4o mini
128,000
Oct 2023
0.15
/
0.6
4096
0.35
s
n/a
65
t/s
n/a
GPT-Turbo
128,000
Dec 2023
/
4096
s
n/a
t/s
n/a
GPT-4o
128,000
Oct 2023
2.5
/
10
4096
0.51
s
n/a
143
t/s
n/a
Claude 3 Haiku
200,000
Apr 2024
/
4096
s
n/a
t/s
n/a
Claude 3.5 Sonnet
200,000
Apr 2024
3
/
15
4096
1.22
s
n/a
78
t/s
n/a
Claude 3 Opus
200,000
Aug 2023
/
4096
s
n/a
t/s
n/a
GPT-4
8192
Dec 2023
/
4096
s
n/a
t/s
n/a
Standard Benchmarks
Dynamic Chart
BENCHMARKS
Model Comparison
Showing 0 out of 20 results
Reset All
Gemini 2.5 Flash
1,000,000
n/a
%
88
n/a
%
78.3
n/a
%
%
n/a
%
n/a
%
n/a
51.1
%
n/a
OpenAI o3
200,000
n/a
%
91.6
n/a
%
83.3
n/a
%
69.1
%
n/a
%
n/a
%
n/a
81.3
%
n/a
OpenAI o4-mini
200,000
50
n/a
%
93.4
n/a
%
81.4
n/a
%
68.1
%
n/a
%
n/a
%
n/a
68.9
%
n/a
Nemotron Ultra 253B
57.1
n/a
%
80.08
n/a
%
76
n/a
%
%
n/a
%
n/a
%
n/a
%
n/a
GPT-4.1 nano
1,000,000
n/a
%
29.4
n/a
%
50.3
n/a
%
%
n/a
%
n/a
%
n/a
9.8
%
n/a
GPT-4.1 mini
1,000,000
n/a
%
49.6
n/a
%
65
n/a
%
23.6
%
n/a
%
n/a
%
n/a
34.7
%
n/a
GPT-4.1
1,000,000
n/a
%
48.1
n/a
%
66.3
n/a
%
55
%
n/a
%
n/a
%
n/a
%
n/a
Llama 4 Behemoth
n/a
%
n/a
%
73.7
n/a
%
%
n/a
95
%
n/a
%
n/a
%
n/a
Llama 4 Scout
10,000,000
n/a
%
n/a
%
57.2
n/a
%
%
n/a
%
n/a
%
n/a
%
n/a
Llama 4 Maverick
10,000,000
53.6
n/a
%
n/a
%
69.8
n/a
%
%
n/a
%
n/a
%
n/a
15.6
%
n/a
Gemma 3 27b
128,000
n/a
%
n/a
%
42.4
n/a
%
10.2
%
n/a
89
%
n/a
59.11
%
n/a
4.9
%
n/a
Grok 3 [Beta]
/
n/a
%
93.3
n/a
%
84.6
n/a
%
%
n/a
%
n/a
%
n/a
%
n/a
Gemini 2.5 Pro
1,000,000
82.1
n/a
%
92
n/a
%
84
n/a
%
63.8
%
n/a
%
n/a
%
n/a
72.9
%
n/a
Claude 3.7 Sonnet
200,000
n/a
%
23.3
n/a
%
68
n/a
%
62.3
%
n/a
82.2
%
n/a
58.3
%
n/a
60.4
%
n/a
GPT-4.5
128,000
46.4
n/a
%
36.7
n/a
%
71.4
n/a
%
38
%
n/a
%
n/a
69.94
%
n/a
44.9
%
n/a
Claude 3.7 Sonnet [R]
200,000
60.7
n/a
%
61.3
n/a
%
78.2
n/a
%
70.3
%
n/a
96.2
%
n/a
58.3
%
n/a
64.9
%
n/a
DeepSeek-R1
128,000
53.6
n/a
%
79.8
n/a
%
71.5
n/a
%
49.2
%
n/a
97.3
%
n/a
57.53
%
n/a
64
%
n/a
OpenAI o3-mini
200,000
50
n/a
%
87.3
n/a
%
79.7
n/a
%
61
%
n/a
97.9
%
n/a
65.12
%
n/a
60.4
%
n/a
OpenAI o1-mini
128,000
n/a
%
63.6
n/a
%
60
n/a
%
%
n/a
90
%
n/a
52.2
%
n/a
32.9
%
n/a
Qwen2.5-VL-32B
131,000
42.9
n/a
%
n/a
%
46
n/a
%
18.8
%
n/a
82.2
%
n/a
62.79
%
n/a
62.84
%
n/a
DeepSeek V3 0324
128,000
n/a
%
59.4
n/a
%
64.8
n/a
%
38.8
%
n/a
94
%
n/a
58.55
%
n/a
55.1
%
n/a
OpenAI o1
200,000
57.1
n/a
%
79.2
n/a
%
75.7
n/a
%
48.9
%
n/a
96.4
%
n/a
67.87
%
n/a
61.7
%
n/a
Gemini 2.0 Flash
1,000,000
53.6
n/a
%
n/a
%
62.1
n/a
%
51.8
%
n/a
89.7
%
n/a
60.42
%
n/a
22.2
%
n/a
Llama 3.3 70b
128,000
n/a
%
n/a
%
50.5
n/a
%
%
n/a
77
%
n/a
77.3
%
n/a
51.43
%
n/a
Nova Pro
300,000
n/a
%
n/a
%
46.9
n/a
%
%
n/a
76.6
%
n/a
68.4
%
n/a
61.38
%
n/a
Claude 3.5 Haiku
200,000
n/a
%
n/a
%
41.6
n/a
%
40.6
%
n/a
69.4
%
n/a
54.31
%
n/a
28
%
n/a
Llama 3.1 405b
128,000
n/a
%
23.3
n/a
%
49
n/a
%
%
n/a
73.8
%
n/a
81.1
%
n/a
%
n/a
GPT-4o mini
128,000
n/a
%
n/a
%
40.2
n/a
%
%
n/a
70.2
%
n/a
64.1
%
n/a
3.6
%
n/a
GPT-4o
128,000
n/a
%
13.4
n/a
%
56.1
n/a
%
31
%
n/a
60.3
%
n/a
72.08
%
n/a
27.1
%
n/a
Claude 3.5 Sonnet
200,000
n/a
%
16
n/a
%
65
n/a
%
49
%
n/a
78
%
n/a
56.46
%
n/a
51.6
%
n/a
*
This comparison view excludes other benchmarks and focuses on MMLU, HellaSwag, HumanEval, BBHard, GSM-8K, and MATH due to the absence of data in the model reports.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
MODEL COMPARISON
Context window, cost and speed comparison
Showing 0 out of 20 results
Reset All
Gemini 2.5 Flash
1,000,000
$
0.15
n/a
$
0.6
n/a
200
n/a
t/s
0.35
seconds
n/a
OpenAI o3
200,000
$
10
n/a
$
40
n/a
94
n/a
t/s
28
seconds
n/a
OpenAI o4-mini
200,000
$
1.1
n/a
$
4.4
n/a
135
n/a
t/s
35.3
seconds
n/a
GPT-4.1 nano
1,000,000
$
0.1
n/a
$
0.4
n/a
n/a
t/s
seconds
n/a
GPT-4.1 mini
1,000,000
$
0.4
n/a
$
1.6
n/a
n/a
t/s
seconds
n/a
GPT-4.1
1,000,000
$
2
n/a
$
8
n/a
n/a
t/s
seconds
n/a
Llama 4 Scout
10,000,000
$
0.11
n/a
$
0.34
n/a
2600
n/a
t/s
0.33
seconds
n/a
Llama 4 Maverick
10,000,000
$
0.2
n/a
$
0.6
n/a
126
n/a
t/s
0.45
seconds
n/a
Gemma 3 27b
128,000
$
0.07
n/a
$
0.07
n/a
59
n/a
t/s
0.72
seconds
n/a
Grok 3 [Beta]
/
$
n/a
$
n/a
n/a
t/s
seconds
n/a
Gemini 2.5 Pro
1,000,000
$
1.25
n/a
$
10
n/a
191
n/a
t/s
30
seconds
n/a
Claude 3.7 Sonnet
200,000
$
3
n/a
$
15
n/a
78
n/a
t/s
0.91
seconds
n/a
GPT-4.5
128,000
$
75
n/a
$
150
n/a
48
n/a
t/s
1.25
seconds
n/a
Claude 3.7 Sonnet [R]
200,000
$
3
n/a
$
15
n/a
78
n/a
t/s
0.95
seconds
n/a
DeepSeek-R1
128,000
$
0.55
n/a
$
2.19
n/a
24
n/a
t/s
4
seconds
n/a
OpenAI o3-mini
200,000
$
1.1
n/a
$
4.4
n/a
214
n/a
t/s
14
seconds
n/a
OpenAI o1-mini
128,000
$
3
n/a
$
12
n/a
220
n/a
t/s
11.43
seconds
n/a
Qwen2.5-VL-32B
131,000
$
n/a
$
n/a
n/a
t/s
seconds
n/a
DeepSeek V3 0324
128,000
$
0.27
n/a
$
1.1
n/a
33
n/a
t/s
4
seconds
n/a
OpenAI o1
200,000
$
15
n/a
$
60
n/a
100
n/a
t/s
30
seconds
n/a
Gemini 2.0 Flash
1,000,000
$
0.1
n/a
$
0.4
n/a
257
n/a
t/s
0.34
seconds
n/a
Llama 3.3 70b
128,000
$
0.59
n/a
$
0.7
n/a
2500
n/a
t/s
0.52
seconds
n/a
Nova Pro
300,000
$
1
n/a
$
4
n/a
128
n/a
t/s
0.64
seconds
n/a
Claude 3.5 Haiku
200,000
$
0.8
n/a
$
4
n/a
66
n/a
t/s
0.88
seconds
n/a
Llama 3.1 405b
128,000
$
3.5
n/a
$
3.5
n/a
969
n/a
t/s
0.73
seconds
n/a
GPT-4o mini
128,000
$
0.15
n/a
$
0.6
n/a
65
n/a
t/s
0.35
seconds
n/a
GPT-4o
128,000
$
2.5
n/a
$
10
n/a
143
n/a
t/s
0.51
seconds
n/a
Claude 3.5 Sonnet
200,000
$
3
n/a
$
15
n/a
78
n/a
t/s
1.22
seconds
n/a
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.