Largest context window: Gemini 1.5 Flash (1M), Claude Models (200K), GPT-4o, GPT-4o mini and Turbo (128K)
Lowest input cost per 1M tokens : Gemini Pro ($0.125), GPT-4o mini ($0.15), Claude 3 Haiku ($0.25)
MMLU (5shot)
GPT-4
86.40%
GPT4o
88.70%
GPT-4T 2024-04-09
86.5%
Gemini Ultra
83.70%
Claude 3 Opus
86.80%
Claude 3.5 Sonnet
88.70%
MATH
GPT-4o mini
70.20%
Gemini Flash
67.70%
Claude 3 Opus
60.10%
GPT-4T 2024-04-09
72.2%
GPT4o
76.60%
Claude 3.5 Sonnet
71.10%
HumanEval (0 shot)
GPT-4T 2024-04-09
87.6%
Claude 3 Opus
84.90%
GPT-4o mini
87.20%
GPT4o
90.20%
Claude 3 Haiku
75.90%
Claude 3.5 Sonnet
92.00%
Top Models per Task
Best in Multitask Reasoning (MMLU)
Top 5 MMLU Models
Best in Coding (Human Eval)
Top 5 HumanEval Models
Best in Math (MATH)
Charts
Fastest and Most Affordable Models
Fastest Models
Charts
Lowest Latency (TTFT)
Latency Chart
Cheapest Models
Price Comparison Chart
Compare Models
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
vs
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Model
Context size
Cutoff date
Input/Output cost
Max output tokens
Latency (TTFT)
Throughput
Llama 3.1 405b
128,000
Dec 2023
$2.7
/
$2.7
4096
0.59s
27 t/s
Llama 3.1 70b
128,000
Dec 2023
$0.6
/
4096
0.38s
2,100 t/s (Cerebras)
Llama 3.1 8b
128,000
Dec 2023
$0.05
/
$0.08
4096
0.32s
~1800 t/s (Cerebras)
Gemini 1.5 Flash
1,000,000
May 2024
$0.075
/
$0.3
4096
1.06S
166 t/s
Gemini 1.5 Pro
2,000,000
May 2024
$3.5
/
$10.5
4096
1.12s
61 t/s
GPT-3.5 Turbo
16,400
Sept 2023
$0.5
/
$1.5
4096
0.37s
84 t/s
GPT-4o mini
128,000
Oct 2023
$0.15
/
$0.6
4096
0.56s
97 t/s
GPT-Turbo
128,000
Dec 2023
$10
/
$30
4096
0.6s
28 t/s
GPT-4o
128,000
Oct 2023
$5
/
$15
4096
0.48s
79 t/s
Claude 3 Haiku
200,000
Apr 2024
$0.25
/
$1.25
4096
0.55s
133 t/s
Claude 3.5 Sonnet
200,000
Apr 2024
$3
/
$15
4096
1.22s
78 t/s
Claude 3 Opus
200,000
Aug 2023
$15
/
$75
4096
1.99s
25 t/s
GPT-4
8192
Dec 2023
$30
/
$60
4096
0.59s
125 t/s
Llama 3.1 405b
128,000
Dec 2023
$2.7
/
$2.7
4096
0.59s
27 t/s
Llama 3.1 70b
128,000
Dec 2023
$0.6
/
4096
0.38s
2,100 t/s (Cerebras)
Llama 3.1 8b
128,000
Dec 2023
$0.05
/
$0.08
4096
0.32s
~1800 t/s (Cerebras)
Gemini 1.5 Flash
1,000,000
May 2024
$0.075
/
$0.3
4096
1.06S
166 t/s
Gemini 1.5 Pro
2,000,000
May 2024
$3.5
/
$10.5
4096
1.12s
61 t/s
GPT-3.5 Turbo
16,400
Sept 2023
$0.5
/
$1.5
4096
0.37s
84 t/s
GPT-4o mini
128,000
Oct 2023
$0.15
/
$0.6
4096
0.56s
97 t/s
GPT-Turbo
128,000
Dec 2023
$10
/
$30
4096
0.6s
28 t/s
GPT-4o
128,000
Oct 2023
$5
/
$15
4096
0.48s
79 t/s
Claude 3 Haiku
200,000
Apr 2024
$0.25
/
$1.25
4096
0.55s
133 t/s
Claude 3.5 Sonnet
200,000
Apr 2024
$3
/
$15
4096
1.22s
78 t/s
Claude 3 Opus
200,000
Aug 2023
$15
/
$75
4096
1.99s
25 t/s
GPT-4
8192
Dec 2023
$30
/
$60
4096
0.59s
125 t/s
Standard Benchmarks
Dynamic Chart
Model Comparison
Average
Multi-choice Qs
Reasoning
Python coding
Future Capabilties
Grade school math
Math Problems
GPT-4
79.45%
86.40%
95.30%
67%
83.10%
92%
52.90%
GPT-4o
-
88.7%
-
90.2%
-
-
76.60%
GPT-4o mini
-
82%
-
87.00%
-
-
70.20%
GPT-3.5
65.46%
70%
85.50%
48.10%
66.60%
57.10%
34.1%
Gemini Ultra
79.52%
83.70%
87.80%
74.40%
83.60%
94.40%
53.20%
Gemini 1.5 Pro
80.08%
81.90%
92.50%
71.90%
84%
91.70%
58.50%
Mixtral 8x7B
59.79%
70.60%
84.40%
40.20%
60.76%
74.40%
28.40%
Llama 3 Instruct - 70B
79.23%
82%
87%
81.7%
81.3%
93%
50.4%
Llama 3 Instruct - 8B
-
68.40%
-
62%
61%
79.60%
30%
Grok 1.5
-
73.00%
-
63%
-
62.90%
23.90%
Mistral Large
-
81.2%
89.2%
45.1%
-
81%
45%
Claude 3 Opus
84.83%
86.80%
95.40%
84.90%
86.80%
95.00%
60.10%
Claude 3 Haiku
73.08%
75.20%
85.90%
75.90%
73.70%
88.90%
38.90%
Gemini 1.5 Flash
-
78.90%
-
-
89.20%
-
67.70%
GPT-4T 2024-04-09
-
86.5%
-
-
87.60%
-
72.2%
Claude 3.5 Sonnet
88.38%
88.70%
89.00%
92.00%
93.10%
96.40%
71.10%
OpenAI o1
-
92.30%
-
92.40%
-
-
94.80%
OpenAI o1-mini
-
85.20%
-
92.40%
-
-
90.00%
* For some models, we don't show an average value because there is missing data.
* This comparison view excludes other benchmarks and focuses solely on MMLU, HellaSwag, HumanEval, BBHard, GSM-8K, and MATH due to the absence of data in the model reports.
Download the Latest Leaderboard Data
x
🎉
Thanks for joining our newsletter.
Oops! Something went wrong.
Cost and Context Window Comparison
Comparison of context window and cost per 1M tokens.
Models
Context Window
Input Cost / 1M tokens
Output Cost / 1M tokens
GPT-4
8,000
$30.00
$60.00
GPT-4-32k
32,000
$60.00
$120.00
GPT-4 Turbo
128,000
$10.00
$30.00
GPT-3.5 Turbo
16,000
$0.5
$1.5
GPT-3.5 Turbo Instruct
4,000
$1.5
$2.00
Gemini Pro
32,000
$0.125
$0.375
Gemini 1.5 Pro
128,000
$7
$21
Mistral Small
16,000
$2.00
$6.00
Mistral Medium
32,000
$2.7
$8.1
Mistral Large
32,000
$8.00
$24.00
Claude 3 Opus
200,000
$15.00
$75.00
Claude 3 Sonnet
200,000
$3.00
$15.00
Claude 3 Haiku
200,000
$0.25
$1.25
GPT4o
128,000
$5
$15
Gemini 1.5 Flash
1,000,000
$0.35
$0.70
Nemotron
4,000
-
-
Llama 3 Models
8,000
-
-
Claude 3.5 Sonnet
200,000
$3
$15
GPT-4o mini
128,000
$0.15
$0.60
HumanEval: Coding Leaderboard
Comparison of pre-trained proprietary and open-source models for code generation.
Model
HumanEval (0 shot)
GPT-4
67%
GPT-4 Turbo
87.1%
GPT-4o
90.2%
GPT-3.5
48.10%
Gemini Pro
67.70%
Gemini Ultra
74.40%
Gemini 1.5 Pro
71.90%
Mixtral 8x7B
40.20%
Mistral Large
45.1%
Claude 3 Opus
84.90%
Claude 3 Haiku
75.90%
Claude 3 Sonnet
73.00%
Llama 3 70B
75.90%
Sources
This leaderboard shows a comparison of capabilities, price and context window for leading commercial and open-source LLMs, based on the benchmark data provided in the models' technical reports. Updated March 2024.