x
Get Better Results with Vellum: Systematically refine your prompts and easily measure your AI's performance. Learn more here  →
last update: 18 December 2024

LLM Leaderboard

Top Models per Task
Best in Multitask Reasoning (MMLU)
Top 5 MMLU Models
Best in Coding (Human Eval)
Top 5 HumanEval Models
Best in Math (MATH)
Charts
Fastest and Most Affordable Models
Fastest Models
Charts
Lowest Latency (TTFT)
Latency Chart
Cheapest Models
Price Comparison Chart
Compare Models
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
vs
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Model
Context size
Cutoff date
Input/Output cost
Max output tokens
Latency (TTFT)
Throughput
OpenAI o1
200,000
October, 2023
$15 input & $7.50 cashed
/
$60
100,000
/
/
Gemini 2.0 Flash
1,000,000
December 2024
/
8,192
0.48s
169 t/s
Llama 3.3 70b
128,000
July 2024
$0.59
/
$0.79
32,768
n/a
250 t/s via Groq
AWS Nova Micro
128,000
July 2024
$0.000035
/
0.00014
4096
0.3s
n/a
AWS Nova Lite
300,000
July 2024
$0.00006
/
$0.00024
4096
0.4s
150 t/s
AWS Nova Pro
300,000
July 2024
$0.0008
/
$0.0032
4096
0.4s
97 t/s
Claude 3.5 Haiku
200,000
July 2024
$0.80
/
$4
4096
0.77s
60 t/s
Llama 3.1 405b
128,000
Dec 2023
$5
/
$10
4096
0.59s
163 t/s via SambaNova
Llama 3.1 70b
128,000
Dec 2023
$0.6
/
4096
0.38s
2,100 t/s (Cerebras)
Llama 3.1 8b
128,000
Dec 2023
$0.05
/
$0.08
4096
0.32s
~1800 t/s (Cerebras)
Gemini 1.5 Flash
1,000,000
May 2024
$0.075
/
$0.3
4096
1.06S
166 t/s
Gemini 1.5 Pro
2,000,000
May 2024
$3.5
/
$10.5
4096
1.12s
61 t/s
GPT-3.5 Turbo
16,400
Sept 2023
$0.5
/
$1.5
4096
0.37s
84 t/s
GPT-4o mini
128,000
Oct 2023
$0.15
/
$0.6
4096
0.56s
97 t/s
GPT-Turbo
128,000
Dec 2023
$10
/
$30
4096
0.6s
28 t/s
GPT-4o
128,000
Oct 2023
$2.5
/
$10
4096
0.48s
79 t/s
Claude 3 Haiku
200,000
Apr 2024
$0.25
/
$1.25
4096
0.55s
133 t/s
Claude 3.5 Sonnet
200,000
Apr 2024
$3
/
$15
4096
1.22s
78 t/s
Claude 3 Opus
200,000
Aug 2023
$15
/
$75
4096
1.99s
25 t/s
GPT-4
8192
Dec 2023
$30
/
$60
4096
0.59s
125 t/s
OpenAI o1
200,000
October, 2023
$15 input & $7.50 cashed
/
$60
100,000
/
/
Gemini 2.0 Flash
1,000,000
December 2024
/
8,192
0.48s
169 t/s
Llama 3.3 70b
128,000
July 2024
$0.59
/
$0.79
32,768
n/a
250 t/s via Groq
AWS Nova Micro
128,000
July 2024
$0.000035
/
0.00014
4096
0.3s
n/a
AWS Nova Lite
300,000
July 2024
$0.00006
/
$0.00024
4096
0.4s
150 t/s
AWS Nova Pro
300,000
July 2024
$0.0008
/
$0.0032
4096
0.4s
97 t/s
Claude 3.5 Haiku
200,000
July 2024
$0.80
/
$4
4096
0.77s
60 t/s
Llama 3.1 405b
128,000
Dec 2023
$5
/
$10
4096
0.59s
163 t/s via SambaNova
Llama 3.1 70b
128,000
Dec 2023
$0.6
/
4096
0.38s
2,100 t/s (Cerebras)
Llama 3.1 8b
128,000
Dec 2023
$0.05
/
$0.08
4096
0.32s
~1800 t/s (Cerebras)
Gemini 1.5 Flash
1,000,000
May 2024
$0.075
/
$0.3
4096
1.06S
166 t/s
Gemini 1.5 Pro
2,000,000
May 2024
$3.5
/
$10.5
4096
1.12s
61 t/s
GPT-3.5 Turbo
16,400
Sept 2023
$0.5
/
$1.5
4096
0.37s
84 t/s
GPT-4o mini
128,000
Oct 2023
$0.15
/
$0.6
4096
0.56s
97 t/s
GPT-Turbo
128,000
Dec 2023
$10
/
$30
4096
0.6s
28 t/s
GPT-4o
128,000
Oct 2023
$2.5
/
$10
4096
0.48s
79 t/s
Claude 3 Haiku
200,000
Apr 2024
$0.25
/
$1.25
4096
0.55s
133 t/s
Claude 3.5 Sonnet
200,000
Apr 2024
$3
/
$15
4096
1.22s
78 t/s
Claude 3 Opus
200,000
Aug 2023
$15
/
$75
4096
1.99s
25 t/s
GPT-4
8192
Dec 2023
$30
/
$60
4096
0.59s
125 t/s
Standard Benchmarks
Dynamic Chart

Model Comparison

AI Benchmark Table
Average MMLU (General) GPQA (Reasoning) HumanEval (Coding) Math BFCL (Tool Use) MGSM (Multilingual)
* For some models, we don't show an average value because there is missing data. Also, every model on this list is the latest version.
x

Download the Latest Leaderboard Data

Thank you!
Your submission has been received!
Oops! Something went wrong while submitting the form.

Cost and Context Window Comparison

Comparison of context window and cost per 1M tokens.
Models Context Window Input Cost / 1M tokens Output Cost / 1M tokens
GPT-4 8,000 $30.00 $60.00
GPT-4-32k 32,000 $60.00 $120.00
GPT-4 Turbo 128,000 $10.00 $30.00
GPT-3.5 Turbo 16,000 $0.5 $1.5
GPT-3.5 Turbo Instruct 4,000 $1.5 $2.00
Gemini Pro 32,000 $0.125 $0.375
Gemini 1.5 Pro 128,000 $7 $21
Mistral Small 16,000 $2.00 $6.00
Mistral Medium 32,000 $2.7 $8.1
Mistral Large 32,000 $8.00 $24.00
Claude 3 Opus 200,000 $15.00 $75.00
Claude 3 Sonnet 200,000 $3.00 $15.00
Claude 3 Haiku 200,000 $0.25 $1.25
GPT4o 128,000 $5 $15
Gemini 1.5 Flash 1,000,000 $0.35 $0.70
Claude 3.5 Sonnet 200,000 $3 $15
GPT-4o mini 128,000 $0.15 $0.60
Claude 3.5 Haiku 200,000 $0.80 $4
AWS Nova Pro 300,000 $0.0008 $0.0032
AWS Nova Lite 300,000 $0.00006 $0.00024
AWS Nova Micro 300,000 $0.000035 $0.00014
Gemini 2.0 Flash (Exp) 1,000,000 - -

HumanEval: Coding Leaderboard

Comparison of pre-trained proprietary and open-source models for code generation.
Model HumanEval (0 shot)
Llama 3.3 70b80.5%
AWS Nova Micro81.1%
AWS Nova Lite85.4%
AWS Nova Pro89%
Claude 3.5 Haiku88.1%
Claude 3.5 Sonnet93.7%
Claude 3 Opus84.9%
GPT-486.6%
GPT-4o90.2%
GPT-Turbo87.1%
GPT-4o mini87.2%
GPT-3.5 Turbo68%
Gemini 1.5 Pro71.9%
Gemini 1.5 Flash71.5%
Llama 3.1 8b72.6%
Llama 3.1 70b80.5%
Llama 3.1 405b89%

Sources

This leaderboard shows a comparison of capabilities, price and context window for leading commercial and open-source LLMs, based on the benchmark data provided in the models' technical reports. Updated March 2024.

Technical reports

Pricing info