updated 23 Mar 2026
LLM Leaderboard
This LLM leaderboard displays the latest public benchmark performance for SOTA model versions released after April 2024. The data comes from model providers as well as independently run evaluations by Vellum or the open-source community. We feature results from non-saturated benchmarks, excluding outdated benchmarks (e.g. MMLU).
Top models per tasks
Best in Reasoning (GPQA Diamond)
100%94%88%82%77%
| Model | Score |
|---|---|
| Claude 3 Opus | 95.4% |
| GPT 5.2 | 92.4% |
| Gemini 3 Pro | 91.9% |
| Claude Opus 4.6 | 91.3% |
| Claude Sonnet 4.6 | 89.9% |
Best in High School Math (AIME 2025)
100%96%93%89%86%
| Model | Score |
|---|---|
| Gemini 3 Pro | 100% |
| GPT 5.2 | 100% |
| Claude Opus 4.6 | 99.8% |
| Kimi K2 Thinking | 99.1% |
| GPT oss 20b | 98.7% |
Best in Agentic Coding (SWE Bench)
85%81%76%72%68%
| Model | Score |
|---|---|
| Claude Sonnet 4.5 | 82% |
| Claude Opus 4.5 | 80.9% |
| Claude Opus 4.6 | 80.8% |
| GPT 5.2 | 80% |
| Claude Sonnet 4.6 | 79.6% |
Best Overall (Humanity's Last Exam)
50%38%25%13%0%
| Model | Score |
|---|---|
| Gemini 3 Pro | 45.8% |
| Kimi K2 Thinking | 44.9% |
| Claude Opus 4.6 | 40% |
| GPT-5 | 35.2% |
| Kimi K2.5 | 30.1% |
Best in Visual Reasoning (ARC-AGI 2)
70%53%35%18%0%
| Model | Score |
|---|---|
| Claude Opus 4.6 | 68.8% |
| Claude Sonnet 4.6 | 58.3% |
| GPT 5.2 | 52.9% |
| Claude Opus 4.5 | 37.6% |
| Gemini 3 Pro | 31% |
Best in Multilingual Reasoning (MMMLU)
95%90%86%81%77%
| Model | Score |
|---|---|
| Gemini 3 Pro | 91.8% |
| Claude Opus 4.6 | 91.1% |
| Claude Opus 4.5 | 90.8% |
| Claude Opus 4.1 | 89.5% |
| Claude Sonnet 4.6 | 89.3% |
Fastest and most affordable models
Fastest Models (Tokens/sec)
1Llama 4 Scout2600 t/s
2Llama 3.3 70b2500 t/s
3Llama 3.1 70b2100 t/s
4Llama 3.1 8b1800 t/s
5Llama 3.1 405b969 t/s
Lowest Latency (TTFT)
1GPT-5.3 Codex0.003s
2Nova Micro0.3s
3Llama 3.1 8b0.32s
4Llama 4 Scout0.33s
5Gemini 2.0 Flash0.34s
Cheapest Models (per 1M tokens)
1Nova Micro$0.04 / $0.14
2Gemma 3 27b$0.07 / $0.07
3Gemini 1.5 Flash$0.075 / $0.3
4GPT oss 20b$0.08 / $0.35
5Gemini 2.0 Flash$0.1 / $0.4
Compare models
vs
| Claude Opus 4.6 | Claude Sonnet 4.6 | |
|---|---|---|
| Context size | 200,000 | 200,000 |
| Cutoff date | May 2025 | Aug 2025 |
| I/O cost | $5 / $25 | $3 / $15 |
| Max output | 128,000 | 64,000 |
| Latency | 1.6s | 0.73s |
| Speed | 67 t/s | 55 t/s |
Claude Opus 4.6Claude Sonnet 4.6
Model Comparison
| Model | Context size | Cutoff date | I/O cost | Max output | Latency | Speed |
|---|---|---|---|---|---|---|
| 200,000 | May 2025 | $5 / $25 | 128,000 | 1.6s | 67 t/s | |
| 200,000 | Aug 2025 | $3 / $15 | 64,000 | 0.73s | 55 t/s | |
OpenAI o3-mini | 200,000 | Dec 2024 | $1.1 / $4.4 | 8,000 | 14s | 214 t/s |
| 128,000 | Dec 2024 | $0.55 / $2.19 | 8,000 | 4s | 24 t/s | |
| 200,000 | Nov 2024 | $3 / $15 | 64,000 | 0.95s | 78 t/s | |
| 1,000,000 | Nov 2024 | $1.25 / $10 | 65,000 | 30s | 191 t/s | |
GPT-5 | 400,000 | April 2025 | $1.25 / $10 | 128,000 | - | - |
Kimi K2 Thinking | 256,000 | April 2025 | $0.6 / $2.5 | 16,400 | 25.3s | 79 t/s |
| 10000000 | April 2025 | $2 / $12 | 650000 | 30.3s | 128 t/s | |
| 200,000 | Mar 2025 | $3 / $15 | 64,000 | 1.9s | - | |
| 200,000 | Mar 2025 | $15 / $75 | 32,000 | 1.95s | - | |
GPT oss 120b | 131,072 | April 2025 | $0.15 / $0.6 | 131,072 | 8.1s | 260 t/s |
GPT oss 20b | 131,072 | April 2025 | $0.08 / $0.35 | 131,072 | 4s | 564 t/s |
| 200,000 | April 2025 | $15 / $75 | 32,000 | - | - | |
GPT 5.1 | 200,000 | April 2025 | $1.25 / $10 | 128,000 | - | - |
| 200000 | April 2025 | $3 / $15 | 160000 | 31s | 69 t/s | |
GPT 5.2 | 400k | Aug 2025 | $1.5 / $14 | 16,000 | 0.6s | 92 t/s |
| 128,000 | Dec 2024 | $0.27 / $1.1 | 8,000 | 4s | 33 t/s | |
Qwen2.5-VL-32B | 131,000 | Dec 2024 | - | 8,000 | - | - |
GPT-4.5 | 128,000 | Nov 2024 | $75 / $150 | 16,384 | 1.25s | 48 t/s |
| 200,000 | Nov 2024 | $3 / $15 | 128,000 | 0.91s | 78 t/s | |
| / | Nov 2024 | - | - | - | - | |
| 128,000 | Nov 2024 | $0.07 / $0.07 | 8192 | 0.72s | 59 t/s | |
GPT-4.1 | 1,000,000 | December 2024 | $2 / $8 | 16,000 | - | - |
GPT-4.1 mini | 1,000,000 | December 2024 | $0.4 / $1.6 | 16,000 | - | - |
| 200,000 | April 2025 | $5 / $25 | 64,000 | - | - | |
OpenAI o1-mini | 128,000 | Dec 2024 | $3 / $12 | 8,000 | 11.43s | 220 t/s |
| 10,000,000 | November 2024 | $0.2 / $0.6 | 8,000 | 0.45s | 126 t/s | |
| 10,000,000 | November 2024 | $0.11 / $0.34 | 8,000 | 0.33s | 2600 t/s | |
| - | November 2024 | - | - | - | - | |
GPT-4.1 nano | 1,000,000 | December 2024 | $0.1 / $0.4 | 32,000 | - | - |
GPT-5.3 Codex | 400,000 | Aug 2025 | $1.75 / $14 | 128,000 | 0.003s | 50 t/s |
Context window, cost and speed comparison
| Models | Context Window | Input Cost / 1M tokens | Output Cost / 1M tokens | Speed (tokens/second) | Latency |
|---|---|---|---|---|---|
| 200,000 | $5 | $25 | 67 t/s | 1.6 seconds | |
| 200,000 | $3 | $15 | 55 t/s | 0.73 seconds | |
GPT-5.3 Codex | 400,000 | $1.75 | $14 | 50 t/s | 0.003 seconds |
| 128,000 | $0.27 | $1.1 | 33 t/s | 4 seconds | |
Qwen2.5-VL-32B | 131,000 | n/a | n/a | n/a | n/a |
OpenAI o1-mini | 128,000 | $3 | $12 | 220 t/s | 11.43 seconds |
OpenAI o3-mini | 200,000 | $1.1 | $4.4 | 214 t/s | 14 seconds |
| 128,000 | $0.55 | $2.19 | 24 t/s | 4 seconds | |
| 200,000 | $3 | $15 | 78 t/s | 0.95 seconds | |
GPT-4.5 | 128,000 | $75 | $150 | 48 t/s | 1.25 seconds |
| 200,000 | $3 | $15 | 78 t/s | 0.91 seconds | |
| 1,000,000 | $1.25 | $10 | 191 t/s | 30 seconds | |
| / | n/a | n/a | n/a | n/a | |
| 128,000 | $0.07 | $0.07 | 59 t/s | 0.72 seconds | |
| 10,000,000 | $0.2 | $0.6 | 126 t/s | 0.45 seconds | |
| 10,000,000 | $0.11 | $0.34 | 2600 t/s | 0.33 seconds | |
| n/a | n/a | n/a | n/a | n/a | |
GPT-4.1 | 1,000,000 | $2 | $8 | n/a | n/a |
GPT-4.1 mini | 1,000,000 | $0.4 | $1.6 | n/a | n/a |
GPT-4.1 nano | 1,000,000 | $0.1 | $0.4 | n/a | n/a |
| 200,000 | $3 | $15 | n/a | 1.9 seconds | |
| 200,000 | $15 | $75 | n/a | 1.95 seconds | |
GPT oss 120b | 131,072 | $0.15 | $0.6 | 260 t/s | 8.1 seconds |
GPT oss 20b | 131,072 | $0.08 | $0.35 | 564 t/s | 4 seconds |
| 200,000 | $15 | $75 | n/a | n/a | |
GPT-5 | 400,000 | $1.25 | $10 | n/a | n/a |
GPT 5.1 | 200,000 | $1.25 | $10 | n/a | n/a |
Kimi K2 Thinking | 256,000 | $0.6 | $2.5 | 79 t/s | 25.3 seconds |
| 10000000 | $2 | $12 | 128 t/s | 30.3 seconds | |
| 200000 | $3 | $15 | 69 t/s | 31 seconds | |
| 200,000 | $5 | $25 | n/a | n/a | |
GPT 5.2 | 400k | $1.5 | $14 | 92 t/s | 0.6 seconds |
Benchmark glossary
- GPQA Diamond
- Graduate-level science questions curated by domain experts. Tests advanced reasoning across physics, chemistry, and biology.
- AIME 2025
- Problems from the 2025 American Invitational Mathematics Examination. Measures multi-step mathematical problem solving.
- SWE-Bench Verified
- Real GitHub issues from popular Python repos that the model must resolve end-to-end. Measures agentic software engineering ability.
- Humanity's Last Exam
- A crowd-sourced exam of extremely hard questions spanning every academic discipline. Designed to be the final exam before superhuman AI.
- ARC-AGI 2
- Abstract visual puzzles requiring novel pattern recognition. Tests fluid intelligence and generalization beyond training data.
- MMMLU
- Massive Multitask Language Understanding across multiple languages. Evaluates knowledge and reasoning in non-English contexts.
OpenAI o3-mini
Kimi K2 Thinking
Qwen2.5-VL-32B