updated 1 Jul 2026
LLM Leaderboard
This LLM leaderboard displays the latest public benchmark performance for SOTA model versions released after April 2024. The data comes from model providers as well as independently run evaluations by Vellum or the open-source community. We feature results from non-saturated benchmarks, excluding outdated benchmarks (e.g. MMLU).
Best Overall (Humanity's Last Exam)
70%53%35%18%0%
| Model | Score |
|---|---|
| Claude Mythos 5 | 64.5% |
| Claude Opus 4.8 | 57.9% |
| Claude Sonnet 5 | 57.4% |
| GLM 5.2 | 54.7% |
| Kimi K2.6 | 54% |
| DeepSeek V4 Flash | 51.6% |
| DeepSeek V4 Pro | 48.2% |
| Gemini 3 Pro | 45.8% |
| Kimi K2 Thinking | 44.9% |
| Gemini 3.1 Pro | 44.4% |
| GPT-5.5 Pro | 43.1% |
| GPT-5.5 | 41.4% |
| Gemini 3.5 Flash | 40.2% |
| Claude Opus 4.6 | 40% |
| GPT-5 | 35.2% |
| Kimi K2.5 | 30.1% |
| Grok 4 | 25.4% |
| Gemini 2.5 Pro | 21.6% |
| OpenAI o3 | 20.3% |
| Claude Sonnet 4.6 | 19.1% |
Top models per tasks
Best in Reasoning (GPQA Diamond)
100%95%91%86%81%
| Model | Score |
|---|---|
| Claude Sonnet 5 | 96.2% |
| Claude 3 Opus | 95.4% |
| Gemini 3.1 Pro | 94.3% |
| Claude Opus 4.7 | 94.2% |
| Claude Fable 5 | 94.1% |
Best in Agentic Coding (SWE Bench)
100%94%88%82%77%
| Model | Score |
|---|---|
| Claude Mythos 5 | 95.5% |
| Claude Fable 5 | 95% |
| Claude Opus 4.8 | 88.6% |
| Claude Opus 4.7 | 87.6% |
| Claude Sonnet 5 | 85.2% |
New
Best for Work Automations (AutoBench)
20%15%10%5%0%
| Model | Score |
|---|---|
| Claude Fable 5 | 17.4% |
| Claude Opus 4.8 | 15.5% |
| Claude Sonnet 5 | 13.5% |
| GPT-5.5 | 12.9% |
| Claude Sonnet 4.6 | 5.3% |
New
Best in Computer Use (OSWorld)
85%81%76%72%68%
| Model | Score |
|---|---|
| Claude Fable 5 | 85% |
| Claude Opus 4.8 | 83.4% |
| Claude Sonnet 5 | 81.2% |
| GPT-5.5 | 78.7% |
| Claude Sonnet 4.6 | 78.5% |
New
Best in Browsing (BrowseComp)
90%86%81%77%72%
| Model | Score |
|---|---|
| Claude Fable 5 | 88% |
| DeepSeek V4 Flash | 85.9% |
| Gemini 3.1 Pro | 85.9% |
| Claude Sonnet 5 | 84.7% |
| GPT-5.5 | 84.4% |
New
Best in Terminal Use (Terminal-Bench 2.1)
90%86%81%77%72%
| Model | Score |
|---|---|
| Claude Mythos 5 | 88% |
| Claude Fable 5 | 84.3% |
| GPT-5.5 | 82.7% |
| GLM 5.2 | 81% |
| Claude Sonnet 5 | 80.4% |
Fastest and most affordable models
Fastest Models (Tokens/sec)
1
Llama 4 Scout2600 t/s
2
Llama 3.1 405b969 t/s
3
GLM 5.2347 t/s
GLM 5.2347 t/s4
Kimi K2.6342.6 t/s
Kimi K2.6342.6 t/s5
Kimi K2.5337.7 t/s
Kimi K2.5337.7 t/sLowest Latency (TTFT)
1
GPT-5.3 Codex0.003s
GPT-5.3 Codex0.003s2
Nova Micro0.3s
3
Llama 4 Scout0.33s
4
Gemini 2.0 Flash0.34s
5
GPT-4o mini0.35s
GPT-4o mini0.35sCheapest Models (per 1M tokens)
1
Nova Micro$0.04 / $0.14
2
Gemini 1.5 Flash$0.075 / $0.3
3
Gemini 2.0 Flash$0.1 / $0.4
4
GPT-4.1 nano$0.1 / $0.4
GPT-4.1 nano$0.1 / $0.45
Llama 4 Scout$0.11 / $0.34
Compare models
Side-by-side comparison of the latest models released in the last 9 months.
| Context size | 1,000,000 | 1,000,000 |
| Cutoff date | Jan 2026 | Jan 2026 |
| I/O cost | $10 / $50 | $5 / $25 |
| Max output | 128,000 | 128,000 |
| Latency | - | 32.1s |
| Speed | - | 64.8 t/s |
Compare Personal AI harnesses
Model Comparison
| Model | Context size | Cutoff date | I/O cost | Max output | Latency | Speed |
|---|---|---|---|---|---|---|
| 1,000,000 | Jan 2026 | $10 / $50 | 128,000 | - | - | |
| 1,000,000 | Jan 2026 | $5 / $25 | 128,000 | 32.1s | 64.8 t/s | |
| 1,000,000 | Jan 2026 | $3 / $15 | 128,000 | 20.69s | 56.3 t/s | |
GLM 5.2 | 1,000,000 | Mar 2026 | $0.95 / $3 | 128,000 | 1.14s | 347 t/s |
Kimi K2.6 | 256,000 | - | $0.95 / $4 | - | 0.68s | 342.6 t/s |
| 1000000 | Jan 2026 | $0.14 / $0.28 | 384000 | 1.42s | 107.9 t/s | |
| 1000000 | Jan 2026 | $0.435 / $0.87 | 384000 | 1.2s | 174.9 t/s | |
| 1,000,000 | Jan 2026 | $2 / $12 | 65,536 | 20.34s | 136.2 t/s | |
GPT-5.5 Pro | 1,000,000 | Apr 2026 | $30 / $180 | 128,000 | - | - |
GPT-5.5 | 1,000,000 | Apr 2026 | $5 / $30 | 128,000 | 76.69s | 79 t/s |
| 1,000,000 | Jan 2026 | $1.5 / $9 | 65,536 | 23.16s | 175.4 t/s | |
| 200,000 | May 2025 | $5 / $25 | 128,000 | 1.6s | 67 t/s | |
| 1,000,000 | Apr 2026 | $5 / $25 | 128,000 | 17.11s | 50.8 t/s | |
| 1,000,000 | Jan 2026 | $10 / $50 | 128,000 | - | - | |
MiniMax M3 | 1,048,576 | Mar 2026 | $0.6 / $2.4 | 512,000 | 0.85s | 98.6 t/s |
Context window, cost and speed comparison
| Models | Context Window | Input Cost / 1M tokens | Output Cost / 1M tokens | Speed (tokens/second) | Latency |
|---|---|---|---|---|---|
| 1,000,000 | $10 | $50 | n/a | n/a | |
| 1,000,000 | $5 | $25 | 64.8 t/s | 32.1 seconds | |
| 1,000,000 | $3 | $15 | 56.3 t/s | 20.69 seconds | |
GLM 5.2 | 1,000,000 | $0.95 | $3 | 347 t/s | 1.14 seconds |
Kimi K2.6 | 256,000 | $0.95 | $4 | 342.6 t/s | 0.68 seconds |
| 1000000 | $0.14 | $0.28 | 107.9 t/s | 1.42 seconds | |
| 1000000 | $0.435 | $0.87 | 174.9 t/s | 1.2 seconds | |
| 1,000,000 | $2 | $12 | 136.2 t/s | 20.34 seconds | |
GPT-5.5 Pro | 1,000,000 | $30 | $180 | n/a | n/a |
GPT-5.5 | 1,000,000 | $5 | $30 | 79 t/s | 76.69 seconds |
| 1,000,000 | $1.5 | $9 | 175.4 t/s | 23.16 seconds | |
| 200,000 | $5 | $25 | 67 t/s | 1.6 seconds | |
| 1,000,000 | $5 | $25 | 50.8 t/s | 17.11 seconds | |
| 1,000,000 | $10 | $50 | n/a | n/a | |
MiniMax M3 | 1,048,576 | $0.6 | $2.4 | 98.6 t/s | 0.85 seconds |
Benchmark glossary
- Humanity's Last Exam
- A crowd-sourced exam of extremely hard questions spanning every academic discipline. Designed to be the final exam before superhuman AI.
- GPQA Diamond
- Graduate-level science questions curated by domain experts. Tests advanced reasoning across physics, chemistry, and biology.
- SWE-Bench Verified
- Real GitHub issues from popular Python repos that the model must resolve end-to-end. Measures agentic software engineering ability.
- AutoBench
- Automation benchmark evaluating a model's ability to complete real-world work automation tasks using tools and multi-step workflows.
- OSWorld-Verified
- Real-world computer use tasks requiring GUI interaction in desktop environments. Measures end-to-end task completion on a real OS.
- BrowseComp
- Agentic web search benchmark testing a model's ability to browse and extract information from the web to answer complex questions.
- Terminal-Bench 2.1
- Terminal and tool use benchmark evaluating a model's ability to execute multi-step tasks in a terminal environment.
MiniMax M3