updated 24 Feb 2026

LLM Leaderboard

This LLM leaderboard displays the latest public benchmark performance for SOTA model versions released after April 2024. The data comes from model providers as well as independently run evaluations by Vellum or the open-source community. We feature results from non-saturated benchmarks, excluding outdated benchmarks (e.g. MMLU). If you want to use these models in your agents, try Vellum.

Top models per tasks

Best in Reasoning (GPQA Diamond)

100%94%88%82%77%
95.4%
Claude 3 Opus
92.4%
GPT 5.2
91.9%
Gemini 3 Pro
91.3%
Claude Opus 4.6
89.9%
Claude Sonnet 4.6

Best in High School Math (AIME 2025)

100%96%93%89%86%
100%
Gemini 3 Pro
100%
GPT 5.2
99.8%
Claude Opus 4.6
99.1%
Kimi K2 Thinking
98.7%
GPT oss 20b

Best in Agentic Coding (SWE Bench)

85%81%76%72%68%
82%
Claude Sonnet 4.5
80.9%
Claude Opus 4.5
80.8%
Claude Opus 4.6
80%
GPT 5.2
79.6%
Claude Sonnet 4.6

Best Overall (Humanity's Last Exam)

50%38%25%13%0%
45.8%
Gemini 3 Pro
44.9%
Kimi K2 Thinking
40%
Claude Opus 4.6
35.2%
GPT-5
30.1%
Kimi K2.5

Best in Visual Reasoning (ARC-AGI 2)

70%53%35%18%0%
68.8%
Claude Opus 4.6
58.3%
Claude Sonnet 4.6
52.9%
GPT 5.2
37.6%
Claude Opus 4.5
31%
Gemini 3 Pro

Best in Multilingual Reasoning (MMMLU)

95%90%86%81%77%
91.8%
Gemini 3 Pro
91.1%
Claude Opus 4.6
90.8%
Claude Opus 4.5
89.5%
Claude Opus 4.1
89.3%
Claude Sonnet 4.6

Fastest and most affordable models

Fastest Models (Tokens/sec)

1Llama 4 Scout2600 t/s
2Llama 3.3 70b2500 t/s
3Llama 3.1 70b2100 t/s
4Llama 3.1 8b1800 t/s
5Llama 3.1 405b969 t/s

Lowest Latency (TTFT)

1GPT-5.3 Codex0.003s
2Nova Micro0.3s
3Llama 3.1 8b0.32s
4Llama 4 Scout0.33s
5Gemini 2.0 Flash0.34s

Cheapest Models (per 1M tokens)

1Nova Micro$0.04 / $0.14
2Gemma 3 27b$0.07 / $0.07
3Gemini 1.5 Flash$0.075 / $0.3
4GPT oss 20b$0.08 / $0.35
5Gemini 2.0 Flash$0.1 / $0.4

Compare models

vs
Claude Opus 4.6Claude Sonnet 4.6
Context size200,000200,000
Cutoff dateMay 2025Aug 2025
I/O cost$5 / $25$3 / $15
Max output128,00064,000
Latency1.6s0.73s
Speed67 t/s55 t/s
Claude Opus 4.6Claude Sonnet 4.6
GPQA Diamond
91.3
89.9
BFCL
-
-
MATH 500
97.6
97.8
AIME 2025
99.8
83
SWE Bench
80.8
79.6
LiveCodeBench
76
72.4

Model Comparison

ModelContext sizeCutoff dateI/O costMax outputLatencySpeed
Claude Opus 4.6200,000May 2025$5 / $25128,0001.6s67 t/s
Claude Sonnet 4.6200,000Aug 2025$3 / $1564,0000.73s55 t/s
OpenAI o3-mini200,000Dec 2024$1.1 / $4.48,00014s214 t/s
DeepSeek-R1128,000Dec 2024$0.55 / $2.198,0004s24 t/s
Claude 3.7 Sonnet [R]200,000Nov 2024$3 / $1564,0000.95s78 t/s
Gemini 2.5 Pro1,000,000Nov 2024$1.25 / $1065,00030s191 t/s
GPT-5400,000April 2025$1.25 / $10128,000--
Kimi K2 Thinking256,000April 2025$0.6 / $2.516,40025.3s79 t/s
Gemini 3 Pro10000000April 2025$2 / $1265000030.3s128 t/s
Claude 4 Sonnet200,000Mar 2025$3 / $1564,0001.9s-
Claude 4 Opus200,000Mar 2025$15 / $7532,0001.95s-
GPT oss 120b131,072April 2025$0.15 / $0.6131,0728.1s260 t/s
GPT oss 20b131,072April 2025$0.08 / $0.35131,0724s564 t/s
Claude Opus 4.1200,000April 2025$15 / $7532,000--
GPT 5.1200,000April 2025$1.25 / $10128,000--
Claude Sonnet 4.5200000April 2025$3 / $1516000031s69 t/s
GPT 5.2400kAug 2025$1.5 / $1416,0000.6s92 t/s
DeepSeek V3 0324128,000Dec 2024$0.27 / $1.18,0004s33 t/s
Qwen2.5-VL-32B131,000Dec 2024-8,000--
GPT-4.5 128,000Nov 2024$75 / $15016,3841.25s48 t/s
Claude 3.7 Sonnet200,000Nov 2024$3 / $15128,0000.91s78 t/s
Grok 3 [Beta]/Nov 2024----
Gemma 3 27b128,000Nov 2024$0.07 / $0.0781920.72s59 t/s
GPT-4.11,000,000December 2024$2 / $816,000--
GPT-4.1 mini1,000,000December 2024$0.4 / $1.616,000--
Claude Opus 4.5200,000April 2025$5 / $2564,000--
OpenAI o1-mini128,000Dec 2024$3 / $128,00011.43s220 t/s
Llama 4 Maverick10,000,000November 2024$0.2 / $0.68,0000.45s126 t/s
Llama 4 Scout10,000,000November 2024$0.11 / $0.348,0000.33s2600 t/s
Llama 4 Behemoth-November 2024----
GPT-4.1 nano1,000,000December 2024$0.1 / $0.432,000--
GPT-5.3 Codex400,000Aug 2025$1.75 / $14128,0000.003s50 t/s

Context window, cost and speed comparison

Models Context Window Input Cost / 1M tokens Output Cost / 1M tokens Speed (tokens/second) Latency
Claude Opus 4.6200,000$5$2567 t/s1.6 seconds
Claude Sonnet 4.6200,000$3$1555 t/s0.73 seconds
GPT-5.3 Codex400,000$1.75$1450 t/s0.003 seconds
DeepSeek V3 0324128,000$0.27$1.133 t/s4 seconds
Qwen2.5-VL-32B131,000n/an/an/an/a
OpenAI o1-mini128,000$3$12220 t/s11.43 seconds
OpenAI o3-mini200,000$1.1$4.4214 t/s14 seconds
DeepSeek-R1128,000$0.55$2.1924 t/s4 seconds
Claude 3.7 Sonnet [R]200,000$3$1578 t/s0.95 seconds
GPT-4.5 128,000$75$15048 t/s1.25 seconds
Claude 3.7 Sonnet200,000$3$1578 t/s0.91 seconds
Gemini 2.5 Pro1,000,000$1.25$10191 t/s30 seconds
Grok 3 [Beta]/n/an/an/an/a
Gemma 3 27b128,000$0.07$0.0759 t/s0.72 seconds
Llama 4 Maverick10,000,000$0.2$0.6126 t/s0.45 seconds
Llama 4 Scout10,000,000$0.11$0.342600 t/s0.33 seconds
Llama 4 Behemothn/an/an/an/an/a
GPT-4.11,000,000$2$8n/an/a
GPT-4.1 mini1,000,000$0.4$1.6n/an/a
GPT-4.1 nano1,000,000$0.1$0.4n/an/a
Claude 4 Sonnet200,000$3$15n/a1.9 seconds
Claude 4 Opus200,000$15$75n/a1.95 seconds
GPT oss 120b131,072$0.15$0.6260 t/s8.1 seconds
GPT oss 20b131,072$0.08$0.35564 t/s4 seconds
Claude Opus 4.1200,000$15$75n/an/a
GPT-5400,000$1.25$10n/an/a
GPT 5.1200,000$1.25$10n/an/a
Kimi K2 Thinking256,000$0.6$2.579 t/s25.3 seconds
Gemini 3 Pro10000000$2$12128 t/s30.3 seconds
Claude Sonnet 4.5200000$3$1569 t/s31 seconds
Claude Opus 4.5200,000$5$25n/an/a
GPT 5.2400k$1.5$1492 t/s0.6 seconds