updated 23 Mar 2026

LLM Leaderboard

This LLM leaderboard displays the latest public benchmark performance for SOTA model versions released after April 2024. The data comes from model providers as well as independently run evaluations by Vellum or the open-source community. We feature results from non-saturated benchmarks, excluding outdated benchmarks (e.g. MMLU).

Top models per tasks

Best in Reasoning (GPQA Diamond)

100%94%88%82%77%
95.4%
Claude 3 Opus
92.4%
GPT 5.2
91.9%
Gemini 3 Pro
91.3%
Claude Opus 4.6
89.9%
Claude Sonnet 4.6
Best in Reasoning (GPQA Diamond)
ModelScore
Claude 3 Opus95.4%
GPT 5.292.4%
Gemini 3 Pro91.9%
Claude Opus 4.691.3%
Claude Sonnet 4.689.9%

Best in High School Math (AIME 2025)

100%96%93%89%86%
100%
Gemini 3 Pro
100%
GPT 5.2
99.8%
Claude Opus 4.6
99.1%
Kimi K2 Thinking
98.7%
GPT oss 20b
Best in High School Math (AIME 2025)
ModelScore
Gemini 3 Pro100%
GPT 5.2100%
Claude Opus 4.699.8%
Kimi K2 Thinking99.1%
GPT oss 20b98.7%

Best in Agentic Coding (SWE Bench)

85%81%76%72%68%
82%
Claude Sonnet 4.5
80.9%
Claude Opus 4.5
80.8%
Claude Opus 4.6
80%
GPT 5.2
79.6%
Claude Sonnet 4.6
Best in Agentic Coding (SWE Bench)
ModelScore
Claude Sonnet 4.582%
Claude Opus 4.580.9%
Claude Opus 4.680.8%
GPT 5.280%
Claude Sonnet 4.679.6%

Best Overall (Humanity's Last Exam)

50%38%25%13%0%
45.8%
Gemini 3 Pro
44.9%
Kimi K2 Thinking
40%
Claude Opus 4.6
35.2%
GPT-5
30.1%
Kimi K2.5
Best Overall (Humanity's Last Exam)
ModelScore
Gemini 3 Pro45.8%
Kimi K2 Thinking44.9%
Claude Opus 4.640%
GPT-535.2%
Kimi K2.530.1%

Best in Visual Reasoning (ARC-AGI 2)

70%53%35%18%0%
68.8%
Claude Opus 4.6
58.3%
Claude Sonnet 4.6
52.9%
GPT 5.2
37.6%
Claude Opus 4.5
31%
Gemini 3 Pro
Best in Visual Reasoning (ARC-AGI 2)
ModelScore
Claude Opus 4.668.8%
Claude Sonnet 4.658.3%
GPT 5.252.9%
Claude Opus 4.537.6%
Gemini 3 Pro31%

Best in Multilingual Reasoning (MMMLU)

95%90%86%81%77%
91.8%
Gemini 3 Pro
91.1%
Claude Opus 4.6
90.8%
Claude Opus 4.5
89.5%
Claude Opus 4.1
89.3%
Claude Sonnet 4.6
Best in Multilingual Reasoning (MMMLU)
ModelScore
Gemini 3 Pro91.8%
Claude Opus 4.691.1%
Claude Opus 4.590.8%
Claude Opus 4.189.5%
Claude Sonnet 4.689.3%

Fastest and most affordable models

Fastest Models (Tokens/sec)

1Llama 4 Scout2600 t/s
2Llama 3.3 70b2500 t/s
3Llama 3.1 70b2100 t/s
4Llama 3.1 8b1800 t/s
5Llama 3.1 405b969 t/s

Lowest Latency (TTFT)

1GPT-5.3 Codex0.003s
2Nova Micro0.3s
3Llama 3.1 8b0.32s
4Llama 4 Scout0.33s
5Gemini 2.0 Flash0.34s

Cheapest Models (per 1M tokens)

1Nova Micro$0.04 / $0.14
2Gemma 3 27b$0.07 / $0.07
3Gemini 1.5 Flash$0.075 / $0.3
4GPT oss 20b$0.08 / $0.35
5Gemini 2.0 Flash$0.1 / $0.4

Compare models

vs
Claude Opus 4.6Claude Sonnet 4.6
Context size200,000200,000
Cutoff dateMay 2025Aug 2025
I/O cost$5 / $25$3 / $15
Max output128,00064,000
Latency1.6s0.73s
Speed67 t/s55 t/s
Claude Opus 4.6Claude Sonnet 4.6
GPQA Diamond
91.3
89.9
BFCL
-
-
MATH 500
97.6
97.8
AIME 2025
99.8
83
SWE Bench
80.8
79.6
LiveCodeBench
76
72.4

Model Comparison

ModelContext sizeCutoff dateI/O costMax outputLatencySpeed
Claude Opus 4.6200,000May 2025$5 / $25128,0001.6s67 t/s
Claude Sonnet 4.6200,000Aug 2025$3 / $1564,0000.73s55 t/s
OpenAI o3-mini200,000Dec 2024$1.1 / $4.48,00014s214 t/s
DeepSeek-R1128,000Dec 2024$0.55 / $2.198,0004s24 t/s
Claude 3.7 Sonnet [R]200,000Nov 2024$3 / $1564,0000.95s78 t/s
Gemini 2.5 Pro1,000,000Nov 2024$1.25 / $1065,00030s191 t/s
GPT-5400,000April 2025$1.25 / $10128,000--
Kimi K2 Thinking256,000April 2025$0.6 / $2.516,40025.3s79 t/s
Gemini 3 Pro10000000April 2025$2 / $1265000030.3s128 t/s
Claude 4 Sonnet200,000Mar 2025$3 / $1564,0001.9s-
Claude 4 Opus200,000Mar 2025$15 / $7532,0001.95s-
GPT oss 120b131,072April 2025$0.15 / $0.6131,0728.1s260 t/s
GPT oss 20b131,072April 2025$0.08 / $0.35131,0724s564 t/s
Claude Opus 4.1200,000April 2025$15 / $7532,000--
GPT 5.1200,000April 2025$1.25 / $10128,000--
Claude Sonnet 4.5200000April 2025$3 / $1516000031s69 t/s
GPT 5.2400kAug 2025$1.5 / $1416,0000.6s92 t/s
DeepSeek V3 0324128,000Dec 2024$0.27 / $1.18,0004s33 t/s
Qwen2.5-VL-32B131,000Dec 2024-8,000--
GPT-4.5 128,000Nov 2024$75 / $15016,3841.25s48 t/s
Claude 3.7 Sonnet200,000Nov 2024$3 / $15128,0000.91s78 t/s
Grok 3 [Beta]/Nov 2024----
Gemma 3 27b128,000Nov 2024$0.07 / $0.0781920.72s59 t/s
GPT-4.11,000,000December 2024$2 / $816,000--
GPT-4.1 mini1,000,000December 2024$0.4 / $1.616,000--
Claude Opus 4.5200,000April 2025$5 / $2564,000--
OpenAI o1-mini128,000Dec 2024$3 / $128,00011.43s220 t/s
Llama 4 Maverick10,000,000November 2024$0.2 / $0.68,0000.45s126 t/s
Llama 4 Scout10,000,000November 2024$0.11 / $0.348,0000.33s2600 t/s
Llama 4 Behemoth-November 2024----
GPT-4.1 nano1,000,000December 2024$0.1 / $0.432,000--
GPT-5.3 Codex400,000Aug 2025$1.75 / $14128,0000.003s50 t/s

Context window, cost and speed comparison

Models Context Window Input Cost / 1M tokens Output Cost / 1M tokens Speed (tokens/second) Latency
Claude Opus 4.6200,000$5$2567 t/s1.6 seconds
Claude Sonnet 4.6200,000$3$1555 t/s0.73 seconds
GPT-5.3 Codex400,000$1.75$1450 t/s0.003 seconds
DeepSeek V3 0324128,000$0.27$1.133 t/s4 seconds
Qwen2.5-VL-32B131,000n/an/an/an/a
OpenAI o1-mini128,000$3$12220 t/s11.43 seconds
OpenAI o3-mini200,000$1.1$4.4214 t/s14 seconds
DeepSeek-R1128,000$0.55$2.1924 t/s4 seconds
Claude 3.7 Sonnet [R]200,000$3$1578 t/s0.95 seconds
GPT-4.5 128,000$75$15048 t/s1.25 seconds
Claude 3.7 Sonnet200,000$3$1578 t/s0.91 seconds
Gemini 2.5 Pro1,000,000$1.25$10191 t/s30 seconds
Grok 3 [Beta]/n/an/an/an/a
Gemma 3 27b128,000$0.07$0.0759 t/s0.72 seconds
Llama 4 Maverick10,000,000$0.2$0.6126 t/s0.45 seconds
Llama 4 Scout10,000,000$0.11$0.342600 t/s0.33 seconds
Llama 4 Behemothn/an/an/an/an/a
GPT-4.11,000,000$2$8n/an/a
GPT-4.1 mini1,000,000$0.4$1.6n/an/a
GPT-4.1 nano1,000,000$0.1$0.4n/an/a
Claude 4 Sonnet200,000$3$15n/a1.9 seconds
Claude 4 Opus200,000$15$75n/a1.95 seconds
GPT oss 120b131,072$0.15$0.6260 t/s8.1 seconds
GPT oss 20b131,072$0.08$0.35564 t/s4 seconds
Claude Opus 4.1200,000$15$75n/an/a
GPT-5400,000$1.25$10n/an/a
GPT 5.1200,000$1.25$10n/an/a
Kimi K2 Thinking256,000$0.6$2.579 t/s25.3 seconds
Gemini 3 Pro10000000$2$12128 t/s30.3 seconds
Claude Sonnet 4.5200000$3$1569 t/s31 seconds
Claude Opus 4.5200,000$5$25n/an/a
GPT 5.2400k$1.5$1492 t/s0.6 seconds

Benchmark glossary

GPQA Diamond
Graduate-level science questions curated by domain experts. Tests advanced reasoning across physics, chemistry, and biology.
AIME 2025
Problems from the 2025 American Invitational Mathematics Examination. Measures multi-step mathematical problem solving.
SWE-Bench Verified
Real GitHub issues from popular Python repos that the model must resolve end-to-end. Measures agentic software engineering ability.
Humanity's Last Exam
A crowd-sourced exam of extremely hard questions spanning every academic discipline. Designed to be the final exam before superhuman AI.
ARC-AGI 2
Abstract visual puzzles requiring novel pattern recognition. Tests fluid intelligence and generalization beyond training data.
MMMLU
Massive Multitask Language Understanding across multiple languages. Evaluates knowledge and reasoning in non-English contexts.