updated 24 Feb 2026

Open Source LLM Leaderboard

This open source LLM leaderboard displays the latest public benchmark performance for open-weight and open-source models released after April 2024. The data comes from model providers as well as independently run evaluations by Vellum or the open-source community. We feature results from non-saturated benchmarks, excluding outdated benchmarks (e.g. MMLU). If you want to use these models in your agents, try Vellum.

Top open source models per tasks

Best in Reasoning (GPQA Diamond)

90%68%45%23%0%
87.6%
Kimi K2.5
84.5%
Kimi K2 Thinking
80.1%
GPT oss 120b
76%
Nemotron Ultra 253B
73.7%
Llama 4 Behemoth

Best in High School Math (AIME 2025)

100%75%50%25%0%
99.1%
Kimi K2 Thinking
98.7%
GPT oss 20b
97.9%
GPT oss 120b
96.1%
Kimi K2.5
74%
DeepSeek-R1

Best in Agentic Coding (SWE Bench)

80%60%40%20%0%
76.8%
Kimi K2.5
71.3%
Kimi K2 Thinking
49.2%
DeepSeek-R1
38.8%
DeepSeek V3 0324
18.8%
Qwen2.5-VL-32B

Best Overall (Humanity's Last Exam)

50%38%25%13%0%
44.9%
Kimi K2 Thinking
30.1%
Kimi K2.5
14.9%
GPT oss 120b
10.9%
GPT oss 20b
8.6%
DeepSeek-R1

Best in Visual Reasoning (ARC-AGI 2)

15%14%12%11%9%
12%
Kimi K2.5

Best in Multilingual Reasoning (MMMLU)

90%86%81%77%72%
85.8%
Llama 4 Behemoth
84.6%
Llama 4 Maverick

Fastest and most affordable open source models

Fastest Models (Tokens/sec)

1Llama 4 Scout2600 t/s
2Llama 3.3 70b2500 t/s
3Llama 3.1 70b2100 t/s
4Llama 3.1 405b969 t/s
5GPT oss 20b564 t/s

Lowest Latency (TTFT)

1Llama 4 Scout0.33s
2Llama 4 Maverick0.45s
3Llama 3.3 70b0.52s
4Gemma 3 27b0.72s
5Llama 3.1 405b0.73s

Cheapest Models (per 1M tokens)

1Gemma 3 27b$0.07 / $0.07
2GPT oss 20b$0.08 / $0.35
3Llama 4 Scout$0.11 / $0.34
4GPT oss 120b$0.15 / $0.6
5Llama 4 Maverick$0.2 / $0.6

Compare open source models

vs
Kimi K2.5Llama 3.1 405b
Context size256,000128,000
Cutoff dateApr 2024Dec 2023
I/O cost$0.6 / $2.5$3.5 / $3.5
Max output33,0004096
Latency1.2s0.73s
Speed45 t/s969 t/s
Kimi K2.5Llama 3.1 405b
GPQA Diamond
87.6
49
BFCL
-
81.1
MATH 500
98
73.8
AIME 2025
96.1
-
SWE Bench
76.8
-
LiveCodeBench
85
-

Model Comparison

ModelContext sizeCutoff dateI/O costMax outputLatencySpeed
DeepSeek-R1128,000Dec 2024$0.55 / $2.198,0004s24 t/s
Kimi K2 Thinking256,000April 2025$0.6 / $2.516,40025.3s79 t/s
GPT oss 120b131,072April 2025$0.15 / $0.6131,0728.1s260 t/s
GPT oss 20b131,072April 2025$0.08 / $0.35131,0724s564 t/s
DeepSeek V3 0324128,000Dec 2024$0.27 / $1.18,0004s33 t/s
Qwen2.5-VL-32B131,000Dec 2024-8,000--
Gemma 3 27b128,000Nov 2024$0.07 / $0.0781920.72s59 t/s
Llama 4 Maverick10,000,000November 2024$0.2 / $0.68,0000.45s126 t/s
Llama 4 Scout10,000,000November 2024$0.11 / $0.348,0000.33s2600 t/s
Llama 4 Behemoth-November 2024----

Context window, cost and speed comparison

Models Context Window Input Cost / 1M tokens Output Cost / 1M tokens Speed (tokens/second) Latency
DeepSeek V3 0324128,000$0.27$1.133 t/s4 seconds
Qwen2.5-VL-32B131,000n/an/an/an/a
DeepSeek-R1128,000$0.55$2.1924 t/s4 seconds
Gemma 3 27b128,000$0.07$0.0759 t/s0.72 seconds
Llama 4 Maverick10,000,000$0.2$0.6126 t/s0.45 seconds
Llama 4 Scout10,000,000$0.11$0.342600 t/s0.33 seconds
Llama 4 Behemothn/an/an/an/an/a
GPT oss 120b131,072$0.15$0.6260 t/s8.1 seconds
GPT oss 20b131,072$0.08$0.35564 t/s4 seconds
Kimi K2 Thinking256,000$0.6$2.579 t/s25.3 seconds