updated 23 Mar 2026

Open Source LLM Leaderboard

This open source LLM leaderboard displays the latest public benchmark performance for open-weight and open-source models released after April 2024. The data comes from model providers as well as independently run evaluations by Vellum or the open-source community. We feature results from non-saturated benchmarks, excluding outdated benchmarks (e.g. MMLU).

Top open source models per tasks

Best in Reasoning (GPQA Diamond)

90%68%45%23%0%
87.6%
Kimi K2.5
84.5%
Kimi K2 Thinking
80.1%
GPT oss 120b
76%
Nemotron Ultra 253B
73.7%
Llama 4 Behemoth
Best in Reasoning (GPQA Diamond)
ModelScore
Kimi K2.587.6%
Kimi K2 Thinking84.5%
GPT oss 120b80.1%
Nemotron Ultra 253B76%
Llama 4 Behemoth73.7%

Best in High School Math (AIME 2025)

100%75%50%25%0%
99.1%
Kimi K2 Thinking
98.7%
GPT oss 20b
97.9%
GPT oss 120b
96.1%
Kimi K2.5
74%
DeepSeek-R1
Best in High School Math (AIME 2025)
ModelScore
Kimi K2 Thinking99.1%
GPT oss 20b98.7%
GPT oss 120b97.9%
Kimi K2.596.1%
DeepSeek-R174%

Best in Agentic Coding (SWE Bench)

80%60%40%20%0%
76.8%
Kimi K2.5
71.3%
Kimi K2 Thinking
49.2%
DeepSeek-R1
38.8%
DeepSeek V3 0324
18.8%
Qwen2.5-VL-32B
Best in Agentic Coding (SWE Bench)
ModelScore
Kimi K2.576.8%
Kimi K2 Thinking71.3%
DeepSeek-R149.2%
DeepSeek V3 032438.8%
Qwen2.5-VL-32B18.8%

Best Overall (Humanity's Last Exam)

50%38%25%13%0%
44.9%
Kimi K2 Thinking
30.1%
Kimi K2.5
14.9%
GPT oss 120b
10.9%
GPT oss 20b
8.6%
DeepSeek-R1
Best Overall (Humanity's Last Exam)
ModelScore
Kimi K2 Thinking44.9%
Kimi K2.530.1%
GPT oss 120b14.9%
GPT oss 20b10.9%
DeepSeek-R18.6%

Best in Visual Reasoning (ARC-AGI 2)

15%14%12%11%9%
12%
Kimi K2.5
Best in Visual Reasoning (ARC-AGI 2)
ModelScore
Kimi K2.512%

Best in Multilingual Reasoning (MMMLU)

90%86%81%77%72%
85.8%
Llama 4 Behemoth
84.6%
Llama 4 Maverick
Best in Multilingual Reasoning (MMMLU)
ModelScore
Llama 4 Behemoth85.8%
Llama 4 Maverick84.6%

Fastest and most affordable open source models

Fastest Models (Tokens/sec)

1Llama 4 Scout2600 t/s
2Llama 3.3 70b2500 t/s
3Llama 3.1 70b2100 t/s
4Llama 3.1 405b969 t/s
5GPT oss 20b564 t/s

Lowest Latency (TTFT)

1Llama 4 Scout0.33s
2Llama 4 Maverick0.45s
3Llama 3.3 70b0.52s
4Gemma 3 27b0.72s
5Llama 3.1 405b0.73s

Cheapest Models (per 1M tokens)

1Gemma 3 27b$0.07 / $0.07
2GPT oss 20b$0.08 / $0.35
3Llama 4 Scout$0.11 / $0.34
4GPT oss 120b$0.15 / $0.6
5Llama 4 Maverick$0.2 / $0.6

Compare open source models

vs
Kimi K2.5Llama 3.1 405b
Context size256,000128,000
Cutoff dateApr 2024Dec 2023
I/O cost$0.6 / $2.5$3.5 / $3.5
Max output33,0004096
Latency1.2s0.73s
Speed45 t/s969 t/s
Kimi K2.5Llama 3.1 405b
GPQA Diamond
87.6
49
BFCL
-
81.1
MATH 500
98
73.8
AIME 2025
96.1
-
SWE Bench
76.8
-
LiveCodeBench
85
-

Model Comparison

ModelContext sizeCutoff dateI/O costMax outputLatencySpeed
DeepSeek-R1128,000Dec 2024$0.55 / $2.198,0004s24 t/s
Kimi K2 Thinking256,000April 2025$0.6 / $2.516,40025.3s79 t/s
GPT oss 120b131,072April 2025$0.15 / $0.6131,0728.1s260 t/s
GPT oss 20b131,072April 2025$0.08 / $0.35131,0724s564 t/s
DeepSeek V3 0324128,000Dec 2024$0.27 / $1.18,0004s33 t/s
Qwen2.5-VL-32B131,000Dec 2024-8,000--
Gemma 3 27b128,000Nov 2024$0.07 / $0.0781920.72s59 t/s
Llama 4 Maverick10,000,000November 2024$0.2 / $0.68,0000.45s126 t/s
Llama 4 Scout10,000,000November 2024$0.11 / $0.348,0000.33s2600 t/s
Llama 4 Behemoth-November 2024----

Context window, cost and speed comparison

Models Context Window Input Cost / 1M tokens Output Cost / 1M tokens Speed (tokens/second) Latency
DeepSeek V3 0324128,000$0.27$1.133 t/s4 seconds
Qwen2.5-VL-32B131,000n/an/an/an/a
DeepSeek-R1128,000$0.55$2.1924 t/s4 seconds
Gemma 3 27b128,000$0.07$0.0759 t/s0.72 seconds
Llama 4 Maverick10,000,000$0.2$0.6126 t/s0.45 seconds
Llama 4 Scout10,000,000$0.11$0.342600 t/s0.33 seconds
Llama 4 Behemothn/an/an/an/an/a
GPT oss 120b131,072$0.15$0.6260 t/s8.1 seconds
GPT oss 20b131,072$0.08$0.35564 t/s4 seconds
Kimi K2 Thinking256,000$0.6$2.579 t/s25.3 seconds

Benchmark glossary

GPQA Diamond
Graduate-level science questions curated by domain experts. Tests advanced reasoning across physics, chemistry, and biology.
AIME 2025
Problems from the 2025 American Invitational Mathematics Examination. Measures multi-step mathematical problem solving.
SWE-Bench Verified
Real GitHub issues from popular Python repos that the model must resolve end-to-end. Measures agentic software engineering ability.
Humanity's Last Exam
A crowd-sourced exam of extremely hard questions spanning every academic discipline. Designed to be the final exam before superhuman AI.
ARC-AGI 2
Abstract visual puzzles requiring novel pattern recognition. Tests fluid intelligence and generalization beyond training data.
MMMLU
Massive Multitask Language Understanding across multiple languages. Evaluates knowledge and reasoning in non-English contexts.