updated 23 Mar 2026
Open Source LLM Leaderboard
This open source LLM leaderboard displays the latest public benchmark performance for open-weight and open-source models released after April 2024. The data comes from model providers as well as independently run evaluations by Vellum or the open-source community. We feature results from non-saturated benchmarks, excluding outdated benchmarks (e.g. MMLU).
Top open source models per tasks
Best in Reasoning (GPQA Diamond)
90%68%45%23%0%
| Model | Score |
|---|---|
| Kimi K2.5 | 87.6% |
| Kimi K2 Thinking | 84.5% |
| GPT oss 120b | 80.1% |
| Nemotron Ultra 253B | 76% |
| Llama 4 Behemoth | 73.7% |
Best in High School Math (AIME 2025)
100%75%50%25%0%
| Model | Score |
|---|---|
| Kimi K2 Thinking | 99.1% |
| GPT oss 20b | 98.7% |
| GPT oss 120b | 97.9% |
| Kimi K2.5 | 96.1% |
| DeepSeek-R1 | 74% |
Best in Agentic Coding (SWE Bench)
80%60%40%20%0%
| Model | Score |
|---|---|
| Kimi K2.5 | 76.8% |
| Kimi K2 Thinking | 71.3% |
| DeepSeek-R1 | 49.2% |
| DeepSeek V3 0324 | 38.8% |
| Qwen2.5-VL-32B | 18.8% |
Best Overall (Humanity's Last Exam)
50%38%25%13%0%
| Model | Score |
|---|---|
| Kimi K2 Thinking | 44.9% |
| Kimi K2.5 | 30.1% |
| GPT oss 120b | 14.9% |
| GPT oss 20b | 10.9% |
| DeepSeek-R1 | 8.6% |
Best in Visual Reasoning (ARC-AGI 2)
15%14%12%11%9%
| Model | Score |
|---|---|
| Kimi K2.5 | 12% |
Best in Multilingual Reasoning (MMMLU)
90%86%81%77%72%
| Model | Score |
|---|---|
| Llama 4 Behemoth | 85.8% |
| Llama 4 Maverick | 84.6% |
Fastest and most affordable open source models
Fastest Models (Tokens/sec)
1Llama 4 Scout2600 t/s
2Llama 3.3 70b2500 t/s
3Llama 3.1 70b2100 t/s
4Llama 3.1 405b969 t/s
5GPT oss 20b564 t/s
Lowest Latency (TTFT)
1Llama 4 Scout0.33s
2Llama 4 Maverick0.45s
3Llama 3.3 70b0.52s
4Gemma 3 27b0.72s
5Llama 3.1 405b0.73s
Cheapest Models (per 1M tokens)
1Gemma 3 27b$0.07 / $0.07
2GPT oss 20b$0.08 / $0.35
3Llama 4 Scout$0.11 / $0.34
4GPT oss 120b$0.15 / $0.6
5Llama 4 Maverick$0.2 / $0.6
Compare open source models
vs
| Kimi K2.5 | Llama 3.1 405b | |
|---|---|---|
| Context size | 256,000 | 128,000 |
| Cutoff date | Apr 2024 | Dec 2023 |
| I/O cost | $0.6 / $2.5 | $3.5 / $3.5 |
| Max output | 33,000 | 4096 |
| Latency | 1.2s | 0.73s |
| Speed | 45 t/s | 969 t/s |
Kimi K2.5Llama 3.1 405b
Model Comparison
| Model | Context size | Cutoff date | I/O cost | Max output | Latency | Speed |
|---|---|---|---|---|---|---|
| 128,000 | Dec 2024 | $0.55 / $2.19 | 8,000 | 4s | 24 t/s | |
Kimi K2 Thinking | 256,000 | April 2025 | $0.6 / $2.5 | 16,400 | 25.3s | 79 t/s |
GPT oss 120b | 131,072 | April 2025 | $0.15 / $0.6 | 131,072 | 8.1s | 260 t/s |
GPT oss 20b | 131,072 | April 2025 | $0.08 / $0.35 | 131,072 | 4s | 564 t/s |
| 128,000 | Dec 2024 | $0.27 / $1.1 | 8,000 | 4s | 33 t/s | |
Qwen2.5-VL-32B | 131,000 | Dec 2024 | - | 8,000 | - | - |
| 128,000 | Nov 2024 | $0.07 / $0.07 | 8192 | 0.72s | 59 t/s | |
| 10,000,000 | November 2024 | $0.2 / $0.6 | 8,000 | 0.45s | 126 t/s | |
| 10,000,000 | November 2024 | $0.11 / $0.34 | 8,000 | 0.33s | 2600 t/s | |
| - | November 2024 | - | - | - | - |
Context window, cost and speed comparison
| Models | Context Window | Input Cost / 1M tokens | Output Cost / 1M tokens | Speed (tokens/second) | Latency |
|---|---|---|---|---|---|
| 128,000 | $0.27 | $1.1 | 33 t/s | 4 seconds | |
Qwen2.5-VL-32B | 131,000 | n/a | n/a | n/a | n/a |
| 128,000 | $0.55 | $2.19 | 24 t/s | 4 seconds | |
| 128,000 | $0.07 | $0.07 | 59 t/s | 0.72 seconds | |
| 10,000,000 | $0.2 | $0.6 | 126 t/s | 0.45 seconds | |
| 10,000,000 | $0.11 | $0.34 | 2600 t/s | 0.33 seconds | |
| n/a | n/a | n/a | n/a | n/a | |
GPT oss 120b | 131,072 | $0.15 | $0.6 | 260 t/s | 8.1 seconds |
GPT oss 20b | 131,072 | $0.08 | $0.35 | 564 t/s | 4 seconds |
Kimi K2 Thinking | 256,000 | $0.6 | $2.5 | 79 t/s | 25.3 seconds |
Benchmark glossary
- GPQA Diamond
- Graduate-level science questions curated by domain experts. Tests advanced reasoning across physics, chemistry, and biology.
- AIME 2025
- Problems from the 2025 American Invitational Mathematics Examination. Measures multi-step mathematical problem solving.
- SWE-Bench Verified
- Real GitHub issues from popular Python repos that the model must resolve end-to-end. Measures agentic software engineering ability.
- Humanity's Last Exam
- A crowd-sourced exam of extremely hard questions spanning every academic discipline. Designed to be the final exam before superhuman AI.
- ARC-AGI 2
- Abstract visual puzzles requiring novel pattern recognition. Tests fluid intelligence and generalization beyond training data.
- MMMLU
- Massive Multitask Language Understanding across multiple languages. Evaluates knowledge and reasoning in non-English contexts.
Kimi K2 Thinking
GPT oss 120b
Qwen2.5-VL-32B