updated 24 Feb 2026
Open Source LLM Leaderboard
This open source LLM leaderboard displays the latest public benchmark performance for open-weight and open-source models released after April 2024. The data comes from model providers as well as independently run evaluations by Vellum or the open-source community. We feature results from non-saturated benchmarks, excluding outdated benchmarks (e.g. MMLU). If you want to use these models in your agents, try Vellum.
Top open source models per tasks
Best in Reasoning (GPQA Diamond)
90%68%45%23%0%
Best in High School Math (AIME 2025)
100%75%50%25%0%
Best in Agentic Coding (SWE Bench)
80%60%40%20%0%
Best Overall (Humanity's Last Exam)
50%38%25%13%0%
Best in Visual Reasoning (ARC-AGI 2)
15%14%12%11%9%
Best in Multilingual Reasoning (MMMLU)
90%86%81%77%72%
Fastest and most affordable open source models
Fastest Models (Tokens/sec)
1Llama 4 Scout2600 t/s
2Llama 3.3 70b2500 t/s
3Llama 3.1 70b2100 t/s
4Llama 3.1 405b969 t/s
5GPT oss 20b564 t/s
Lowest Latency (TTFT)
1Llama 4 Scout0.33s
2Llama 4 Maverick0.45s
3Llama 3.3 70b0.52s
4Gemma 3 27b0.72s
5Llama 3.1 405b0.73s
Cheapest Models (per 1M tokens)
1Gemma 3 27b$0.07 / $0.07
2GPT oss 20b$0.08 / $0.35
3Llama 4 Scout$0.11 / $0.34
4GPT oss 120b$0.15 / $0.6
5Llama 4 Maverick$0.2 / $0.6
Compare open source models
vs
| Kimi K2.5 | Llama 3.1 405b | |
|---|---|---|
| Context size | 256,000 | 128,000 |
| Cutoff date | Apr 2024 | Dec 2023 |
| I/O cost | $0.6 / $2.5 | $3.5 / $3.5 |
| Max output | 33,000 | 4096 |
| Latency | 1.2s | 0.73s |
| Speed | 45 t/s | 969 t/s |
Kimi K2.5Llama 3.1 405b
Model Comparison
| Model | Context size | Cutoff date | I/O cost | Max output | Latency | Speed |
|---|---|---|---|---|---|---|
| 128,000 | Dec 2024 | $0.55 / $2.19 | 8,000 | 4s | 24 t/s | |
Kimi K2 Thinking | 256,000 | April 2025 | $0.6 / $2.5 | 16,400 | 25.3s | 79 t/s |
GPT oss 120b | 131,072 | April 2025 | $0.15 / $0.6 | 131,072 | 8.1s | 260 t/s |
GPT oss 20b | 131,072 | April 2025 | $0.08 / $0.35 | 131,072 | 4s | 564 t/s |
| 128,000 | Dec 2024 | $0.27 / $1.1 | 8,000 | 4s | 33 t/s | |
Qwen2.5-VL-32B | 131,000 | Dec 2024 | - | 8,000 | - | - |
| 128,000 | Nov 2024 | $0.07 / $0.07 | 8192 | 0.72s | 59 t/s | |
| 10,000,000 | November 2024 | $0.2 / $0.6 | 8,000 | 0.45s | 126 t/s | |
| 10,000,000 | November 2024 | $0.11 / $0.34 | 8,000 | 0.33s | 2600 t/s | |
| - | November 2024 | - | - | - | - |
Context window, cost and speed comparison
| Models | Context Window | Input Cost / 1M tokens | Output Cost / 1M tokens | Speed (tokens/second) | Latency |
|---|---|---|---|---|---|
| 128,000 | $0.27 | $1.1 | 33 t/s | 4 seconds | |
Qwen2.5-VL-32B | 131,000 | n/a | n/a | n/a | n/a |
| 128,000 | $0.55 | $2.19 | 24 t/s | 4 seconds | |
| 128,000 | $0.07 | $0.07 | 59 t/s | 0.72 seconds | |
| 10,000,000 | $0.2 | $0.6 | 126 t/s | 0.45 seconds | |
| 10,000,000 | $0.11 | $0.34 | 2600 t/s | 0.33 seconds | |
| n/a | n/a | n/a | n/a | n/a | |
GPT oss 120b | 131,072 | $0.15 | $0.6 | 260 t/s | 8.1 seconds |
GPT oss 20b | 131,072 | $0.08 | $0.35 | 564 t/s | 4 seconds |
Kimi K2 Thinking | 256,000 | $0.6 | $2.5 | 79 t/s | 25.3 seconds |
Kimi K2 Thinking
GPT oss 120b
Qwen2.5-VL-32B