updated 24 Feb 2026
Best LLM for Coding
This coding LLM leaderboard compares the latest models on engineering-specific benchmarks including SWE-Bench, LiveCodeBench, Aider Polyglot, BFCL tool use, and more. The data comes from model providers as well as independently run evaluations by Vellum or the open-source community. If you want to use these models in your agents, try Vellum.
Top models for coding
Best in Live CodeBench
85%80%74%69%63%
Best in Agentic Coding (SWE Bench)
85%81%76%72%68%
Best in Tool Use (BFCL)
70%53%35%18%0%
Model Comparison
| Models | LiveCodeBench | SWE Bench | MATH 500 | BFCL | Aider Polyglot |
|---|---|---|---|---|---|
| 76% | 80.8% | 97.6% | n/a | n/a | |
| 72.4% | 79.6% | 97.8% | n/a | n/a | |
GPT-5.3 Codex | n/a | n/a | n/a | n/a | n/a |
| 41% | 38.8% | 94% | 58.5% | n/a | |
Qwen2.5-VL-32B | n/a | 18.8% | 82.2% | 62.8% | n/a |
OpenAI o1-mini | n/a | n/a | 90% | 52.2% | n/a |
OpenAI o3-mini | 74.1% | 61% | 97.9% | 65.1% | n/a |
| 64.3% | 49.2% | 97.3% | 57.5% | n/a | |
| n/a | 70.3% | 96.2% | 58.3% | n/a | |
GPT-4.5 | n/a | 38% | n/a | 69.9% | n/a |
| n/a | 62.3% | 82.2% | 58.3% | n/a | |
| 69% | 59.6% | n/a | n/a | n/a | |
| 79.4% | n/a | n/a | n/a | n/a | |
| n/a | 10.2% | 89% | 59.1% | n/a | |
| 41% | n/a | n/a | n/a | n/a | |
| 32.8% | n/a | n/a | n/a | n/a | |
| 49.4% | n/a | 95% | n/a | n/a | |
GPT-4.1 | 52% | 55% | n/a | n/a | n/a |
GPT-4.1 mini | n/a | 23.6% | n/a | n/a | n/a |
GPT-4.1 nano | n/a | n/a | n/a | n/a | n/a |
| n/a | 72.7% | n/a | n/a | n/a | |
| n/a | 72.5% | n/a | n/a | n/a | |
GPT oss 120b | 69% | n/a | n/a | n/a | n/a |
GPT oss 20b | 69% | n/a | n/a | n/a | n/a |
| n/a | 74.5% | n/a | n/a | n/a | |
GPT-5 | n/a | 74.9% | n/a | n/a | n/a |
GPT 5.1 | n/a | 76.3% | n/a | n/a | n/a |
Kimi K2 Thinking | 83.1% | 71.3% | n/a | n/a | n/a |
| 79.7% | 76.2% | n/a | n/a | n/a | |
| n/a | 82% | n/a | n/a | n/a | |
| n/a | 80.9% | n/a | n/a | n/a | |
GPT 5.2 | n/a | 80% | n/a | n/a | n/a |
Context window, cost and speed comparison
| Models | Context Window | Input Cost / 1M tokens | Output Cost / 1M tokens | Speed (tokens/second) | Latency |
|---|---|---|---|---|---|
| 200,000 | $5 | $25 | 67 t/s | 1.6 seconds | |
| 200,000 | $3 | $15 | 55 t/s | 0.73 seconds | |
GPT-5.3 Codex | 400,000 | $1.75 | $14 | 50 t/s | 0.003 seconds |
| 128,000 | $0.27 | $1.1 | 33 t/s | 4 seconds | |
Qwen2.5-VL-32B | 131,000 | n/a | n/a | n/a | n/a |
OpenAI o1-mini | 128,000 | $3 | $12 | 220 t/s | 11.43 seconds |
OpenAI o3-mini | 200,000 | $1.1 | $4.4 | 214 t/s | 14 seconds |
| 128,000 | $0.55 | $2.19 | 24 t/s | 4 seconds | |
| 200,000 | $3 | $15 | 78 t/s | 0.95 seconds | |
GPT-4.5 | 128,000 | $75 | $150 | 48 t/s | 1.25 seconds |
| 200,000 | $3 | $15 | 78 t/s | 0.91 seconds | |
| 1,000,000 | $1.25 | $10 | 191 t/s | 30 seconds | |
| / | n/a | n/a | n/a | n/a | |
| 128,000 | $0.07 | $0.07 | 59 t/s | 0.72 seconds | |
| 10,000,000 | $0.2 | $0.6 | 126 t/s | 0.45 seconds | |
| 10,000,000 | $0.11 | $0.34 | 2600 t/s | 0.33 seconds | |
| n/a | n/a | n/a | n/a | n/a | |
GPT-4.1 | 1,000,000 | $2 | $8 | n/a | n/a |
GPT-4.1 mini | 1,000,000 | $0.4 | $1.6 | n/a | n/a |
GPT-4.1 nano | 1,000,000 | $0.1 | $0.4 | n/a | n/a |
| 200,000 | $3 | $15 | n/a | 1.9 seconds | |
| 200,000 | $15 | $75 | n/a | 1.95 seconds | |
GPT oss 120b | 131,072 | $0.15 | $0.6 | 260 t/s | 8.1 seconds |
GPT oss 20b | 131,072 | $0.08 | $0.35 | 564 t/s | 4 seconds |
| 200,000 | $15 | $75 | n/a | n/a | |
GPT-5 | 400,000 | $1.25 | $10 | n/a | n/a |
GPT 5.1 | 200,000 | $1.25 | $10 | n/a | n/a |
Kimi K2 Thinking | 256,000 | $0.6 | $2.5 | 79 t/s | 25.3 seconds |
| 10000000 | $2 | $12 | 128 t/s | 30.3 seconds | |
| 200000 | $3 | $15 | 69 t/s | 31 seconds | |
| 200,000 | $5 | $25 | n/a | n/a | |
GPT 5.2 | 400k | $1.5 | $14 | 92 t/s | 0.6 seconds |
GPT-5.3 Codex
Qwen2.5-VL-32B
Kimi K2 Thinking