updated 24 Feb 2026

Best LLM for Coding

This coding LLM leaderboard compares the latest models on engineering-specific benchmarks including SWE-Bench, LiveCodeBench, Aider Polyglot, BFCL tool use, and more. The data comes from model providers as well as independently run evaluations by Vellum or the open-source community. If you want to use these models in your agents, try Vellum.

Top models for coding

Best in Live CodeBench

85%80%74%69%63%
83.1%
Kimi K2 Thinking
79.7%
Gemini 3 Pro
79.4%
Grok 3 [Beta]
76%
Claude Opus 4.6
74.1%
OpenAI o3-mini

Best in Agentic Coding (SWE Bench)

85%81%76%72%68%
82%
Claude Sonnet 4.5
80.9%
Claude Opus 4.5
80.8%
Claude Opus 4.6
80%
GPT 5.2
79.6%
Claude Sonnet 4.6

Best in Tool Use (BFCL)

70%53%35%18%0%
69.9%
GPT-4.5
65.1%
OpenAI o3-mini
62.8%
Qwen2.5-VL-32B
59.1%
Gemma 3 27b
58.5%
DeepSeek V3 0324

Model Comparison

Models LiveCodeBench SWE Bench MATH 500 BFCL Aider Polyglot
Claude Opus 4.676%80.8%97.6%n/an/a
Claude Sonnet 4.672.4%79.6%97.8%n/an/a
GPT-5.3 Codexn/an/an/an/an/a
DeepSeek V3 032441%38.8%94%58.5%n/a
Qwen2.5-VL-32Bn/a18.8%82.2%62.8%n/a
OpenAI o1-minin/an/a90%52.2%n/a
OpenAI o3-mini74.1%61%97.9%65.1%n/a
DeepSeek-R164.3%49.2%97.3%57.5%n/a
Claude 3.7 Sonnet [R]n/a70.3%96.2%58.3%n/a
GPT-4.5 n/a38%n/a69.9%n/a
Claude 3.7 Sonnetn/a62.3%82.2%58.3%n/a
Gemini 2.5 Pro69%59.6%n/an/an/a
Grok 3 [Beta]79.4%n/an/an/an/a
Gemma 3 27bn/a10.2%89%59.1%n/a
Llama 4 Maverick41%n/an/an/an/a
Llama 4 Scout32.8%n/an/an/an/a
Llama 4 Behemoth49.4%n/a95%n/an/a
GPT-4.152%55%n/an/an/a
GPT-4.1 minin/a23.6%n/an/an/a
GPT-4.1 nanon/an/an/an/an/a
Claude 4 Sonnetn/a72.7%n/an/an/a
Claude 4 Opusn/a72.5%n/an/an/a
GPT oss 120b69%n/an/an/an/a
GPT oss 20b69%n/an/an/an/a
Claude Opus 4.1n/a74.5%n/an/an/a
GPT-5n/a74.9%n/an/an/a
GPT 5.1n/a76.3%n/an/an/a
Kimi K2 Thinking83.1%71.3%n/an/an/a
Gemini 3 Pro79.7%76.2%n/an/an/a
Claude Sonnet 4.5n/a82%n/an/an/a
Claude Opus 4.5n/a80.9%n/an/an/a
GPT 5.2n/a80%n/an/an/a

Context window, cost and speed comparison

Models Context Window Input Cost / 1M tokens Output Cost / 1M tokens Speed (tokens/second) Latency
Claude Opus 4.6200,000$5$2567 t/s1.6 seconds
Claude Sonnet 4.6200,000$3$1555 t/s0.73 seconds
GPT-5.3 Codex400,000$1.75$1450 t/s0.003 seconds
DeepSeek V3 0324128,000$0.27$1.133 t/s4 seconds
Qwen2.5-VL-32B131,000n/an/an/an/a
OpenAI o1-mini128,000$3$12220 t/s11.43 seconds
OpenAI o3-mini200,000$1.1$4.4214 t/s14 seconds
DeepSeek-R1128,000$0.55$2.1924 t/s4 seconds
Claude 3.7 Sonnet [R]200,000$3$1578 t/s0.95 seconds
GPT-4.5 128,000$75$15048 t/s1.25 seconds
Claude 3.7 Sonnet200,000$3$1578 t/s0.91 seconds
Gemini 2.5 Pro1,000,000$1.25$10191 t/s30 seconds
Grok 3 [Beta]/n/an/an/an/a
Gemma 3 27b128,000$0.07$0.0759 t/s0.72 seconds
Llama 4 Maverick10,000,000$0.2$0.6126 t/s0.45 seconds
Llama 4 Scout10,000,000$0.11$0.342600 t/s0.33 seconds
Llama 4 Behemothn/an/an/an/an/a
GPT-4.11,000,000$2$8n/an/a
GPT-4.1 mini1,000,000$0.4$1.6n/an/a
GPT-4.1 nano1,000,000$0.1$0.4n/an/a
Claude 4 Sonnet200,000$3$15n/a1.9 seconds
Claude 4 Opus200,000$15$75n/a1.95 seconds
GPT oss 120b131,072$0.15$0.6260 t/s8.1 seconds
GPT oss 20b131,072$0.08$0.35564 t/s4 seconds
Claude Opus 4.1200,000$15$75n/an/a
GPT-5400,000$1.25$10n/an/a
GPT 5.1200,000$1.25$10n/an/a
Kimi K2 Thinking256,000$0.6$2.579 t/s25.3 seconds
Gemini 3 Pro10000000$2$12128 t/s30.3 seconds
Claude Sonnet 4.5200000$3$1569 t/s31 seconds
Claude Opus 4.5200,000$5$25n/an/a
GPT 5.2400k$1.5$1492 t/s0.6 seconds