updated 23 Mar 2026

Best LLM for Coding

This coding LLM leaderboard compares the latest models on engineering-specific benchmarks including SWE-Bench, LiveCodeBench, Aider Polyglot, BFCL tool use, and more. The data comes from model providers as well as independently run evaluations by Vellum or the open-source community.

Top models for coding

Best in Live CodeBench

100%75%50%25%0%
93.5
DeepSeek V4 Pro
91.6
DeepSeek V4 Flash
83.1
Kimi K2 Thinking
79.7
Gemini 3 Pro
79.4
Grok 3 [Beta]
Best in Live CodeBench
ModelScore
DeepSeek V4 Pro93.5%
DeepSeek V4 Flash91.6%
Kimi K2 Thinking83.1%
Gemini 3 Pro79.7%
Grok 3 [Beta]79.4%

Best in Agentic Coding (SWE Bench)

100%94%88%82%77%
95.5
Claude Mythos 5
95
Claude Fable 5
88.6
Claude Opus 4.8
87.6
Claude Opus 4.7
85.2
Claude Sonnet 5
Best in Agentic Coding (SWE Bench)
ModelScore
Claude Mythos 595.5%
Claude Fable 595%
Claude Opus 4.888.6%
Claude Opus 4.787.6%
Claude Sonnet 585.2%

Best in Tool Use (BFCL)

70%53%35%18%0%
69.9
GPT-4.5
65.1
OpenAI o3-mini
62.8
Qwen2.5-VL-32B
59.1
Gemma 3 27b
58.5
DeepSeek V3 0324
Best in Tool Use (BFCL)
ModelScore
GPT-4.5 69.9%
OpenAI o3-mini65.1%
Qwen2.5-VL-32B62.8%
Gemma 3 27b59.1%
DeepSeek V3 032458.5%

Model Comparison

Models LiveCodeBench SWE Bench MATH 500 BFCL Aider Polyglot
DeepSeek V4 Flash91.6%79%n/an/an/a
DeepSeek V4 Pro93.5%80.6%n/an/an/a
Gemini 3.1 Pron/a80.6%n/an/an/a
Gemini 3.5 Flashn/an/an/an/an/a
GLM 5.2n/an/an/an/an/a
Claude Sonnet 5n/a85.2%n/an/an/a
MiniMax M3n/a80.5%n/an/an/a
Claude Mythos 5n/a95.5%n/an/an/a
Claude Fable 5n/a95%n/an/an/a
Claude Opus 4.8n/a88.6%n/an/an/a
GPT-5.5n/a58.6%n/an/an/a
GPT-5.5 Pron/an/an/an/an/a
Claude Opus 4.7n/a87.6%n/an/an/a
Claude Opus 4.676%80.8%97.6%n/an/a
Claude Sonnet 4.672.4%79.6%97.8%n/an/a
GPT-5.3 Codexn/an/an/an/an/a
DeepSeek V3 032441%38.8%94%58.5%n/a
Qwen2.5-VL-32Bn/a18.8%82.2%62.8%n/a
OpenAI o1-minin/an/a90%52.2%n/a
OpenAI o3-mini74.1%61%97.9%65.1%n/a
DeepSeek-R164.3%49.2%97.3%57.5%n/a
Claude 3.7 Sonnet [R]n/a70.3%96.2%58.3%n/a
GPT-4.5 n/a38%n/a69.9%n/a
Claude 3.7 Sonnetn/a62.3%82.2%58.3%n/a
Gemini 2.5 Pro69%59.6%n/an/an/a
Grok 3 [Beta]79.4%n/an/an/an/a
Gemma 3 27bn/a10.2%89%59.1%n/a
Llama 4 Maverick41%n/an/an/an/a
Llama 4 Scout32.8%n/an/an/an/a
Llama 4 Behemoth49.4%n/a95%n/an/a
GPT-4.152%55%n/an/an/a
GPT-4.1 minin/a23.6%n/an/an/a
GPT-4.1 nanon/an/an/an/an/a
Claude 4 Sonnetn/a72.7%n/an/an/a
Claude 4 Opusn/a72.5%n/an/an/a
GPT oss 120b69%n/an/an/an/a
GPT oss 20b69%n/an/an/an/a
Claude Opus 4.1n/a74.5%n/an/an/a
GPT-5n/a74.9%n/an/an/a
GPT 5.1n/a76.3%n/an/an/a
Kimi K2 Thinking83.1%71.3%n/an/an/a
Gemini 3 Pro79.7%76.2%n/an/an/a
Claude Sonnet 4.5n/a82%n/an/an/a
Claude Opus 4.5n/a80.9%n/an/an/a
GPT 5.2n/a80%n/an/an/a

Context window, cost and speed comparison

Models Context Window Input Cost / 1M tokens Output Cost / 1M tokens Speed (tokens/second) Latency
Claude Mythos 51,000,000$10$50n/an/a
Claude Opus 4.81,000,000$5$2564.8 t/s32.1 seconds
Claude Sonnet 51,000,000$3$1556.3 t/s20.69 seconds
GLM 5.21,000,000$0.95$3347 t/s1.14 seconds
DeepSeek V4 Flash1000000$0.14$0.28107.9 t/s1.42 seconds
DeepSeek V4 Pro1000000$0.435$0.87174.9 t/s1.2 seconds
Gemini 3.1 Pro1,000,000$2$12136.2 t/s20.34 seconds
GPT-5.5 Pro1,000,000$30$180n/an/a
GPT-5.51,000,000$5$3079 t/s76.69 seconds
Gemini 3.5 Flash1,000,000$1.5$9175.4 t/s23.16 seconds
Claude Opus 4.6200,000$5$2567 t/s1.6 seconds
Claude Sonnet 4.6200,000$3$1555 t/s0.73 seconds
Claude Opus 4.71,000,000$5$2550.8 t/s17.11 seconds
Claude Fable 51,000,000$10$50n/an/a
MiniMax M31,048,576$0.6$2.498.6 t/s0.85 seconds

Coding benchmark glossary

LiveCodeBench
Continuously updated competitive programming problems sourced after model training cutoffs. Measures genuine code generation on unseen tasks.
Aider Polyglot
Multi-language code editing benchmark using the Aider coding assistant. Tests the ability to correctly modify existing code across languages.
SWE-Bench Verified
Real GitHub issues from popular Python repos that the model must resolve end-to-end. Measures agentic software engineering ability.
BFCL
Berkeley Function Calling Leaderboard testing structured tool and function call accuracy. Evaluates how reliably a model invokes APIs.
GRIND
Adaptive reasoning benchmark requiring iterative problem decomposition. Tests a model's ability to break down and solve multi-step coding challenges.

The Personal AI you were promised

GET STARTED