updated 1 Jul 2026

LLM Leaderboard

This LLM leaderboard displays the latest public benchmark performance for SOTA model versions released after April 2024. The data comes from model providers as well as independently run evaluations by Vellum or the open-source community. We feature results from non-saturated benchmarks, excluding outdated benchmarks (e.g. MMLU).

Top models per tasks

Best in Reasoning (GPQA Diamond)

100%95%91%86%81%
96.2
Claude Sonnet 5
95.4
Claude 3 Opus
94.3
Gemini 3.1 Pro
94.2
Claude Opus 4.7
94.1
Claude Fable 5
Best in Reasoning (GPQA Diamond)
ModelScore
Claude Sonnet 596.2%
Claude 3 Opus95.4%
Gemini 3.1 Pro94.3%
Claude Opus 4.794.2%
Claude Fable 594.1%

Best in Agentic Coding (SWE Bench)

100%94%88%82%77%
95.5
Claude Mythos 5
95
Claude Fable 5
88.6
Claude Opus 4.8
87.6
Claude Opus 4.7
85.2
Claude Sonnet 5
Best in Agentic Coding (SWE Bench)
ModelScore
Claude Mythos 595.5%
Claude Fable 595%
Claude Opus 4.888.6%
Claude Opus 4.787.6%
Claude Sonnet 585.2%
New

Best for Work Automations (AutoBench)

20%15%10%5%0%
17.4
Claude Fable 5
15.5
Claude Opus 4.8
13.5
Claude Sonnet 5
12.9
GPT-5.5
5.3
Claude Sonnet 4.6
Best for Work Automations (AutoBench)
ModelScore
Claude Fable 517.4%
Claude Opus 4.815.5%
Claude Sonnet 513.5%
GPT-5.512.9%
Claude Sonnet 4.65.3%
New

Best in Computer Use (OSWorld)

85%81%76%72%68%
85
Claude Fable 5
83.4
Claude Opus 4.8
81.2
Claude Sonnet 5
78.7
GPT-5.5
78.5
Claude Sonnet 4.6
Best in Computer Use (OSWorld)
ModelScore
Claude Fable 585%
Claude Opus 4.883.4%
Claude Sonnet 581.2%
GPT-5.578.7%
Claude Sonnet 4.678.5%
New

Best in Browsing (BrowseComp)

90%86%81%77%72%
88
Claude Fable 5
85.9
DeepSeek V4 Flash
85.9
Gemini 3.1 Pro
84.7
Claude Sonnet 5
84.4
GPT-5.5
Best in Browsing (BrowseComp)
ModelScore
Claude Fable 588%
DeepSeek V4 Flash85.9%
Gemini 3.1 Pro85.9%
Claude Sonnet 584.7%
GPT-5.584.4%
New

Best in Terminal Use (Terminal-Bench 2.1)

90%86%81%77%72%
88
Claude Mythos 5
84.3
Claude Fable 5
82.7
GPT-5.5
81
GLM 5.2
80.4
Claude Sonnet 5
Best in Terminal Use (Terminal-Bench 2.1)
ModelScore
Claude Mythos 588%
Claude Fable 584.3%
GPT-5.582.7%
GLM 5.281%
Claude Sonnet 580.4%

Fastest and most affordable models

Fastest Models (Tokens/sec)

1Llama 4 Scout2600 t/s
2Llama 3.1 405b969 t/s
3GLM 5.2347 t/s
4Kimi K2.6342.6 t/s
5Kimi K2.5337.7 t/s

Lowest Latency (TTFT)

1GPT-5.3 Codex0.003s
2Nova Micro0.3s
3Llama 4 Scout0.33s
4Gemini 2.0 Flash0.34s
5GPT-4o mini0.35s

Cheapest Models (per 1M tokens)

1Nova Micro$0.04 / $0.14
2Gemini 1.5 Flash$0.075 / $0.3
3Gemini 2.0 Flash$0.1 / $0.4
4GPT-4.1 nano$0.1 / $0.4
5Llama 4 Scout$0.11 / $0.34

Compare models

Side-by-side comparison of the latest models released in the last 9 months.

vs
Claude Mythos 5Claude Opus 4.8
Context size1,000,0001,000,000
Cutoff dateJan 2026Jan 2026
I/O cost$10 / $50$5 / $25
Max output128,000128,000
Latency-32.1s
Speed-64.8 t/s
Best Overall (HLE)
Claude Mythos 5
64.5
Claude Opus 4.8
57.9
Best in Terminal Use (Terminal-Bench 2.1)
Claude Mythos 5
88
Claude Opus 4.8
74.6
Best in Agentic Coding (SWE-Bench)
Claude Mythos 5
95.5
Claude Opus 4.8
88.6
Best in Reasoning (GPQA Diamond)
Claude Mythos 5
94.1
Claude Opus 4.8
93.6

Compare Personal AI harnesses

Compare with
Vellum
Hermes
OpenClaw
Claude Cowork
Hermes
Open source
MIT
MIT
Apache 2.0
Proprietary
MIT
Time to set up
Easy
Moderate
Difficult
Easy
Moderate
Native channels
iOS, MacOS, Web, Voice, Email, Telegram, Slack, CLI
CLI / TUI
CLI, MacOS, Web
CLI, MacOS, Windows, Web
CLI / TUI
Memory
Managed memory
SQLite + markdown — you build the memory stack
Basic memory, context loss
Limited
SQLite + markdown — you build the memory stack
Security
Built-in security
DIY
DIY
No sandboxing
DIY
Hosting
Cloud or self-hosted
Self-hosted only
Self-hosted only
Anthropic cloud
Self-hosted only
Native integrations
Managed OAuth connections
No managed connectors
No managed connectors
MCP only
No managed connectors
Schedules
Cron + Heartbeat
Cron + Heartbeat
Cron + Heartbeat
Cron only
Cron + Heartbeat
Pricing
Free + API costs, Paid plans available
Free + DIY Hosting Costs + API costs
Free + DIY Hosting Costs + API costs
Paid plans available + API costs
Free + DIY Hosting Costs + API costs

Model Comparison

ModelContext sizeCutoff dateI/O costMax outputLatencySpeed
Claude Mythos 51,000,000Jan 2026$10 / $50128,000--
Claude Opus 4.81,000,000Jan 2026$5 / $25128,00032.1s64.8 t/s
Claude Sonnet 51,000,000Jan 2026$3 / $15128,00020.69s56.3 t/s
GLM 5.21,000,000Mar 2026$0.95 / $3128,0001.14s347 t/s
Kimi K2.6256,000-$0.95 / $4-0.68s342.6 t/s
DeepSeek V4 Flash1000000Jan 2026$0.14 / $0.283840001.42s107.9 t/s
DeepSeek V4 Pro1000000Jan 2026$0.435 / $0.873840001.2s174.9 t/s
Gemini 3.1 Pro1,000,000Jan 2026$2 / $1265,53620.34s136.2 t/s
GPT-5.5 Pro1,000,000Apr 2026$30 / $180128,000--
GPT-5.51,000,000Apr 2026$5 / $30128,00076.69s79 t/s
Gemini 3.5 Flash1,000,000Jan 2026$1.5 / $965,53623.16s175.4 t/s
Claude Opus 4.6200,000May 2025$5 / $25128,0001.6s67 t/s
Claude Opus 4.71,000,000Apr 2026$5 / $25128,00017.11s50.8 t/s
Claude Fable 51,000,000Jan 2026$10 / $50128,000--
MiniMax M31,048,576Mar 2026$0.6 / $2.4512,0000.85s98.6 t/s

Context window, cost and speed comparison

Models Context Window Input Cost / 1M tokens Output Cost / 1M tokens Speed (tokens/second) Latency
Claude Mythos 51,000,000$10$50n/an/a
Claude Opus 4.81,000,000$5$2564.8 t/s32.1 seconds
Claude Sonnet 51,000,000$3$1556.3 t/s20.69 seconds
GLM 5.21,000,000$0.95$3347 t/s1.14 seconds
Kimi K2.6256,000$0.95$4342.6 t/s0.68 seconds
DeepSeek V4 Flash1000000$0.14$0.28107.9 t/s1.42 seconds
DeepSeek V4 Pro1000000$0.435$0.87174.9 t/s1.2 seconds
Gemini 3.1 Pro1,000,000$2$12136.2 t/s20.34 seconds
GPT-5.5 Pro1,000,000$30$180n/an/a
GPT-5.51,000,000$5$3079 t/s76.69 seconds
Gemini 3.5 Flash1,000,000$1.5$9175.4 t/s23.16 seconds
Claude Opus 4.6200,000$5$2567 t/s1.6 seconds
Claude Opus 4.71,000,000$5$2550.8 t/s17.11 seconds
Claude Fable 51,000,000$10$50n/an/a
MiniMax M31,048,576$0.6$2.498.6 t/s0.85 seconds

Benchmark glossary

Humanity's Last Exam
A crowd-sourced exam of extremely hard questions spanning every academic discipline. Designed to be the final exam before superhuman AI.
GPQA Diamond
Graduate-level science questions curated by domain experts. Tests advanced reasoning across physics, chemistry, and biology.
SWE-Bench Verified
Real GitHub issues from popular Python repos that the model must resolve end-to-end. Measures agentic software engineering ability.
AutoBench
Automation benchmark evaluating a model's ability to complete real-world work automation tasks using tools and multi-step workflows.
OSWorld-Verified
Real-world computer use tasks requiring GUI interaction in desktop environments. Measures end-to-end task completion on a real OS.
BrowseComp
Agentic web search benchmark testing a model's ability to browse and extract information from the web to answer complex questions.
Terminal-Bench 2.1
Terminal and tool use benchmark evaluating a model's ability to execute multi-step tasks in a terminal environment.

The Personal AI you were promised

GET STARTED