updated 1 Jul 2026

LLM Leaderboard

This LLM leaderboard displays the latest public benchmark performance for SOTA model versions released after April 2024. The data comes from model providers as well as independently run evaluations by Vellum or the open-source community. We feature results from non-saturated benchmarks, excluding outdated benchmarks (e.g. MMLU).

Best Overall (Humanity's Last Exam)

70%53%35%18%0%

64.5

Claude Mythos 5

57.9

Claude Opus 4.8

57.4

Claude Sonnet 5

54.7

GLM 5.2

Kimi K2.6

51.6

DeepSeek V4 Flash

48.2

DeepSeek V4 Pro

45.8

Gemini 3 Pro

44.9

Kimi K2 Thinking

44.4

Gemini 3.1 Pro

43.1

GPT-5.5 Pro

41.4

GPT-5.5

40.2

Gemini 3.5 Flash

Claude Opus 4.6

35.2

GPT-5

30.1

Kimi K2.5

25.4

Grok 4

21.6

Gemini 2.5 Pro

20.3

OpenAI o3

19.1

Claude Sonnet 4.6

Best Overall (Humanity's Last Exam)
Model	Score
Claude Mythos 5	64.5%
Claude Opus 4.8	57.9%
Claude Sonnet 5	57.4%
GLM 5.2	54.7%
Kimi K2.6	54%
DeepSeek V4 Flash	51.6%
DeepSeek V4 Pro	48.2%
Gemini 3 Pro	45.8%
Kimi K2 Thinking	44.9%
Gemini 3.1 Pro	44.4%
GPT-5.5 Pro	43.1%
GPT-5.5	41.4%
Gemini 3.5 Flash	40.2%
Claude Opus 4.6	40%
GPT-5	35.2%
Kimi K2.5	30.1%
Grok 4	25.4%
Gemini 2.5 Pro	21.6%
OpenAI o3	20.3%
Claude Sonnet 4.6	19.1%

Top models per tasks

Best in Reasoning (GPQA Diamond)

100%95%91%86%81%

96.2

Claude Sonnet 5

95.4

Claude 3 Opus

94.3

Gemini 3.1 Pro

94.2

Claude Opus 4.7

94.1

Claude Fable 5

Best in Reasoning (GPQA Diamond)
Model	Score
Claude Sonnet 5	96.2%
Claude 3 Opus	95.4%
Gemini 3.1 Pro	94.3%
Claude Opus 4.7	94.2%
Claude Fable 5	94.1%

Best in Agentic Coding (SWE Bench)

100%94%88%82%77%

95.5

Claude Mythos 5

Claude Fable 5

88.6

Claude Opus 4.8

87.6

Claude Opus 4.7

85.2

Claude Sonnet 5

Best in Agentic Coding (SWE Bench)
Model	Score
Claude Mythos 5	95.5%
Claude Fable 5	95%
Claude Opus 4.8	88.6%
Claude Opus 4.7	87.6%
Claude Sonnet 5	85.2%

New

Best for Work Automations (AutoBench)

20%15%10%5%0%

17.4

Claude Fable 5

15.5

Claude Opus 4.8

13.5

Claude Sonnet 5

12.9

GPT-5.5

5.3

Claude Sonnet 4.6

Best for Work Automations (AutoBench)
Model	Score
Claude Fable 5	17.4%
Claude Opus 4.8	15.5%
Claude Sonnet 5	13.5%
GPT-5.5	12.9%
Claude Sonnet 4.6	5.3%

New

Best in Computer Use (OSWorld)

85%81%76%72%68%

Claude Fable 5

83.4

Claude Opus 4.8

81.2

Claude Sonnet 5

78.7

GPT-5.5

78.5

Claude Sonnet 4.6

Best in Computer Use (OSWorld)
Model	Score
Claude Fable 5	85%
Claude Opus 4.8	83.4%
Claude Sonnet 5	81.2%
GPT-5.5	78.7%
Claude Sonnet 4.6	78.5%

New

Best in Browsing (BrowseComp)

90%86%81%77%72%

Claude Fable 5

85.9

DeepSeek V4 Flash

85.9

Gemini 3.1 Pro

84.7

Claude Sonnet 5

84.4

GPT-5.5

Best in Browsing (BrowseComp)
Model	Score
Claude Fable 5	88%
DeepSeek V4 Flash	85.9%
Gemini 3.1 Pro	85.9%
Claude Sonnet 5	84.7%
GPT-5.5	84.4%

New

Best in Terminal Use (Terminal-Bench 2.1)

90%86%81%77%72%

Claude Mythos 5

84.3

Claude Fable 5

82.7

GPT-5.5

GLM 5.2

80.4

Claude Sonnet 5

Best in Terminal Use (Terminal-Bench 2.1)
Model	Score
Claude Mythos 5	88%
Claude Fable 5	84.3%
GPT-5.5	82.7%
GLM 5.2	81%
Claude Sonnet 5	80.4%

Fastest and most affordable models

Fastest Models (Tokens/sec)

Llama 4 Scout2600 t/s

Llama 3.1 405b969 t/s

GLM 5.2347 t/s

Kimi K2.6342.6 t/s

Kimi K2.5337.7 t/s

Lowest Latency (TTFT)

GPT-5.3 Codex0.003s

Nova Micro0.3s

Llama 4 Scout0.33s

Gemini 2.0 Flash0.34s

GPT-4o mini0.35s

Cheapest Models (per 1M tokens)

Nova Micro$0.04 / $0.14

Gemini 1.5 Flash$0.075 / $0.3

Gemini 2.0 Flash$0.1 / $0.4

GPT-4.1 nano$0.1 / $0.4

Llama 4 Scout$0.11 / $0.34

Compare models

Side-by-side comparison of the latest models released in the last 9 months.

	Claude Mythos 5	Claude Opus 4.8
Context size	1,000,000	1,000,000
Cutoff date	Jan 2026	Jan 2026
I/O cost	$10 / $50	$5 / $25
Max output	128,000	128,000
Latency	-	32.1s
Speed	-	64.8 t/s

Best Overall (HLE)

Claude Mythos 5

64.5

Claude Opus 4.8

57.9

Best in Terminal Use (Terminal-Bench 2.1)

Claude Mythos 5

Claude Opus 4.8

74.6

Best in Agentic Coding (SWE-Bench)

Claude Mythos 5

95.5

Claude Opus 4.8

88.6

Best in Reasoning (GPQA Diamond)

Claude Mythos 5

94.1

Claude Opus 4.8

93.6

Compare Personal AI harnesses

Compare with

	Vellum	Hermes	OpenClaw	Claude Cowork	Hermes
Open source	MIT	MIT	Apache 2.0	Proprietary	MIT
Time to set up	Easy	Moderate	Difficult	Easy	Moderate
Native channels	iOS, MacOS, Web, Voice, Email, Telegram, Slack, CLI	CLI / TUI	CLI, MacOS, Web	CLI, MacOS, Windows, Web	CLI / TUI
Memory	Managed memory	SQLite + markdown — you build the memory stack	Basic memory, context loss	Limited	SQLite + markdown — you build the memory stack
Security	Built-in security	DIY	DIY	No sandboxing	DIY
Hosting	Cloud or self-hosted	Self-hosted only	Self-hosted only	Anthropic cloud	Self-hosted only
Native integrations	Managed OAuth connections	No managed connectors	No managed connectors	MCP only	No managed connectors
Schedules	Cron + Heartbeat	Cron + Heartbeat	Cron + Heartbeat	Cron only	Cron + Heartbeat
Pricing	Free + API costs, Paid plans available	Free + DIY Hosting Costs + API costs	Free + DIY Hosting Costs + API costs	Paid plans available + API costs	Free + DIY Hosting Costs + API costs

Model Comparison

Model	Context size	Cutoff date	I/O cost	Max output	Latency	Speed
Claude Mythos 5	1,000,000	Jan 2026	$10 / $50	128,000	-	-
Claude Opus 4.8	1,000,000	Jan 2026	$5 / $25	128,000	32.1s	64.8 t/s
Claude Sonnet 5	1,000,000	Jan 2026	$3 / $15	128,000	20.69s	56.3 t/s
GLM 5.2	1,000,000	Mar 2026	$0.95 / $3	128,000	1.14s	347 t/s
Kimi K2.6	256,000	-	$0.95 / $4	-	0.68s	342.6 t/s
DeepSeek V4 Flash	1000000	Jan 2026	$0.14 / $0.28	384000	1.42s	107.9 t/s
DeepSeek V4 Pro	1000000	Jan 2026	$0.435 / $0.87	384000	1.2s	174.9 t/s
Gemini 3.1 Pro	1,000,000	Jan 2026	$2 / $12	65,536	20.34s	136.2 t/s
GPT-5.5 Pro	1,000,000	Apr 2026	$30 / $180	128,000	-	-
GPT-5.5	1,000,000	Apr 2026	$5 / $30	128,000	76.69s	79 t/s
Gemini 3.5 Flash	1,000,000	Jan 2026	$1.5 / $9	65,536	23.16s	175.4 t/s
Claude Opus 4.6	200,000	May 2025	$5 / $25	128,000	1.6s	67 t/s
Claude Opus 4.7	1,000,000	Apr 2026	$5 / $25	128,000	17.11s	50.8 t/s
Claude Fable 5	1,000,000	Jan 2026	$10 / $50	128,000	-	-
MiniMax M3	1,048,576	Mar 2026	$0.6 / $2.4	512,000	0.85s	98.6 t/s

Context window, cost and speed comparison

Models	Context Window	Input Cost / 1M tokens	Output Cost / 1M tokens	Speed (tokens/second)	Latency
Claude Mythos 5	1,000,000	$10	$50	n/a	n/a
Claude Opus 4.8	1,000,000	$5	$25	64.8 t/s	32.1 seconds
Claude Sonnet 5	1,000,000	$3	$15	56.3 t/s	20.69 seconds
GLM 5.2	1,000,000	$0.95	$3	347 t/s	1.14 seconds
Kimi K2.6	256,000	$0.95	$4	342.6 t/s	0.68 seconds
DeepSeek V4 Flash	1000000	$0.14	$0.28	107.9 t/s	1.42 seconds
DeepSeek V4 Pro	1000000	$0.435	$0.87	174.9 t/s	1.2 seconds
Gemini 3.1 Pro	1,000,000	$2	$12	136.2 t/s	20.34 seconds
GPT-5.5 Pro	1,000,000	$30	$180	n/a	n/a
GPT-5.5	1,000,000	$5	$30	79 t/s	76.69 seconds
Gemini 3.5 Flash	1,000,000	$1.5	$9	175.4 t/s	23.16 seconds
Claude Opus 4.6	200,000	$5	$25	67 t/s	1.6 seconds
Claude Opus 4.7	1,000,000	$5	$25	50.8 t/s	17.11 seconds
Claude Fable 5	1,000,000	$10	$50	n/a	n/a
MiniMax M3	1,048,576	$0.6	$2.4	98.6 t/s	0.85 seconds

Benchmark glossary

Humanity's Last Exam: A crowd-sourced exam of extremely hard questions spanning every academic discipline. Designed to be the final exam before superhuman AI.
GPQA Diamond: Graduate-level science questions curated by domain experts. Tests advanced reasoning across physics, chemistry, and biology.
SWE-Bench Verified: Real GitHub issues from popular Python repos that the model must resolve end-to-end. Measures agentic software engineering ability.
AutoBench: Automation benchmark evaluating a model's ability to complete real-world work automation tasks using tools and multi-step workflows.
OSWorld-Verified: Real-world computer use tasks requiring GUI interaction in desktop environments. Measures end-to-end task completion on a real OS.
BrowseComp: Agentic web search benchmark testing a model's ability to browse and extract information from the web to answer complex questions.
Terminal-Bench 2.1: Terminal and tool use benchmark evaluating a model's ability to execute multi-step tasks in a terminal environment.

LLM Leaderboard

Best Overall (Humanity's Last Exam)

Top models per tasks

Best in Reasoning (GPQA Diamond)

Best in Agentic Coding (SWE Bench)

Best for Work Automations (AutoBench)

Best in Computer Use (OSWorld)

Best in Browsing (BrowseComp)

Best in Terminal Use (Terminal-Bench 2.1)

Fastest and most affordable models

Fastest Models (Tokens/sec)

Lowest Latency (TTFT)

Cheapest Models (per 1M tokens)

Compare models

Compare Personal AI harnesses

Model Comparison

Context window, cost and speed comparison

Benchmark glossary

The Personal AI you were promised