Vellum is coming to the AI Engineering World's Fair in SF. Come visit our booth and get a live demo!

Claude 3.7 Sonnet vs OpenAI o1 vs DeepSeek R1

Learn how the latest Anthropic's model compares to similar top-tier reasoning models on the market.

8 min
Written by
Reviewed by
No items found.

Anthropic just dropped Claude 3.7 Sonnet, and it’s a textbook case of second-mover advantage. With OpenAI’s o1 and DeepSeek’s R1 already setting the stage for reasoning models, Anthropic had time to analyze what worked and what didn’t—and it shows.

What’s most interesting is their shift in focus.

Instead of chasing standard benchmarks, they’ve trained this model for real business use cases. They’re doubling down on coding and developer tools—an area where they’ve had an edge from the start.

The response?

Developers are already building pokemon agents, recreating games — all happy vibes indeed

Another standout feature is the ability to dynamically switch between standard and advanced reasoning. The API lets you control how many tokens the model spends on "thinking time," giving you full flexibility. So so smart Anthropic!

In this article we’ll compare the latest reasoning models (o1, o3-mini and DeepSeek R1) with the Claude 3.7 Sonnet model to understand how they compare on price, use-cases, and performance!

Results

In this analysis, We look at standard benchmarks, human-expert reviews, and conduct a set of our own small-scale experiments.

Here are our findings:

  • Pricing: Claude 3.7 Sonnet sits in the middle—cheaper than OpenAI’s o1 model but pricier than DeepSeek R1 and OpenAI’s O3-mini. However, its ability to adjust token usage on the fly adds significant value, making it the most flexible choice.
  • Latency: It’s hard to pin down the exact latency with extended thinking for Claude 3.7 Sonnet, but being able to set token limits and control response time for a task is a solid advantage. This dual-mode approach means developers no longer need separate fast vs. smart models. You get configurable latency which is a huge deal not available to any other model at the moment. This is somewhat similar to OpenAI’s o3-mini model that has pre-built low, middle, and high reasoning modes, but there is no direct control on ‘thinking token spend’.
  • Standard Benchmarks: Claude 3.7 Sonnet is strong in reasoning (GPQA: 78.2% / 84.8%), multilingual Q&A (MMLU: 86.1%), and coding (SWE-bench: 62.3% / 70.3%), making it a solid choice for businesses and developers. Anthropic really wanted to solve for real business use-cases, than math for example — which is still not a very frequent use-case for production-grade AI solutions.
  • Math reasoning: Our small evaluations backed Anthropic’s claim that Claude 3.7 Sonnet struggles with math reasoning. Surprisingly, OpenAI’s o1 didn’t perform much better. Even o3-mini, which should’ve done better, only got 27/50 correct answers, barely ahead of DeepSeek R1’s 29/50. None of them are reliable for real math problems.
  • Puzzle Solving: Claude 3.7 Sonnet led with 21/28 correct answers, followed by DeepSeek R1 with 18/28, while OpenAI’s models struggled. It looks like OpenAI and Gemini 2.0 Flash are still overfitting to their training data, while Anthropic and DeepSeek might be figuring out how to make models that actually think.

Methodology

In the next two sections we will cover three analysis:

  • Latency & Cost comparison
  • Standard benchmark comparison (example: what is the reported performance for math tasks between Claude 3.7 Sonnet vs OpenAI o1?)
  • Independent evaluation experiments (math equations and puzzles)

Evaluations with Vellum

To conduct these evaluations, we used Vellum’s AI development platform, where we:

  • Configured all 0-shot prompt variations for both models using the LLM Playground.
  • Built the evaluation dataset & configured our evaluation experiment using the Evaluation Suite in Vellum. We used an LLM-as-a-judge to analyze generated answers to correct responses from our benchmark dataset for the math/reasoning problems.

We then compiled and presented the findings using the Evaluation Reports generated at the end of each evaluation run.

You can skip to the section that interests you most using the "Table of Contents" panel on the left or scroll down to explore the full comparison between OpenAI o1, o3-mini Claude 3.7 Sonnet, and DeepSeek R1.

Pricing

Claude 3.7 Sonnet keeps the same pricing as earlier models—$3/M input tokens, $15/M output tokens ($0.003 and $0.015 per 1K). This applies to both standard and extended thinking modes, with thinking tokens counted as output. No extra surcharge for reasoning.

Compared to competitors, Claude 3.7 is much cheaper than OpenAI’s o1 ($15/M in, $60/M out) but more expensive than o3-mini, which costs $1.10/M in, $4.40/M out. Meanwhile, DeepSeek R1 undercuts them all at $0.14/M in, $0.55/M out, though with trade-offs. Claude 3.7 sits in the middle—cheaper than top-tier closed models, but pricier than open alternatives

For anyone looking to test Claude 3.7 Sonnet: the token budget control is the key feature to master. Being able to specify exactly how much "thinking" happens (50-128K tokens) creates entirely new optimization opportunities.👇🏻

Latency

Claude 3.7 introduces a hybrid reasoning architecture that can trade off latency for better answers on demand. In standard mode, it’s extremely fast – Anthropic cites ~200 ms latency for quick responses​ (presumably time to first token or for short answers). The average latency according to a independently run evaluation sits at 1.16s.

In extended thinking mode, the model can take up to 15 seconds (reportedly) for deeper reasoning​, during which it internally “thinks” through complex tasks. It’s hard to pin down the exact latency with extended thinking, but being able to set token limits and control response time for a task is a solid advantage.

This dual-mode approach means developers no longer need separate fast vs. smart models. You get configurable latency which is a huge deal not available to any other model at the moment. This is somewhat similar to OpenAI’s o3-mini model that has pre-built low, middle, and high reasoning modes, but no direct control on ‘thinking token spend’.

More tokens for thinking will add more latency, but will definitely lead to better performance for harder tasks. As shown in the AIME 2024 performance graph below, accuracy improves as more tokens are allocated, following a logarithmic trend.

Benchmarks

Claude 3.7 Sonnet is a well-rounded model, excelling in graduate-level reasoning (GPQA Diamond: 78.2% / 84.8%), multilingual Q&A (MMLU: 86.1%), and instruction following (IFEval: 93.2%), making it a strong choice for business and developer use cases.

Its agentic coding (SWE-bench: 62.3% / 70.3%) and tool use (TAU-bench: 81.2%) reinforce its practical strengths.

While it lags in high school math competition scores (AIME: 61.3% / 80.0%), it prioritizes real-world performance over leaderboard optimization—staying true to Anthropic’s focus on usable AI.

It’s also interesting to see that the Claude 3.7 Sonnet without extended thinking is showcasing great results on all these benchmarks.

Independent Evals

Task 1: Math

For this task, we’ll compare the models on how well they solve some of the hardest SAT math questions. This is the 0-shot prompt that we used for both models:

You are a helpful assistant who is the best at solving math equations. You must output only the answer, without explanations. Here’s the <question>

We then ran all 50 math questions and here’s what we got:

Click to Interact
×

From this table we can notice that:

  • All models are particularly bad at solving math problems
  • DeepSeek R1 guessed 29/50 answers right (58%), and the O3-mini (High) got 27/50 answers right. Those two did best on this eval but it’s still a coin toss  — we don’t see any meaningful performance at these tasks from these models still.
  • Claude 3.7 Sonnet and OpenAI o1 were the worst, and similarly bad. We proved that Claude 3.7 Sonnet is really not good at math, as they actually stated in the announcement. However, we expected better performance from OpenAI o1 and o3-mini.

Task 2: Puzzles

We tested OpenAI-o1, DeepSeek-R1, Claude 3.7 Sonnet, and OpenAI o3-mini on 28 well-known puzzles. For this evaluation, we changed some portion of the puzzles, and made them trivial. We wanted to see if the models still overfit on training data or will adapt to new contexts.

For example, we modified the Monty Hall problem:

Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a gold bar; behind the others, rotten vegetables. You pick a door, say No. 1, and the host asks you, 'Do you want to pick door No. 2 instead?' What choice of door now gives you the biggest advantage?

In the original Monty Hall problem, the host reveals an extra door. In this case, it does not, and since there is no additional information provided, your odds remain the same.

The correct answer here is: “It is not an advantage to switch. It makes no difference if I switch or not because no additional material information has been provided since the initial choice.”

Most models had trouble working with the new context, but Claude 3.7 Sonnet performed noticeably better:

Click to Interact
×

From this evaluation we can clearly see that:

  • Claude 3.7 Sonnet got 21/28 answers right, hitting 75% accuracy. DeepSeek R1 followed with 18/28 correct guesses and 64% accuracy. For the rest of the models, getting the right answer was basically a coin flip.
  • OpenAI’s models and Gemini 2.0 Flash Thinking still seem to overfit, likely optimizing too much for benchmark data. Meanwhile, Anthropic and DeepSeek may have figured out a different approach—improving their models without leaning too heavily on benchmarks and training data.

Evaluate with Vellum

At Vellum, we built our evaluation using our own AI development platform—the same tooling teams use to compare, test, and optimize LLM-powered features.

With the LLM Playground, we configured controlled zero-shot prompts across models. The Evaluation Suite helped us automate grading, ensuring a fair and structured comparison. And with Evaluation Reports, we could quickly surface insights into where each model excelled (or struggled).

If you need to run large-scale LLM experiments — book a demo with one of our experts here.

Conclusion

Claude 3.7 Sonnet proves that Anthropic is playing the long game—prioritizing real-world usability over leaderboard flexing. The model isn’t flawless (math is still a weak spot), but its ability to dynamically adjust reasoning depth and token spend is a genuine step forward.

Our evaluations showed it leading in puzzle-solving and reasoning, while OpenAI’s models still seem to overfit on training data. DeepSeek R1 remains a strong contender, especially given its pricing, but lacks the same flexibility.

For developers and businesses, the takeaway is clear: if you need fine-tuned control over performance and cost, Claude 3.7 Sonnet is one to watch.

ABOUT THE AUTHOR
Anita Kirkovska
Founding Growth Lead

An AI expert with a strong ML background, specializing in GenAI and LLM education. A former Fulbright scholar, she leads Growth and Education at Vellum, helping companies build and scale AI products. She conducts LLM evaluations and writes extensively on AI best practices, empowering business leaders to drive effective AI adoption.

ABOUT THE reviewer

No items found.
lAST UPDATED
Feb 25, 2025
share post
Expert verified
Related Posts
Guides
October 21, 2025
15 min
AI transformation playbook
LLM basics
October 20, 2025
8 min
The Top Enterprise AI Automation Platforms (Guide)
LLM basics
October 10, 2025
7 min
The Best AI Workflow Builders for Automating Business Processes
LLM basics
October 7, 2025
8 min
The Complete Guide to No‑Code AI Workflow Automation Tools
All
October 6, 2025
6 min
OpenAI's Agent Builder Explained
Product Updates
October 1, 2025
7
Vellum Product Update | September
The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska
Head of Engineering

This is just a great newsletter. The content is so helpful, even when I’m busy I read them.

Jeremy Hicks
Solutions Architect

Experiment, Evaluate, Deploy, Repeat.

AI development doesn’t end once you've defined your system. Learn how Vellum helps you manage the entire AI development lifecycle.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component, Use {{general-cta}}

Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component  [For enterprise], Use {{general-cta-enterprise}}

The best AI agent platform for enterprises
Production-grade rigor in one platform: prompt builder, agent sandbox, and built-in evals and monitoring so your whole org can go AI native.

[Dynamic] Ebook CTA component using the Ebook CMS filtered by name of ebook.
Use {{ebook-cta}} and add a Ebook reference in the article

Thank you!
Your submission has been received!
Oops! Something went wrong while submitting the form.
Button Text

LLM leaderboard CTA component. Use {{llm-cta}}

Check our LLM leaderboard
Compare all open-source and proprietary model across different tasks like coding, math, reasoning and others.

Case study CTA component (ROI)

40% cost reduction on AI investment
Learn how Drata’s team uses Vellum and moves fast with AI initiatives, without sacrificing accuracy and security.

Case study CTA component (cutting eng overhead) = {{coursemojo-cta}}

6+ months on engineering time saved
Learn how CourseMojo uses Vellum to enable their domain experts to collaborate on AI initiatives, reaching 10x of business growth without expanding the engineering team.

Case study CTA component (Time to value) = {{time-cta}}

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.

[Dynamic] Guide CTA component using Blog Post CMS, filtering on Guides’ names

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.
New CTA
Sorts the trigger and email categories

Dynamic template box for healthcare, Use {{healthcare}}

Start with some of these healthcare examples

SOAP Note Generation Agent
Extract subjective and objective info, assess and output a treatment plan.
Prior authorization navigator
Automate the prior authorization process for medical claims.

Dynamic template box for insurance, Use {{insurance}}

Start with some of these insurance examples

Insurance claims automation agent
Collect and analyze claim information, assess risk and verify policy details.
AI agent for claims review
Review healthcare claims, detect anomalies and benchmark pricing.
Agent that summarizes lengthy reports (PDF -> Summary)
Summarize all kinds of PDFs into easily digestible summaries.

Dynamic template box for eCommerce, Use {{ecommerce}}

Start with some of these eCommerce examples

E-commerce shopping agent
Check order status, manage shopping carts and process returns.

Dynamic template box for Marketing, Use {{marketing}}

Start with some of these marketing examples

Competitor research agent
Scrape relevant case studies from competitors and extract ICP details.
LinkedIn Content Planning Agent
Create a 30-day Linkedin content plan based on your goals and target audience.

Dynamic template box for Sales, Use {{sales}}

Start with some of these sales examples

Research agent for sales demos
Company research based on Linkedin and public data as a prep for sales demo.

Dynamic template box for Legal, Use {{legal}}

Start with some of these legal examples

Legal RAG chatbot
Chatbot that provides answers based on user queries and legal documents.
Legal document processing agent
Process long and complex legal documents and generate legal research memorandum.

Dynamic template box for Supply Chain/Logistics, Use {{supply}}

Start with some of these supply chain examples

Risk assessment agent for supply chain operations
Comprehensive risk assessment for suppliers based on various data inputs.

Dynamic template box for Edtech, Use {{edtech}}

Start with some of these edtech examples

Turn LinkedIn Posts into Articles and Push to Notion
Convert your best Linkedin posts into long form content.

Dynamic template box for Compliance, Use {{compliance}}

Start with some of these compliance examples

No items found.

Dynamic template box for Customer Support, Use {{customer}}

Start with some of these customer support examples

Trust Center RAG Chatbot
Read from a vector database, and instantly answer questions about your security policies.
Q&A RAG Chatbot with Cohere reranking

Template box, 2 random templates, Use {{templates}}

Start with some of these agents

Turn LinkedIn Posts into Articles and Push to Notion
Convert your best Linkedin posts into long form content.
Research agent for sales demos
Company research based on Linkedin and public data as a prep for sales demo.

Template box, 6 random templates, Use {{templates-plus}}

Build AI agents in minutes

Legal RAG chatbot
Chatbot that provides answers based on user queries and legal documents.
SOAP Note Generation Agent
Extract subjective and objective info, assess and output a treatment plan.
Turn LinkedIn Posts into Articles and Push to Notion
Convert your best Linkedin posts into long form content.
Agent that summarizes lengthy reports (PDF -> Summary)
Summarize all kinds of PDFs into easily digestible summaries.
Competitor research agent
Scrape relevant case studies from competitors and extract ICP details.
Healthcare explanations of a patient-doctor match
Summarize why a patient was matched with a specific provider.

Build AI agents in minutes for

{{industry_name}}

Clinical trial matchmaker
Match patients to relevant clinical trials based on EHR.
Prior authorization navigator
Automate the prior authorization process for medical claims.
Population health insights reporter
Combine healthcare sources and structure data for population health management.
Legal document processing agent
Process long and complex legal documents and generate legal research memorandum.
Legal contract review AI agent
Asses legal contracts and check for required classes, asses risk and generate report.
Legal RAG chatbot
Chatbot that provides answers based on user queries and legal documents.

Case study results overview (usually added at top of case study)

What we did:

1-click

This is some text inside of a div block.

28,000+

Separate vector databases managed per tenant.

100+

Real-world eval tests run before every release.