Participate in our State of AI Development surveyfor a chance to win a MacBook M4 Pro!Take 4-min Survey →
Model Comparisons
GPT-4o Mini v/s Claude 3 Haiku v/s GPT-3.5 Turbo: A Comparison
Jul 19, 2024
Co-authors:
Anita Kirkovska
Founding GenAI Growth
Akash Sharma
Co-founder & CEO
Table of Contents

New day, new model! Today, OpenAI released “GPT-4o Mini”, their latest cost-efficient small model. At 128k tokens the context window is 8x larger than GPT-3.5 Turbo. OpenAI is suggesting developers use GPT-4o Mini where they would have previously used GPT-3.5 Turbo as this model is multimodal, performs better on benchmarks and is more than 60% cheaper.

We see a pattern emerging of model providers announcing models across “weight classes”, Anthropic has 3 different Claude 3 models and OpenAI has 2 GPT-4o models. In this comparison article, we’ll answer the following questions:

  • Does GPT-4o Mini really perform better than GPT-3.5 Turbo for my tasks?
  • What’s the best small model currently on the market? Claude 3 Haiku or an OpenAI model?

To compare GPT-4o Mini, GPT-3.5 Turbo and Claude 3 Haiku on specific tasks, we evaluated them across 3 different tasks:

  • Data extraction from legal contracts
  • Customer tickets classification
  • Verbal reasoning

For these specific tasks, we learned that:

  • Data Extraction: GPT-4o Mini performs worse than GPT-3.5 Turbo and Claude 3 Haiku, sometimes missing the mark entirely. All models don’t have high enough quality for this task (only 60-70% accuracy)
  • Classification: Highest precision for GPT-4o (88.89%), making it the best choice to avoid False Positives. Balanced F1 Score between GPT-4o Mini & GPT-3.5 Turbo
  • Verbal Reasoning: GPT-4o Mini outperforms the other models. It doesn’t do well on numerical questions but performs well on relationship / language specific ones.

Read the whole analysis in the sections that follow, and sign up for our newsletter if you want to get these analyses in your inbox!

Our Approach

We look at standard benchmarks, community-ran data, and conduct a set of our own small-scale experiments.

In the next few sections we cover:

  • Cost comparison
  • Performance comparison (L‍‍a‍‍t‍‍e‍‍n‍‍c‍‍y‍‍,‍‍ ‍‍T‍‍h‍‍r‍‍o‍‍u‍‍g‍‍h‍‍p‍‍u‍‍t‍‍)‍‍‍‍
  • Standard benchmark comparison (example: what is the reported performance for math tasks between GPT-4o mini vs GPT-3.5 vs Claude 3 Haiku?)
  • Three evaluation experiments (data extraction, classification and math reasoning)

You can skip to the section that interests you most using the "Table of Contents" panel on the left or scroll down to explore the full comparison between the models.

Cost Comparison

OpenAI has remained true to its word of continuously pushing costs down and making AI accessible to a large number of people.

Winner = GPT-4o Mini

Performance Comparison

Latency Comparison

Latency, or time to first token, is an important metric to minimize because it helps reduce the perception of how “slow” the model is to respond. The data below measure p50 latency across a dataset and you can see GPT-3.5 Turbo is marginally faster than the other models.

However, latency can be impacted by the load on the API and size of the prompt. Given the low latency of these models overall and the data here only showing median we’d call this a tie between the 3 models.

Result = TIE

Note: lower is better

Throughput Comparison

Throughput, on the other hand, is the number of tokens a model can generate per second once it generates the first token. GPT-4o Mini’s throughput is significantly higher than other models in the market, so for long form output generation (e.g., writing a job description) GPT-4o’s completions will likely be the fastest despite a slower time to first token

Winner = GPT-4o Mini

Note: higher is better

Reported Capabilities

Standard benchmarks

As part of GPT-4o Mini’s launch blog, OpenAI released details about the model’s performance on standard benchmarks:

  • MMLU (Massive Multitask Language Understanding)
  • GPQA (Graduate Level Google-Proof Q&A)
  • DROP (Discrete Reasoning Over Paragraphs)
  • MGSM (Multilingual Grade School Math)
  • MATH (General Mathematics)
  • HumanEval (Code Generation)
  • MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning)
  • MathVista (Visual Math Reasoning)

Here are the main takeaways from this data:

  • GPT-4o Mini performs 2nd best on most benchmarks after GPT-4o. The biggest performance improvements compared to GPT-3.5 Turbo seem to be in Mathematics (MATH and MGSM), which was a common issue with prior generations of LLMs.
  • GPT-4o has newer capabilities in visual math reasoning because of multimodality, this capability was not present in GPT-3.5 Turbo

ELO Leaderboard

For chat completions, evaluated by the public ELO leaderboard, GPT-4o outranks GPT-4 Turbo:

Benchmarks and crowdsourced evals matter, but they don’t tell the whole story. To really know how your AI system performs, you must dive deep and evaluate these models for your use-case.

Now, let’s compare these models on three tasks that might be useful for your project.

Task 1: Data Extraction

For this task, we’ll compare GPT-4o Mini, GPT-3.5 Turbo & Claude 3 Haiku on their ability to extract key pieces of information from legal contracts. Our dataset includes Master Services Agreements (MSAs) between companies and their customers. The contracts vary in length, with some as short as 5 pages and others longer than 50 pages.

In this evaluation we’ll extract a total of 12 fields like Contract Title, Name of Customer, Name of Vendor, details of Termination Clause, whether Force Majeure was present or not etc.

You can check our original prompt and the JSON schema we expected the model to return:

You're a contract reviewer who is working to help review contracts following an Merger & Acquisition deal. Your goal is to analyze the text provided and return key data points, focusing on contract terms, risk, and other characteristics that would be important. You should only use the text provided to return the data.

From the provided text, create valid JSON with the schema:
{
contract_title: string, // the name of the agreement
customer: string, // this is the customer signing the agreement
vendor: string, // this is the vendor who is supplying the services
effective_date: date, // format as m/d/yyyy
initial_term: string, // the length of the agreement (ex. 1 year, 5 years, 18 months, etc.)
extension_renewal_options: string, // are there extension or renewal options in the contract? 
automatic_renewal: string, // is this agreement set to automatically renew? 
termination_clause: string, // the full text in the contract containing information about how to terminate the agreement
termination_notice: string, // the number of days that must be given notice before the agreement can be terminated. only include the number. 
force_majeure: string, // is there a clause for force majeure present in the agreement? 
force_majeure_pandemic: string, // does force majeure include reference to viral outbreaks, pandemics or epidemic events? 
assignment_allowed: string, // is there language specifying whether assignment is allowed? answer in only one sentence.
jurisdiction: string, // the jurisdiction or governing law for the agreement (ex. Montana, Georgia, New York). if this is a state, only answer with the name of the state.
}

Contract:
"""
{{ contract }}
"""

We gathered ground truth data for 10 contracts and used Vellum Evaluations to create 14 custom metrics. These metrics compared our ground truth data with the LLM's output for each parameter in the JSON generated by the model.

Then, we tested GPT-4o Mini, GPT-3.5 Turbo & Claude 3 Haiku using Vellum’s Evaluation suite:

Then we compared how well each model extracted the correct parameters, by looking at the absolute and relative mean values for each entity, using Evaluation Reports:

Here’s what we found across the 14 fields:

  • 7 fields had equal performance across models. For the other 7 fields results were a mixed bag, in most cases there was one model that did worse and the other two were tied at the top:
    • GPT-4o Mini was the worst performing model in 4 fields,
    • GPT-3.5 Turbo was the worst in 2 fields.
    • Claude 3 Haiku was the worst on 1 field.
  • For one of the fields, GPT-4o Mini completely missed the mark and had 20% accuracy compared to 70% for GPT-3.5 Turbo & Claude 3 Haiku
  • From an absolute perspective, this weight class of models don’t provide the desired quality for accurate data extraction. Most fields only had 60-70% accuracy while some were far lower. For a complex data extraction task where accuracy is important pick a more powerful model like GPT-4o or Claude 3.5 Sonnet and use advanced prompting techniques like few-shot or chain of thought prompting.

Winner: Claude 3 Haiku beats GPT-3.5 Turbo marginally, but all models fall short of the mark for data extraction task.

Task 2: Classification

In this evaluation, we had GPT-3.5 Turbo, Claude 3 Haiku and GPT-4o Mini determine whether a customer support ticket was resolved or not. In our prompt we provided clear instructions of when a customer ticket is closed, and added few-shot examples to help with most difficult cases.

We ran the evaluation to test if the models' outputs matched our ground truth data for 100 labeled test cases.

In the Evaluation Report below you can see how all models compare to GPT-4o Mini:

We can see from the report that:

  1. Accuracy Comparison: GPT-4o Mini (0.72) does better than Claude 3 Haiku (0.61) and GPT-3.5 Turbo (0.66).
  2. Improvements: Claude 3 Haiku and GPT-3.5 Turbo outperform GPT-4o Mini on 11 completions
  3. Regressions: Claude 3 Haiku and GPT-3.5 Turbo underperform GPT-4o Mini on 22 and 17 completions respectively, adding further evidence that GPT-4o Mini does better at this classification task

Accuracy is important but not the only metric to consider, especially in contexts where false positives (incorrectly marking unresolved tickets as resolved) can lead to customer dissatisfaction.

So, we calculated the precision, recall and F1 score for these models:

Key takeaways:

  • GPT-4o Mini has the highest precision at 88.89%, indicating it is the best at avoiding false positives. This means when GPT-4o Mini classifies a ticket as resolved, it is more likely to be accurate, thus reducing the chance of incorrectly marking unresolved tickets as resolved.
  • Both GPT-4o Mini and GPT-3.5 Turbo have higher F1 scores compared to Claude 3 Haiku

Winner: TIE between GPT-4o Mini and GPT-3.5 Turbo, choice based on preference between Type 1 and Type 2 errors

Note: Keep in mind that prompting techniques can help increase these numbers. We can analyze the misclassified scenarios, and use those insights to prompt the model better. When it comes to AI development it’s all about iterative improvements.

Task 3: Reasoning

The benchmarks released by OpenAI says that GPT-4o Mini is the best model in its weight class on reasoning tasks. Let’s see how it does on our evals. We selected 16 verbal reasoning questions to compare the two. Here is an example riddle:

💡  Verbal reasoning question:

1. Choose the word that best completes the analogy: Feather is to Bird as Scale is to _______.

A) Reptile

B) Dog

C) Fish

D) Plant

Answer: Reptile 

Below is a screenshot on the initial test we ran in our prompt environment in Vellum:

Now, let’s run the evaluation across all 16 reasoning questions:

From the image above we can see that:

  • GPT-4o Mini outperforms the other models with 50% accuracy, versus 44% for GPT-3.5 Turbo and 19% for Claude 3 Haiku.
  • Claude 3 Haiku is often unable to complete its output, better prompt engineering would likely resolve this issue
  • GPT-4o Mini doesn’t do well on numerical questions but performs well on relationship / language specific ones.

Winner: GPT-4o Mini

Summary

Conclusion

While GPT-4o Mini leads in most areas, further evaluation and prompt testing on your specific use case is essential to fully understand the capabilities of these models. Building production-ready AI systems requires careful trade-offs, good prompt curation, and iterative evaluation.

Want to compare these models on your tasks & test cases? Vellum can help! Book a demo here.

Source for throughput & latency: artificialanalysis.ai

Source for standard benchmarks: https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/

ABOUT THE AUTHOR

ABOUT THE AUTHOR
Anita Kirkovska
Founding GenAI Growth

Anita Kirkovska, is currently leading Growth and Content Marketing at Vellum. She is a technical marketer, with an engineering background and a sharp acumen for scaling startups. She has helped SaaS startups scale and had a successful exit from an ML company. Anita writes a lot of content on generative AI to educate business founders on best practices in the field.

ABOUT THE AUTHOR
Akash Sharma
Co-founder & CEO

Akash Sharma, CEO and co-founder at Vellum (YC W23) is enabling developers to easily start, develop and evaluate LLM powered apps. By talking to over 1,500 people at varying maturities of using LLMs in production, he has acquired a very unique understanding of the landscape, and is actively distilling his learnings with the broader LLM community. Before starting Vellum, Akash completed his undergrad at the University of California, Berkeley, then spent 5 years at McKinsey's Silicon Valley Office.

The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska
Head of Engineering

This is just a great newsletter. The content is so helpful, even when I’m busy I read them.

Jeremy Hicks
Solutions Architect
Related Posts
View More

Experiment, Evaluate, Deploy, Repeat.

AI development doesn’t end once you've defined your system. Learn how Vellum helps you manage the entire AI development lifecycle.