New day, new model! Today, OpenAI released “GPT-4o Mini”, their latest cost-efficient small model. At 128k tokens the context window is 8x larger than GPT-3.5 Turbo. OpenAI is suggesting developers use GPT-4o Mini where they would have previously used GPT-3.5 Turbo as this model is multimodal, performs better on benchmarks and is more than 60% cheaper.
We see a pattern emerging of model providers announcing models across “weight classes”, Anthropic has 3 different Claude 3 models and OpenAI has 2 GPT-4o models. In this comparison article, we’ll answer the following questions:
- Does GPT-4o Mini really perform better than GPT-3.5 Turbo for my tasks?
- What’s the best small model currently on the market? Claude 3 Haiku or an OpenAI model?
To compare GPT-4o Mini, GPT-3.5 Turbo and Claude 3 Haiku on specific tasks, we evaluated them across 3 different tasks:
- Data extraction from legal contracts
- Customer tickets classification
- Verbal reasoning
For these specific tasks, we learned that:
- Data Extraction: GPT-4o Mini performs worse than GPT-3.5 Turbo and Claude 3 Haiku, sometimes missing the mark entirely. All models don’t have high enough quality for this task (only 60-70% accuracy)
- Classification: Highest precision for GPT-4o (88.89%), making it the best choice to avoid False Positives. Balanced F1 Score between GPT-4o Mini & GPT-3.5 Turbo
- Verbal Reasoning: GPT-4o Mini outperforms the other models. It doesn’t do well on numerical questions but performs well on relationship / language specific ones.
Read the whole analysis in the sections that follow, and sign up for our newsletter if you want to get these analyses in your inbox!
We look at standard benchmarks, community-ran data, and conduct a set of our own small-scale experiments.
In the next few sections we cover:
- Cost comparison
- Performance comparison (Latency, Throughput)
- Standard benchmark comparison (example: what is the reported performance for math tasks between GPT-4o mini vs GPT-3.5 vs Claude 3 Haiku?)
- Three evaluation experiments (data extraction, classification and math reasoning)
You can skip to the section that interests you most using the "Table of Contents" panel on the left or scroll down to explore the full comparison between the models.
OpenAI has remained true to its word of continuously pushing costs down and making AI accessible to a large number of people.
Winner = GPT-4o Mini
Latency Comparison
Latency, or time to first token, is an important metric to minimize because it helps reduce the perception of how “slow” the model is to respond. The data below measure p50 latency across a dataset and you can see GPT-3.5 Turbo is marginally faster than the other models.
However, latency can be impacted by the load on the API and size of the prompt. Given the low latency of these models overall and the data here only showing median we’d call this a tie between the 3 models.
Result = TIE
Throughput Comparison
Throughput, on the other hand, is the number of tokens a model can generate per second once it generates the first token. GPT-4o Mini’s throughput is significantly higher than other models in the market, so for long form output generation (e.g., writing a job description) GPT-4o’s completions will likely be the fastest despite a slower time to first token
Winner = GPT-4o Mini
Standard benchmarks
As part of GPT-4o Mini’s launch blog, OpenAI released details about the model’s performance on standard benchmarks:
- MMLU (Massive Multitask Language Understanding)
- GPQA (Graduate Level Google-Proof Q&A)
- DROP (Discrete Reasoning Over Paragraphs)
- MGSM (Multilingual Grade School Math)
- MATH (General Mathematics)
- HumanEval (Code Generation)
- MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning)
- MathVista (Visual Math Reasoning)
Here are the main takeaways from this data:
- GPT-4o Mini performs 2nd best on most benchmarks after GPT-4o. The biggest performance improvements compared to GPT-3.5 Turbo seem to be in Mathematics (MATH and MGSM), which was a common issue with prior generations of LLMs.
- GPT-4o has newer capabilities in visual math reasoning because of multimodality, this capability was not present in GPT-3.5 Turbo
ELO Leaderboard
For chat completions, evaluated by the public ELO leaderboard, GPT-4o outranks GPT-4 Turbo:
Benchmarks and crowdsourced evals matter, but they don’t tell the whole story. To really know how your AI system performs, you must dive deep and evaluate these models for your use-case.
Now, let’s compare these models on three tasks that might be useful for your project.
For this task, we’ll compare GPT-4o Mini, GPT-3.5 Turbo & Claude 3 Haiku on their ability to extract key pieces of information from legal contracts. Our dataset includes Master Services Agreements (MSAs) between companies and their customers. The contracts vary in length, with some as short as 5 pages and others longer than 50 pages.
In this evaluation we’ll extract a total of 12 fields like Contract Title, Name of Customer, Name of Vendor, details of Termination Clause, whether Force Majeure was present or not etc.
You can check our original prompt and the JSON schema we expected the model to return:
We gathered ground truth data for 10 contracts and used Vellum Evaluations to create 14 custom metrics. These metrics compared our ground truth data with the LLM's output for each parameter in the JSON generated by the model.
Then, we tested GPT-4o Mini, GPT-3.5 Turbo & Claude 3 Haiku using Vellum’s Evaluation suite:
Then we compared how well each model extracted the correct parameters, by looking at the absolute and relative mean values for each entity, using Evaluation Reports:
Here’s what we found across the 14 fields:
- 7 fields had equal performance across models. For the other 7 fields results were a mixed bag, in most cases there was one model that did worse and the other two were tied at the top:
- GPT-4o Mini was the worst performing model in 4 fields,
- GPT-3.5 Turbo was the worst in 2 fields.
- Claude 3 Haiku was the worst on 1 field.
- For one of the fields, GPT-4o Mini completely missed the mark and had 20% accuracy compared to 70% for GPT-3.5 Turbo & Claude 3 Haiku
- From an absolute perspective, this weight class of models don’t provide the desired quality for accurate data extraction. Most fields only had 60-70% accuracy while some were far lower. For a complex data extraction task where accuracy is important pick a more powerful model like GPT-4o or Claude 3.5 Sonnet and use advanced prompting techniques like few-shot or chain of thought prompting.
Winner: Claude 3 Haiku beats GPT-3.5 Turbo marginally, but all models fall short of the mark for data extraction task.
In this evaluation, we had GPT-3.5 Turbo, Claude 3 Haiku and GPT-4o Mini determine whether a customer support ticket was resolved or not. In our prompt we provided clear instructions of when a customer ticket is closed, and added few-shot examples to help with most difficult cases.
We ran the evaluation to test if the models' outputs matched our ground truth data for 100 labeled test cases.
In the Evaluation Report below you can see how all models compare to GPT-4o Mini:
We can see from the report that:
- Accuracy Comparison: GPT-4o Mini (0.72) does better than Claude 3 Haiku (0.61) and GPT-3.5 Turbo (0.66).
- Improvements: Claude 3 Haiku and GPT-3.5 Turbo outperform GPT-4o Mini on 11 completions
- Regressions: Claude 3 Haiku and GPT-3.5 Turbo underperform GPT-4o Mini on 22 and 17 completions respectively, adding further evidence that GPT-4o Mini does better at this classification task
Accuracy is important but not the only metric to consider, especially in contexts where false positives (incorrectly marking unresolved tickets as resolved) can lead to customer dissatisfaction.
So, we calculated the precision, recall and F1 score for these models:
Key takeaways:
- GPT-4o Mini has the highest precision at 88.89%, indicating it is the best at avoiding false positives. This means when GPT-4o Mini classifies a ticket as resolved, it is more likely to be accurate, thus reducing the chance of incorrectly marking unresolved tickets as resolved.
- Both GPT-4o Mini and GPT-3.5 Turbo have higher F1 scores compared to Claude 3 Haiku
Winner: TIE between GPT-4o Mini and GPT-3.5 Turbo, choice based on preference between Type 1 and Type 2 errors
Note: Keep in mind that prompting techniques can help increase these numbers. We can analyze the misclassified scenarios, and use those insights to prompt the model better. When it comes to AI development it’s all about iterative improvements.
The benchmarks released by OpenAI says that GPT-4o Mini is the best model in its weight class on reasoning tasks. Let’s see how it does on our evals. We selected 16 verbal reasoning questions to compare the two. Here is an example riddle:
Below is a screenshot on the initial test we ran in our prompt environment in Vellum:
Now, let’s run the evaluation across all 16 reasoning questions:
From the image above we can see that:
- GPT-4o Mini outperforms the other models with 50% accuracy, versus 44% for GPT-3.5 Turbo and 19% for Claude 3 Haiku.
- Claude 3 Haiku is often unable to complete its output, better prompt engineering would likely resolve this issue
- GPT-4o Mini doesn’t do well on numerical questions but performs well on relationship / language specific ones.
Winner: GPT-4o Mini
Conclusion
While GPT-4o Mini leads in most areas, further evaluation and prompt testing on your specific use case is essential to fully understand the capabilities of these models. Building production-ready AI systems requires careful trade-offs, good prompt curation, and iterative evaluation.
Want to compare these models on your tasks & test cases? Vellum can help! Book a demo here.
Source for throughput & latency: artificialanalysis.ai
Source for standard benchmarks: https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/
Latest AI news, tips, and techniques
Specific tips for Your AI use cases
No spam
Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.
This is just a great newsletter. The content is so helpful, even when I’m busy I read them.
Experiment, Evaluate, Deploy, Repeat.
AI development doesn’t end once you've defined your system. Learn how Vellum helps you manage the entire AI development lifecycle.