Today, OpenAI announced that they’ve reached a new level of AI capability — and they’re reseting the counter back to 1. They shipped their latest models: OpenAI o1 and OpenAI o1 mini.
Built to handle hard problems — they take more time to think before responding, similar to how a person would approach a difficult task. The “OpenAI o1” model, specifically, shows incredible results for various hard problems: math, coding, and reasoning. The model is also able to fight against “jailbreaks” — almost 4 times better than GPT-4o.
This next level of reasoning will have an impact of many industries, from genomics, economics, cognition, to quantum physics — it’s that powerful.
The smaller version, the “OpenAI o1 mini” is a specifically designed model for developers. It excels at accurately generating and debugging complex code — and it’s 80% cheaper than OpenAI o1.
Now that we got the TLDR; let’s go into our analysis where we compare these models on three tasks, and we cover some of the latest benchmarks and human-expert reviews.
To measure the impact of this new AI capability, we analyzed and compared the OpenAI o1 and GPT-4o models on these tasks:
- Reasoning riddles
- Math Equations
- Customer tickets classification
For these specific tasks, we've learned that:
- Reasoning riddles: Compared to GPT-4o, OpenAI o1 only got one more example correct (12/16).The results between the models were similar, but we noticed that both GPT-4o and OpenAI o1 struggled with analogy reasoning riddles. However, the o1 model performed better on riddles that required more calculations. We threw in GPT-4o and OpenAI o1 in there, and they both performed the same, answering 10/16 questions correctly. With a major difference — GPT4o solved them all in a split second.
- Math equations: We used one of the hardest 10 SAT questions for this experiment and OpenAI o1 got 6/10 right — which is really impressive! GPT-4o did poorly on this tasks, and it got only 2 equations right. For good measure, we added Claude 3.5 Sonnet to the mix — but it performed equally bad as GPT-4o.
- Classification: OpenAI o1 had 12% improvement over GPT-4o on 100 test cases — which is great improvement for this task. It has the best precision (83%), recall and overall f1 score — so if your classification task is not sensitive to latency you should go with OpenAI o1.
We can conclude that:
- If you want extremely fast model, at a lower cost — just go with GPT-4o mini.
- If you want the most capable model for production — go with GPT-4o.
- If you want to solve some extremely hard problems (especially in math!) and you don’t care about latency — go with OpenAI o1.
💡If you're looking to evaluate these models on your own task - Vellum can help. Book a call with one of our AI experts to set up your evaluation.
1) Productionizing any feature built on top of the OpenAI o1 model is going to be very hard
The thinking step can take a long time (I waited more than 3 minutes for some answers!), and we can’t determine how long either. OpenAI is hiding the actual CoT process, and they only provide a summary of it — so there is no good way for us to measure how long a given output will take to generate and/or understand how the model thinks. In some cases I’ve ran the same question x3 with OpenAI o1 and got three different answers. Also, while the reasoning is not visible in the API, the tokens still occupy space in the model's context window and are billed as output tokens — expect to pay for top tier tokens you don’t see.
2) The OpenAI o1 won't need advanced prompting
It seems like you can prompt these models in a very straightforward way. Including more CoT or few-shot examples won't have an impact and in some cases it might hinder the performance.
3) The OpenAI o1 model won’t be useful for many frequent use-cases
While the model is really powerful for solving hard problems, it’s still not equipped with the standard features/parameters that GPT-4o has. They’ve disabled streaming, tool use and other features from the API — so have that in mind when you’re choosing a model for your use-case. Also, the human-expert reviews showed a preference towards GPT-4o for some natural language tasks, which means that this model is not the best choice for every task.
4) Choose the problem and your models wisely
Now, more than ever we need to know which tasks are going to be better solved with “reasoning models” vs “standard models”. For a basic reasoning task, GPT-4o took less than a second to provide the answer, while we waited OpenAI o1 for 2-3 minutes to “think” (more like overthink!) to get to the same answer. In contrast, GPT-4o will be fast to make mistakes — too. Balance will be key.
Read the whole analysis in the sections that follow, and sign up for our newsletter if you want to get these analyses in your inbox!
To put it simply, the new o1 model is so much better because of two changes:
- It’s trained with a large-scale reinforcement learning algorithm that teaches the model how to answer queries using chain of thought (read more about CoT here);
- Then, also, the model takes extra time to think during inference, improving its answers in real time.
We covered the Orion and the Strawberry models in this post, but if you want to go deep into the technical details read their system card here.
But now, let’s go into the analysis.
The main focus on this analysis is to compare GPT-4o (gpt-4o 2024-08-06)
and the OpenAI o1 model.
We look at standard benchmarks, human-expert reviews, and conduct a set of our own small-scale experiments.
In the next two sections we will over three analysis:
- Latency and Cost comparison
- Standard benchmark comparison (example: what is the reported performance for math tasks between GPT-4o vs GPT-4?)
- Human-expert reviews (OpenAI’s own version of the Chatbot Arena)
- Three evaluation experiments (math equations, classification and reasoning)
You can skip to the section that interests you most using the "Table of Contents" panel on the left or scroll down to explore the full comparison between Claude 3.5 Sonnet and GPT-4o.
As expected, the new o1 models are much slower, due to their “reasoning” process.
OpenAI o1 is approximately 30 times slower than GPT-4o. Similarly, the o1 mini version is around 16 times slower than GPT-4o mini.
When it comes to cost, OpenAI o1 and o1 mini are one of the most expensive models on the market right now for two reasons: 1) their initial input/output tokens cost a lot, and 2) you also get charged for the hidden CoT tokens that you don't see in the output.
The cost for 1M tokens for OpenAI o1 is $15 for 1M input tokens, and $60.00 for 1M output tokens — making it 3 times more expensive than GPT-4o. Similarly, OpenAI o1 mini, costs $3.00 for 1M input tokens and $12.00 for 1M output tokens — making it 20 times more expensive than GPT-4o mini.
So, unless you’re dealing with very hard problems where the extra performance of o1 or o1 mini is necessary, it’s hard to justify the significant cost difference, especially when GPT-4o provides similar capabilities at a fraction of the price.
When new models are released, we learn about their capabilities from benchmark data reported in the technical reports. The new OpenAI o1 model improves on the most complex reasoning benchmarks:
- Exceeds human PhD-level accuracy on challenging benchmark tasks in physics, chemistry, and biology on the GPQA benchmark
- Coding is easier — It ranks in the 89th percentile on competitive programming questions (Codeforces)
- It’s also very good at math — In a qualifying exam for the International Mathematics Olympiad (IMO), GPT-4o correctly solved only 13% of problems, while the reasoning model scored 83%. Now, this is next level.
On the standard ML benchmarks, it has huge improvements across the board:
More statistics from Chatbot Arena (ELO Leaderboard)
This public ELO leaderboard is part of the LMSYS Chatbot Arena. The chatbot arena allows you to prompt two anonymous language models, vote on the best response, and then reveal their identities.
They’ve gathered over 6,000 votes, and the results show that the OpenAI o1 model is consistently ranked #1 across all categories, with Math being the most notable area of impact. The o1-mini model is #1 in technical areas, #2 overall. Check out the full results on this link.
OpenAI also brought in human experts to review and compare the new model with GPT-4o, without knowing which model they were evaluating.
The results show that the newest model is great at complex tasks, but not preferred for some natural language tasks — suggesting that maybe the model is not the best for every use-case.
For this task, we’ll compare the GPT-4o and OpenAI o1 models on how well they solve some of the hardest SAT math questions. This is the basic prompt that we used for both models:
To write this prompt we used Vellum’s Prompt IDE to test for answer structure and validity. We didn’t want the model to output extra characters or explanations before we ran the large evaluation set with all 15 SAT math questions — so we took some time refining the prompt:
Then we ran all test cases (10 math equations) in Vellum Evaluations:
Here’s what we found out
- GPT-4o did poorly on most of the examples, and only got 2 right. OpenAI o1 got 6 of them right — which is impressive.
- We put Claude 3.5 Sonnet in the mix for good measure, and it performed just as poorly as GPT-4o, with identical results.
Winner: OpenAI o1!
GPT-4o is the best model for reasoning tasks — as we can see from standard benchmarks and independently ran evaluations.
Will OpenAI o1 take the #1 position?
To find out, we selected 16 verbal reasoning questions to compare the two. Here is an example riddle and its sources:
Then we ran the evaluation across all cases:
From the image above we can see that:
- OpenAI o1 got only one more example correct than GPT-4o. While not a great difference, we can see some improvements.
- The examples were related to mathematical and distance calculations, and given the increased math capabilities of the o1 model — this is somewhat expected.
- We were also curious to see how the “mini” models would perform — GPT-4o Mini (63%) outperformed OpenAI o1 Mini (54%) and did so in less time.
Winner: For the 16-question set we tested, it’s essentially a tie.
In this analysis, we had both OpenAI o1 and GPT-4o determine whether a customer support ticket was resolved or not. In our prompt we provided clear instructions of when a customer ticket is closed, and added few-shot examples to help with most difficult cases.
We ran the evaluation to test if the models' outputs matched our ground truth data for 100 labeled test cases.
You can see the results we got here:
Here’s what we can observe here:
- OpenAI o1 did significantly better here, given that we ran 100 test cases. 12% improvement here is amazing.
For classification tasks, accuracy is important but not the only metric to consider, especially in contexts where false positives (incorrectly marking unresolved tickets as resolved) can lead to customer dissatisfaction.
So, we calculated the precision, recall and f1 score for these models:
Winner: For this task, OpenAI o1 clearly wins. It has higher precision, accuracy and recall. If your classification task is not sensitive to latency consider using OpenAI o1.
Conclusion
Although OpenAI o1 shines in many areas, its high latency and closed system can be real barriers for production use. For most cases, GPT-4o is still the go-to model, while OpenAI o1 is better suited for solving tough problems behind the scenes, rather than being the first choice for everyday production needs.
To try Vellum and evaluate these models on your tasks, book a demo here.
Latest AI news, tips, and techniques
Specific tips for Your AI use cases
No spam
Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.
This is just a great newsletter. The content is so helpful, even when I’m busy I read them.
Experiment, Evaluate, Deploy, Repeat.
AI development doesn’t end once you've defined your system. Learn how Vellum helps you manage the entire AI development lifecycle.