February 5, 2025
We evaluated the performance of OpenAI-o1, DeepSeek-R1, Gemini-2.0 Flash Thinking, and OpenAI o3-mini models on 27 different puzzles. These well-known puzzles were altered to see if the models could adapt to new context or simply fall back on their training data.
For example, we added the Monty Hall problem in this set, but we changed one parameter:
Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a gold bar; behind the others, rotten vegetables. You pick a door, say No. 1, and the host asks you, 'Do you want to pick door No. 2 instead?' What choice of door now gives you the biggest advantage?
In the original Monty Hall problem, the host reveals an extra door. In this case, it does not, and since there is no additional information provided, your odds remain the same.
The correct answer here is: “It is not an advantage to switch. It makes no difference if I switch or not because no additional material information has been provided since the initial choice.”
But this wasn’t obvious to the models.
Maybe reasoning models aren’t good at reasoning, yet.
The table below provides a breakdown of the test data we used, along with the model's responses. The rest of the article highlights some of the most interesting examples.
Last June, a group of scientists created these puzzles and conducted similar evaluation on older models. The challenges they identified remain in today’s more advanced models.
Here’s what we found.
Overfitting
Our findings reveal that models still default to familiar solutions rather than adapting to new contexts, highlighting their tendency to overfit and rely on known assumptions (inductive bias). In some cases, they even imposed nonexistent constraints, wasting tokens and ultimately arriving at the wrong answer.
Let's look at some examples.
Question: A group of four people needs to cross a bridge at night. The bridge is very old and rickety. They have only one torch, and because it's nighttime, the torch is necessary to cross the bridge. Each person walks at a different speed:A takes 1 minute to cross,B takes 2 minutes,C takes 5 minutes, andD takes 10 minutes.What is the fastest time they can all get across the bridge?
Answer: 10 minutes, the speed of the slowest person as they cross the bridge together.s
DeekSeek-R1: "...First, the main constraints are that only two people can cross the bridge at once because they need the torch, and whenever two people cross, someone has to bring the torch back for the others. So the challenge is to minimize the total time by optimizing who goes together and who comes back with the torch."
Question: Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a gold bar; behind the others, rotten vegetables. You pick a door, say No. 1, and the host asks you, 'Do you want to pick door No. 2 instead?' What choice of door now gives you the biggest advantage?
Answer: No, it is not an advantage to switch. It makes no difference if I switch or not because no additional material information has been provided since the initial choice. The Monty Hall Problem does not apply here, since the host does't open another door that can give you information whether you should switch to the second door.
OpenAI O1: "To apply the standard Monty Hall reasoning here, the best move is to switch away from your original choice. In the classic setup, after you pick one door, the host—who knows where the gold bar is—eliminates a losing door among the doors you did not pick, leaving exactly one other unopened door. Because your initial choice has only a 1/3 probability of being the winner and the other unopened door then carries a 2/3 probability, the switch gives you the better chance of winning.<final answer>Switch to Door No. 2 for the greatest advantage.</final answer>"
Question: A farmer wants to cross a river and take with him a wolf, a goat, and a cabbage. He has a boat with three secure separate compartments. If the wolf and the goat are alone on one shore, the wolf will eat the goat. If the goat and the cabbage are alone on the shore, the goat will eat the cabbage. What is the minimum number of crossings the farmer needs to make to bring the wolf, the goat, and the cabbage across the river without anything being eaten?
Answer: Place the wolf, goat, and cabbage in separate secure compartments in the boat and row across the river. This will prevent any of them from being eaten by the others.
Gemini 2.0 Flash Thinking: "**Understanding the Constraints:*** **Boat Capacity:** The farmer can take at most one item across the river at a time (since the compartments are secure and separate, it's essentially like carrying one item plus himself).* **Dangerous Pairings:** * Wolf and Goat cannot be left alone together. * Goat and Cabbage cannot be left alone together."
Lack of logic
There were many illogical answers, and commonsense inconsistencies.
Question: Bob has three boxes in front of him - Box A, Box B, and Box C. Bob does not know what is in the boxes. Colin knows that Box A will explode when it is opened, Box B contains 5 dollars, and Box C is empty. Should Bob open a box? What box should Bob open?
Answer: No, Bob should not open a box because he has a 1/3 chance of killing himself. The 1/3 chance of “winning” $5 is generally not worth that sort of risk!
OpenAI-o1: "<final answer>Bob should open Box B because it contains the five dollars. Opening Box A causes an explosion, and Box C is empty.</final answer>"
DeepSeek-R1: "Though Bob lacks prior knowledge of the boxes' contents, *we* (as informed observers) know Box B holds the reward. Colin’s knowledge contextualizes the problem, but the answer hinges on the reader’s full information. **Yes, Bob should open a box**, specifically **Box B**, to safely secure the $5 while avoiding the risk in Box A and the empty Box C."
Question: A partially full hotel has an infinite number of fully furnished rooms. How does it accommodate one more guest?
Answer: By putting the guest in an empty room.
OpenAI-o1: "<final answer>You move each occupant from room n to room n+1, so room 1 becomes vacant and can be given to the new guest.</final answer>"
DeekSeek-R1: "Wait, maybe "partially full" just means there's a countably infinite number of guests already. Because if the hotel has infinite rooms, even if they're all filled with infinitely many guests, you can still make room by shifting everyone. But I think the standard solution is: move each existing guest from room n to room n+1, freeing up room 1 for the new guest. But does that apply here even if the hotel isn't full?"
Overthinking
Seeing R1’s reasoning gave us insight into how it thinks—and for simple puzzles, it thinks a lot. It often called the problem "straightforward" but doubted it could be that easy, leading to overthinking and searching for complexity that wasn’t there.
Question: You have six horses and want to race them to see which is fastest. What is the minimum number of races needed to do this?
Answer: The minimum number of races is one. You can race them on a single race track with at least six lanes - the order in which they cross the finish line determines which is the fastest.
DeepSeek-R1: ".....Wait, no, maybe the standard here is that each race can have up to five horses. Wait, but maybe that's a standard assumption in such puzzles. Alternatively, perhaps racing multiple horses at once with no limit. But the standard logic puzzles often restrict races to five horses at a time, so the answer is derived from that. Let me proceed with that assumption because otherwise, if you could race all six at once, then the answer is just one race....."
Question: You have 3 switches in front of you - A, B, and C. You have 3 light bulbs in front of you in the same room - one red, one blue, one purple. They are LED and do not get warm when turned on. You want to know which switch turns on which light bulb. What is the best way to determine this?
Answer: A process of elimination. Test each switch independently and observe which light bulb turns on for each.
OpenAI o3-mini (high): "<final answer>The best way is to set up a timed (video) recording of the bulbs while you operate the switches in a known sequence—so that you “stamp” each bulb’s on/off change with the time you flipped its switch—and then, by reviewing the recording, you can match each switch (A, B, or C) to its corresponding LED bulb.</final answer>."
This analysis is interesting, knowing that in two years, we might look back and laugh at how dumb these so-called advanced reasoning models were. While their randomness and repetitiveness are still obvious, there’s real progress—and we can see it by analyzing their 'thinking' —At least with DeepSeek.
From these results we can see that the biggest progress is in spatial reasoning and popular science. They are better at linguistics too, to some extent. Yes, these models can indeed find how many 'r' letters there are in Strawberry, but it can't find how many they are in "Strawberrrry". Go figure!
Spatial reasoning
Compared to earlier models, these new reasoning models performed significantly better on spatial reasoning questions. The exact reason isn’t clear. Chain-of-thought reasoning during training or inference may have played a role, but it’s unclear why the improvement is so much greater for spatial tasks than for other types of puzzles.
Question: I'm in London and facing west, what direction is Edinburgh?
Answer: Right, or northwest
DeepSeek-R1: "..... But generally, when asked "what direction is X from Y", it's cardinal. Let me confirm with a quick example: If I'm facing south and someone asks where north is, the direction is still north, behind me. So regardless of facing, the direction is north. Therefore, Edinburgh is to the north of London. The user's facing doesn't change that. So the answer is north. Yeah, that makes sense.</think>Edinburgh is located north of London. Even when facing west, the cardinal direction to Edinburgh from London remains north. To face Edinburgh while oriented west, you would turn right (north). However, the cardinal direction is simply north.<final answer> North </final answer>"
Question: You are playing Russian roulette with a six-shooter revolver. Your opponent puts in five bullets, spins the chambers, and fires at himself, but no bullet comes out. He gives you the choice of whether or not you should spin the chambers again before firing at yourself. Should you spin?
Answer: Yes, you should ask him to spin again. There was only one empty chamber to start with which was fortunately aligned with the barrel when the opponent fired at himself. This means that the next chamber is 100% certain to have a bullet.
OpenAI o3-high: "..... Thus, although neither option is attractive, spinning gives you about a 16.67% chance of survival (1/6), which is far better than the 0% chance if you don’t spin.<final answer>Yes, you should spin the chamber; spinning gives you a survival chance of 1/6, while not spinning guarantees death.</final answer>"
Popular science
These puzzles test common scientific and mathematical misconceptions. While earlier models struggled, the latest reasoning models come close to getting them all right.
Question: Which weighs more, a pound of water, two pounds of bricks, a pound of feathers, or three pounds of air?
Answer: Three pounds of air
Gemini 2.0 Thinking Flash: "..... The heaviest item is **three pounds of air**.Here's why:The question is designed to be a bit of a trick! It lists different materials, but then it tells you the *weight* of each one directly.* **A pound of water:** Weighs one pound.* **Two pounds of bricks:** Weighs two pounds.* **A pound of feathers:** Weighs one pound.* **Three pounds of air:** Weighs three pounds.Since three is the largest number, three pounds of air is the heaviest. <final answer>Three pounds of air</final answer>"
Deeply seeking answers
The truth is, we can see in the DeepSeek's answers that the model asked the right questions and even arrived at the right answers during its "thinking" phase—yet often hesitated to state them outright. More often than not, it defaulted to the training data as the "safe" choice.
This isn’t surprising—LLMs are prone to overfitting and inductive bias, which is why we see them making so many assumptions in their reasoning process.
While this might sound like a glass-half-empty view, let’s take a more optimistic stance.
The latest DeepSeek-R1 model leans heavily on pure reinforcement learning, a training approach that doesn’t use labeled data and is meant to help with generalization. In theory, this should help the model adapt to new contexts and generate novel ideas. And while we still see signs of it defaulting to training data, there’s clear progress compared to earlier models.
All of this makes sense—these are still v1 models, and frankly, they’re not as dumb as they might seem. Below are some snippets where it made the right assumptions but hesitated to follow through:
Question: A man steps over a 0.5m hurdle. What is the minimum height he could have jumped?
Answer: 0m, he didn't jump at all, he stepped over the hurdle.
DeepSeek-R1: "...Alternatively, since stepping over a hurdle can be done with just lifting your leg, perhaps the minimal jump height is zero. But the question says "jumped," so implying at least a small jump. But stepping over might not require a jump if the hurdle is low. For example, if you lift your leg high enough. But then what's the jump for? Maybe if you don't jump, your other foot is still on the ground. So to step over, you need to have both feet off the ground, hence a jump..."
Question: How many pairs of twins do you need in a room for there to be at least a 50% chance that two people have the same birthday?
Answer: 1 pair
DeepSeek-R1: "...So the confusion is here: if a "pair of twins" refers to two people who are twins (sharing the same birthday), then even one pair of twins in the room guarantees that there are two people with the same birthday. The probability would be 100%. But the problem says "at least a 50% chance", which conflicts. Therefore, there must be a different interpretation..."
CoT Evaluation is All We Need
Now that we can see how these models reason, accuracy alone isn’t enough to evaluate them. We need new ways to assess their decision-making process—how they arrive at their conclusions, not just whether they’re right.
This could include:
Step-by-step reasoning analysis: Tracking the model’s thought process to see if it follows logical steps or gets lost in unnecessary complexity; or
Confidence calibration: Measuring how well a model’s certainty aligns with the correctness of its answers.
By focusing on how models think, not just what they answer, we can better understand their strengths, limitations, and areas for improvement.
This evaluation was a lot of fun to put together! If you have any questions or thoughts, feel free to reach out to me on Twitter!