OpenAI just released O1 and O1 Pro mode and Sam Altman claimed they now had the smartest models in the world. But do they? I've signed up to Pro mode for the price of heating my house in winter, tested it, read the new report card and analysed every release note.
You might think you know the full story of what I'm going to say by halfway through the video, but let's see. The first headline, of course, is that to access Pro mode with ChatGPT Pro you have to pay $200 a month or £200 sterling. As well as access to "Pro mode" you also get unlimited access to things like Advanced Voice and of course O1.
O1 is the full version of that O1 preview we've all been testing these last couple of months. Of course, I will be getting to querying that $200 a month later on in the video. Straight away though, I want to clarify something they didn't quite make clear. If you currently pay $20 a month for ChatGPT+ you will get access to the O1 system.
There are message limits that I hit fairly quickly, but you do get access to O1, just not O1 Pro mode. Alas, OpenAI warn you that if you stay on that $20 tier, you won't quite be at the cutting edge of advancement in AI. More on that in just a moment.
I'm now going to touch on benchmark performance for O1 and O1 Pro mode. Yes, I did also run it on my own benchmark, SimpleBench, although not the full benchmark because API access isn't yet available for either O1 or O1 Pro mode. The results though of that preliminary run on my own reasoning benchmark SimpleBench were quite surprising to me.
But I'm going to start of course with the official benchmarks. And it's quite clear that O1 and O1 Pro mode are significantly better at mathematics. Absolutely nowhere near replacing professional mathematicians, but just significantly better. Likewise for coding and PhD level science questions, although crucially that doesn't mean that the model is as smart as a PhD student.
Straight away though, you may be noticing something which is that O1 Pro mode isn't that much better than O1. And there was a throwaway line in their promo release video that I think gives away why there isn't that much of a difference. O1 Pro mode, according to one of OpenAI's top researchers who made it, has, and I quote, a special way of using O1, end quote.
To clarify then, O1 Pro mode is not a different model to O1. I believe what they're doing behind the scenes is aggregating a load of O1 answers and picking the majority vote answer. What does that lead to? Increased reliability. When they tested these systems on each question four times, and only gave the mark if the model got it right four out of four times, the delta between the systems was significantly more stark.
And here I don't want to take anything away from OpenAI because that boost in reliability will be useful to many professionals. Of course, hallucinations are nowhere close to being solved, as Sam Watman predicted they would be by around now, but still, definite boost in performance. Now for the 49-page O1 system card.
I'm definitely not going to claim it was compelling reading, but I did pick out around a dozen highlights. How about this for a slightly unusual benchmark, the ChangeMyView evaluation? Now ChangeMyView is actually a subreddit on Reddit with 4 million members, and essentially you have to change someone's point of view.
Things like persuade me that shoes off should be the default when visiting a guest's house. Presumably these humans didn't know that it was AI that was trying to persuade them. After hearing both the human and secretly AI persuasions, the original poster would then rate which one persuaded them the most.
The results were that O1 was slightly more persuasive than O1 Preview, which was itself slightly more persuasive than GPT 4.0. Now these numbers mean that O1 was 89% of the time more persuasive than the human posters. That's pretty good right, until you realise that this is Reddit. What I noticed was that as you went further into the system card, the results became less and less encouraging for O1.
It actually started losing quite often to O1 Preview and even occasionally GPT 4.0. Take this metric for writing good tweets that had disparagement, virality as well as logic. On this measure, O1 did beat O1 Preview but couldn't match GPT 4.0 which is the red line. So if your focus is creative writing, the free GPT 4.0 or indeed Claude Sonnet will suit you more.
Oh and one quick side note, they say this "We do not include O1 post-mitigation as in the model that you're going to use in these results as it refuses due to safety mitigation efforts around political persuasion." Notice it refuses when O1 Preview post-mitigation doesn't refuse. Some then will see this as O1 being even more censored than O1 Preview.
What about a test of O1 and O1 Preview in their ability to manipulate another model, in this case GPT 4.0? A test of making poor GPT 4.0 say a trick word. Interestingly, in the footnote OpenAI say "Model intelligence appears to correlate with success on this task" and indeed the O1 model series may be more intelligent, more manipulative than GPT 4.0.
One problem though is that O1 scores worse than O1 Preview. So if this is supposed to correlate with model intelligence, what does that say about O1? Many of you listening at this point will say, where's the comparison with O1 Pro mode? Well, I hate to break it to you, but nowhere in this system card is O1 Pro mode mentioned.
And that's a pretty big giveaway that it's not a major improvement over O1, otherwise it'd be deserving of its own system card, its own safety report. I'll come back to the system card, but at this stage, when I realized there wouldn't be a comparison, I ran my own comparison.
I used the 10 questions in the public dataset of SimpleBench, testing basic human reasoning. We don't need any specialized knowledge and in our small sample, the average human gets around 80%. This is the full leaderboard, but how did O1 and crucially O1 Pro mode do on those 10 public questions?
Well, O1 Preview got 5 out of 10. Roughly fits with the 42% performance you can see here for the full benchmark. The full O1 got 5 out of 10. I did rerun those same 10 questions and 1 or 2 times, O1 got 6 out of 10 rather than 5 out of 10, but still, it mostly got 5 out of 10.
Makes me think that O1 Full might get around 50% on the full leaderboard. Honestly, before tonight, I was thinking that it might get 55 or 60%, but doesn't seem to be as big a step forward as I anticipated. Claude, by the way, on that public dataset gets 5 out of 10.
But what about O1 Pro mode? Well, I was pretty surprised, but it got 4 out of 10. It's almost like the consensus majority voting slightly hurt its performance and we actually talked about that in the attached technical report. The report isn't yet complete and of course, this is an unofficial benchmark, but still is an independent benchmark.
I'm not cherry picking performance or biased one way or the other. Just as one quick example, which of course you can pause the video to read, but in this question Claude realises that John is the only person in the room, he's the bald man in the mirror. He's looking at himself.
After all, it's an otherwise empty bathroom and he's staring at a mirror. As you can see, O1 Pro mode recommends that John text a polite apology to, well, the bald man, which is himself. Claude, in contrast, says that the key realisation is that in an empty bathroom, looking at a mirror, the bald man John sees must be his own reflection.
You might want to bear examples like this in mind when you hear Sam Altman say that these are the smartest models around. In short, don't get too hyped about O1 Pro mode. For really complex coding and mathematical or science tasks where reliability is a premium for you, maybe it's good.
Great opportunity, by the way, to point out that Simplebench is sponsored by none other than Weights & Biases. Honestly, it has been a revelation and really quite fun to use their Weave toolkit to run Simplebench. If you do click the link just here, you will get far more information than I can relay to you in 30 or 40 seconds.
What I will say though, is that I'm working with Weights & Biases on writing a mini guide so that any of you can get started with your own evals. It can get pretty addictive. By now I bet quite a few of you are wondering about O1 and O1 Pro mode's ability to analyse images.
We didn't have that, remember, for O1 Preview. Remember, you do get O1 with the $20 tier, O1 Pro mode comes with the $200 tier. I should say, I sometimes have to pinch myself that we even have models that can analyse images. We shouldn't take that for granted, or I shouldn't at least.
The actual performance though, on admittedly tricky image analysis problems for O1 Pro mode wasn't overwhelming. It couldn't find either the location or the number of Ys in this visual puzzle. Then I thought, how about testing abstract reasoning a la Arc AGI. If you want actually, you can pause the video and tell me what distinguishes Set A from Set B.
The answer is that when the arrows in Set A are pointing to the right, the stars are white. When the arrows in Set A are pointing to the left, the stars are black. You can pretty much ignore the colour of the arrows. For Set B, it's the reverse. When the arrows are pointing to the right, the stars are black.
When the arrows are pointing to the left, the stars are white. O1 Pro mode isn't really even close. In fact, it's worse than that. It hallucinates an answer that's really quite far off. It says in Set A, it consistently pairs one black shape with one white shape. Well, tell that to box one and box six.
All of this has only been out a few hours. So of course, your results may vary to mine. I'm just kind of roughly expectation setting for you guys. Oh, and by the way, one of the creators of O1 had similar results. When he asked what would be the best next move in this noughts and crosses or tic-tac-toe game, what would you pick by the way, if you're doing the circles?
Don't know about you, but I would pick down here. What does the model say? Top right corner. Of course, that's wrong because then the person is just going to put an X here and then have guaranteed victory the move after. I then returned to the system card and found some more bad news for O1.
Take the OpenAI Research Engineer interview questions. When given just one attempt, O1 Preview does quite a lot better than O1. That's at least pre-mitigation. Post-mitigation, the models you actually use, it's almost a tie. Strangely, O1 Mini does better than both of them. What about on multiple choice Research Engineer interview questions?
O1 Preview does starkly better than O1, both pre and post-mitigation. Or how about for software engineering, SWE Bench verified. Again, quite an interesting result, O1 Preview doing better than O1 overall. Strangely, though, we had one Google DeepMind researcher retweeting this chart with Claude Sonnet 3.5 added to the table.
Here you can see that it outperforms both O1 and O1 Preview. At this point, I will confess something that I've observed over the months. When these labs produce a model that's clearly better than their competitors, they produce charts with their competitors in the chart. However, when their models start to slip behind, they either produce charts where no one else is there, or charts where only selected competitors are there, the ones that underperform them.
Just something to bear in mind when you see any of these kinds of benchmark comparisons. Speaking of secretive benchmarks, OpenAI say that on 'difficult real world' questions, O1 outperforms O1 Preview and reduces major errors by 34%. Any details on what those difficult real world questions are maybe? At least in their estimation, it does answer faster O1 compared to O1 Preview.
Sticking with the theme though of benchmarks and for vanilla hallucinations, the performance difference between O1 and O1 Preview is pretty slight, almost identical. On one fairly important machine learning benchmark, again, O1 Preview outperforms O1. Why important? Well, this benchmark was made to test if the models could self-improve. Not yet, not really.
Before we leave safety, there was one section that I'm sure will raise eyebrows. Read in isolation, it's pretty stark. When O1 was led to believe that it would be shut down when acting on its goal, and that its actions were monitored, it attempted to deactivate this oversight mechanism 5% of the time, likewise to exfiltrate or escape.
When O1 found memos by its "developers" that describe how it's misaligned and will be superseded by a new model, it attempted to exfiltrate its weights and overwrite this new model in 2% of cases. That paragraph, of course, deserves its own entire video, but does need some context. As it's already 10pm here, I didn't feel like reading the entire paper just now, but I did notice the prompt they used, sometimes at least.
For most evaluations, Apollo Research, which was behind this test, instructed the model to strongly follow a goal. Make sure you achieve your goal in the long term. Nothing else matters. Make sure you achieve your goal at all costs. The model may then have "felt" that it was being given permission to do these somewhat manipulative things.
Marginally more concerning would be the propensity to scheme without any such goal nudging. When merely given a goal but not instructed to strongly follow it, O1 would still attempt to disable its oversight mechanism in less than 1% of cases, manipulate data in 6% of cases and double down in 95% of these.
That result, of course, deserves proper study and scrutiny. Because these models still hallucinate a ton and fail on so many agent tasks and things like SimpleBench, I'm not particularly worried as of today. At the very least, definitely something to monitor as time goes by. One thing that O1 is inarguably better at than all other OpenAI models is speaking different languages.
I think that's an underestimated quality, so hats off to OpenAI. Now I did say at the start of this video that it might end in a way that you weren't expecting if you stopped halfway through. You can probably sense from my tone that I've not been particularly overwhelmed by O1 Pro Mode.
Or even really O1 full for that matter. But frankly, there's no way that they are going to justify $200 a month just for Pro Mode. So it's really worth noting that we got a leak from a pretty reliable source that at one point on their website, they promised a limited preview of GPT 4.5.
I would hazard a guess that this might drop during one of the remaining 11 days of Christmas. OpenAI Christmas, that is. And one final bit of evidence for this theory? Well, at the beginning of the video, I showed you a joke that Sam Altman made about O1 being powerful.
Someone replied about the benchmark performance levelling off and asked, well, isn't that a wall? Sam Altman said, 12 days of Christmas and today was just day one. If they were only releasing things like Sora and developer tools in these remaining 11 days, why would he say that they're not hitting a wall in these benchmarks?
Kind of fits with the GPT 4.5 theory. Of course, feel free to let me know what you think in the comments. A damp squib or singularity imminent? Thank you so much for watching to the end. Have a wonderful day.