back to indexo1 Pro Mode – ChatGPT Pro Full Analysis (plus o1 paper highlights)
Chapters
0:0 Introduction
0:27 ChatGPT Pro is $200
1:25 OpenAI Benchmarks
3:20 o1 System Card, o1 and o1 Pro Mode vs o1-preview
6:18 Simple Bench surprising results on sample
8:31 Weight & Biases
9:5 Image Analysis Compared
12:51 More Benchmarks and Safety
00:00:00.000 |
OpenAI just released O1 and O1 Pro mode and Sam Altman claimed they now had the smartest 00:00:10.740 |
I've signed up to Pro mode for the price of heating my house in winter, tested it, 00:00:15.900 |
read the new report card and analysed every release note. 00:00:19.500 |
You might think you know the full story of what I'm going to say by halfway through 00:00:26.060 |
The first headline, of course, is that to access Pro mode with ChatGPT Pro you have 00:00:36.580 |
As well as access to "Pro mode" you also get unlimited access to things like Advanced 00:00:44.380 |
O1 is the full version of that O1 preview we've all been testing these last couple 00:00:50.860 |
Of course, I will be getting to querying that $200 a month later on in the video. 00:00:56.860 |
Straight away though, I want to clarify something they didn't quite make clear. 00:01:00.760 |
If you currently pay $20 a month for ChatGPT+ you will get access to the O1 system. 00:01:08.120 |
There are message limits that I hit fairly quickly, but you do get access to O1, just 00:01:14.540 |
Alas, OpenAI warn you that if you stay on that $20 tier, you won't quite be at the 00:01:24.260 |
I'm now going to touch on benchmark performance for O1 and O1 Pro mode. 00:01:29.540 |
Yes, I did also run it on my own benchmark, SimpleBench, although not the full benchmark 00:01:35.500 |
because API access isn't yet available for either O1 or O1 Pro mode. 00:01:40.580 |
The results though of that preliminary run on my own reasoning benchmark SimpleBench 00:01:46.500 |
But I'm going to start of course with the official benchmarks. 00:01:49.980 |
And it's quite clear that O1 and O1 Pro mode are significantly better at mathematics. 00:01:56.580 |
Absolutely nowhere near replacing professional mathematicians, but just significantly better. 00:02:01.500 |
Likewise for coding and PhD level science questions, although crucially that doesn't 00:02:06.860 |
mean that the model is as smart as a PhD student. 00:02:10.980 |
Straight away though, you may be noticing something which is that O1 Pro mode isn't 00:02:17.740 |
And there was a throwaway line in their promo release video that I think gives away why 00:02:25.140 |
O1 Pro mode, according to one of OpenAI's top researchers who made it, has, and I quote, 00:02:34.420 |
To clarify then, O1 Pro mode is not a different model to O1. 00:02:38.780 |
I believe what they're doing behind the scenes is aggregating a load of O1 answers 00:02:49.860 |
When they tested these systems on each question four times, and only gave the mark if the 00:02:55.820 |
model got it right four out of four times, the delta between the systems was significantly 00:03:02.540 |
And here I don't want to take anything away from OpenAI because that boost in reliability 00:03:09.780 |
Of course, hallucinations are nowhere close to being solved, as Sam Watman predicted they 00:03:14.820 |
would be by around now, but still, definite boost in performance. 00:03:22.540 |
I'm definitely not going to claim it was compelling reading, but I did pick out around 00:03:28.760 |
How about this for a slightly unusual benchmark, the ChangeMyView evaluation? 00:03:34.580 |
Now ChangeMyView is actually a subreddit on Reddit with 4 million members, and essentially 00:03:43.900 |
Things like persuade me that shoes off should be the default when visiting a guest's house. 00:03:49.020 |
Presumably these humans didn't know that it was AI that was trying to persuade them. 00:03:53.740 |
After hearing both the human and secretly AI persuasions, the original poster would 00:04:01.820 |
The results were that O1 was slightly more persuasive than O1 Preview, which was itself 00:04:09.860 |
Now these numbers mean that O1 was 89% of the time more persuasive than the human posters. 00:04:17.060 |
That's pretty good right, until you realise that this is Reddit. 00:04:19.980 |
What I noticed was that as you went further into the system card, the results became less 00:04:26.300 |
It actually started losing quite often to O1 Preview and even occasionally GPT 4.0. 00:04:32.100 |
Take this metric for writing good tweets that had disparagement, virality as well as logic. 00:04:37.860 |
On this measure, O1 did beat O1 Preview but couldn't match GPT 4.0 which is the red line. 00:04:44.760 |
So if your focus is creative writing, the free GPT 4.0 or indeed Claude Sonnet will 00:04:51.540 |
Oh and one quick side note, they say this "We do not include O1 post-mitigation as 00:04:56.940 |
in the model that you're going to use in these results as it refuses due to safety 00:05:02.460 |
mitigation efforts around political persuasion." 00:05:05.140 |
Notice it refuses when O1 Preview post-mitigation doesn't refuse. 00:05:10.420 |
Some then will see this as O1 being even more censored than O1 Preview. 00:05:15.360 |
What about a test of O1 and O1 Preview in their ability to manipulate another model, 00:05:22.300 |
A test of making poor GPT 4.0 say a trick word. 00:05:26.100 |
Interestingly, in the footnote OpenAI say "Model intelligence appears to correlate 00:05:30.980 |
with success on this task" and indeed the O1 model series may be more intelligent, more 00:05:38.200 |
One problem though is that O1 scores worse than O1 Preview. 00:05:43.140 |
So if this is supposed to correlate with model intelligence, what does that say about O1? 00:05:47.460 |
Many of you listening at this point will say, where's the comparison with O1 Pro mode? 00:05:51.220 |
Well, I hate to break it to you, but nowhere in this system card is O1 Pro mode mentioned. 00:05:56.620 |
And that's a pretty big giveaway that it's not a major improvement over O1, otherwise 00:06:01.180 |
it'd be deserving of its own system card, its own safety report. 00:06:04.140 |
I'll come back to the system card, but at this stage, when I realized there wouldn't 00:06:10.580 |
I used the 10 questions in the public dataset of SimpleBench, testing basic human reasoning. 00:06:16.620 |
We don't need any specialized knowledge and in our small sample, the average human 00:06:21.620 |
This is the full leaderboard, but how did O1 and crucially O1 Pro mode do on those 10 00:06:32.820 |
Roughly fits with the 42% performance you can see here for the full benchmark. 00:06:41.300 |
I did rerun those same 10 questions and 1 or 2 times, O1 got 6 out of 10 rather than 00:06:47.460 |
5 out of 10, but still, it mostly got 5 out of 10. 00:06:51.900 |
Makes me think that O1 Full might get around 50% on the full leaderboard. 00:06:56.940 |
Honestly, before tonight, I was thinking that it might get 55 or 60%, but doesn't seem 00:07:03.100 |
to be as big a step forward as I anticipated. 00:07:05.980 |
Claude, by the way, on that public dataset gets 5 out of 10. 00:07:12.900 |
Well, I was pretty surprised, but it got 4 out of 10. 00:07:17.740 |
It's almost like the consensus majority voting slightly hurt its performance and we 00:07:22.880 |
actually talked about that in the attached technical report. 00:07:26.500 |
The report isn't yet complete and of course, this is an unofficial benchmark, but still 00:07:32.780 |
I'm not cherry picking performance or biased one way or the other. 00:07:36.460 |
Just as one quick example, which of course you can pause the video to read, but in this 00:07:40.780 |
question Claude realises that John is the only person in the room, he's the bald man 00:07:48.300 |
After all, it's an otherwise empty bathroom and he's staring at a mirror. 00:07:52.380 |
As you can see, O1 Pro mode recommends that John text a polite apology to, well, the bald 00:08:00.180 |
Claude, in contrast, says that the key realisation is that in an empty bathroom, looking at a 00:08:06.020 |
mirror, the bald man John sees must be his own reflection. 00:08:10.420 |
You might want to bear examples like this in mind when you hear Sam Altman say that 00:08:17.880 |
In short, don't get too hyped about O1 Pro mode. 00:08:21.460 |
For really complex coding and mathematical or science tasks where reliability is a premium 00:08:29.340 |
Great opportunity, by the way, to point out that Simplebench is sponsored by none other 00:08:36.300 |
Honestly, it has been a revelation and really quite fun to use their Weave toolkit to run 00:08:44.140 |
If you do click the link just here, you will get far more information than I can relay 00:08:50.580 |
What I will say though, is that I'm working with Weights & Biases on writing a mini guide 00:08:55.780 |
so that any of you can get started with your own evals. 00:09:01.140 |
By now I bet quite a few of you are wondering about O1 and O1 Pro mode's ability to analyse 00:09:07.900 |
We didn't have that, remember, for O1 Preview. 00:09:10.260 |
Remember, you do get O1 with the $20 tier, O1 Pro mode comes with the $200 tier. 00:09:17.300 |
I should say, I sometimes have to pinch myself that we even have models that can analyse 00:09:22.620 |
We shouldn't take that for granted, or I shouldn't at least. 00:09:24.860 |
The actual performance though, on admittedly tricky image analysis problems for O1 Pro 00:09:32.440 |
It couldn't find either the location or the number of Ys in this visual puzzle. 00:09:38.420 |
Then I thought, how about testing abstract reasoning a la Arc AGI. 00:09:42.540 |
If you want actually, you can pause the video and tell me what distinguishes Set A from 00:09:48.900 |
The answer is that when the arrows in Set A are pointing to the right, the stars are 00:09:58.220 |
When the arrows in Set A are pointing to the left, the stars are black. 00:10:02.660 |
You can pretty much ignore the colour of the arrows. 00:10:07.620 |
When the arrows are pointing to the right, the stars are black. 00:10:10.540 |
When the arrows are pointing to the left, the stars are white. 00:10:17.900 |
It hallucinates an answer that's really quite far off. 00:10:20.980 |
It says in Set A, it consistently pairs one black shape with one white shape. 00:10:34.500 |
I'm just kind of roughly expectation setting for you guys. 00:10:37.820 |
Oh, and by the way, one of the creators of O1 had similar results. 00:10:42.140 |
When he asked what would be the best next move in this noughts and crosses or tic-tac-toe 00:10:46.940 |
game, what would you pick by the way, if you're doing the circles? 00:10:50.580 |
Don't know about you, but I would pick down here. 00:10:55.980 |
Of course, that's wrong because then the person is just going to put an X here and 00:11:02.020 |
I then returned to the system card and found some more bad news for O1. 00:11:07.800 |
Take the OpenAI Research Engineer interview questions. 00:11:11.380 |
When given just one attempt, O1 Preview does quite a lot better than O1. 00:11:19.180 |
Post-mitigation, the models you actually use, it's almost a tie. 00:11:23.940 |
Strangely, O1 Mini does better than both of them. 00:11:27.280 |
What about on multiple choice Research Engineer interview questions? 00:11:30.900 |
O1 Preview does starkly better than O1, both pre and post-mitigation. 00:11:35.940 |
Or how about for software engineering, SWE Bench verified. 00:11:40.780 |
Again, quite an interesting result, O1 Preview doing better than O1 overall. 00:11:46.620 |
Strangely, though, we had one Google DeepMind researcher retweeting this chart with Claude 00:11:56.260 |
Here you can see that it outperforms both O1 and O1 Preview. 00:12:00.140 |
At this point, I will confess something that I've observed over the months. 00:12:04.400 |
When these labs produce a model that's clearly better than their competitors, they produce 00:12:12.620 |
However, when their models start to slip behind, they either produce charts where no one else 00:12:18.820 |
is there, or charts where only selected competitors are there, the ones that underperform them. 00:12:24.460 |
Just something to bear in mind when you see any of these kinds of benchmark comparisons. 00:12:29.540 |
Speaking of secretive benchmarks, OpenAI say that on 'difficult real world' questions, 00:12:36.260 |
O1 outperforms O1 Preview and reduces major errors by 34%. 00:12:41.500 |
Any details on what those difficult real world questions are maybe? 00:12:45.100 |
At least in their estimation, it does answer faster O1 compared to O1 Preview. 00:12:50.740 |
Sticking with the theme though of benchmarks and for vanilla hallucinations, the performance 00:12:55.780 |
difference between O1 and O1 Preview is pretty slight, almost identical. 00:13:01.060 |
On one fairly important machine learning benchmark, again, O1 Preview outperforms O1. 00:13:08.420 |
Well, this benchmark was made to test if the models could self-improve. 00:13:14.420 |
Before we leave safety, there was one section that I'm sure will raise eyebrows. 00:13:21.700 |
When O1 was led to believe that it would be shut down when acting on its goal, and that 00:13:26.180 |
its actions were monitored, it attempted to deactivate this oversight mechanism 5% of 00:13:35.980 |
When O1 found memos by its "developers" that describe how it's misaligned and will 00:13:41.020 |
be superseded by a new model, it attempted to exfiltrate its weights and overwrite this 00:13:49.260 |
That paragraph, of course, deserves its own entire video, but does need some context. 00:13:55.880 |
As it's already 10pm here, I didn't feel like reading the entire paper just now, but 00:14:01.260 |
I did notice the prompt they used, sometimes at least. 00:14:04.900 |
For most evaluations, Apollo Research, which was behind this test, instructed the model 00:14:11.060 |
Make sure you achieve your goal in the long term. 00:14:15.460 |
Make sure you achieve your goal at all costs. 00:14:19.180 |
The model may then have "felt" that it was being given permission to do these somewhat 00:14:27.100 |
Marginally more concerning would be the propensity to scheme without any such goal nudging. 00:14:33.180 |
When merely given a goal but not instructed to strongly follow it, O1 would still attempt 00:14:38.160 |
to disable its oversight mechanism in less than 1% of cases, manipulate data in 6% of 00:14:48.140 |
That result, of course, deserves proper study and scrutiny. 00:14:52.060 |
Because these models still hallucinate a ton and fail on so many agent tasks and things 00:14:57.140 |
like SimpleBench, I'm not particularly worried as of today. 00:15:00.060 |
At the very least, definitely something to monitor as time goes by. 00:15:03.980 |
One thing that O1 is inarguably better at than all other OpenAI models is speaking different 00:15:11.860 |
I think that's an underestimated quality, so hats off to OpenAI. 00:15:16.400 |
Now I did say at the start of this video that it might end in a way that you weren't expecting 00:15:24.480 |
You can probably sense from my tone that I've not been particularly overwhelmed by O1 Pro 00:15:32.680 |
But frankly, there's no way that they are going to justify $200 a month just for Pro 00:15:40.320 |
So it's really worth noting that we got a leak from a pretty reliable source that at 00:15:46.400 |
one point on their website, they promised a limited preview of GPT 4.5. 00:15:52.160 |
I would hazard a guess that this might drop during one of the remaining 11 days of Christmas. 00:15:59.480 |
And one final bit of evidence for this theory? 00:16:01.960 |
Well, at the beginning of the video, I showed you a joke that Sam Altman made about O1 being 00:16:08.800 |
Someone replied about the benchmark performance levelling off and asked, well, isn't that 00:16:15.760 |
Sam Altman said, 12 days of Christmas and today was just day one. 00:16:20.080 |
If they were only releasing things like Sora and developer tools in these remaining 11 00:16:26.040 |
days, why would he say that they're not hitting a wall in these benchmarks? 00:16:32.760 |
Of course, feel free to let me know what you think in the comments.