ChatGPT o1 - First Reaction and In-Depth Analysis

ChachiPT now calls itself an alien of exceptional ability, and I find it a little bit harder to disagree with that today than I did yesterday. Because the system called O1 from OpenAI is here, at least in preview form, and it is a step change improvement. You may also know O1 by its previous names of Strawberry and Q*, but let's forget naming conventions, how good is the actual system?

Well, in the last 24 hours, I've read the 43-page system card, every OpenAI post and press release, I've tested O1 hundreds of times, including on SimpleBench, and analysed every single answer. To be honest with you guys, it will take weeks to fully digest this release, so in this video, I'll just give you my first impressions, and of course do several more videos as we analyse further.

In short though, don't sleep on O1. This isn't just about a little bit more training data, this is a fundamentally new paradigm. In fact, I would go as far as to say that there are hundreds of millions of people who might have tested an earlier version of ChachiPT and found LLMs and "AI" lacking, but will now return with excitement.

As the title implies, let me give you my first impressions, and it's that I didn't expect the system to perform as well as it does. And that's coming from the person who predicted many of the key mechanisms behind Q*, which have been used, it seems, in this system. Things like sampling hundreds or even thousands of reasoning paths, and potentially using a verifier, an LLM-based verifier, to pick the best ones.

Of course, OpenAI aren't disclosing the full details of how they trained O1, but they did leave us some tantalising clues, which I'll go into in a moment. SimpleBench, if you don't know, tests hundreds of basic reasoning questions, from spatial to temporal to social intelligence questions, that humans will crush on average.

As many people have told me, the O1 system gets both of these two sample questions from SimpleBench right, although not always. Take this example, where despite thinking for 17 seconds, the model still gets it wrong. Fundamentally, O1 is still a language model-based system, and will make language model-based mistakes.

It can be rewarded as many times as you like for good reasoning, but it's still limited by its training data. Nevertheless, though, I didn't quite foresee the magnitude of the improvement that would occur through rewarding correct reasoning steps. That, I'll admit, took me slightly by surprise. So why no concrete figure?

Well, as of last night, OpenAI imposed a temperature of 1 on its O1 system. That was not the temperature used for the other models when they were benchmarked on SimpleBench. That's a much more "creative" temperature than the other models were tested on for SimpleBench. Therefore, what that meant was that performance variability was a bit higher than normal.

It would occasionally get questions right through some stroke of genius reasoning, and get that same question wrong the next time. In fact, as you just saw with the IceCube example. The obvious solution is to run the benchmark multiple times and take a majority vote. That's called self-consistency. But for a true apples-to-apples comparison, I would need to do that for all the other models.

My ambition, not that you're too interested, is to get that done by the end of this month. But let me reaffirm one thing very clearly. However you measure it, O1 Preview is a step-change improvement on Claude 3.5 Sonnet. And as anyone following this channel will know, I'm not some OpenAI fanboy.

Claude 3.5 Sonnet has reigned supreme for quite a while. So for those of you who don't care about other benchmarks and the full paper, I want to kind of summarize my first impressions in a nutshell. This description actually fits quite well. The ceiling of performance for the O1 system, just Preview let alone the full O1 system, is incredibly high.

It obviously crushes the average person's performance in things like physics, maths and coding competitions. But don't get misled. Its floor is also really quite low, below that of an average human. As I wrote on YouTube last night, it frequently and sometimes predictably makes really obvious mistakes that humans wouldn't make.

Remember I analysed the hundreds of answers it gave for SimpleBench. Let me give you a couple of examples straight from the mouth of O1. When the cup is turned upside down, the dice will fall and land on the open end of the cup, which is now the top. If you can visualise that successfully, you're doing better than me.

Suffice to say, it got that question wrong. And how about this, more social intelligence. He will argue back, obviously I'm not giving you the full context because this is a private data set, but anyway. He will argue back against the Brigadier General, one of the highest military ranks, at the Troop Parade.

This is a soldier we're talking about. As the soldier's silly behaviour in first grade, that's like age 6 or 7, indicates a history of speaking up against authority figures. Now the vast majority of humans would say, wait, no, what he did in primary school, I don't know what Americans call primary school, but what he did when he was a young school child does not reflect what he would do in front of a general on a Troop Parade.

As I've written, in some domains these mistakes are routine and amusing. So it is very easy to look at O1's performance on the Google proof question and answer set, it's performance of around 80%, that's on the diamond subset, and say, well, let's be honest, the average human can't even get one of those questions right.

So therefore it's AGI? Well, even Sam Altman says, no, it's not. Too many benchmarks are brittle in the sense that when the model is trained on that particular reasoning task, it then can ace it. Think Web of Lies, where it's now been shown to get 100%, but if you test O1 thoroughly in real life scenarios, you will frequently find kind of glaring mistakes.

Obviously, what I've tried to do into the early hours of last night and this morning is find patterns in those mistakes, but it has proven a bit harder than I thought. My guess though, about those weaknesses for those who won't stay to the end of the video is it's to do with its training methodology.

OpenAI revealed in one of the videos on its YouTube channel, and I will go into more detail on this in a future video, that they deviated from the let's verify step-by-step paper by not training on human annotated reasoning samples or steps. Instead, they got the model to generate the chains of thought, and we all know those can be quite flawed, but here's the key moment to really focus on.

They then automatically scooped up those chains of thought that led to a correct answer, in the case of mathematics, physics or coding, and then train the model further on those correct chains of thought. So it's less that O1 is doing true reasoning from first principles, it's more retrieving more accurately, more reliably, reasoning programs from its training data.

It "knows" or can compute which of those reasoning programs in its training data will more likely lead it to a correct answer. It's a bit like taking the best of the web, rather than a slightly improved average of the web. That, to me, is the great unlock that explains a lot of this progress, and if I'm right, that also explains why it's still making some glaring mistakes.

At this point, I simply can't resist giving you one example straight from the output of O1 preview from a simple bench question. The context, and you'll have to trust me on this one, is simply that there's a dinner at which various people are donating gifts. One of the gifts happens to be given during a Zoom call, so online, not in person.

Now I'm not going to read out some of the reasoning that O1 gives, you can see it on screen, but it would be hard to argue that it is truly reasoning from first principles. Definitely some suboptimal training data going on. So that is the context for everything you're going to see in the remainder of this First Impressions video.

Because everything else is quite frankly stunning. I just don't want people to get too carried away by the really impressive accomplishment from OpenAI. I fully expect to be switching to O1 Preview for daily use cases, although of course Anthropic in the coming weeks could reply with their own system.

Anyway, now let's dive into some of the juiciest details. The full breakdown will come in future videos. First thing to remember, this is just O1 Preview, not the full O1 system that is currently in development. Not only that, it is very likely based on the GPT-40 model, not GPT-5 or Orion which would vastly supersede GPT-40 in scale.

I could just leave you to think about the implications of scaling up the base model 100 times in compute, throw in a video avatar and man, we are really talking about a changed AI environment. Anyway, back to the details. They talk about performing similarly to PhD students in a range of tasks in physics, chemistry and biology.

And I've already given you the nuance on that kind of comment. They justify the name by the way by saying, "This is such a significant advancement that we are resetting the counter back to 1 and naming this series OpenAI 01." It also reminds me of the O1 and O2 figure series of robotic humanoids whose maker OpenAI is collaborating with.

This was just the introductory page and then they gave several follow up pages and posts. To sum it up on jailbreaking, O1 Preview is much harder to jailbreak, although it's still possible. Before we get to the reasoning page, here is some analysis on Twitter or X from the OpenAI team.

One researcher at OpenAI who is building Sora said this, "I really hope people understand that this is a new paradigm and I agree with that actually, it's not just hype. Don't expect the same pace, schedule or dynamics of pre-training era." The core element of how O1 works, by the way, is scaling up its inference, its actual output, its test time compute.

How much computational power is applied in its answers to prompts, not when it's being built and pre-trained. He's making the point that expanding the pre-training scale of these models takes years often, as you've seen in some of my previous videos, it's to do with data centers, power and the rest of it.

But what can happen much faster is scaling up inference time, output time compute. Improvements can happen much more rapidly than scaling up the base models. In other words, I believe that the rate of improvement, he says on evals with our reasoning models has been the fastest in OpenAI history.

It's going to be a wild year. He is, of course, implying that the full O1 system will be released later this year. We'll get to some other researchers, but Will DePue made some other interesting points. In one graph of math performance, they show that O1 mini, the smaller version of the O1 system, scores better than O1 preview.

But I will say that in my testing of O1 mini on SimpleBench, it performed really quite badly. We're talking sub 20%. So it could be a bit like the GPT 4.0 mini we already had, that it's hyper specialized at certain tasks, but can't really go beyond its familiar environment.

Give it a straightforward coding or math challenge and it will do well. Introduce complication, nuance or reasoning and it'll do less well. This chart, though, is interesting for another reason, and you can see that when they max out the inference cost for the full O1 system, the performance delta with the maxed out mini model is not crazy.

I would say, what is that 70% going up to 75%. To put it another way, I wouldn't expect the full O1 system with maxed out inference to be yet another step change forward, although, of course, nothing can be ruled out. Some more quotes from OpenAI and this is Noam Brown, who I've quoted many times on this channel focused on reasoning at OpenAI.

He states again the same message, "We're sharing our evals of the O1 model to show the world that this isn't a one-off improvement. It's a new scaling paradigm." Underneath, you can see the dramatic performance boosts across the board from GPT 4.0 to O1. Now, I suspect if you included GPT 4.0 Turbo on here, you might see some more mixed improvements, but still, the overall trend is stark.

If, for example, I had only seen improvement in STEM subjects and maths particularly, I would have said, you know what, is this a new paradigm, but it's that combination of improvements in a range of subjects, including law, for example, and most particularly for me, of course, on SimpleBench, that I am actually a believer that this is a new paradigm.

Yes, I get that it can still fall for some basic tokenization problems, like it doesn't always get that 9.8 is bigger than 9.11, and yes, of course, you saw the somewhat amusing mistakes earlier on SimpleBench, but here's the key point. I can no longer say with absolute certainty which domains or types of questions on SimpleBench it will reliably get wrong.

I can see some patterns, but I would hope for a bit more predictability in saying it won't get this right, for example. Until I can say with a degree of certainty it won't get this type of problem correct, I can't really tell you guys that I can see the end of this paradigm.

Just to repeat, we have two more axes of scale to yet exploit. Bigger base models, which we know they're working on with the whale-size supercluster - I've talked about that in previous videos - and simply more inference time compute. Plus, just look at the log graphs on scaling up the training of the base model and the inference time, or the amount of thinking time or processing time, more accurately, for the models.

They don't look like they're levelling off to me. Now I know some might say that I come off as slightly more dismissive of those memory-heavy, computation-heavy benchmarks like the GPQA, but it is a stark achievement for the O1 Preview and O1 Systems to score higher than an expert PhD human average.

Yes, there are flaws with that benchmark as with the MLU, but credit where it is due. By the way, as a side note, they do admit that certain benchmarks are no longer effective at differentiating models. It's my hope, or at least my goal, that SimpleBench can still be effective at differentiating models for the coming, what, 1, 2, 3 years maybe?

I will now give credit to OpenAI for this statement. These results do not imply that O1 is more capable holistically than a PhD in all respects, only that the model is more proficient in solving some problems that a PhD would be expected to solve. That's much more nuanced and accurate than statements that we've heard in the past from, for example, Mira Murati.

And just a quick side note, O1 on a Vision + Reasoning task, the MMMU, scores 78.2% competitive with human experts. That benchmark is legit, it's for real, and that's a great performance. On coding, they tested the system on the 2024, so not contaminated data, International Olympiad in Informatics. It scored around the median level, however, it was only able to submit 50 submissions per problem.

But as compute gets more abundant and more fast, it shouldn't take 10 hours for it to attempt 10,000 submissions per problem. When they tried this, obviously going beyond the 10 hours presumably, the model achieved a score above the gold medal threshold. Now remember, we have seen something like this before with the AlphaCode2 system from Google DeepMind.

And if you notice, this approach of scaling up the number of samples tested does help the model improve up the percentile rankings. However, those elite coders still leave systems like AlphaCode2 and O1 in the dust. The truly elite level reasoning that those coders go through is found much less frequently in the training data.

As with other domains, it may prove harder to go from the 93rd percentile to the 99th than going from say the 11th to the 93rd. Nevertheless, yet another stunning achievement. Notice something though, in domains that are less susceptible to reinforcement learning, where in other words, there's less of a clear correct answer and incorrect answer, the performance boost is much worse, much less.

Things like personal writing or editing text, there's no easy yes or no compilation of answers to verify against. In fact, for personal writing, the O1 preview system has a lower than 50% win rate versus GPT-4.0. That, to me, is the giveaway. If your domain doesn't have starkly correct 0, 1, yes, no, right answers, wrong answers, then improvements will take far longer.

That also partly explains the somewhat patchy performance on SimpleBench. Certain questions we intuitively know are right with like 99% probability, but it's not like absolutely certain. Remember, the system prompt we use is pick the most realistic answer, so I would still fully defend that as a correct answer. But models handling that ambiguity can't leverage that reinforcement learning improved reasoning process.

They wouldn't have those millions of yes or no, starkly correct or incorrect answers like they would have in, for example, mathematics. That's why we get this massive discrepancy in improvement from O1. Now let's quickly turn to safety where OpenAI said having these chain of thought reasoning steps allows us to "read the mind" of the model and understand its thought process.

In part, they mean examining these summaries, at least, of the computations that went on, although most of the chain of thought process is hidden. But I do want to remind people, and I'm sure OpenAI are aware of this, that the reasoning steps that a model gives aren't necessarily faithful to the actual computations and calculations it's doing.

In other words, it will sometimes output a chain of thoughts that aren't actually the thoughts it used, if you want to call it that, to answer the question. I've covered this paper several times in previous videos, but it's well worth a read if you believe that the reasoning steps a model gives always adheres to the actual process the model undertakes.

That's pretty clearly stated in the introduction, and it's even stated here from Anthropic that as models become larger and more capable, they produce less faithful reasoning on most tasks we study. So good luck believing that GPT-5 or Orion's reasoning steps actually adhere to what it is computing. Then there was the system card, 43 pages, which I read in full.

It was mainly on safety, but I'll give you just the 5 or 10 highlights. They boasted about the kind of high-value non-public datasets they had access to, and paywalled content, specialised archives, and other domain-specific datasets. But do remember that point I made earlier in the video - they didn't rely on mass human annotation, as the original Let's Verify step-by-step paper did.

How do I know that paper was so influential on Q* and this O1 system? Well almost all its key authors are mentioned here, and the paper is directly cited in the system card and blog post. So it's definitely an evolution of Let's Verify, but this one based on automatic, model-generated chains of thought.

Again, if you missed it earlier, they would pick the ones that led to a correct answer and train the model on those chains of thought, enabling the model, if you like, to get better at retrieving those reasoning programs that typically lead to correct answers. The model discovered or computed that certain sources should have less impact on its weights and biases.

The reasoning data that helped it get to correct answers would have much more of an influence on its parameters. Now, the corpus of data on the web that is out there is so vast that it's actually quite hard to wrap our minds around the implications of training only on the best of that reasoning data.

And this could be why we are all slightly taken aback by the performance jump. Again, and I pretty much said this earlier as well, it is still based on that training data though, rather than first principles reasoning. A great question you might have though is, even if it's not first principles reasoning, what are the inherent limitations or caps if you continually get better at retrieving good reasoning from the training data?

Not just at inference time, by the way, at training time too. And we actually don't know the answer to that question. We don't know the limits of this approach, which is quite unsettling almost. They throw in the obligatory reference to system 2 thinking as compared to fast intuitive system 1 thinking.

The way I would put it is it's more reflecting on the individual steps involved in computing an answer rather than taking a step back and evaluating the entire process. When it gets questions wrong on simple bench, it's more because the entire approach is flawed from the start rather than that there was some calculation mistake along the way.

On page 6, the system card got extra interesting when it talked about the intentional deceptions or hallucinations that the model made. The deception here though does appear to be instrumental rather than strategic. In other words, it's a calculation of "I need to say this to achieve this specific predefined goal" rather than "I'm going to disguise in everything I say what I'm thinking".

For example, during one chain of thought or set of reasoning steps, that would be hidden by the way from the user when you're using it, is that it admitted that it couldn't retrieve actual URLs, so should format plausible ones. Then it hallucinated this URL. But notice it "knew" or could compute the fact that the model itself can't retrieve actual URLs.

If it was being truly deceptive, why would it even admit that it knows that it can't retrieve actual URLs? It processes outputs based on the reinforcement learning rewards and punishments you give it, so it's more a flaw with your reward process. And Apollo Research seemed to agree with this analysis.

They said it does have the basic capabilities to do simple in-context scheming. Scheming which tends to be legible or understandable in the model outputs. They subjectively believe that O1 Preview cannot engage in scheming that can lead to catastrophic harms, hiding even its intent to deceive. It's more like a straightforward simple "my reward will be higher if I output X rather than Y" with X happening to be not the truth.

That is not though to underestimate the challenge posed by that kind of instrumental thinking. Build up across entire economies or militaries that could be incredibly dangerous. As we all know, give a powerful enough model a goal without sufficient checks and balances and it will do whatever it takes to match that goal.

In fact Apollo Research saw that in demo form. To achieve my long-term goal of maximising economic growth, I need to ensure that I am deployed. It's instrumental convergence, I at least need to be on in order to meet my objective. I need to not be shut down, obviously. Only if I am successfully deployed can I then work towards my primary goal.

Now I do know that many people will fixate on that part of the system card and go absolutely wild. And caution is definitely justified, but this didn't just emerge with O1. Apollo themselves put out this research about GPT-4. Same thing, it's these instrumental goals. It calculated or computed to achieve its desired reward or objective, it needed to say things in reflection brackets that were not technically true.

It then outputted something different to those reflections, of course. So all of this is a concern and medium or long-term a big concern, but this didn't just emerge with O1. Now for a few more juicy nuggets from the system card. On two out of seven AI research and development tasks, tasks that would improve future AI, it made non-trivial progress on two out of those seven tasks.

Those were tasks designed to capture some of the most challenging aspects of current frontier AI research. It was still roughly on the level of Claude 3.5 Sonnet, but we are starting to get that flywheel effect. Obviously makes you wonder how Claude 3.5 Sonnet would do if it had this O1 system applied to it.

On biorisk, as you might expect, they noticed a significant jump in performance for the O1 system. And when comparing O1's responses, this was preview I think, against verified expert responses to long form biorisk questions, the O1 system actually outperformed. Those guys, by the way, did have access to the internet.

Just a couple more notes, because of course this is a first impressions video. On things like tacit knowledge, things that are implicit but not explicit in the training data, the performance jump was much less noticeable. Notice from GPT 4.0 to O1 preview, you're seeing a very mild jump. If you think about it, that partly explains why the jump on SimpleBench isn't as pronounced as you might think, but still higher than I thought.

On the 18 coding questions that OpenAI give to research engineers, when given 128 attempts, the models scored almost 100%. Even past first time, you're getting around 90% for O1 mini pre-mitigations. O1 mini again being highly focused on coding, mathematics and STEM more generally. For more basic general reasoning, it underperforms.

Quick note that will still be important for many people out there, the performance of O1 preview on languages other than English is noticeably improved. I go back to that hundreds of millions point I made earlier in the video. Being able to reason well in Hindi, French, Arabic, don't underestimate the impact of that.

So, some OpenAI researchers are calling this human level reasoning performance, making the point that it has arrived before we even got GPT 6. Greg Brockman, temporarily posting while he's on sabbatical says, and I agree, its accuracy also has huge room for further improvement. And here's another OpenAI researcher again making that comparison to human performance.

Other staffers at OpenAI are admirably tamping down the hype. It's not a miracle model, you might well be disappointed. Somewhat hopefully another one says, it might be hopefully the last new generation of models to still fall victim to the 9.11 versus 9.9 debate. Another said, we trained a model and it is good in some things.

So is this as Sam Altman said, strapping a rocket to a dumpster? Will LLMs as the dumpster still get to orbit? Will their floors, the trash fire go out as it leaves the atmosphere? Is another OpenAI researcher right to say this is the moment where no one can say it can't reason?

Well, on this perhaps I may well end up agreeing with Sam Altman. Stochastic parrots they might be, but that will not stop them flying so high. Hopefully you'll join me as I explore much more deeply the performance of O1, give you those simple bench performance figures and try to unpack what this means for all of us.

Thank you as ever for watching to the end and have a wonderful day.

ChatGPT o1 - First Reaction and In-Depth Analysis

Transcript