o1 - What is Going On? Why o1 is a 3rd Paradigm of Model + 10 Things You Might Not Know

The world has had just a few days to process the impact of O1 Preview from OpenAI and I have used that time to read, or in a couple of cases re-read, seven papers that I think help explain what O1 is and what's coming next. I'll also draw on talks released earlier today to back up the claim that I made a few days ago that O1 Preview represents a step change in how models are trained and what they can do.

Of course, I'll also remind you of what they can't yet do, even if they thought about it for the length of this entire video. Here at least is what one top OpenAI researcher thinks. "I didn't expect," he said, "there to be much time where there's two totally different, roughly intelligence-matched, winning on different dimensions, species.

But that seems pretty clearly where we're at." To be honest, I just wanted to use this tweet to set the stage for the special moment we're in in AI. Here is what I think OpenAI did at a very high level with the O1 series of models. As the video progresses, I'll get even more granular with my reasoning and quote paragraphs from three-year-old papers to back it up.

But for those who want a big picture overview, here's what I think they've done. The foundational, original objective of language models is to model language. It's to predict the next word. You can think of that, if you like, as paradigm one. Interesting, but not overly useful. Ask a question and the language model might predict another question to follow it.

So to simplify again, we brought in another objective, paradigm two. We wanted models to be honest, harmless, and helpful. We, or more like a proxy for us, would give rewards to the models when it produced outputs that met those objectives. We started to get answers that weren't just likely from a probability perspective, but also, sometimes, harmless, honest, and helpful.

Enter, chat your BT, and I hear that's doing well. O1, for me at least, represents paradigm three. We want to reward answers that are objectively correct. Not saying that we've forgotten the original objectives, but we've layered another one on top of them. But how did they actually do that?

Well, again, I'm going to give you the one-minute summary and then go into more detail later in the video. Most of us might be aware that you can get models to output what's called a chain of thought. By asking models, for example, to think step-by-step, you can get much longer outputs that have reasoning steps within them.

But that secret's already a few years old, so that's not what is special about O1. So people thought of a brilliant idea that just didn't quite work. How about we feed the model thousands of examples of human step-by-step reasoning? Well, yes, that does work, but it's not really optimal.

It doesn't scale super well. OpenAI realized you could go one step better, so I'm going to hand the mic to them for 30 seconds. Wait, so it's better to train on model-generated chains of thought? But how come they're so often wrong? And what does he mean by reinforcement learning in this context?

Well, how about this? And here, clearly, I'm going to get slightly metaphorical. How about we go up to the model and whisper in its ear, "Get really creative. Don't worry as much about predicting the next word super accurately. I just want really diverse outputs from you." The model, of course, at what's called a temperature of one, is more than happy to get creative and generate loads of diverse chains of thought.

Meanwhile, other researchers must be looking on at these guys thinking, "What are they doing? These are going to be so unreliable." But then what if we had a way, preferably automatically, of grading those outputs? Then even many of you might agree, "Well, some of those outputs are going to be good." Especially with more and more time spent thinking, longer and longer chains of thought.

Doesn't matter how low a proportion of the outputs are correct, as long as we get at least one or a few. Then out of those thousands of outputs, we can take those that work, those that produce the correct answer in mathematics, science, coding. We take that answer and we fine-tune the model on those correct answers with correct reasoning steps.

That's how this becomes reinforcement learning. Only the best outputs are making it through to the next round and being used to further train the model. And because you're only fine-tuning or further training on those correct outputs with correct reasoning steps, this process is highly data efficient. Unlike training on the web, where it might include one of your random Reddit comments, which were awful by the way, or a tweet you did a few years ago, this is golden data.

So notice then how it's a marriage of train time compute, the fine-tuning or training of a model, and what's called test time compute, that thinking time. Test time, if you weren't clear, is when the model is actually outputting something, not when it's being trained. We already knew that giving the models the time to produce what's called serial calculations, one after another, after another, before producing their final output would boost results.

That kind of makes sense, right? Especially in technical domains. And that's what we've seen. But then marry that with train time compute, training on those correct generations, and then you get these two scaling graphs. For this difficult mathematics competition, more time to think equals better results. But then train or fine-tune the model or generator on correct outputs and reasoning steps, and that also produces a noticeable increase.

And as you may have noticed, neither of those graphs look like they are particularly leveling off anytime soon. Before I get into those seven juicy papers that I mentioned at the start, I do want to touch on a bigger question that many of you might have, which is, is this reasoning?

Does it count as human-like intelligence? Well, it's definitely not human-like, but it might not ultimately matter. The analogy I came up with was this. Think of a librarian. You're going up to this librarian because you have a question you want answered. The library books here are the model's training data, and the original Chachapiti was a very friendly librarian, but it would often bring you the wrong book.

Or maybe it would bring you the right book that could answer your question, but point to the wrong paragraph within that book. It was clear that Chachapiti was decent as a librarian, but had no idea what it was handing to you. It was pretty easy if you wanted to, to demonstrate that it wasn't actually intelligent.

The O1 series of models are much better librarians. They've been taking notes on what books successfully answered the questions that guests had and which ones didn't, down to the level not just of the book, but the chapter, the paragraph, and the line. We are still left though, of course, with the fundamental question, but the librarian doesn't actually understand what it's presenting.

This though, is when things get philosophically murky. Does it ultimately matter in the end? We don't even understand how the human brain works. Frankly, I'm going to leave this one to you guys, and you can let me know what you think in the comments. But one thing is clear from this metaphor, which is, if you ask a question about something that's not in the model's training data, that's not in the library, then doesn't matter what you think, the librarian will screw up.

By the way, the librarian is exceptionally unlikely to say, "I don't know," and instead will bring you an irrelevant book. That weakness, of course, is still incredibly prevalent in O1 Preview. And there is another hurdle that would follow if you agree with this analysis, not just a lack of training data.

What about domains that have plenty of training data, but no clearly correct or incorrect answers? Then you would have no way of sifting through all of those chains of thought and fine-tuning on the correct ones. Compared to the original GPT-4-O in domains with correct and incorrect answers, largely, you can see the performance boost.

In areas with harder to distinguish correct or incorrect answers, much less of a boost. In fact, a regress in personal writing. So that's the big picture, but now time for the juicy details and hidden hints we've seen over the last few days. But before that, I hope you'll forgive me for just two minutes about me and the channel.

And when I say that I'm grateful for comments and for watching to the end, I really do mean it. It's an honor to take a small parcel of your time, and I don't expect any support beyond that. When I launched AI Insiders late last year on Patreon, it was my attempt, though, at keeping the channel financially viable, as I had permanently given up running my previous business around mid last year.

I picked a price, frankly, that I thought it was worth, which was $29. And I was honestly moved that people signed up and stayed steady for almost a year. These guys are truly my Spartans, and many of you will be watching. But yes, of course, I read the emails of people saying it was just a bit too much money for them and they couldn't quite afford it.

So nine months on, I have decided to take what you could call a gamble with my entire career and reduce the price significantly from $29 a month to $9 a month all in, or actually with an annual sub discount, $7.56 a month. Now, just quickly to encourage my dedicated supporters to stay on that higher tier, I will be keeping the unfiltered personal podcast exclusive to that original $29 tier.

And to anyone who stays at that tier, I will personally message you with thanks. Also, I do that for every new person joining that tier. But to everyone else for whom $9 a month is viable, let me finish these two quick minutes with a tour. What you get access to is exclusive AI explained videos.

I think there's around 30 of them now, like this one from last night on that humanities last exam or benchmark that people were talking about yesterday. Exact same quality that you would expect. And you get explainers like this one on the origins of the term AGI. Obviously people comment as they do on YouTube and on and on.

You can also download each video so you can watch it offline if you want to. For those $9 or $7, you also get access to the Discord, which has evolved a lot since it started. Now has live meetups, a new book club, and of course, general discussion. If you go to the introductions page, you can see the caliber of the kind of people who join on Discord.

Some people, of course, won't care about any of that and will just want to support hype-free AI journalism in a landscape I wrote that increasingly needs it. Totally understand, by the way, if $9 is too much, I am just super grateful for you watching. Back to O1 though, and one thing you might have noticed is we can't actually see those chains of thought.

If you've used O1 for sure, you do see a summary and of course the output, but not the true chains of thought that led it to the output. OpenAI admits that part of the reason for that is their own competitive advantage and if you have followed the analysis so far, that would kind of make sense.

Rival labs, especially those that don't care much about terms and conditions, could train on successful chains of thought that were outputted by the O1 series. After all, that is the key ingredient to its success, so it makes sense. But even if we can't see those chains of thought, they have clearly unlocked much better serial calculations.

Imagine you had to square a number multiple times in a row. It's really hard to do that in parallel, isn't it? You kind of need to know the result of the first calculation before you can do the next one. Well, with a really long scratchpad to work things out on, or a chain of thought that's hidden, models get much better at that.

That ability to break down long or confusing questions into a series of small computational steps is why I think that O1 preview gets questions like these correct most of the time, as people have been pointing out to me. Now, I'm very much aware of that fact because I analyzed every single answer that O1 preview gave, as I said in my last video when I benchmarked it initially on SimpleBench.

But that whole thing of just sneaking in a fact amongst many others, as you might see in this question when she eats three other cookies, that's just one small component of SimpleBench. There still remains many, many question categories where it flops badly. Again, because the data is not in its training data.

The librarian can't retrieve a book that's not there. The makers of ArcAGI, amongst the most popular AI benchmarks, say this. "In summary, O1 represents a paradigm shift from 'memorize the answers' to 'memorize the reasoning'. Remember, it was trained on those reasoning steps that did end up leading to a correct answer, so it's starting to get better at recognizing which kinds of reasoning lead to correct answers in which domain.

Less, do I have that exact fact, exact answer in my training data? And more, do I have the kind of reasoning steps that I think might be appropriate for solving this problem?" But still, as I think I've made clear in this video, if those reasoning steps or facts are not in the training data, they're not in distribution, it still will fail.

O1 is still not a departure from the broader paradigm of fitting a curve to a distribution in order to boost performance by making everything in distribution, training on everything, expanding the library. They say we still need new ideas for artificial general intelligence. Another way of putting this is that there doesn't exist a foundation model for the physical world.

We don't have those banks and banks of "correct answers" for real-world tasks. And that's partly why models flop on SimpleBench. So you should start to notice a pattern in those questions that O1 Preview is now getting right where GPT-4.0 couldn't. One of the stars of O1, Noam Brown of OpenAI, gave this example.

It came from the famous Professor Rao that I interviewed on this channel and actually for Insiders. It's about stacking blocks and it's quite confusing at first, but O1 Preview gets it nicely. This by the way was originally given as an example of the kind of problems that LLMs simply can't get right.

But I wouldn't say that stacking blocks is a data sparse kind of domain. It's just that previous models got overwhelmed with the amount of things going on. It just required too many serial calculations and computations and they couldn't do it. And if you want more evidence that training data for better or ill dictates performance, here's an example with O1 Preview.

The surgeon who is the boy's father says, "I can't operate on this boy. He's my son." Who is the surgeon to the boy? Remember? He's been described as the boy's father. But the surgeon is the boy's other father. The boy has two fathers. As always then, it's worth remembering that exam-style knowledge benchmarks in particular, rather than true reasoning benchmarks, does not equal real-world capabilities.

I now want to count down 10 more interesting facts and bits of background about O1 before I end on where we all go from here. What comes next? First, as you may have gathered, the training of O1 was fundamentally different from GPT-40. That extra layer of reinforcement learning means that no amount of prompt engineering on the base GPT-40, no amount of asking for thinking step-by-step, will be able to match its performance.

Next is that O1 Preview and O1 might be piecing together reasoning steps that we haven't pieced together before. They're still "our" reasoning steps, but the model is optimised to piece together those steps that achieve the desired result. As one of the key authors of O1, who I talked about a lot last year, Lucas Kaiser said, "When you know the right chain of thought, you can compute anything." And as Andrej Karpathy said two days ago, "Those thoughts don't even need to be legible to us." They might piece together reasoning steps that are translated from other languages, or are in their own made-up language.

Remember that the objective is now clearly to get the right answer, so the models will optimise however they can to do so. And my next point is, as Noam Brown points out, this is exactly what happened with chess. Indeed, the way he says it is, "This is starting to sound a lot like," referring to O1 Mini's performance in the Codeforces contest, "It's starting to sound a lot like the trajectory of chess." Which was what?

Well, I'm obviously oversimplifying here, but Stockfish, the best chess model, was trained originally with human heuristics, hand-crafted functions to evaluate board positions. Obviously it used search to assess way more positions than a human could, but it still had those hand-crafted functions. Well, until July 2023. Stockfish removed the hand-crafted evaluation and transitioned to a fully neural network-based approach.

Or to put it another way, by crafting its own reasoning steps and being optimised to put them together in the most effective fashion, we may end up with reasoning that we ourselves couldn't have come up with. As long as there is something out there that can grade correct versus incorrect, the performance will keep improving.

For the next series of interesting points, I'm going to draw on a video I made nine months ago. Obviously, I am ridiculously biased, but I think it was absolutely bang on in terms of its predictions about Q*. Honestly, if you've got time, I recommend watching this entire video, but I'm going to pick out a handful of moments where I really called it, but that's not the important bit.

It's where it helps explain that these researchers saw what is coming nine months, 12 months ago. The clues were out there. We just had to put them together. So my fourth interesting point comes from minute 17 of the video, where this approach of emitting chains of thought can be extended into different modalities.

"Multimodal, where the chain of thought is basically a simulation of the world. So it will be multimodality and this ability to generate sequences of things before you give an answer that will resemble much more what we call reasoning." That was a short snippet, but I think contains a crucial detail.

Just like the O1 family of models is scoring dramatically better in physics tests, Sora, the video generation model from OpenAI could get way better at modeling physics in pixels. It could attempt to predict the next pixel with chains of thought and be fine-tuned on those predictions that actually worked, potentially without even needing a data labeling revolution.

A video generation model could learn which sources, which videos from YouTube, for example, depict reality with the most accuracy. As one former Googler and OpenAI member, Jeffrey Irving said, "You could have a scenario of let's think pixel by pixel." This step change, in other words, doesn't have to be limited to text.

Now for my next point, I have no idea why I delayed it this long into the video, but look at this prediction that I made back in November. I picked out a key paragraph in a paper called "Let's verify step-by-step" which indicated what OpenAI were working on. "Let's verify step-by-step" could be like choosing an action.

After all, in the original paper, using test-time compute in this way was described as a kind of search. And in "Let's verify", they hinted at a step forward involving reinforcement learning. They said, "We do not attempt to improve the generator, the model coming up with solutions, with reinforcement learning.

We do not discuss any supervision the generator would receive from the reward model, if trained with RL." And here's the key sentence, "Although fine-tuning the generator with reinforcement learning is a natural next step, it is intentionally not the focus of this work." Is that the follow-up work that they did?

I mean, if that prediction isn't worthy of a like on YouTube, or preferably joining AI Insiders, then I don't know what is. I then went into detail on the 2022 paper that showed how that would be done. In a nutshell, it involves fine-tuning a model on the outputs it generated that happen to work.

Keep going until you generate rationales that get the correct answer, and then fine-tune on all of those rationales. And they say that, "We show that STAR significantly improves performance on multiple datasets compared to a model fine-tuned to directly predict final answers." Does that remind you of Let's Verify? "And performs comparably to fine-tuning a 30x larger state-of-the-art language model." Next, I want to show you all a warning straight from Ilya Sutskever, one of the key authors of this approach.

Presumably, he's putting it to work in the safe superintelligence company. But he had a warning, "Reinforcement learning is creative." Reinforcement learning has a much more significant challenge. It is creative. Reinforcement learning is actually creative. Every single stunning example of creativity in AI comes from a reinforcement learning system. For example, AlphaZero has invented a whole new way of playing a game that humans have perfected for thousands of years.

It is reinforcement learning that can come up creative solutions to problems, solutions which we might not be able to understand at all. And so what happens if you do reinforcement learning on long or even medium time horizon when your AI is interacting with the real world, trying to achieve some kind of a beneficial outcome, let's say, as judged by us, but while being very, very, very creative.

This does not mean that this problem is unsolvable, but it means that it is a problem. And it means that some of the more naive approaches will suffer from some unexpected creativity that will make the antics of Sydney seem very modest. Next, I feel I foreshadowed Q*, Strawberry or O1's weakness with spatial reasoning.

I think the development is likely a big step forward for narrow domains like mathematics, but is in no way yet a solution for AGI. The world is still a bit too complex for this to work yet. That desperate need to model the world's complexity and achieve true spatial intelligence is why Fei-Fei Li's startup is already worth $1 billion after just four months.

Now, as I've hinted already in this video, I think OpenAI graded the individual reasoning steps of the generators outputs, not just whether the overall answer was correct. But for more background on that, it would be easier for me to just play a couple of minutes from that November 2023 video, in which, by the way, I cite a June 2023 video from this channel.

So that's test time compute, but what about let's verify step by step? Well, going back to that original 2021 verifier paper, they said this. The problem they noticed with their approach back in 2021 was that their models were rewarding correct solutions, but sometimes there would be false positives. Getting to the correct final answer using flawed reasoning.

They knew this was a problem, and so they worked on it. And then in May of this year, they came out with let's verify step by step. In this paper, by getting a verifier or reward model to focus on the process, the P, instead of the outcome, the O, results were far more dramatic.

Next, notice how the graph is continuing to rise. If they just had more, let's say, test time compute, this could continue rising higher. And I actually speculated on that back on June the 1st. That difference of about 10% is more than half of the difference between GPT-3 and GPT-4.

And also, is it me, or is that line continuing to grow? Suggesting that when more compute is available, the difference could be even more stark. Imagine a future where GPT-4 or 5 can sample, say, a trillion 10 to the 12 solutions. So you're beginning to see my hypothesis emerging.

A new and improved let's verify step by step, called Q*, drawing upon enhanced inference time compute to push the graph toward 100%. If you want more details on that process reward model, check out the video I did back then called Double the Performance. But the very short version is that they trained a reward model to notice the individual steps in a reasoning sequence.

That reward model then got very good at spotting erroneous steps. Furthermore, when that model concluded that there were no erroneous steps, as we've seen from the graphs, that was highly indicative of a correct solution. Notice also that sometimes it could pick out such a correct solution, when the original generator, GPT-4, only outputted that correct solution one time in a thousand.

Furthermore, the method somewhat generalized out of distribution, going beyond mathematics to boost performance in chemistry, physics, and other subjects. And Noam Brown, I think, gave a clear hint that verifiers were used in the training of O1. Again, my theory is the only answers for which every reasoning step was correct, and the final answer were used to train or fine tune the O1 family.

But just look at this point he leaves hanging in the air after showing the famous let's verify graph. If you do this process reward models, where you're verifying every single step with a really good reward model, you're getting an even bigger boost and you're getting up to 78.2%. And you can see it still looks like that number, that line would go up more if you generated more samples.

That's as big a hint as you're going to get that let's verify was key for O1. And very quickly before I leave let's verify, don't forget that that paper cited work from Google. Some of the other key authors behind and around let's verify have also gone to Anthropic. So it's not like OpenAI will be the only ones working on this.

Yes, they're well ahead, but I could well see one of those other two labs catching up. And do you remember early in this video, I talked about how higher temperature was optimal for generating those creative chains of thought? Well, that was suggested as early as 2021 at OpenAI. From the paper I cited in that November 2023 video, I talked about this paragraph.

Verification consists of sampling multiple high temperature solutions. And then it goes on about verification. You might be wondering where I'm going with this, but that is why I think the API of the O1 family keep the temperature at one. I think the model itself was used to generate those chains of thought.

And then that same model was then fine-tuned on those correct solutions. In other words, because the model was trained that way, it's optimal to keep the temperature at one. OpenAI don't actually allow you to change it. Let me know in the comments if you think I've figured out something there.

Anyway, those were the facts. And the White House is certainly taking all of this quite seriously. They were shown Strawberry and O1 earlier this year, and they now describe how AI data center development and promoting it and funding it reflects the importance of these projects to American national security and economic interests.

The government at the very least is a believer, but are we? Well, I have been very impressed by O1 Preview. Let me know if you have. Thank you so much for watching. I would love to see you over on AI Insiders. But either way, please do have a wonderful day.

o1 - What is Going On? Why o1 is a 3rd Paradigm of Model + 10 Things You Might Not Know

Chapters

Transcript