o1 - What is Going On? Why o1 is a 3rd Paradigm of Model + 10 Things You Might Not Know

00:00:00.000 | The world has had just a few days to process the impact of O1 Preview from OpenAI and I have used

00:00:08.480 | that time to read, or in a couple of cases re-read, seven papers that I think help explain what O1 is

00:00:15.600 | and what's coming next. I'll also draw on talks released earlier today to back up the claim that

00:00:22.640 | I made a few days ago that O1 Preview represents a step change in how models are trained and what

00:00:29.200 | they can do. Of course, I'll also remind you of what they can't yet do, even if they thought about

00:00:34.560 | it for the length of this entire video. Here at least is what one top OpenAI researcher thinks.

00:00:41.360 | "I didn't expect," he said, "there to be much time where there's two totally different, roughly

00:00:48.240 | intelligence-matched, winning on different dimensions, species. But that seems pretty

00:00:53.440 | clearly where we're at." To be honest, I just wanted to use this tweet to set the stage for

00:00:59.680 | the special moment we're in in AI. Here is what I think OpenAI did at a very high level with the

00:01:08.000 | O1 series of models. As the video progresses, I'll get even more granular with my reasoning and quote

00:01:14.800 | paragraphs from three-year-old papers to back it up. But for those who want a big picture overview,

00:01:21.200 | here's what I think they've done. The foundational, original objective of language models is to model

00:01:28.960 | language. It's to predict the next word. You can think of that, if you like, as paradigm one.

00:01:34.640 | Interesting, but not overly useful. Ask a question and the language model might predict another

00:01:41.440 | question to follow it. So to simplify again, we brought in another objective, paradigm two. We

00:01:47.360 | wanted models to be honest, harmless, and helpful. We, or more like a proxy for us, would give

00:01:54.400 | rewards to the models when it produced outputs that met those objectives. We started to get

00:01:59.920 | answers that weren't just likely from a probability perspective, but also, sometimes, harmless,

00:02:06.240 | honest, and helpful. Enter, chat your BT, and I hear that's doing well. O1, for me at least,

00:02:12.000 | represents paradigm three. We want to reward answers that are objectively correct. Not saying

00:02:18.720 | that we've forgotten the original objectives, but we've layered another one on top of them.

00:02:23.280 | But how did they actually do that? Well, again, I'm going to give you the one-minute summary and

00:02:27.600 | then go into more detail later in the video. Most of us might be aware that you can get models to

00:02:33.360 | output what's called a chain of thought. By asking models, for example, to think step-by-step,

00:02:38.640 | you can get much longer outputs that have reasoning steps within them. But that secret's

00:02:43.760 | already a few years old, so that's not what is special about O1. So people thought of a brilliant

00:02:49.360 | idea that just didn't quite work. How about we feed the model thousands of examples of human

00:02:56.000 | step-by-step reasoning? Well, yes, that does work, but it's not really optimal. It doesn't scale

00:03:02.240 | super well. OpenAI realized you could go one step better, so I'm going to hand the mic to them for

00:03:08.320 | 30 seconds. Wait, so it's better to train on model-generated chains of thought? But how come

00:03:36.720 | they're so often wrong? And what does he mean by reinforcement learning in this context? Well,

00:03:41.680 | how about this? And here, clearly, I'm going to get slightly metaphorical. How about we go up to

00:03:45.840 | the model and whisper in its ear, "Get really creative. Don't worry as much about predicting

00:03:50.800 | the next word super accurately. I just want really diverse outputs from you." The model,

00:03:55.600 | of course, at what's called a temperature of one, is more than happy to get creative and generate

00:04:00.720 | loads of diverse chains of thought. Meanwhile, other researchers must be looking on at these

00:04:05.600 | guys thinking, "What are they doing? These are going to be so unreliable." But then what if we

00:04:09.840 | had a way, preferably automatically, of grading those outputs? Then even many of you might agree,

00:04:15.600 | "Well, some of those outputs are going to be good." Especially with more and more time spent thinking,

00:04:21.440 | longer and longer chains of thought. Doesn't matter how low a proportion of the outputs

00:04:25.920 | are correct, as long as we get at least one or a few. Then out of those thousands of outputs,

00:04:30.800 | we can take those that work, those that produce the correct answer in mathematics, science,

00:04:35.920 | coding. We take that answer and we fine-tune the model on those correct answers with correct

00:04:42.320 | reasoning steps. That's how this becomes reinforcement learning. Only the best outputs

00:04:47.040 | are making it through to the next round and being used to further train the model. And because you're

00:04:52.880 | only fine-tuning or further training on those correct outputs with correct reasoning steps,

00:04:59.280 | this process is highly data efficient. Unlike training on the web, where it might include one

00:05:05.520 | of your random Reddit comments, which were awful by the way, or a tweet you did a few years ago,

00:05:10.720 | this is golden data. So notice then how it's a marriage of train time compute,

00:05:17.040 | the fine-tuning or training of a model, and what's called test time compute, that thinking time.

00:05:22.560 | Test time, if you weren't clear, is when the model is actually outputting something,

00:05:26.400 | not when it's being trained. We already knew that giving the models the time to produce what's

00:05:30.960 | called serial calculations, one after another, after another, before producing their final output

00:05:36.560 | would boost results. That kind of makes sense, right? Especially in technical domains. And that's

00:05:40.560 | what we've seen. But then marry that with train time compute, training on those correct generations,

00:05:47.280 | and then you get these two scaling graphs. For this difficult mathematics competition,

00:05:52.640 | more time to think equals better results. But then train or fine-tune the model or generator

00:05:59.840 | on correct outputs and reasoning steps, and that also produces a noticeable increase. And as you

00:06:06.960 | may have noticed, neither of those graphs look like they are particularly leveling off anytime

00:06:12.800 | soon. Before I get into those seven juicy papers that I mentioned at the start, I do want to touch

00:06:18.640 | on a bigger question that many of you might have, which is, is this reasoning? Does it count as

00:06:24.240 | human-like intelligence? Well, it's definitely not human-like, but it might not ultimately matter.

00:06:30.800 | The analogy I came up with was this. Think of a librarian. You're going up to this librarian

00:06:37.120 | because you have a question you want answered. The library books here are the model's training data,

00:06:42.800 | and the original Chachapiti was a very friendly librarian, but it would often bring you the wrong

00:06:47.840 | book. Or maybe it would bring you the right book that could answer your question, but point to the

00:06:52.960 | wrong paragraph within that book. It was clear that Chachapiti was decent as a librarian, but

00:06:59.040 | had no idea what it was handing to you. It was pretty easy if you wanted to, to demonstrate that

00:07:04.000 | it wasn't actually intelligent. The O1 series of models are much better librarians. They've been

00:07:09.200 | taking notes on what books successfully answered the questions that guests had and which ones

00:07:14.080 | didn't, down to the level not just of the book, but the chapter, the paragraph, and the line.

00:07:18.880 | We are still left though, of course, with the fundamental question, but the librarian doesn't

00:07:23.440 | actually understand what it's presenting. This though, is when things get philosophically murky.

00:07:28.880 | Does it ultimately matter in the end? We don't even understand how the human brain works.

00:07:33.680 | Frankly, I'm going to leave this one to you guys, and you can let me know what you think

00:07:37.520 | in the comments. But one thing is clear from this metaphor, which is, if you ask a question

00:07:43.440 | about something that's not in the model's training data, that's not in the library,

00:07:47.280 | then doesn't matter what you think, the librarian will screw up. By the way, the librarian is

00:07:52.640 | exceptionally unlikely to say, "I don't know," and instead will bring you an irrelevant book.

00:07:57.120 | That weakness, of course, is still incredibly prevalent in O1 Preview.

00:08:01.520 | And there is another hurdle that would follow if you agree with this analysis,

00:08:05.760 | not just a lack of training data. What about domains that have plenty of training data,

00:08:10.320 | but no clearly correct or incorrect answers? Then you would have no way of sifting through

00:08:16.240 | all of those chains of thought and fine-tuning on the correct ones. Compared to the original

00:08:21.280 | GPT-4-O in domains with correct and incorrect answers, largely, you can see the performance

00:08:26.880 | boost. In areas with harder to distinguish correct or incorrect answers, much less of a boost. In

00:08:32.480 | fact, a regress in personal writing. So that's the big picture, but now time for the juicy details

00:08:39.120 | and hidden hints we've seen over the last few days. But before that, I hope you'll forgive me

00:08:43.920 | for just two minutes about me and the channel. And when I say that I'm grateful for comments

00:08:50.480 | and for watching to the end, I really do mean it. It's an honor to take a small parcel of your time,

00:08:56.880 | and I don't expect any support beyond that. When I launched AI Insiders late last year on Patreon,

00:09:02.240 | it was my attempt, though, at keeping the channel financially viable, as I had permanently given up

00:09:08.000 | running my previous business around mid last year. I picked a price, frankly, that I thought it was

00:09:13.120 | worth, which was $29. And I was honestly moved that people signed up and stayed steady for almost

00:09:20.480 | a year. These guys are truly my Spartans, and many of you will be watching. But yes, of course,

00:09:26.000 | I read the emails of people saying it was just a bit too much money for them and they couldn't

00:09:31.280 | quite afford it. So nine months on, I have decided to take what you could call a gamble with my

00:09:38.080 | entire career and reduce the price significantly from $29 a month to $9 a month all in, or actually

00:09:47.440 | with an annual sub discount, $7.56 a month. Now, just quickly to encourage my dedicated

00:09:53.840 | supporters to stay on that higher tier, I will be keeping the unfiltered personal podcast exclusive

00:09:59.360 | to that original $29 tier. And to anyone who stays at that tier, I will personally message you

00:10:04.880 | with thanks. Also, I do that for every new person joining that tier. But to everyone else for whom

00:10:10.560 | $9 a month is viable, let me finish these two quick minutes with a tour. What you get access

00:10:16.320 | to is exclusive AI explained videos. I think there's around 30 of them now, like this one

00:10:21.440 | from last night on that humanities last exam or benchmark that people were talking about yesterday.

00:10:26.800 | Exact same quality that you would expect. And you get explainers like this one on the origins of the

00:10:31.520 | term AGI. Obviously people comment as they do on YouTube and on and on. You can also download each

00:10:37.360 | video so you can watch it offline if you want to. For those $9 or $7, you also get access to the

00:10:44.720 | Discord, which has evolved a lot since it started. Now has live meetups, a new book club, and of

00:10:50.640 | course, general discussion. If you go to the introductions page, you can see the caliber of

00:10:56.000 | the kind of people who join on Discord. Some people, of course, won't care about any of that

00:11:01.040 | and will just want to support hype-free AI journalism in a landscape I wrote that increasingly

00:11:07.360 | needs it. Totally understand, by the way, if $9 is too much, I am just super grateful for you

00:11:12.720 | watching. Back to O1 though, and one thing you might have noticed is we can't actually see those

00:11:18.480 | chains of thought. If you've used O1 for sure, you do see a summary and of course the output,

00:11:23.440 | but not the true chains of thought that led it to the output. OpenAI admits that part of the

00:11:28.800 | reason for that is their own competitive advantage and if you have followed the analysis so far,

00:11:34.480 | that would kind of make sense. Rival labs, especially those that don't care much about

00:11:38.240 | terms and conditions, could train on successful chains of thought that were outputted by the O1

00:11:43.840 | series. After all, that is the key ingredient to its success, so it makes sense. But even if we

00:11:48.960 | can't see those chains of thought, they have clearly unlocked much better serial calculations.

00:11:54.720 | Imagine you had to square a number multiple times in a row. It's really hard to do that in parallel,

00:12:00.000 | isn't it? You kind of need to know the result of the first calculation before you can do the next

00:12:04.720 | one. Well, with a really long scratchpad to work things out on, or a chain of thought that's hidden,

00:12:10.960 | models get much better at that. That ability to break down long or confusing questions into a

00:12:16.880 | series of small computational steps is why I think that O1 preview gets questions like these correct

00:12:23.280 | most of the time, as people have been pointing out to me. Now, I'm very much aware of that fact

00:12:27.600 | because I analyzed every single answer that O1 preview gave, as I said in my last video when I

00:12:32.880 | benchmarked it initially on SimpleBench. But that whole thing of just sneaking in a fact amongst

00:12:38.480 | many others, as you might see in this question when she eats three other cookies, that's just

00:12:42.880 | one small component of SimpleBench. There still remains many, many question categories where it

00:12:47.920 | flops badly. Again, because the data is not in its training data. The librarian can't retrieve a book

00:12:53.680 | that's not there. The makers of ArcAGI, amongst the most popular AI benchmarks, say this. "In

00:12:59.360 | summary, O1 represents a paradigm shift from 'memorize the answers' to 'memorize the reasoning'.

00:13:06.640 | Remember, it was trained on those reasoning steps that did end up leading to a correct answer,

00:13:12.640 | so it's starting to get better at recognizing which kinds of reasoning lead to correct answers

00:13:19.200 | in which domain. Less, do I have that exact fact, exact answer in my training data? And more, do I

00:13:26.080 | have the kind of reasoning steps that I think might be appropriate for solving this problem?"

00:13:30.960 | But still, as I think I've made clear in this video, if those reasoning steps or facts are

00:13:36.080 | not in the training data, they're not in distribution, it still will fail. O1 is still not

00:13:42.720 | a departure from the broader paradigm of fitting a curve to a distribution in order to boost

00:13:48.400 | performance by making everything in distribution, training on everything, expanding the library.

00:13:53.760 | They say we still need new ideas for artificial general intelligence. Another way of putting this

00:14:00.000 | is that there doesn't exist a foundation model for the physical world. We don't have those banks

00:14:05.680 | and banks of "correct answers" for real-world tasks. And that's partly why models flop on

00:14:12.240 | SimpleBench. So you should start to notice a pattern in those questions that O1 Preview

00:14:18.000 | is now getting right where GPT-4.0 couldn't. One of the stars of O1, Noam Brown of OpenAI,

00:14:24.720 | gave this example. It came from the famous Professor Rao that I interviewed on this

00:14:29.520 | channel and actually for Insiders. It's about stacking blocks and it's quite confusing at first,

00:14:34.240 | but O1 Preview gets it nicely. This by the way was originally given as an example of the kind

00:14:39.280 | of problems that LLMs simply can't get right. But I wouldn't say that stacking blocks is a

00:14:44.160 | data sparse kind of domain. It's just that previous models got overwhelmed with the amount

00:14:48.800 | of things going on. It just required too many serial calculations and computations and they

00:14:53.680 | couldn't do it. And if you want more evidence that training data for better or ill dictates

00:15:00.320 | performance, here's an example with O1 Preview. The surgeon who is the boy's father says,

00:15:08.480 | "I can't operate on this boy. He's my son." Who is the surgeon to the boy? Remember? He's been

00:15:14.480 | described as the boy's father. But the surgeon is the boy's other father. The boy has two fathers.

00:15:20.160 | As always then, it's worth remembering that exam-style knowledge benchmarks in particular,

00:15:26.320 | rather than true reasoning benchmarks, does not equal real-world capabilities.

00:15:31.040 | I now want to count down 10 more interesting facts and bits of background about O1 before I end on

00:15:37.840 | where we all go from here. What comes next? First, as you may have gathered, the training of O1 was

00:15:43.600 | fundamentally different from GPT-40. That extra layer of reinforcement learning means that no

00:15:49.760 | amount of prompt engineering on the base GPT-40, no amount of asking for thinking step-by-step,

00:15:55.600 | will be able to match its performance. Next is that O1 Preview and O1 might be piecing together

00:16:02.720 | reasoning steps that we haven't pieced together before. They're still "our" reasoning steps,

00:16:09.200 | but the model is optimised to piece together those steps that achieve the desired result.

00:16:15.200 | As one of the key authors of O1, who I talked about a lot last year, Lucas Kaiser said,

00:16:20.880 | "When you know the right chain of thought, you can compute anything." And as Andrej Karpathy said

00:16:26.560 | two days ago, "Those thoughts don't even need to be legible to us." They might piece together

00:16:31.360 | reasoning steps that are translated from other languages, or are in their own made-up language.

00:16:37.120 | Remember that the objective is now clearly to get the right answer, so the models will optimise

00:16:42.880 | however they can to do so. And my next point is, as Noam Brown points out, this is exactly what

00:16:48.800 | happened with chess. Indeed, the way he says it is, "This is starting to sound a lot like," referring

00:16:54.160 | to O1 Mini's performance in the Codeforces contest, "It's starting to sound a lot like the trajectory

00:16:59.680 | of chess." Which was what? Well, I'm obviously oversimplifying here, but Stockfish, the best

00:17:04.960 | chess model, was trained originally with human heuristics, hand-crafted functions to evaluate

00:17:11.600 | board positions. Obviously it used search to assess way more positions than a human could,

00:17:16.880 | but it still had those hand-crafted functions. Well, until July 2023. Stockfish removed the

00:17:23.920 | hand-crafted evaluation and transitioned to a fully neural network-based approach. Or to put

00:17:29.600 | it another way, by crafting its own reasoning steps and being optimised to put them together

00:17:34.720 | in the most effective fashion, we may end up with reasoning that we ourselves couldn't have come up

00:17:39.440 | with. As long as there is something out there that can grade correct versus incorrect, the performance

00:17:44.880 | will keep improving. For the next series of interesting points, I'm going to draw on a video

00:17:49.840 | I made nine months ago. Obviously, I am ridiculously biased, but I think it was absolutely

00:17:57.280 | bang on in terms of its predictions about Q*. Honestly, if you've got time, I recommend watching

00:18:02.240 | this entire video, but I'm going to pick out a handful of moments where I really called it,

00:18:07.280 | but that's not the important bit. It's where it helps explain that these researchers saw what is

00:18:12.880 | coming nine months, 12 months ago. The clues were out there. We just had to put them together.

00:18:17.680 | So my fourth interesting point comes from minute 17 of the video, where this approach of emitting

00:18:23.680 | chains of thought can be extended into different modalities. "Multimodal, where the chain of thought

00:18:29.680 | is basically a simulation of the world. So it will be multimodality and this ability to generate

00:18:36.000 | sequences of things before you give an answer that will resemble much more what we call reasoning."

00:18:43.600 | That was a short snippet, but I think contains a crucial detail. Just like the O1 family of

00:18:48.640 | models is scoring dramatically better in physics tests, Sora, the video generation model from OpenAI

00:18:55.040 | could get way better at modeling physics in pixels. It could attempt to predict the next

00:19:01.040 | pixel with chains of thought and be fine-tuned on those predictions that actually worked,

00:19:06.320 | potentially without even needing a data labeling revolution. A video generation model could learn

00:19:11.360 | which sources, which videos from YouTube, for example, depict reality with the most accuracy.

00:19:16.640 | As one former Googler and OpenAI member, Jeffrey Irving said, "You could have a scenario of let's

00:19:22.240 | think pixel by pixel." This step change, in other words, doesn't have to be limited to text.

00:19:27.760 | Now for my next point, I have no idea why I delayed it this long into the video,

00:19:32.240 | but look at this prediction that I made back in November. I picked out a key paragraph in a paper

00:19:37.520 | called "Let's verify step-by-step" which indicated what OpenAI were working on. "Let's verify step-by-step"

00:19:44.480 | could be like choosing an action. After all, in the original paper, using test-time compute in this way

00:19:50.000 | was described as a kind of search. And in "Let's verify", they hinted at a step forward involving

00:19:55.920 | reinforcement learning. They said, "We do not attempt to improve the generator, the model coming

00:20:00.800 | up with solutions, with reinforcement learning. We do not discuss any supervision the generator

00:20:06.000 | would receive from the reward model, if trained with RL." And here's the key sentence, "Although

00:20:11.280 | fine-tuning the generator with reinforcement learning is a natural next step, it is intentionally

00:20:17.200 | not the focus of this work." Is that the follow-up work that they did? I mean, if that prediction

00:20:22.400 | isn't worthy of a like on YouTube, or preferably joining AI Insiders, then I don't know what is.

00:20:28.000 | I then went into detail on the 2022 paper that showed how that would be done. In a nutshell,

00:20:33.520 | it involves fine-tuning a model on the outputs it generated that happen to work. Keep going

00:20:39.280 | until you generate rationales that get the correct answer, and then fine-tune on all of those

00:20:44.640 | rationales. And they say that, "We show that STAR significantly improves performance on multiple

00:20:49.760 | datasets compared to a model fine-tuned to directly predict final answers." Does that

00:20:54.560 | remind you of Let's Verify? "And performs comparably to fine-tuning a 30x larger state-of-the-art

00:21:00.960 | language model." Next, I want to show you all a warning straight from Ilya Sutskever, one of the

00:21:07.040 | key authors of this approach. Presumably, he's putting it to work in the safe superintelligence

00:21:12.080 | company. But he had a warning, "Reinforcement learning is creative." Reinforcement learning

00:21:17.360 | has a much more significant challenge. It is creative. Reinforcement learning is actually

00:21:26.400 | creative. Every single stunning example of creativity in AI comes from a reinforcement

00:21:34.480 | learning system. For example, AlphaZero has invented a whole new way of playing a game that

00:21:41.760 | humans have perfected for thousands of years. It is reinforcement learning that can come up

00:21:45.680 | creative solutions to problems, solutions which we might not be able to understand at all.

00:21:51.600 | And so what happens if you do reinforcement learning on long or even medium time horizon

00:21:57.680 | when your AI is interacting with the real world, trying to achieve some kind of a beneficial

00:22:04.320 | outcome, let's say, as judged by us, but while being very, very, very creative.

00:22:08.720 | This does not mean that this problem is unsolvable, but it means that it is a problem.

00:22:14.000 | And it means that some of the more naive approaches will suffer from

00:22:17.440 | some unexpected creativity that will make the antics of Sydney seem very modest.

00:22:22.720 | Next, I feel I foreshadowed Q*, Strawberry or O1's weakness with spatial reasoning.

00:22:28.800 | I think the development is likely a big step forward for narrow domains like mathematics,

00:22:34.240 | but is in no way yet a solution for AGI. The world is still a bit too complex for this to work yet.

00:22:41.520 | That desperate need to model the world's complexity and achieve true spatial intelligence

00:22:46.560 | is why Fei-Fei Li's startup is already worth $1 billion after just four months.

00:22:52.240 | Now, as I've hinted already in this video, I think OpenAI graded the individual reasoning steps

00:22:57.920 | of the generators outputs, not just whether the overall answer was correct.

00:23:02.080 | But for more background on that, it would be easier for me to just play

00:23:06.000 | a couple of minutes from that November 2023 video, in which, by the way,

00:23:10.400 | I cite a June 2023 video from this channel.

00:23:14.080 | So that's test time compute, but what about let's verify step by step?

00:23:17.760 | Well, going back to that original 2021 verifier paper, they said this.

00:23:21.760 | The problem they noticed with their approach back in 2021 was that their

00:23:25.440 | models were rewarding correct solutions, but sometimes there would be false positives.

00:23:30.320 | Getting to the correct final answer using flawed reasoning.

00:23:33.840 | They knew this was a problem, and so they worked on it.

00:23:36.400 | And then in May of this year, they came out with let's verify step by step.

00:23:41.360 | In this paper, by getting a verifier or reward model to focus on the process,

00:23:46.560 | the P, instead of the outcome, the O, results were far more dramatic.

00:23:51.600 | Next, notice how the graph is continuing to rise.

00:23:55.040 | If they just had more, let's say, test time compute, this could continue rising higher.

00:24:00.960 | And I actually speculated on that back on June the 1st.

00:24:04.480 | That difference of about 10% is more than half of the difference between GPT-3 and GPT-4.

00:24:10.800 | And also, is it me, or is that line continuing to grow?

00:24:14.320 | Suggesting that when more compute is available, the difference could be even more stark.

00:24:18.960 | Imagine a future where GPT-4 or 5 can sample, say, a trillion 10 to the 12 solutions.

00:24:25.360 | So you're beginning to see my hypothesis emerging.

00:24:27.840 | A new and improved let's verify step by step, called Q*,

00:24:31.600 | drawing upon enhanced inference time compute to push the graph toward 100%.

00:24:36.880 | If you want more details on that process reward model,

00:24:40.160 | check out the video I did back then called Double the Performance.

00:24:43.760 | But the very short version is that they trained a reward model

00:24:47.200 | to notice the individual steps in a reasoning sequence.

00:24:51.360 | That reward model then got very good at spotting erroneous steps.

00:24:55.760 | Furthermore, when that model concluded that there were no erroneous steps,

00:24:59.360 | as we've seen from the graphs, that was highly indicative of a correct solution.

00:25:04.080 | Notice also that sometimes it could pick out such a correct solution,

00:25:08.240 | when the original generator, GPT-4, only outputted that correct solution one time in a thousand.

00:25:14.480 | Furthermore, the method somewhat generalized out of distribution,

00:25:18.400 | going beyond mathematics to boost performance in chemistry, physics, and other subjects.

00:25:24.000 | And Noam Brown, I think, gave a clear hint that verifiers were used in the training of O1.

00:25:30.480 | Again, my theory is the only answers for which every reasoning step was correct,

00:25:35.120 | and the final answer were used to train or fine tune the O1 family.

00:25:40.400 | But just look at this point he leaves hanging in the air

00:25:43.760 | after showing the famous let's verify graph.

00:25:46.400 | If you do this process reward models,

00:25:47.760 | where you're verifying every single step with a really good reward model,

00:25:50.640 | you're getting an even bigger boost and you're getting up to 78.2%.

00:25:53.440 | And you can see it still looks like that number,

00:25:55.520 | that line would go up more if you generated more samples.

00:25:58.160 | That's as big a hint as you're going to get that let's verify was key for O1.

00:26:04.320 | And very quickly before I leave let's verify,

00:26:06.880 | don't forget that that paper cited work from Google.

00:26:10.240 | Some of the other key authors behind and around let's verify have also gone to Anthropic.

00:26:15.440 | So it's not like OpenAI will be the only ones working on this.

00:26:19.040 | Yes, they're well ahead, but I could well see one of those other two labs catching up.

00:26:23.200 | And do you remember early in this video,

00:26:24.800 | I talked about how higher temperature was optimal for generating those creative chains of thought?

00:26:30.000 | Well, that was suggested as early as 2021 at OpenAI.

00:26:34.640 | From the paper I cited in that November 2023 video,

00:26:38.320 | I talked about this paragraph.

00:26:40.480 | Verification consists of sampling multiple high temperature solutions.

00:26:44.960 | And then it goes on about verification.

00:26:46.720 | You might be wondering where I'm going with this,

00:26:49.120 | but that is why I think the API of the O1 family keep the temperature at one.

00:26:55.520 | I think the model itself was used to generate those chains of thought.

00:27:00.320 | And then that same model was then fine-tuned on those correct solutions.

00:27:04.400 | In other words, because the model was trained that way,

00:27:06.480 | it's optimal to keep the temperature at one.

00:27:08.640 | OpenAI don't actually allow you to change it.

00:27:10.720 | Let me know in the comments if you think I've figured out something there.

00:27:13.520 | Anyway, those were the facts.

00:27:15.440 | And the White House is certainly taking all of this quite seriously.

00:27:18.960 | They were shown Strawberry and O1 earlier this year,

00:27:22.640 | and they now describe how AI data center development and promoting it

00:27:26.960 | and funding it reflects the importance of these projects to American national security

00:27:32.640 | and economic interests.

00:27:34.080 | The government at the very least is a believer, but are we?

00:27:37.760 | Well, I have been very impressed by O1 Preview.

00:27:41.040 | Let me know if you have.

00:27:42.560 | Thank you so much for watching.

00:27:44.240 | I would love to see you over on AI Insiders.

00:27:47.600 | But either way, please do have a wonderful day.

o1 - What is Going On? Why o1 is a 3rd Paradigm of Model + 10 Things You Might Not Know

Chapters