back to indexo1 - What is Going On? Why o1 is a 3rd Paradigm of Model + 10 Things You Might Not Know
Chapters
0:0 Intro
1:4 How o1 Works (The 3rd Paradigm)
3:10 We Don’t Need Human Examples (OpenAI)
3:54 How o1 Works (Temp 1 Graded)
6:28 Is This Reasoning?
8:48 Personal Announcement
11:27 Hidden, serial Thoughts?
13:11 Memorized Reasoning?
15:40 10 Facts
00:00:00.000 |
The world has had just a few days to process the impact of O1 Preview from OpenAI and I have used 00:00:08.480 |
that time to read, or in a couple of cases re-read, seven papers that I think help explain what O1 is 00:00:15.600 |
and what's coming next. I'll also draw on talks released earlier today to back up the claim that 00:00:22.640 |
I made a few days ago that O1 Preview represents a step change in how models are trained and what 00:00:29.200 |
they can do. Of course, I'll also remind you of what they can't yet do, even if they thought about 00:00:34.560 |
it for the length of this entire video. Here at least is what one top OpenAI researcher thinks. 00:00:41.360 |
"I didn't expect," he said, "there to be much time where there's two totally different, roughly 00:00:48.240 |
intelligence-matched, winning on different dimensions, species. But that seems pretty 00:00:53.440 |
clearly where we're at." To be honest, I just wanted to use this tweet to set the stage for 00:00:59.680 |
the special moment we're in in AI. Here is what I think OpenAI did at a very high level with the 00:01:08.000 |
O1 series of models. As the video progresses, I'll get even more granular with my reasoning and quote 00:01:14.800 |
paragraphs from three-year-old papers to back it up. But for those who want a big picture overview, 00:01:21.200 |
here's what I think they've done. The foundational, original objective of language models is to model 00:01:28.960 |
language. It's to predict the next word. You can think of that, if you like, as paradigm one. 00:01:34.640 |
Interesting, but not overly useful. Ask a question and the language model might predict another 00:01:41.440 |
question to follow it. So to simplify again, we brought in another objective, paradigm two. We 00:01:47.360 |
wanted models to be honest, harmless, and helpful. We, or more like a proxy for us, would give 00:01:54.400 |
rewards to the models when it produced outputs that met those objectives. We started to get 00:01:59.920 |
answers that weren't just likely from a probability perspective, but also, sometimes, harmless, 00:02:06.240 |
honest, and helpful. Enter, chat your BT, and I hear that's doing well. O1, for me at least, 00:02:12.000 |
represents paradigm three. We want to reward answers that are objectively correct. Not saying 00:02:18.720 |
that we've forgotten the original objectives, but we've layered another one on top of them. 00:02:23.280 |
But how did they actually do that? Well, again, I'm going to give you the one-minute summary and 00:02:27.600 |
then go into more detail later in the video. Most of us might be aware that you can get models to 00:02:33.360 |
output what's called a chain of thought. By asking models, for example, to think step-by-step, 00:02:38.640 |
you can get much longer outputs that have reasoning steps within them. But that secret's 00:02:43.760 |
already a few years old, so that's not what is special about O1. So people thought of a brilliant 00:02:49.360 |
idea that just didn't quite work. How about we feed the model thousands of examples of human 00:02:56.000 |
step-by-step reasoning? Well, yes, that does work, but it's not really optimal. It doesn't scale 00:03:02.240 |
super well. OpenAI realized you could go one step better, so I'm going to hand the mic to them for 00:03:08.320 |
30 seconds. Wait, so it's better to train on model-generated chains of thought? But how come 00:03:36.720 |
they're so often wrong? And what does he mean by reinforcement learning in this context? Well, 00:03:41.680 |
how about this? And here, clearly, I'm going to get slightly metaphorical. How about we go up to 00:03:45.840 |
the model and whisper in its ear, "Get really creative. Don't worry as much about predicting 00:03:50.800 |
the next word super accurately. I just want really diverse outputs from you." The model, 00:03:55.600 |
of course, at what's called a temperature of one, is more than happy to get creative and generate 00:04:00.720 |
loads of diverse chains of thought. Meanwhile, other researchers must be looking on at these 00:04:05.600 |
guys thinking, "What are they doing? These are going to be so unreliable." But then what if we 00:04:09.840 |
had a way, preferably automatically, of grading those outputs? Then even many of you might agree, 00:04:15.600 |
"Well, some of those outputs are going to be good." Especially with more and more time spent thinking, 00:04:21.440 |
longer and longer chains of thought. Doesn't matter how low a proportion of the outputs 00:04:25.920 |
are correct, as long as we get at least one or a few. Then out of those thousands of outputs, 00:04:30.800 |
we can take those that work, those that produce the correct answer in mathematics, science, 00:04:35.920 |
coding. We take that answer and we fine-tune the model on those correct answers with correct 00:04:42.320 |
reasoning steps. That's how this becomes reinforcement learning. Only the best outputs 00:04:47.040 |
are making it through to the next round and being used to further train the model. And because you're 00:04:52.880 |
only fine-tuning or further training on those correct outputs with correct reasoning steps, 00:04:59.280 |
this process is highly data efficient. Unlike training on the web, where it might include one 00:05:05.520 |
of your random Reddit comments, which were awful by the way, or a tweet you did a few years ago, 00:05:10.720 |
this is golden data. So notice then how it's a marriage of train time compute, 00:05:17.040 |
the fine-tuning or training of a model, and what's called test time compute, that thinking time. 00:05:22.560 |
Test time, if you weren't clear, is when the model is actually outputting something, 00:05:26.400 |
not when it's being trained. We already knew that giving the models the time to produce what's 00:05:30.960 |
called serial calculations, one after another, after another, before producing their final output 00:05:36.560 |
would boost results. That kind of makes sense, right? Especially in technical domains. And that's 00:05:40.560 |
what we've seen. But then marry that with train time compute, training on those correct generations, 00:05:47.280 |
and then you get these two scaling graphs. For this difficult mathematics competition, 00:05:52.640 |
more time to think equals better results. But then train or fine-tune the model or generator 00:05:59.840 |
on correct outputs and reasoning steps, and that also produces a noticeable increase. And as you 00:06:06.960 |
may have noticed, neither of those graphs look like they are particularly leveling off anytime 00:06:12.800 |
soon. Before I get into those seven juicy papers that I mentioned at the start, I do want to touch 00:06:18.640 |
on a bigger question that many of you might have, which is, is this reasoning? Does it count as 00:06:24.240 |
human-like intelligence? Well, it's definitely not human-like, but it might not ultimately matter. 00:06:30.800 |
The analogy I came up with was this. Think of a librarian. You're going up to this librarian 00:06:37.120 |
because you have a question you want answered. The library books here are the model's training data, 00:06:42.800 |
and the original Chachapiti was a very friendly librarian, but it would often bring you the wrong 00:06:47.840 |
book. Or maybe it would bring you the right book that could answer your question, but point to the 00:06:52.960 |
wrong paragraph within that book. It was clear that Chachapiti was decent as a librarian, but 00:06:59.040 |
had no idea what it was handing to you. It was pretty easy if you wanted to, to demonstrate that 00:07:04.000 |
it wasn't actually intelligent. The O1 series of models are much better librarians. They've been 00:07:09.200 |
taking notes on what books successfully answered the questions that guests had and which ones 00:07:14.080 |
didn't, down to the level not just of the book, but the chapter, the paragraph, and the line. 00:07:18.880 |
We are still left though, of course, with the fundamental question, but the librarian doesn't 00:07:23.440 |
actually understand what it's presenting. This though, is when things get philosophically murky. 00:07:28.880 |
Does it ultimately matter in the end? We don't even understand how the human brain works. 00:07:33.680 |
Frankly, I'm going to leave this one to you guys, and you can let me know what you think 00:07:37.520 |
in the comments. But one thing is clear from this metaphor, which is, if you ask a question 00:07:43.440 |
about something that's not in the model's training data, that's not in the library, 00:07:47.280 |
then doesn't matter what you think, the librarian will screw up. By the way, the librarian is 00:07:52.640 |
exceptionally unlikely to say, "I don't know," and instead will bring you an irrelevant book. 00:07:57.120 |
That weakness, of course, is still incredibly prevalent in O1 Preview. 00:08:01.520 |
And there is another hurdle that would follow if you agree with this analysis, 00:08:05.760 |
not just a lack of training data. What about domains that have plenty of training data, 00:08:10.320 |
but no clearly correct or incorrect answers? Then you would have no way of sifting through 00:08:16.240 |
all of those chains of thought and fine-tuning on the correct ones. Compared to the original 00:08:21.280 |
GPT-4-O in domains with correct and incorrect answers, largely, you can see the performance 00:08:26.880 |
boost. In areas with harder to distinguish correct or incorrect answers, much less of a boost. In 00:08:32.480 |
fact, a regress in personal writing. So that's the big picture, but now time for the juicy details 00:08:39.120 |
and hidden hints we've seen over the last few days. But before that, I hope you'll forgive me 00:08:43.920 |
for just two minutes about me and the channel. And when I say that I'm grateful for comments 00:08:50.480 |
and for watching to the end, I really do mean it. It's an honor to take a small parcel of your time, 00:08:56.880 |
and I don't expect any support beyond that. When I launched AI Insiders late last year on Patreon, 00:09:02.240 |
it was my attempt, though, at keeping the channel financially viable, as I had permanently given up 00:09:08.000 |
running my previous business around mid last year. I picked a price, frankly, that I thought it was 00:09:13.120 |
worth, which was $29. And I was honestly moved that people signed up and stayed steady for almost 00:09:20.480 |
a year. These guys are truly my Spartans, and many of you will be watching. But yes, of course, 00:09:26.000 |
I read the emails of people saying it was just a bit too much money for them and they couldn't 00:09:31.280 |
quite afford it. So nine months on, I have decided to take what you could call a gamble with my 00:09:38.080 |
entire career and reduce the price significantly from $29 a month to $9 a month all in, or actually 00:09:47.440 |
with an annual sub discount, $7.56 a month. Now, just quickly to encourage my dedicated 00:09:53.840 |
supporters to stay on that higher tier, I will be keeping the unfiltered personal podcast exclusive 00:09:59.360 |
to that original $29 tier. And to anyone who stays at that tier, I will personally message you 00:10:04.880 |
with thanks. Also, I do that for every new person joining that tier. But to everyone else for whom 00:10:10.560 |
$9 a month is viable, let me finish these two quick minutes with a tour. What you get access 00:10:16.320 |
to is exclusive AI explained videos. I think there's around 30 of them now, like this one 00:10:21.440 |
from last night on that humanities last exam or benchmark that people were talking about yesterday. 00:10:26.800 |
Exact same quality that you would expect. And you get explainers like this one on the origins of the 00:10:31.520 |
term AGI. Obviously people comment as they do on YouTube and on and on. You can also download each 00:10:37.360 |
video so you can watch it offline if you want to. For those $9 or $7, you also get access to the 00:10:44.720 |
Discord, which has evolved a lot since it started. Now has live meetups, a new book club, and of 00:10:50.640 |
course, general discussion. If you go to the introductions page, you can see the caliber of 00:10:56.000 |
the kind of people who join on Discord. Some people, of course, won't care about any of that 00:11:01.040 |
and will just want to support hype-free AI journalism in a landscape I wrote that increasingly 00:11:07.360 |
needs it. Totally understand, by the way, if $9 is too much, I am just super grateful for you 00:11:12.720 |
watching. Back to O1 though, and one thing you might have noticed is we can't actually see those 00:11:18.480 |
chains of thought. If you've used O1 for sure, you do see a summary and of course the output, 00:11:23.440 |
but not the true chains of thought that led it to the output. OpenAI admits that part of the 00:11:28.800 |
reason for that is their own competitive advantage and if you have followed the analysis so far, 00:11:34.480 |
that would kind of make sense. Rival labs, especially those that don't care much about 00:11:38.240 |
terms and conditions, could train on successful chains of thought that were outputted by the O1 00:11:43.840 |
series. After all, that is the key ingredient to its success, so it makes sense. But even if we 00:11:48.960 |
can't see those chains of thought, they have clearly unlocked much better serial calculations. 00:11:54.720 |
Imagine you had to square a number multiple times in a row. It's really hard to do that in parallel, 00:12:00.000 |
isn't it? You kind of need to know the result of the first calculation before you can do the next 00:12:04.720 |
one. Well, with a really long scratchpad to work things out on, or a chain of thought that's hidden, 00:12:10.960 |
models get much better at that. That ability to break down long or confusing questions into a 00:12:16.880 |
series of small computational steps is why I think that O1 preview gets questions like these correct 00:12:23.280 |
most of the time, as people have been pointing out to me. Now, I'm very much aware of that fact 00:12:27.600 |
because I analyzed every single answer that O1 preview gave, as I said in my last video when I 00:12:32.880 |
benchmarked it initially on SimpleBench. But that whole thing of just sneaking in a fact amongst 00:12:38.480 |
many others, as you might see in this question when she eats three other cookies, that's just 00:12:42.880 |
one small component of SimpleBench. There still remains many, many question categories where it 00:12:47.920 |
flops badly. Again, because the data is not in its training data. The librarian can't retrieve a book 00:12:53.680 |
that's not there. The makers of ArcAGI, amongst the most popular AI benchmarks, say this. "In 00:12:59.360 |
summary, O1 represents a paradigm shift from 'memorize the answers' to 'memorize the reasoning'. 00:13:06.640 |
Remember, it was trained on those reasoning steps that did end up leading to a correct answer, 00:13:12.640 |
so it's starting to get better at recognizing which kinds of reasoning lead to correct answers 00:13:19.200 |
in which domain. Less, do I have that exact fact, exact answer in my training data? And more, do I 00:13:26.080 |
have the kind of reasoning steps that I think might be appropriate for solving this problem?" 00:13:30.960 |
But still, as I think I've made clear in this video, if those reasoning steps or facts are 00:13:36.080 |
not in the training data, they're not in distribution, it still will fail. O1 is still not 00:13:42.720 |
a departure from the broader paradigm of fitting a curve to a distribution in order to boost 00:13:48.400 |
performance by making everything in distribution, training on everything, expanding the library. 00:13:53.760 |
They say we still need new ideas for artificial general intelligence. Another way of putting this 00:14:00.000 |
is that there doesn't exist a foundation model for the physical world. We don't have those banks 00:14:05.680 |
and banks of "correct answers" for real-world tasks. And that's partly why models flop on 00:14:12.240 |
SimpleBench. So you should start to notice a pattern in those questions that O1 Preview 00:14:18.000 |
is now getting right where GPT-4.0 couldn't. One of the stars of O1, Noam Brown of OpenAI, 00:14:24.720 |
gave this example. It came from the famous Professor Rao that I interviewed on this 00:14:29.520 |
channel and actually for Insiders. It's about stacking blocks and it's quite confusing at first, 00:14:34.240 |
but O1 Preview gets it nicely. This by the way was originally given as an example of the kind 00:14:39.280 |
of problems that LLMs simply can't get right. But I wouldn't say that stacking blocks is a 00:14:44.160 |
data sparse kind of domain. It's just that previous models got overwhelmed with the amount 00:14:48.800 |
of things going on. It just required too many serial calculations and computations and they 00:14:53.680 |
couldn't do it. And if you want more evidence that training data for better or ill dictates 00:15:00.320 |
performance, here's an example with O1 Preview. The surgeon who is the boy's father says, 00:15:08.480 |
"I can't operate on this boy. He's my son." Who is the surgeon to the boy? Remember? He's been 00:15:14.480 |
described as the boy's father. But the surgeon is the boy's other father. The boy has two fathers. 00:15:20.160 |
As always then, it's worth remembering that exam-style knowledge benchmarks in particular, 00:15:26.320 |
rather than true reasoning benchmarks, does not equal real-world capabilities. 00:15:31.040 |
I now want to count down 10 more interesting facts and bits of background about O1 before I end on 00:15:37.840 |
where we all go from here. What comes next? First, as you may have gathered, the training of O1 was 00:15:43.600 |
fundamentally different from GPT-40. That extra layer of reinforcement learning means that no 00:15:49.760 |
amount of prompt engineering on the base GPT-40, no amount of asking for thinking step-by-step, 00:15:55.600 |
will be able to match its performance. Next is that O1 Preview and O1 might be piecing together 00:16:02.720 |
reasoning steps that we haven't pieced together before. They're still "our" reasoning steps, 00:16:09.200 |
but the model is optimised to piece together those steps that achieve the desired result. 00:16:15.200 |
As one of the key authors of O1, who I talked about a lot last year, Lucas Kaiser said, 00:16:20.880 |
"When you know the right chain of thought, you can compute anything." And as Andrej Karpathy said 00:16:26.560 |
two days ago, "Those thoughts don't even need to be legible to us." They might piece together 00:16:31.360 |
reasoning steps that are translated from other languages, or are in their own made-up language. 00:16:37.120 |
Remember that the objective is now clearly to get the right answer, so the models will optimise 00:16:42.880 |
however they can to do so. And my next point is, as Noam Brown points out, this is exactly what 00:16:48.800 |
happened with chess. Indeed, the way he says it is, "This is starting to sound a lot like," referring 00:16:54.160 |
to O1 Mini's performance in the Codeforces contest, "It's starting to sound a lot like the trajectory 00:16:59.680 |
of chess." Which was what? Well, I'm obviously oversimplifying here, but Stockfish, the best 00:17:04.960 |
chess model, was trained originally with human heuristics, hand-crafted functions to evaluate 00:17:11.600 |
board positions. Obviously it used search to assess way more positions than a human could, 00:17:16.880 |
but it still had those hand-crafted functions. Well, until July 2023. Stockfish removed the 00:17:23.920 |
hand-crafted evaluation and transitioned to a fully neural network-based approach. Or to put 00:17:29.600 |
it another way, by crafting its own reasoning steps and being optimised to put them together 00:17:34.720 |
in the most effective fashion, we may end up with reasoning that we ourselves couldn't have come up 00:17:39.440 |
with. As long as there is something out there that can grade correct versus incorrect, the performance 00:17:44.880 |
will keep improving. For the next series of interesting points, I'm going to draw on a video 00:17:49.840 |
I made nine months ago. Obviously, I am ridiculously biased, but I think it was absolutely 00:17:57.280 |
bang on in terms of its predictions about Q*. Honestly, if you've got time, I recommend watching 00:18:02.240 |
this entire video, but I'm going to pick out a handful of moments where I really called it, 00:18:07.280 |
but that's not the important bit. It's where it helps explain that these researchers saw what is 00:18:12.880 |
coming nine months, 12 months ago. The clues were out there. We just had to put them together. 00:18:17.680 |
So my fourth interesting point comes from minute 17 of the video, where this approach of emitting 00:18:23.680 |
chains of thought can be extended into different modalities. "Multimodal, where the chain of thought 00:18:29.680 |
is basically a simulation of the world. So it will be multimodality and this ability to generate 00:18:36.000 |
sequences of things before you give an answer that will resemble much more what we call reasoning." 00:18:43.600 |
That was a short snippet, but I think contains a crucial detail. Just like the O1 family of 00:18:48.640 |
models is scoring dramatically better in physics tests, Sora, the video generation model from OpenAI 00:18:55.040 |
could get way better at modeling physics in pixels. It could attempt to predict the next 00:19:01.040 |
pixel with chains of thought and be fine-tuned on those predictions that actually worked, 00:19:06.320 |
potentially without even needing a data labeling revolution. A video generation model could learn 00:19:11.360 |
which sources, which videos from YouTube, for example, depict reality with the most accuracy. 00:19:16.640 |
As one former Googler and OpenAI member, Jeffrey Irving said, "You could have a scenario of let's 00:19:22.240 |
think pixel by pixel." This step change, in other words, doesn't have to be limited to text. 00:19:27.760 |
Now for my next point, I have no idea why I delayed it this long into the video, 00:19:32.240 |
but look at this prediction that I made back in November. I picked out a key paragraph in a paper 00:19:37.520 |
called "Let's verify step-by-step" which indicated what OpenAI were working on. "Let's verify step-by-step" 00:19:44.480 |
could be like choosing an action. After all, in the original paper, using test-time compute in this way 00:19:50.000 |
was described as a kind of search. And in "Let's verify", they hinted at a step forward involving 00:19:55.920 |
reinforcement learning. They said, "We do not attempt to improve the generator, the model coming 00:20:00.800 |
up with solutions, with reinforcement learning. We do not discuss any supervision the generator 00:20:06.000 |
would receive from the reward model, if trained with RL." And here's the key sentence, "Although 00:20:11.280 |
fine-tuning the generator with reinforcement learning is a natural next step, it is intentionally 00:20:17.200 |
not the focus of this work." Is that the follow-up work that they did? I mean, if that prediction 00:20:22.400 |
isn't worthy of a like on YouTube, or preferably joining AI Insiders, then I don't know what is. 00:20:28.000 |
I then went into detail on the 2022 paper that showed how that would be done. In a nutshell, 00:20:33.520 |
it involves fine-tuning a model on the outputs it generated that happen to work. Keep going 00:20:39.280 |
until you generate rationales that get the correct answer, and then fine-tune on all of those 00:20:44.640 |
rationales. And they say that, "We show that STAR significantly improves performance on multiple 00:20:49.760 |
datasets compared to a model fine-tuned to directly predict final answers." Does that 00:20:54.560 |
remind you of Let's Verify? "And performs comparably to fine-tuning a 30x larger state-of-the-art 00:21:00.960 |
language model." Next, I want to show you all a warning straight from Ilya Sutskever, one of the 00:21:07.040 |
key authors of this approach. Presumably, he's putting it to work in the safe superintelligence 00:21:12.080 |
company. But he had a warning, "Reinforcement learning is creative." Reinforcement learning 00:21:17.360 |
has a much more significant challenge. It is creative. Reinforcement learning is actually 00:21:26.400 |
creative. Every single stunning example of creativity in AI comes from a reinforcement 00:21:34.480 |
learning system. For example, AlphaZero has invented a whole new way of playing a game that 00:21:41.760 |
humans have perfected for thousands of years. It is reinforcement learning that can come up 00:21:45.680 |
creative solutions to problems, solutions which we might not be able to understand at all. 00:21:51.600 |
And so what happens if you do reinforcement learning on long or even medium time horizon 00:21:57.680 |
when your AI is interacting with the real world, trying to achieve some kind of a beneficial 00:22:04.320 |
outcome, let's say, as judged by us, but while being very, very, very creative. 00:22:08.720 |
This does not mean that this problem is unsolvable, but it means that it is a problem. 00:22:14.000 |
And it means that some of the more naive approaches will suffer from 00:22:17.440 |
some unexpected creativity that will make the antics of Sydney seem very modest. 00:22:22.720 |
Next, I feel I foreshadowed Q*, Strawberry or O1's weakness with spatial reasoning. 00:22:28.800 |
I think the development is likely a big step forward for narrow domains like mathematics, 00:22:34.240 |
but is in no way yet a solution for AGI. The world is still a bit too complex for this to work yet. 00:22:41.520 |
That desperate need to model the world's complexity and achieve true spatial intelligence 00:22:46.560 |
is why Fei-Fei Li's startup is already worth $1 billion after just four months. 00:22:52.240 |
Now, as I've hinted already in this video, I think OpenAI graded the individual reasoning steps 00:22:57.920 |
of the generators outputs, not just whether the overall answer was correct. 00:23:02.080 |
But for more background on that, it would be easier for me to just play 00:23:06.000 |
a couple of minutes from that November 2023 video, in which, by the way, 00:23:14.080 |
So that's test time compute, but what about let's verify step by step? 00:23:17.760 |
Well, going back to that original 2021 verifier paper, they said this. 00:23:21.760 |
The problem they noticed with their approach back in 2021 was that their 00:23:25.440 |
models were rewarding correct solutions, but sometimes there would be false positives. 00:23:30.320 |
Getting to the correct final answer using flawed reasoning. 00:23:33.840 |
They knew this was a problem, and so they worked on it. 00:23:36.400 |
And then in May of this year, they came out with let's verify step by step. 00:23:41.360 |
In this paper, by getting a verifier or reward model to focus on the process, 00:23:46.560 |
the P, instead of the outcome, the O, results were far more dramatic. 00:23:51.600 |
Next, notice how the graph is continuing to rise. 00:23:55.040 |
If they just had more, let's say, test time compute, this could continue rising higher. 00:24:00.960 |
And I actually speculated on that back on June the 1st. 00:24:04.480 |
That difference of about 10% is more than half of the difference between GPT-3 and GPT-4. 00:24:10.800 |
And also, is it me, or is that line continuing to grow? 00:24:14.320 |
Suggesting that when more compute is available, the difference could be even more stark. 00:24:18.960 |
Imagine a future where GPT-4 or 5 can sample, say, a trillion 10 to the 12 solutions. 00:24:25.360 |
So you're beginning to see my hypothesis emerging. 00:24:27.840 |
A new and improved let's verify step by step, called Q*, 00:24:31.600 |
drawing upon enhanced inference time compute to push the graph toward 100%. 00:24:36.880 |
If you want more details on that process reward model, 00:24:40.160 |
check out the video I did back then called Double the Performance. 00:24:43.760 |
But the very short version is that they trained a reward model 00:24:47.200 |
to notice the individual steps in a reasoning sequence. 00:24:51.360 |
That reward model then got very good at spotting erroneous steps. 00:24:55.760 |
Furthermore, when that model concluded that there were no erroneous steps, 00:24:59.360 |
as we've seen from the graphs, that was highly indicative of a correct solution. 00:25:04.080 |
Notice also that sometimes it could pick out such a correct solution, 00:25:08.240 |
when the original generator, GPT-4, only outputted that correct solution one time in a thousand. 00:25:14.480 |
Furthermore, the method somewhat generalized out of distribution, 00:25:18.400 |
going beyond mathematics to boost performance in chemistry, physics, and other subjects. 00:25:24.000 |
And Noam Brown, I think, gave a clear hint that verifiers were used in the training of O1. 00:25:30.480 |
Again, my theory is the only answers for which every reasoning step was correct, 00:25:35.120 |
and the final answer were used to train or fine tune the O1 family. 00:25:40.400 |
But just look at this point he leaves hanging in the air 00:25:47.760 |
where you're verifying every single step with a really good reward model, 00:25:50.640 |
you're getting an even bigger boost and you're getting up to 78.2%. 00:25:53.440 |
And you can see it still looks like that number, 00:25:55.520 |
that line would go up more if you generated more samples. 00:25:58.160 |
That's as big a hint as you're going to get that let's verify was key for O1. 00:26:04.320 |
And very quickly before I leave let's verify, 00:26:06.880 |
don't forget that that paper cited work from Google. 00:26:10.240 |
Some of the other key authors behind and around let's verify have also gone to Anthropic. 00:26:15.440 |
So it's not like OpenAI will be the only ones working on this. 00:26:19.040 |
Yes, they're well ahead, but I could well see one of those other two labs catching up. 00:26:24.800 |
I talked about how higher temperature was optimal for generating those creative chains of thought? 00:26:30.000 |
Well, that was suggested as early as 2021 at OpenAI. 00:26:34.640 |
From the paper I cited in that November 2023 video, 00:26:40.480 |
Verification consists of sampling multiple high temperature solutions. 00:26:46.720 |
You might be wondering where I'm going with this, 00:26:49.120 |
but that is why I think the API of the O1 family keep the temperature at one. 00:26:55.520 |
I think the model itself was used to generate those chains of thought. 00:27:00.320 |
And then that same model was then fine-tuned on those correct solutions. 00:27:04.400 |
In other words, because the model was trained that way, 00:27:08.640 |
OpenAI don't actually allow you to change it. 00:27:10.720 |
Let me know in the comments if you think I've figured out something there. 00:27:15.440 |
And the White House is certainly taking all of this quite seriously. 00:27:18.960 |
They were shown Strawberry and O1 earlier this year, 00:27:22.640 |
and they now describe how AI data center development and promoting it 00:27:26.960 |
and funding it reflects the importance of these projects to American national security 00:27:34.080 |
The government at the very least is a believer, but are we? 00:27:37.760 |
Well, I have been very impressed by O1 Preview. 00:27:47.600 |
But either way, please do have a wonderful day.