Q* - Clues to the Puzzle?

As you might expect, I have been researching nonstop about this apparent powerful AI discovery that insiders at OpenAI said could threaten humanity. I've spoken to every insider I know and done a ton of research and I am not claiming to have solved the puzzle. But I can provide some genuine clues that I think will be at least part of the answer.

Normally, I like to be a lot firmer than that, but this is the best I can do. Now the first thing to note of course is that OpenAI have now denied that Sam Altman's Alsthur was precipitated by this safety letter to the board. As my previous three videos have shown, there was certainly a lot else going on, so it might well not just be about this AI breakthrough.

Second, I just want to quickly debunk this clip that's doing the rounds where people are claiming that Sam Altman called this new creation a creature, not just a tool. Actually, if you watch to the end, he's very much saying he's glad that people now think of it as part of the toolbox.

So despite this frantic headline, I am not trying to overhype things, but I genuinely think I figured a couple of things out. Let's get to it. Later in the article, these insiders, these researchers flagged up work by an AI scientist team, the existence of which multiple sources confirmed. This AI scientist team was formed by combining earlier co-gen and math gen teams at OpenAI.

Their work on exploring how to optimize existing AI models to improve their reasoning was flagged in the letter to the board. Now there is very little public information about either the co-gen or math gen teams, but I dredged up this old tweet from Sam Altman. Yes, by the way, I am now on Twitter.

I finally succumbed. Anyway, Sam Altman said, "Really exciting process supervision result from our math gen team," and he linked to a critical paper that I covered back in the spring. It's called Let's Verify Step-by-Step. And that paper is the crux of the video. That's what I think the former math gen, now AI scientist team were working on.

And take this tweet in September by Noam Brown at OpenAI. "My teammates and I at OpenAI are hiring ML engineers for research on LLM multi-step reasoning. We recently hit a state of the art 78% on the math benchmark," which I'll get to in a second, "Our new plans are even more ambitious.

I'm only just getting started, but you might already be glimpsing why I think this paper is the crux of the new development." Here's another reason. What's much more ambitious than 78%? Acing such tests. That's apparently what they've achieved behind the scenes, at least according to Reuters. And what's another bit of evidence?

Well, we have this exclusive report from the information. Again, similar frantic headline, "OpenAI made an AI breakthrough before Altman firing, stoking excitement and concern." But who led that breakthrough? The technical breakthrough, they say, was spearheaded by OpenAI chief scientist, Ilya Sutskov. And the most recent paper on Archive listing Ilya Sutskov as an author is the "Let's Verify Step-by-Step" paper.

Now at this point, I know you want me to get to what it means, but with so many theories floating out there, I want to give you yet more evidence that at least one of the breakthroughs links to "Let's Verify Step-by-Step." And I think I might even know what the other breakthrough is.

The same article in the information talks about Sutskov working on ways to allow language models to solve tasks that involve reasoning, like math or science problems. It talks about how he had this secret program called GPT-0. And here's where it gets interesting. The team hypothesized that giving language models more time and computing power to generate responses to questions could allow them to develop new academic breakthroughs.

And Lucas Kaiser, who we will definitely be seeing more of in this video, indeed he appears in the thumbnail, apparently held a key role in the GPT-0 project. And look at this, among the techniques the team experimented with was an ML concept known as "test-time computation" that's apparently meant to boost language models' problem-solving abilities.

And we have a hundred speculations online about what this might mean. But I knew that name rang a bell and I dug up this 2021 paper which cites test-time compute. And what is this random paper that Philip is talking about? Well, it's from OpenAI and look at some of the co-authors, Carl Cobb of the MathGen team and Lucas Kaiser cited in the information.

This was actually one of the precursor papers to Let's Verify Step-by-Step, which I will explain in a moment. It introduced the now famous in ML circles GSM 8K dataset. That's 8,000 grade school math problems. More importantly though, it trialed this method of at test time, generating many candidates' solutions and selecting the one ranked highest by a verifier.

And I'm going to massively oversimplify at this point and just say that a verifier is a separate model trained only at this point in this paper to spot good solutions, solutions that get the answer correct. And what the paper proposed was getting the base LLM to generate hundreds of solutions and then getting this separate verifier to spot the ones that were likely the most correct.

And in a nutshell, the authors noticed that if they invested more computing power in generating more solutions and taking a majority vote among the top verifier ranked solutions, that had a massive effect on performance. And that's what it means by test time compute, investing your computing power while you're taking the test, not during training.

So the model stays the same. You're not further training it or fine tuning it. You're investing that computing power during test time, again, to generate potential solutions and take majority votes amongst them, self-consistency. They found that using a verifier in this way was better than fine tuning because verification was a simpler task than generation.

It's easier to check your answers than generate good ones. So investing time in checking answers was more fruitful. How much more fruitful? Well, the use of verifiers results in approximately the same performance boost as a 30 times model size increase. And then it gets prophetic when it says, and verifiers scale significantly better with increased data.

And to hammer the point home, they said a 6 billion parameter model with verification slightly outperforms a fine tuned 175 billion parameter model. Again, thereby offering a boost approximately equivalent to a 30X model size increase. And that team, by the way, with Lucas Kaiser was probably drawing on work by a single dude done six months earlier in April, 2021.

While studying the board game Hex, the researcher, Andy Jones found an interesting result. He said, along with our main result, we further show that the test time and train time compute available to an agent can be traded off as we've seen while maintaining performance. I read that paper in full and it was cited by no other than Noam Brown.

He's the guy we saw earlier who joined OpenAI in July of this year. He said he's investigating how to make reasoning methods truly general. If successful, we may one day see LLMs that are a thousand times better than GPT-4. And later in that thread, he cites that same paper from Andy Jones.

And he concludes with this, all those prior methods are specific to the games they're talking about. But if we can discover a general version, the benefits could be huge. Yes, inference may be a thousand X slower and more costly, but what inference cost would we pay for a new cancer drug or for a proof of the famous mathematical Riemann hypothesis?

And we'll come back to mathematical hypotheses in a moment. Anyway, improved capabilities are always risky, he went on. But if this research succeeds, it could be valuable for safety research as well. Notice that point about capabilities being risky. It does seem to me to be linking together into that Reuters story.

Imagine he says being able to spend $1 million on inference to see what a more capable future model might look like. It would give us a warning that we would otherwise lack. Is that what OpenAI did when they recently pushed back the veil of ignorance? Did they just spend a million dollars on inference and see what a more capable future model might look like?

So that's test time compute, but what about let's verify step by step? Well going back to that original 2021 verifier paper, they said this. The problem that they noticed with their approach back in 2021 was that their models were rewarding correct solutions, but sometimes there would be false positives.

Getting to the correct final answer using flawed reasoning. They knew this was a problem and so they worked on it. And then in May of this year, they came out with let's verify step by step. In this paper, by getting a verifier or reward model to focus on the process, the P, instead of the outcome, the O, results were far more dramatic.

Next, notice how the graph is continuing to rise. If they just had more, let's say test time compute, this could continue rising higher. And I actually speculated on that back on June the 1st. That difference of about 10% is more than half of the difference between GPT-3 and GPT-4.

And also, is it me or is that line continuing to grow? Suggesting that when more compute is available, the difference could be even more stark. Imagine a future where GPT-4 or 5 can sample, say, a trillion 10 to the 12 solutions. So you're beginning to see my hypothesis emerging.

A new and improved let's verify step by step called Q*, drawing upon enhanced inference time compute to push the graph toward 100%. If you want more details on that process reward model, check out the video I did back then called Double the Performance. But the very short version is that they trained a reward model to notice the individual steps in a reasoning sequence.

That reward model then got very good at spotting erroneous steps. Furthermore, when that model concluded that there were no erroneous steps, as we've seen from the graphs, that was highly indicative of a correct solution. Notice also that sometimes it could pick out such a correct solution when the original generator, GPT-4, only outputted that correct solution one time in a thousand.

Furthermore, the method somewhat generalized out of distribution, going beyond mathematics to boost performance in chemistry, physics, and other subjects. So was it potentially the million dollar inference run that spooked the researchers? Or was it the potential to make radical breakthroughs in science? And a further bit of supporting evidence for this theory comes again from the information.

It gives us a rough timeline in the months following the breakthrough. Sutskve had reservations about the technology and in July he formed the super alignment team. So the original breakthrough, whatever it was, had to have come way before July. That would fit much more with it being associated with let's verify step by step.

Or again, maybe that combination of process reward modeling and inference time compute. And in a genuinely fascinating recent conference call, this is what Ilya Sutskve said in reply to a question asking about if models could produce mathematical conjectures. How original or creative are the latest large language models? Of course we know that for instance AlphaGo did some pretty creative moves when it won its match in South Korea.

So that's possible. But to be very concrete, do you think the existing models or some, you know, the next GPT-4, say GPT-5 or so, would be able to state a new non-trivial mathematical conjecture? I'm not saying proving it. I'm saying stating it. Who thinks it's possible within the next five years?

Are you sure that the current model cannot do it? I'm not sure, absolutely. Do you know whether it can? I mean, let me give you an example of something creative that GPT-4 can already do. Obviously we would have all loved to hear Ilya Sutskve's full answer to that question, but we never got a chance.

But here's where we get to arguably the strongest bit of evidence that I've got. Remember that name again, Lucas Kaiser cited in the exclusive information report. If you remember, he had that key role on the GPT-0 project. He was a co-author in both of the papers that I've brought up, and he was focused on test time computation to boost language models, problem solving abilities.

Well, as the article says, even more significantly, he was one of the co-authors of the original attention is all you need transformers paper. And so presumably with that much pedigree, it would take quite a lot to get him excited. So I'm now going to show you some extracts from two YouTube videos, both of which have hardly any views, despite him being mentioned in these exclusives about Q*.

And Lucas Kaiser will describe in these videos how significant he thinks a variation of let's think step by step could be. First he'll give you some further background on the breakthrough. It needs to do this thinking inside its layers. That may not be enough time and space to do it.

Like tell me what you're thinking and only then give the answer. And if you do that, there is a recent paper that says the model can basically do any computation. It can even execute programs. It's Turing complete. And does this really help? So on mathematics, you can tell the model, hey, do this thinking, but do it like number each step, like one, two, three, four, five, six, seven, as you see here, and be very precise about each step.

And then you can try to verify each of these steps of thinking separately. You can even ask the model, well, was step three correct? Was step four correct? And when you do that, like this MATH dataset, which is a little bit tougher math problems than like the pure arithmetic, it was especially made to show like what the models cannot do.

If you add this thinking, you can get to like almost 80% just by allowing the model to think. And at least on mathematical questions, this gives like insane gains. And insane is the fair word to use. Like if you-- a transformer has a limitation in running time. It runs in n squared time for input of n, and that's it.

When you allow it to produce chains of thought, it's as computationally powerful as anything you can imagine. But, okay, so these two ingredients give you something that generalizes. Could we make them even more powerful? And this is called chain of thought mostly, and chain of hindsight, and programs of thought and so on.

But I think this has turned out to be the method that makes transformers more powerful. And it's not just mathematics where you can build this thinking. He even goes as far as describing it as a major focus for deep learning in 2024. If you think what is coming in the next years, I think there'll be a lot of focus on doing this thinking thing with deep learning, with language models, probably with chains of thought, but also these chains of thought currently, they're just prompted, but maybe you need to do some fine tuning, some learning.

There'll be a lot of work there, but this is a very hard problem to solve. And we'll start with much simpler exercises and probably move forward. But I think this is something that the coming years will see a lot of work. And in a sign of the significance he attributes to this method, he said it could even be revolutionary for multimodality.

I think you also need these chains of thought that, like you need to give the model the ability to think longer than it has layers. But it can be combined with multimodality. So in the future, the models will have this knowledge of the world and this generation, which we call chain of thought and text.

But multimodality, this means just it's a chain of frames of what's going to happen to the world, which is basically how we sometimes think, you know, what will, if I go, what will happen to me? And I think that will indeed be, so it will be multimodality and this ability to generate sequences of things before you give an answer that will resemble much more what we call reasoning.

He then described how this method will help models to generalize from much less data. Layers are not the end, right? You can do chains of thought to extend them. You can do GNNs, you can do recurrence in depth. How do you see the next two years of deep learning?

Yeah, I think there'll be as interesting as any previous two years or even more. I think there'll be a lot on the chain of thought, but very generally speaking. So also on the agents, building libraries of knowledge, possibly multimodal where the chain of thought is basically a simulation of the world.

So I think that will be like one big topic and I think this will make the models generalize much better from less data too. And that might remind you something, going back to Reuters and the information. Sarscopha's breakthrough allowed OpenAI to overcome limitations on obtaining enough high quality data to train new models.

According to the insider with knowledge, a major obstacle for developing next generation models. So according to my theory, this breakthrough is less about generating trillions and trillions of tokens worth of synthetic data, but more about using the data you've got much more efficiently. But now, alas, we must get to the bits that my theory can't explain, namely the name.

The information sites two top researchers at OpenAI building on top of Sarscopha's method a model called Q*. Now I've tested every link I could possibly find to the name Q* with my theory about let's verify step by step. And while I do have some ideas, honestly, it's still an open question.

And of course, I like everyone has to admit that there's a chance that I'm entirely wrong. When I put my idea to a senior ML researcher at a top AGI lab, he thought it had real legs. It was a genuine possibility. And he said one link to the name Q* could be in a generic sense.

Without getting too technical, Q* refers to the optimal Q function or optimal policy. Another possibility is that the Q references Q learning. Generically, that's a reinforcement learning technique where an agent learns to make optimal decisions by exploring its environment. An agent chooses actions, see how they go, and then updates their policy.

Basically trial and error trading off exploration of new steps, new actions versus exploitation of actions you know have some good reward. And here's where the analogy gets a little bit tenuous. Picking the reasoning steps in let's verify step by step could be like choosing an action. After all, in the original paper, using test time compute in this way was described as a kind of search.

And in let's verify, they hinted at a step forward involving reinforcement learning. They said we do not attempt to improve the generator, the model coming up with solutions, with reinforcement learning. We do not discuss any supervision the generator would receive from the reward model. If trained with RL. And here's the key sentence.

Although fine tuning the generator with reinforcement learning is a natural next step, it is intentionally not the focus of this work. Is that the follow up work that they did? I mean, you can kind of think of Q learning for process supervision as minimizing the cumulative probability of failure, which is the equivalent of maximizing the probability of success.

And after all, maximizing a sum of rewards over multiple steps is exactly what Q learning aims to do. If any of you, though, have better guesses for the analogy, and I'm sure you do, do let me know in the comments. But what about the star? Well, again, here I am truly speculating.

Unlike the earlier parts of the video in which I am much more confident, this is much more speculative and tenuous. Peter Liu of Google DeepMind had this idea. Remember the leak talked about acing math tests. He said, "Sounds like OpenAI got some good numbers on GSM 8K." Remember that's the set of questions made for that original Verifier paper back in 2021.

He said he's speculating, but there's a star in this paper, a technique that fine tunes a model to its own better outputs. In a nutshell, it involves fine tuning a model on the outputs it generated that happened to work. Keep going until you generate rationales that get the correct answer and then fine tune on all of those rationales.

And they say that we show that star significantly improves performance on multiple datasets compared to a model fine tuned to directly predict final answers. Does that remind you of Let's Verify? And performs comparably to fine tuning a 30X larger state-of-the-art language model. He went on that GSM 8K and the math benchmark featured in Let's Verify are great testbeds for self-improvement because model outputs can be evaluated for correctness more or less automatically.

This brings us on to another strand in what all of this actually means for us. He said for more open-ended generation, humans often provide the feedback. However, as LLMs have gotten more capable, an interesting emerging ability is that they're getting better at evaluation for other things, not just math.

At some point, if self-evaluation or self-critique works reliably, you get general self-improvement beyond math. So this is a further possibility for what might have freaked out those researchers. Generalizing beyond math though is hard, as Andrei Karpathy pointed out this week. I think a lot of people are broadly inspired by what happened with AlphaGo.

In AlphaGo, this was a Go playing program developed by DeepMind. AlphaGo actually had two major stages, the first release of it did. In the first stage, you learn by imitating human expert players. So you take lots of games that were played by humans, you kind of like just filter to the games played by really good humans, and you learn by imitation.

You're getting the neural network to just imitate really good players. This works and this gives you a pretty good Go playing program, but it can't surpass human. It's only as good as the best human that gives you the training data. So DeepMind figured out a way to actually surpass humans, and the way this was done is by self-improvement.

Now in the case of Go, this is a simple closed sandbox environment. You have a game and you can play lots of games in the sandbox and you can have a very simple reward function, which is just winning the game. So you can query this reward function that tells you if whatever you've done was good or bad, did you win, yes or no.

This is something that is available, very cheap to evaluate, and automatic. And so because of that, you can play millions and millions of games and kind of perfect the system just based on the probability of winning. So there's no need to imitate, you can go beyond human. And that's in fact what the system ended up doing.

So here on the right, we have the Elo rating and AlphaGo took 40 days in this case to overcome some of the best human players by self-improvement. So I think a lot of people are kind of interested in what is the equivalent of this step number two for large language models, because today we're only doing step one.

We are imitating humans. As I mentioned, there are human labelers writing out these answers and we're imitating their responses. And we can have very good human labelers, but fundamentally, it would be hard to go above sort of human response accuracy if we only train on the humans. So that's the big question.

What is the step two equivalent in the domain of open language modeling? And the main challenge here is that there's a lack of reward criterion in the general case. So because we are in a space of language, everything is a lot more open and there's all these different types of tasks.

And fundamentally, there's no like simple reward function you can access that just tells you if whatever you did, whatever you sampled was good or bad. There's no easy to evaluate fast criterion or reward function. If models can get good at generalization using reinforcement learning with any of these techniques, Ilya Satsykova has a slight warning that he put out earlier this year.

He compared the creative results we might get to outbursts from Bing Sydney. Reinforcement learning has a much more significant challenge. It is creative. Reinforcement learning is actually creative. Every single stunning example of creativity in AI comes from a reinforcement learning system. For example, AlphaZero has invented a whole new way of playing a game that humans have perfected for thousands of years.

It is reinforcement learning that can come up creative solutions to problems, solutions which we might not be able to understand at all. And so what happens if you do reinforcement learning on long or even medium time horizon when your AI is interacting with the real world, trying to achieve some kind of a beneficial outcome, let's say, as judged by us, but while being very, very, very creative.

This does not mean that this problem is unsolvable, but it means that it is a problem. And it means that some of the more naive approaches will suffer from some unexpected creativity that will make the antics of Sydney seem very modest. So that's as far as I've gotten. I might be completely wrong.

Let me know what you think in the comments. I think the development is likely a big step forward for narrow domains like mathematics, but is in no way yet a solution for AGI. The world is still a bit too complex for this to work yet. Anyway, time to move on to something more positive.

After all, even Sam Altman can now get along with Adam D'Angelo. So anything is possible. I'm going to end with some positive and amazing news about music generation. But first I want to introduce you to the AI Explained bot. If you're feeling bored, you may even want to discuss the contents of this video and Q* with the AI Explained bot.

It has access to the transcripts of my videos, including this one. I'm proud to announce that they're sponsoring this video and their playground is honestly amazing. In fact, I reached out to them about sponsorship. It's that good. Their playground is super easy to use, even if you're not from a coding background.

And as they know, their speech to text model, Conforma 2, is state of the art and it is particularly good on alphanumerics. A perfect example of that is how it can transcribe a GPT-4. That is something in my transcripts that so many models struggled with. Anyway, I have honestly thought that their playground is amazing for anyone to use for months now.

And yes, it's literally just clicking to upload your audio file and then pressing transcribe. Anyway, thanks to Assembly AI, you can now play about with the AI Explained bot. After such a heavy video, I think it's only appropriate to end with a bit of music, but not any music, music generated by Google DeepMind.

Their new Lyra model can convert your hums into an orchestra. As always, thank you so much for watching and whatever happens, have a wonderful day.

Q* - Clues to the Puzzle?

Transcript