Stanford CS25: V5 I Large Language Model Reasoning, Denny Zhou of Google Deepmind

All right. So hi, everyone. We're going to get started. So for today's lecture for CS25, very pleasure to have Denny Zhou from Google DeepMind here to give a talk on large language model reasoning. And so Denny founded the reasoning team at the Google Brain, which is now part of Google DeepMind.

His group is renowned for pioneering chain of thought prompting and self-consistency, as well as developing the mathematical foundations of in-context learning and chain of thought reasoning. His team also created core capabilities that powered Gemini's reasoning capabilities. Further, Denny co-founded the Conference on Language Modeling, or COM, and served as general chair for the 2024 conference.

So yeah, I'll let Denny take it from here. Yeah, I'm glad to see many of you guys have already believed AOM is kind of reason. Actually, you may wonder what's my answer for this question. Yeah, to me, actually, I don't know. That really depends on the definition of reasoning.

So, for my talk today, we have a very specific definition about reasoning. So, I know there are many debates about if AOM can reason. I never joined those debates. Because without a definition of reasoning, I have no idea about those things. But for AOM reasoning, and we particularly mean that intermediate tokens between input and output, So, this idea actually is not very new.

Even in 2017, DeepMind already published a paper "How to Use Intermediate Tokens to Solve Mass Problems." So, at that time, I think the community was quite happy about AlphaGo, AlphaZero. But this paper is really ground-breaking paper. If you haven't read that paper before, I strongly encourage you to look at that paper.

So, they introduced natural language to solve mass problems. However, in the literature at that time, I think everyone else just used symbolic approach or search. So, this idea actually is also very common for neurosymbolic literature. In neurosymbolic literature, actually, it's very common to use intermediate process. to solve some reasoning problems.

Here's an example about how to use AOM reasoning. When I founded the reasoning team in Google Brain, I created this task. So, it's called last letter concatenation. I used this task as a motivating example. At that time, one could use transform models to solve this task. So, what's the output when concatenating the last letter of each word of artificial intelligence?

So, if there's no reasoning process, you will say, "Okay, the answer is LE." If there's a reasoning process, the model would output, say, "The last letter of artificial intelligence is L, the last letter of intelligence is E, concatenating L and E to SLE," or something like that. So, the highlighted text here is called reasoning.

So, if you are familiar with program synthesis or neurosymbolic reasoning, you wouldn't be surprised about this task design. Of course, you can imagine that I tried other options. For example, I didn't see the first letter. The reason is that I tried first letter, and all logic models can solve that problem quite well.

Because there are so many initiates on the web, and the model has already learned how to concatenate first letters. Then I switched to last letters, and all models failed. I know many people say, "Oh yeah, this is so natural, right? We need intermediate steps, just like humans." I know, in the current days, you may see LMs are very similar to humans.

But for us, as researchers, we should always keep in mind, LMs are just probabilistic models. We need more humans. And if you are always keeping this in mind, it will be better for you to understand a lot of new techniques. So, why intermediate tokens only matters? Okay, we have a theoretical work, I should collaborate with Professor Terry Ma in Stanford and his students.

So, for any problems solvable by a Boolean circuits of size T, constant size transformers can solve it by generating OT intermediate tokens. It's a very powerful result. So, the size here means the number of logic gates. So, for example, if we use a GPU clusters, that would be tons of millions of gates, right?

Even the billions of trillions, yeah. If we directly generate final answers, either require a huge depth or cannot solve it at all. That's how we understand reasoning from a theoretical perspective. So, in the later of this lecture, I will come back to this theoretical argument. There's a common belief about ALM reasoning, and the pre-trained ALMs cannot reason without further prompting engineering, like COT prompting or fine-tuning, you know, current days, everyone talks about IL fine-tuning, right?

Is that true? Is that true? Do you agree that? Agree that? Okay. So, I believe it's wrong, yeah. It's very wrong, yeah. So, pre-trained ALMs are ready to reason, and all we need is decoding, just about decoding process. So, yeah, no matter how fancy those techniques look like in the kind of days.

So, here's an example here. If I have three apples, my dad has two more apples than me, and how many apples do we have in total? So, if you have any pre-trained models, like Lama, Deep Seek, or Chang Wen, or something, and I didn't try those models, okay. If you have any pre-trained models, you can type this question in the pre-trained model and see what happened.

Probably it's very likely you'll see answer like five apples. Of course, the answer is wrong here, okay, this is called graded decoding. You will say, okay, yeah, you're right, right, for pre-trained models, there's no reasoning, right? The problem is about decoding, because we use graded decoding by default. If you look at the second candidates, because you have a big vocabulary size, right, and you can look at the second candidate for the first token, And the problem will start from I, and we'll see what happens.

We'll just then continue the decoding process. We'll see, okay, I have three apples, and my dad has two more apples than me, so he has five apples, and three plus five equals eight. It's perfect, right? We just need to look for more candidates, that's amazing. And there's another choice, and the third candidate for the first token is V, we'll see what happened here.

We'll have eight apples in total. Yeah, somehow it's also cracked. And probably from the fourth candidate will be you, we'll continue decoding, we'll see what happened here. Again, yeah, you can clearly see a chain of thought in this response, and the final answer is correct. And this is the fifth candidate for the first token, and I said five is wrong, okay, yeah.

You can see that actually the reasoning path is already in the output space. And in particular here, for the second response and the fourth response, they are based on the chain of thought reasoning. The problem is how to select the best response, right? If we just look at the examples here, you may see, okay, we can, by output length, if the model has some synchings, and the output length will be longer, because it contains reasoning tokens.

And actually, we have a better idea. We have a better idea to select the response, and by its answer confidence. Confidence means, because the model is just a previous model, we can look at the probability of the token in prediction. A very interesting thing is that for the response with chain of thought reasoning.

The answer token has way higher confidence. For this example, for this example, actually, for the token 8, the model's confidence is nearly 98%. You can imagine that's huge, right? Because we have huge vocabulary size. So usually, for each token, the probability is nearly zero. So this process is called a chain of thought decoding.

So basically, it consists of two steps. So basically, it consists of two steps. Step one, we just go beyond grid decoding by checking more generation candidates. And in the second step, we choose candidates which have the highest confidence on the final answer. And channel sort of decoding is a very simple approach.

But still, it needs some programming work. And I heard in the current days that people just want to use a natural language, right? No one write code. Of course, you guys are exceptional. And we have to say, okay, can we reshape the model's output distribution so that sort for responses naturally rank first?

If the channel sort response is ranked first, and then the graded decoding can naturally find it, right? So now we have to look at the channel sort of prompting. If you know channel sort of prompting, now you can see why it works. Channel sort of prompting is a very simple approach.

So given this problem, and you'll probably use another similar problems as an example. And put that before your question. And then the model will magically follow the style, reasoning style, and generate a step-by-step solution. Yeah, now you can see that why channel sort of prompting works. Because it changes the output distribution to push the original channel sort of solutions in the output space to the top position.

Even there's a simpler approach. It's called a less single-by-step. There's another amazing work in reasoning. When that paper came out, and I thought it was a joke. How possible. Yeah. And at that time, the Google Brain team built a model called Palm. And I tried a lesson step-by-step in our Palm model because of course I know how Palm was built.

It's definitely not related to this magic trick. And then I found it works on Palm. I was so shocked. So this paper really inspired me a lot on reasoning research. Those prompting approaches, you know, are really simple. And prompting really works. But we can see there's also some pitfalls.

So like CLT prompting, right? It needs task-specific examples. To me, I don't feel comfortable about that. If I have questions to ask someone, if I know similar problems, then I can solve it by myself, right? Why should I ask other people? And for the other approach, it's called lesson step-by-step.

It's generic, okay? You don't have to find similar examples. You just say lesson step-by-step, and then the magic will come out. Unfortunately, it performs much worse than a few shots prompting. And yeah, I just mentioned of that. Okay, yeah, both approach looks well, right? Even for lesson step-by-step, it's also well, right?

If I ask somebody a question, then they have to follow ways lesson step-by-step. Otherwise, they couldn't think anymore, right? That's not expected. So, how to fix it? So, there's a popular approach called supervised fine-tuning. So, for this approach, and the idea actually is very simple. We collect a set of problems and the step-by-step solutions from human annotators.

And then we maximize the likelihood of human solutions. Maximum likelihood actually for LM's training, pretty net token. It's just maximize likelihood. Yeah. And after that, we can apply the model everywhere. So, I listed a deep amount of paper in 2017. I mentioned that paper at the very beginning. Yeah, they exactly did something like that.

They collected a set of mass work problems and also human annotated step-by-step solutions. And then they trained the sequence-to-sector model to solve mass problems. In 2021, an OPI actually further extended that approach, built a much larger data set called GSM-8K grad school mass problems. And then they used those data sets to fine-tune GPT-3 models.

So, here, let me give an example of how it works, okay. You can just put a problem here. Like, for example, at the beginning, I said, okay, we can do large-letter concatenation. And you can put this example here to the problem and the answer, okay. And the other one is the CD math problem.

How many efforts you can put there? And then use that as a training data to fine-tune your model. And then you can test the model with a new question. So, how many R's in strawberry? Probably know why I particularly chose this problem here. Because in the social media, many people believe that it's a good question to test if AGI has come or not.

Yeah, and SFT is really generic approach. Once you train the model, you can apply it anywhere, right. And if that can solve reasoning, my talk is done here, right. We don't have to talk more, right. Just collect more examples from those brilliant minds in Stanford, right. We can train the model and it's done.

But actually, it doesn't generalize well. And the way I realized this issue in 2021, in the summer, we found it didn't work well on reasoning. What we could do? Scaling, scaling, scaling. To get more data to train the model and see how it works. The lesson here is, you know, don't scale blindly.

Once the paradigm is wrong, no matter how to scale, it doesn't work. So, how to fix the genetic failure from SFT, let's look at the SFT procedure here, right. Just two steps. So, where's the mistake? The mistake part, actually, from human. So, if you don't know that before, you'll be surprised, right.

If human entities are wrong, and how scale AI can make money. And, actually, one of my team members invented RF and tuning. Actually, one had told me, and the response generated by machines could even better for training than human data. I was really surprised at the very beginning, yeah.

So, first attempt is called self-improve. Yeah, exactly, just change that. Okay, instead of collecting data from humans, we can just let a model generate data. So, collect a set of problems, and also then let your model generate step-by-step solutions. And then, again, maximize the likelihood of correct answers. So, like math problems, you may know the final answer, right.

You know the ground truth answer, but you don't have step-by-step solutions. Okay, let a model generate step-by-step solutions. And then you can use a true answer to decide which response to be used. If the answer is correct from the solution, then choose that, otherwise reject. It's called reject sampling.

And then you can use this dataset to fine-tune your model, okay. Exactly as you have done in the SFT, the only difference, the data is from your model. It's not from humans. And this approach actually was proposed by Eric, right, and Tony, and and also Noah, yeah. The paper is called Star.

Yeah, the star approach, it's a very amazing paper. Actually, in the star paper, actually, when they proposed the approach, they considered to use that to save cost, labeling cost. Because human labeling are really expensive. But in the current days, we understand this approach from different perspectives. Okay, once the response are generated or training data generated by the model, and the model can be self-improved, right?

And after the model improved, and then we can collect data again. And then this approach is then just the same as the RF and tuning approach in the current days. I put a paper here, and I think it's a paper by researchers in Badans published in January 2024. I think this is the earliest academic publication I have noticed about RF and tuning.

Even the paper title is called Reasoning with Reinforced Fan Tuning. After OpenAID 01 got popular, and then everyone began to realize Fan Tuning in the public. I believe multiple institutions independently discovered this idea, such a simple idea, yeah. But it works really well. So, of course, yes, if you are, if you are, if you, after seeing this RF and tuning process, and we need via fire in this, in this training loop, the via fire can tell us which response is correct.

Because we know the final answer, we just need to use that to select the step-by-step reasoning path. So, a reliable via fire is the most crucial in IL-F and tuning. Not the R algorithms. I know in the country, so many people talk about different algorithms. And so many tons of variants of PPO, or reinforced, you know.

If anyone found some algorithms are significantly better than another one, please let me know, probably I missed something. Yeah. I really like what Richard Sartang said here. Verification, the key to AI. It's an article titled by Richard Sartang in 2001. OK, now a very interesting question is, why generated from the model instead of from humans?

That's a really interesting question, right? It's not about saving cost, it's about performance. Does anyone have an idea here? Yeah. Is it consistency and chain of thought structure versus if there's wrong variation, how did you define approach problems? How about consistency, OK. Yeah. The distribution is closer to what you do trains, it's easier to train them all.

Yeah, excellent, yeah, yeah, yeah, thanks. So, yeah, this related to the first principle in machine learning. Directly optimize what we want. I don't know if anyone still remembers some machine learning stuff here. Of course, you guys should remember that, yeah. So, if we want to build a model for reasoning, right?

Or just in general, about generating something interesting, right? We need to optimize the metric of measuring generation quality. Those metrics could be very different, right? For example, if we're solving math problems, we would care about the correctness, if the answer is correct or not. If for machine translation, you would optimize blue score.

Or just about a metric to measure the quality of the generations, OK. Once you have a metric, all we need is to compute gradients of the metric and do back propagation, yeah. So, mathematically, we can write this formula, right? So, we need a function R to measure the response quality, given the problem, and also your model parameter, theta.

OK. Yeah. Of course, you can see R is a reward, or R is your calculation accuracy, or R is your blue score. Well, no matter. You can define any R you want. That's your target, right? And then compute the gradient. Since the model is a previous model, we need to maximize the impacted value of the metric.

So, how to do it? We need to do sampling to compute the impactation. That's why you've got a policy gradient. Yeah. That's how it works. There's no, if you understand all the mathematical principles here, there's no magic. I know some people would like to talk about something in a more magical way.

So, for example, how to incentivize your model to sink, incentivize your model to region. I don't use those words. I just use standard machine learning words. Define your metric, compute gradient, and do back propagation. That's all. So, yeah. Of course, yeah. Once you find your paradigm works well, we need to scale your approach.

Not a problem. It's a lot to scale. Okay. And the interesting is that for this Io-Fantoni approach, we scale the output length, or scale the length of a COT. And you probably also scale the model depth. All right. Because from our theoretical analysis, once, as long as your COT is long enough, the model can solve nearly every computable problem.

So, that's amazing. You don't have to scale your model size. You just need a minimal constant size transform models. And that's fine. So, actually, if you look at the literature, you're looking at the literature, you've got to scale your model size. So, you've got to scale your model size, you've got to scale your model size.

So, you've got to scale your model size, you've got to scale your model size. So, you've got to scale your model size, you've got to scale your model size, you've got to scale your model size, you've got to scale your model size, you've got to scale your model size, you've got to scale your model size, you've got to scale your model size, and you've got to scale your model size.

And that's even more non-trivial to realize. So, you've got to scale your model size, you've got to scale your model size. So, you've got to scale your model size. So, you've got to scale your model size. So, you've got to scale your model size. So, you've got to scale your model size.

So, you've got to scale your model size. So, you've got to scale your model size. So, you've got to scale your model size. So, you've got to scale your model size. So, you've got to scale your model size. So, you've got to scale your model size. So, you've got to scale your model size.

So, you've got to scale your model size. So, you've got to scale your model size. So, you've got to scale your model size. So, you've got to scale your model size. So, actually, I used the code to say 30 still is useful. Actually, I want to give an example here about why AOM region is so different from classical AI here.

In December 2024, Google released a model called Gimni 2.0 Syncing Mode. So, of course, 2.5 Pro is much more powerful, okay. I used that model for a particular reason. So, in December 2024, after the model released, I tried a math problem just to ensure this problem is not in our training set, okay.

Because I used the number 2025 for the next year. Now, it's for this year, okay. Using the numbers from 1 to 10 to make 2025. And using each number once and the primary operations plus and the multiplication, okay. Of course, one can write a Python program, do exhaustive search, and get results, right.

Let's look at the syncing process on the red panel generated from the model. Actually, for Gimni models, you can check the thinking process. It's very interesting to look at. Okay. Let's see how the model did the syncing, right. It's done by search. See that? At the very beginning, the model said, okay, this is a relatively large number.

Suggesting multiplication will be heavily involved. It's just like a human syncing, right. And even I see, okay, it's worth noting that 2025 is 45 squared. And 45 times 45. Actually, when I made this question, even I didn't realize that. that's huge hint here. And I see, okay, so the target is large, and I started thinking about how to get large intermediate products use multiplication.

And see, blah, blah, and that's aim for products that get us closer to the square root of 2024, which is 45. You see that? You see that? And after, actually, I made a cut off here. The syncing is very, very long. That's why we did a long COT in the I/O fine tuning.

And you can find an answer. After syncing, the model showed the final answer, right. They exactly followed the syncing process. You see, let's break down it. Okay, first part, and the 10 times 4 plus 5 equals 40 plus 5 equals 45. And the second part is also, again, 45, and then 45 times 45, to get 2025.

That's amazing, right? We don't need any search. I don't know if anyone read another paper related to chain of sort prompting. It's called a tray of sort prompting. Anyone read that paper? Great, yeah. In that paper, actually, there's a very interesting example. It's game 24. This problem is way harder than game 24.

In tray of sort prompting, they combine search with prompting to solve game 24. But now you don't need that at all, right? The model can solve game 24 just by natural language. Let's see that. This is how our chain is so powerful. It's amazing. And again, I would like to cite Richard Sutton here.

You see, in the beta lesson, right. The core idea here, okay. Building our discoveries only makes it harder to see how the discovery process can be done. Yeah. That's, I think Richard Sutton drew the beta lesson after he joined Google DeepMind. And he saw the success of AlphaGo and AlphaZero.

And he said, okay, only two processes are really scalable. One is learning. The other is a search. And, but here, I would like to see only emphasize one thing. Learning is scalable. We just need learning. Yeah. Yeah. For AlphaTuning, okay, yeah. And the big advantage is that it generalizes so well, but for automatically via file tasks.

Because we need via file in the loop. There's no way to put a human in the loop there. And of course, not all tasks are automatically via file. All right. Can anyone give examples? Non-verifiable tasks? Yeah. Creative writing. Creative writing. Yeah. Creative writing. Hmm? Creative writing. Right? Yes. Creative writing.

Yeah. Great example. Yeah. That's the big restrictions for RL fine tuning at this point. I know so many people are really interested in creating RL algorithms to improve the approach. I really want to see, we spend more time to think about, you know, how to solve those non-verifiable tasks.

For real problems are actually really non-verifiable, like creative writing, even like coding. Right? I know so people say, okay, coding problem will be solved by AI in a few years. And I think it will be very challenging to be solved. Right? I know actually for, they will talk about a program, they only talk about a competitive programming.

Compatible programming is not like our daily programming work, right? So we write code, we care about your design, your readability, right? Yeah. How to collaborate with other people. Not just give a final answer. Yeah. Yeah. I have a talk about, you know, all of the ideas. Actually, at the very beginning I talk about CLT decoding, okay?

Actually, the reason pass is already in the output space. And all I need to do is about decoding. To reshape the output distribution, such that the grid decoding is funded, okay? And then I talk about channel sort of prompting, or lessons like that, which can reshape the output distribution.

And then SFT, and then RF tuning. RF tuning is so powerful. But we still have a chance to improve those process. Basically, I want to talk about two key ideas. One is aggregation. The other is about retrieval. And we have seen that ALM reasoning is really powerful, right? But any decoding issue in the paradigm of generating reasoning tokens and then find answers.

Right? It's so natural, right? Given the problem, and then generating the media tokens, and then find an answer. Does anyone see any problem in this process? Any problem? Any problem? Yeah? Is the design of the model, the model is just designed to predict next outcome. The challenge is the way it predicts next outcome.

That's what creates a situation where the outcome will not be aligned to the expected outcome. Yeah, great. Yeah. Yeah. Yeah. The model is originally designed just for predict-negade tokens, yeah. So, yeah. Thanks. So, yeah. We need to always keep in mind here that ALMs are probabilistic models. They are not humans.

What does that mean mathematically? Let's think about what ALM does in decoding, right? Given the problem, and it generates reasoning, and then find an answer. And then the response found by graded decoding. What does graded decoding mean? Agamax the probability, right? However, for us, right, we need to argument the final answer.

Choose the answer with the maximum probability, right? Choose the most confident answer. So, not aligned, right? There's such a simple high school conditional probability math here. But it's really useful for us to understand the decoding process. And we can, let's fix it, right? We just need one step further, okay.

If we generate reasoning paths, we should sum over all reasoning paths to find the probability of the final answer. In terms of machine learning, it's called marginalization. Just sum over all, because all the reasoning paths actually essentially are just latent variables. And, of course, if we start machine learning, then we know actually this, the sum can be computed by sampling.

And then once you get this idea, then you see, okay, that's exactly the motivation. And a line or another problem approach is called self-consistency. So generate multiple response by random sampling. And then choose the answer that appears most frequently. So let me show a simple example here. For this math problem, you know, and you could sample the response many times.

For the first response, you would get, let's say, $18. And for the second one, you would get $26. And again, you would get $18, right? And then we look at the final answer, right? And then choose the most frequent one. So that's exactly the process implementing marginalization in probability.

We don't look at the reasoning paths, which only chose the most frequent answer. Not most frequent reasoning paths. That's the trick. So that's called marginalization empirically. And if you apply this approach, you can see a huge improvement. That's really surprising. I know in the current days, you may think, okay, if you want to get a huge improvement, you probably need to spend a lot of time to build a suffocated mass formulations.

We don't have two. Okay. So for GSMK problems, we can see that, right? Even for fine-tune GPT-3 models, they used, they got actually 33%. And then OpenAI used Verifier to get actually 55%. That's amazing. That's a matched performance from the Verifier. And however, the most surprising thing is after applying self-consciency, the accuracy jumped to 75%.

The relative improvement is nearly 50%. And using PALM-2 will even get an accuracy of 92%. And of course, one may say, okay, yeah, that's for PALM models. You know, the model for many, for several years. Sounds like 10 years ago. But in the current days, every year is just like one decade.

The whole field is moving so fast. Actually, if you look at the O1 model. I forgot when OpenAI released O1 models. Probably October last year, right? Yeah. And actually, they also showed the results by aggregation. See that consensus at 64. And then we still see a great improvement by aggregation or self-consciency.

Yeah. Of course, self-consciences should be more expensive. And it seems to, you know, Yes, great problem. Of course, yes. Self-consciency and using more samples will be more expensive. And using more tokens. And people see that's kind of inference time skating. Yeah. There are so many ways for inference time skating.

If you use a longer COT, that will also increase inference time. So actually, when some people told me about inference time skating earlier, I don't know what that exactly means. Unless they can completely see what's scaled. Yeah. And self-consciency is definitely a way to scale up. Yeah. And also, self-consciency is definitely a way to scale up.

And also, self-consciency is naturally self-calibrated. Higher consistency indicates higher accuracy. This is for a GSMK benchmark. Actually, when the self-consciency is more than 80%, the accuracy is nearly 100%. So I know some people care about uncertainty or confidence in prediction. And they can just simply try sampling multiple times.

And I have two short questions here. So to make sure everyone got the really key ideas in self-consciency, I hope you know how to, yeah, you really found a lot of fun using this simple idea. So the first question is, okay, when the AOM outputs a direct answer without intermediate steps, will you still sample several times and then choose the most common answer?

Will you? Does anyone have an answer here? If the model just directly generates a final answer, what do we do? Yeah, go ahead. This is like you can just get the probabilities . Exactly. Yes. Exactly. Exactly. Exactly. Just like exactly what we did in the classical machine learning, right?

We will have user logistic regression to get a PY given X. We just need to maximize the probability to there. That's why we couldn't see self-consciency in the old machine learning literature. It's unnecessary. It's only useful for AOM reasoning. That's why we see it here. After we have reasoning, and then we need a self-consciency here.

And the second question is, change self-consciency by letting AOMs generate multiple response instead of sampling multiple times and then choosing the most common answer. Does this make sense? Does this make sense? Right? You can see, just tell model, generate five answers instead of sample five times. Right? Yeah. So actually, when I try that, and again, you know, for everything, we just need to follow the machine learning principle.

Actually, this principle is called max marginal inference. Yeah. You just need to choose the final answer with the maximum probability. That's all we need to know. You don't have to think about any fancy things about AOMs. You don't have to compare with humans. You know? Math is all we need here.

And one can naturally, of course, self-consciency has a problem. You will see the unique answer, right? You check the frequency of the unique answer. And for general problems, it's hard to see the answer will be by single token. And for example, for this problem, you will see all the answers are different.

Okay? In this case, we have a good thing about self-consciency. It's called a universal self-consciency. And for this problem here, you can see the second response is the most common one. Because all these three countries are in all other answers, right? And we just need to let AOMs choose the most consistent response.

Okay. I've talked about how to use aggregation to improve reasoning. The other way is about retrieval. So I know there's a lot of debate about AOM reasoning. People say, okay, AOMs may not just do retrieval instead of reasoning. So I know many people, I saw that debate in social media.

Actually, to me, it's always hard to differentiate retrieval and reasoning. And when I, I'm senior AOMs for all the conferences almost every year. And we always have to talk about the novelty of each paper. And actually, it's similar to the debate, retrieval reasoning, right? Yeah? Yeah? Similar to the concept of self-consciency.

I saw an experiment trying different models to run in parallel. And then running parallel, they may have literally like running GPT-4, running the same, like concurrently with running GM9 2.5. Like all different models in parallel for the same question. And then at the end, just having like a clarifier.

So some of them find the most consistent results. Yes. Yes. Yes. If you generate response from different models, that would be more like the model assembling approach with many models and the combined results. Like a random forest. Yes. Yeah. The mathematical principle is not exactly the same as self-consciency.

But the implementation are the same. Yes. Yeah. Great point. Yeah. Actually, again, I'm not interested in the debate about retrieval reasoning. And for people working on, actually, I work in industry. I really just care about performance. So to me, you know, just to retrieve a plus reasoning, I should do the debate, right?

Yeah. So in 2024, we have paper about analogical reasoning. So I can just use this small example to show why retrieval is important in reasoning. Okay. So for this problem, see, what's the area of the square? What's the four vertices, blah, blah, blah. Okay. The highlighted text is added by me.

And I say, okay, it's a prompt. Okay. We call a related problem and then solve this one. Okay. So at that moment, I tried the GPT 3.5 and also our own model and they failed solving this problem. After adding this, after adding the prompt of recalling a related problems, and the model can solve it.

Okay. Let's see what happened here. So after telling the model to recall related problems, and the model did find a related problem. Related problem doesn't mean the same problem. It's indeed just a related problem. You can see the related problem here is finding the distance between two points on a coordinate plan.

And there's a formula there. And then the model, oh yeah, now I know how to compute the distance and then how to compute the area. It's just a small case to show how retrieval is important in reasoning. Here's another example called a step back for the physical problems. And before solving this problem, we just let a model in.

We just give a few short examples to show the model. Okay. Before solving this problem, you can make a step back to consider a more abstract problem. get the principle. And then solve it. That's how retrieval works for reasoning. And now everyone knows deep research. Deep research is exactly the same idea here, right?

Okay. So we have a team that deep research and also open AI deep research. And one of open AI's deep research lead was my intern. And after he's a PhD and he joined open AI and he invented deep research. And you see how deep research works because they can find a similar problem or knowledge to solve the problem.

Yeah. The basic idea is very simple. Okay. Yeah. Now I can give a summary here. Actually, you know, forget about the debate if ALMs can reason or not. For ALMs, reasoning is always better than no reasoning. Yeah. And ALF and tuning is better than SFT. Aggregating multiple answers is better than one answer.

Of course, that will be more costly. And retrieval plus reasoning is better than reasoning only. And yeah, that's the end of my talk. And for the next breakthroughs, you know, I really want to see, okay, how to solve the task beyond unique, verifiable answers. And in the kind days.

And I also want to see how people build real applied casings instead of just solving benchmarks. I think all benchmarks will be saturated soon. Yeah. And I know, you know, all you guys are very passionate about AGI or build ALMs. I would like to quote Richard Feynman here. "The truth always turns out to be simpler than you thought." And I think that's a particular truth for ALM research.

And I saw so many academic papers that always try complicated many things. So that's why I just came and talked as simple as possible. Actually, it's indeed simple. That's it. Yeah. Thank you. Thanks, Danny, for the very insightful as well as interesting talk. So now we'll be taking questions.

We have some questions online from Slido and Zoom, but also in person. So we can maybe start with some in-person questions. Hi. Thank you for the talk. So earlier on in the lecture, you talked about confidence. And like a common way to do this is like just taking the average log probabilities of output token sequences.

Yeah. So like my question is, do you think there are better ways to do this? And also, is this a good indicator for hallucinations? Oh, for the first slide, when I talk about confidence, just the node aggregation, just the probability for net token prediction. Just a conditional probability for the generation.

Yeah. You can just look at the log props from the model and you can see the probability. Yeah. Yeah. And like do you think this is a good indicator for hallucinations? Yeah. Yeah. Same as so. Yeah. From our empirical observation. Yeah. Yeah. And we can see after reasoning pass, there's a huge jump on confidence for the final answer.

Yeah. Thank you. Hello. Earlier you mentioned that, for example, Richard Sutton said that it's scaling learning and search, and your opinion is more like scaling learning is all you need. I'd just like to expand more on that and why you believe that search is not as necessary. That's why I used that example.

And actually, okay, so actually I should make it more concrete. When you build models, you don't have to keep search in mind. But in the after model it's built, and you can use the search as a tool. You have a special case of tool use. Like a trail of sort prompting.

They can just integrate symbolic search with the model. Yeah. So, but for reasoning research, I just care about the fundamental abilities. Yeah. For example, if we want to solve this problem, the model could be motivated to write a pattern program to solve those problems by search. But for the reasoning process, we don't need to search.

It's just, how to say it. Of course, we can always search everything. That's why if you use a search to solve any problems, you can get a higher accuracy. And I don't know. That really depends on what you want. Intelligence or just by search. Yeah. Hi. Thank you for the talk.

You mentioned in the case where there's no reasoning that it's not necessary to sample because you can simply look at the logits. But wouldn't sampling converge on a different distribution in the case, for example, where the most likely next token leads to a diffuse distribution for the following token and the different paths spread out.

Whereas if you were to sample and a less likely token were to lead to a sharper distribution, you could actually have a more likely path of tokens there. So, wouldn't these two methods fundamentally differ? Good question. Yeah. The problem is, actually, we still don't know how the distributions are reshaped during the training stage.

It's very unclear there. Yeah. So, to me, it's very hard to answer this question. But we still don't have good explanation of how those distributions are reshaped for the final distribution. Yeah. Thank you. Hi. Thank you for the talk. So, how do we differentiate reasoning and answer? Like, do we need to extract that number from the tokens, from the final strings, the output string?

What if the answer can be, like, a program? Then how do we differentiate the reasoning and the answer? Yeah, great question. Yeah. If the answer is a program, it will be harder to extract. It will be harder to extract. Yeah. Yeah. So, when people use an aisle of untuning, and that's why you just see those guys are talking mass problems or competitive programming problems.

Yeah. So, I think for the general case, you have to write a very careful parser for the final answer. Yeah. I see. And also, what if the problem is very challenging, such that actually the lower confidence answer might be the correct answer? It's possible. Yeah. Yeah. Then how can I use the self-consistency better?

Self-consistency is not perfect. I see. It's perfect. Everything is done, right? Yeah, not perfect. All right. Okay. Thank you. So, considering the, you know, conversations that AGI is coming, you know, like from two to five years from now, how, and basically if we, you know, if it's true, then let's say 90% of jobs are automated.

What skills do you, you know, develop in kids to give them a shot to survive in the future that is coming? That's a big question. Who thought AGI will come in five years? I mean, there's like AI 2027, right? By Daniel Gokutayo. Like, there are lots of conversations in the AI community that's giving like the timeline of like two to five years.

I was in the, I clear last year, there was a workshop. And I remember one audience asked me a question in the panelist. And he said, okay, and AI is moving so fast, they failed, you know? And what would be the most scary thing in the, in the future, in the next few years?

And yeah, I remember some people did talk about the, the risk of AI. But my answer is, to me, most guys say is, yeah, winter comes back. And then I lost my job. Actually, I saw many restrictions for the country approach. So, I actually, I know many people like the chat balls, the OEM sort of things.

I really, actually really want to see real killer applications from the kind of AI research. I don't know if anyone really needed those AI stuff or not just for fun. Yeah. I'm not quite sure about it. I know, actually, the AI models is really good for programming. Yeah, can be a good assistant for coding.

And that's all I have to know about it. Yeah. We should be fine. Yeah, I think we're out of time. But thanks, everybody, for your great questions. And thanks again to Denny for the great talk. Thank you very much. Thank you.

Stanford CS25: V5 I Large Language Model Reasoning, Denny Zhou of Google Deepmind

Transcript