back to indexStanford CS25: V5 I Large Language Model Reasoning, Denny Zhou of Google Deepmind

00:00:08.960 |
So for today's lecture for CS25, very pleasure 00:00:14.560 |
to give a talk on large language model reasoning. 00:00:17.880 |
And so Denny founded the reasoning team at the Google 00:00:23.800 |
His group is renowned for pioneering chain of thought 00:00:31.560 |
of in-context learning and chain of thought reasoning. 00:00:37.240 |
that powered Gemini's reasoning capabilities. 00:00:42.760 |
on Language Modeling, or COM, and served as general chair 00:00:56.540 |
Yeah, I'm glad to see many of you guys have already 00:01:04.740 |
Actually, you may wonder what's my answer for this question. 00:01:16.580 |
That really depends on the definition of reasoning. 00:01:20.900 |
So, for my talk today, we have a very specific definition about reasoning. 00:01:27.100 |
So, I know there are many debates about if AOM can reason. 00:01:33.460 |
Because without a definition of reasoning, I have no idea about those things. 00:01:39.720 |
But for AOM reasoning, and we particularly mean that intermediate tokens between input and output, 00:02:02.340 |
Even in 2017, DeepMind already published a paper "How to Use Intermediate Tokens to Solve Mass Problems." 00:02:14.620 |
So, at that time, I think the community was quite happy about AlphaGo, AlphaZero. 00:02:22.260 |
But this paper is really ground-breaking paper. 00:02:25.100 |
If you haven't read that paper before, I strongly encourage you to look at that paper. 00:02:32.100 |
So, they introduced natural language to solve mass problems. 00:02:38.700 |
However, in the literature at that time, I think everyone else just used symbolic approach or search. 00:02:49.420 |
So, this idea actually is also very common for neurosymbolic literature. 00:02:56.060 |
In neurosymbolic literature, actually, it's very common to use intermediate process. 00:03:03.980 |
Here's an example about how to use AOM reasoning. 00:03:11.980 |
When I founded the reasoning team in Google Brain, I created this task. 00:03:25.300 |
At that time, one could use transform models to solve this task. 00:03:30.540 |
So, what's the output when concatenating the last letter of each word of artificial intelligence? 00:03:36.780 |
So, if there's no reasoning process, you will say, "Okay, the answer is LE." 00:03:40.780 |
If there's a reasoning process, the model would output, say, "The last letter of artificial 00:03:46.780 |
intelligence is L, the last letter of intelligence is E, concatenating L and E to SLE," or something like that. 00:03:55.020 |
So, the highlighted text here is called reasoning. 00:03:59.020 |
So, if you are familiar with program synthesis or neurosymbolic reasoning, you wouldn't be surprised about this task design. 00:04:15.260 |
Of course, you can imagine that I tried other options. 00:04:21.500 |
The reason is that I tried first letter, and all logic models can solve that problem quite well. 00:04:29.500 |
Because there are so many initiates on the web, and the model has already learned how to concatenate first letters. 00:04:35.500 |
Then I switched to last letters, and all models failed. 00:04:41.740 |
I know many people say, "Oh yeah, this is so natural, right? 00:04:51.980 |
We need intermediate steps, just like humans." 00:04:55.980 |
I know, in the current days, you may see LMs are very similar to humans. 00:05:01.980 |
But for us, as researchers, we should always keep in mind, LMs are just probabilistic models. 00:05:11.980 |
And if you are always keeping this in mind, it will be better for you to understand a lot of new techniques. 00:05:23.340 |
Okay, we have a theoretical work, I should collaborate with Professor Terry Ma in Stanford and his students. 00:05:31.340 |
So, for any problems solvable by a Boolean circuits of size T, 00:05:38.220 |
constant size transformers can solve it by generating OT intermediate tokens. 00:05:47.580 |
So, the size here means the number of logic gates. 00:05:53.500 |
So, for example, if we use a GPU clusters, that would be tons of millions of gates, right? 00:06:04.060 |
If we directly generate final answers, either require a huge depth or cannot solve it at all. 00:06:12.780 |
That's how we understand reasoning from a theoretical perspective. 00:06:20.700 |
So, in the later of this lecture, I will come back to this theoretical argument. 00:06:30.060 |
There's a common belief about ALM reasoning, and the pre-trained ALMs cannot reason without further prompting engineering, 00:06:41.260 |
like COT prompting or fine-tuning, you know, current days, everyone talks about IL fine-tuning, right? 00:07:04.780 |
So, pre-trained ALMs are ready to reason, and all we need is decoding, just about decoding process. 00:07:13.180 |
So, yeah, no matter how fancy those techniques look like in the kind of days. 00:07:24.540 |
If I have three apples, my dad has two more apples than me, and how many apples do we have in total? 00:07:35.660 |
So, if you have any pre-trained models, like Lama, Deep Seek, or Chang Wen, or something, and I didn't try those models, okay. 00:07:47.100 |
If you have any pre-trained models, you can type this question in the pre-trained model and see what happened. 00:07:53.260 |
Probably it's very likely you'll see answer like five apples. 00:07:56.460 |
Of course, the answer is wrong here, okay, this is called graded decoding. 00:08:00.300 |
You will say, okay, yeah, you're right, right, for pre-trained models, there's no reasoning, right? 00:08:05.580 |
The problem is about decoding, because we use graded decoding by default. 00:08:11.820 |
If you look at the second candidates, because you have a big vocabulary size, right, and you can look at the second candidate for the first token, 00:08:24.780 |
And the problem will start from I, and we'll see what happens. 00:08:29.020 |
We'll just then continue the decoding process. 00:08:33.020 |
We'll see, okay, I have three apples, and my dad has two more apples than me, so he has five apples, and three plus five equals eight. 00:08:45.020 |
We just need to look for more candidates, that's amazing. 00:08:51.260 |
And there's another choice, and the third candidate for the first token is V, we'll see what happened here. 00:09:03.260 |
And probably from the fourth candidate will be you, we'll continue decoding, we'll see what happened here. 00:09:09.500 |
Again, yeah, you can clearly see a chain of thought in this response, and the final answer is correct. 00:09:21.740 |
And this is the fifth candidate for the first token, and I said five is wrong, okay, yeah. 00:09:28.140 |
You can see that actually the reasoning path is already in the output space. 00:09:34.940 |
And in particular here, for the second response and the fourth response, they are based on the chain of thought reasoning. 00:09:50.940 |
The problem is how to select the best response, right? 00:09:57.420 |
If we just look at the examples here, you may see, okay, we can, by output length, 00:10:04.220 |
if the model has some synchings, and the output length will be longer, because it contains reasoning tokens. 00:10:20.140 |
We have a better idea to select the response, and by its answer confidence. 00:10:26.620 |
Confidence means, because the model is just a previous model, we can look at the probability of the token in prediction. 00:10:38.620 |
A very interesting thing is that for the response with chain of thought reasoning. 00:10:55.100 |
For this example, for this example, actually, for the token 8, the model's confidence is nearly 98%. 00:11:11.580 |
So usually, for each token, the probability is nearly zero. 00:11:18.060 |
So this process is called a chain of thought decoding. 00:11:30.540 |
Step one, we just go beyond grid decoding by checking more generation candidates. 00:11:38.540 |
And in the second step, we choose candidates which have the highest confidence on the final answer. 00:11:55.020 |
And channel sort of decoding is a very simple approach. 00:12:05.500 |
And I heard in the current days that people just want to use a natural language, right? 00:12:13.340 |
And we have to say, okay, can we reshape the model's output distribution so that sort for responses naturally rank first? 00:12:25.660 |
If the channel sort response is ranked first, and then the graded decoding can naturally find it, right? 00:12:34.700 |
So now we have to look at the channel sort of prompting. 00:12:44.060 |
If you know channel sort of prompting, now you can see why it works. 00:12:50.060 |
Channel sort of prompting is a very simple approach. 00:12:54.220 |
So given this problem, and you'll probably use another similar problems as an example. 00:13:09.660 |
And then the model will magically follow the style, reasoning style, and generate a step-by-step solution. 00:13:20.540 |
Yeah, now you can see that why channel sort of prompting works. 00:13:26.540 |
Because it changes the output distribution to push the original channel sort of solutions in the output space to the top position. 00:14:00.700 |
And at that time, the Google Brain team built a model called Palm. 00:14:08.860 |
And I tried a lesson step-by-step in our Palm model because of course I know how Palm was built. 00:14:18.940 |
It's definitely not related to this magic trick. 00:14:28.700 |
So this paper really inspired me a lot on reasoning research. 00:14:33.740 |
Those prompting approaches, you know, are really simple. 00:15:05.580 |
If I have questions to ask someone, if I know similar problems, then I can solve it by myself, right? 00:15:20.220 |
And for the other approach, it's called lesson step-by-step. 00:15:27.260 |
You just say lesson step-by-step, and then the magic will come out. 00:15:33.340 |
Unfortunately, it performs much worse than a few shots prompting. 00:15:54.700 |
Even for lesson step-by-step, it's also well, right? 00:15:59.260 |
If I ask somebody a question, then they have to follow ways lesson step-by-step. 00:16:05.020 |
Otherwise, they couldn't think anymore, right? 00:16:15.180 |
So, there's a popular approach called supervised fine-tuning. 00:16:22.780 |
So, for this approach, and the idea actually is very simple. 00:16:29.500 |
We collect a set of problems and the step-by-step solutions from human annotators. 00:16:39.180 |
And then we maximize the likelihood of human solutions. 00:16:44.940 |
Maximum likelihood actually for LM's training, pretty net token. 00:16:55.740 |
And after that, we can apply the model everywhere. 00:17:09.980 |
I mentioned that paper at the very beginning. 00:17:13.500 |
They collected a set of mass work problems and also human annotated step-by-step solutions. 00:17:22.700 |
And then they trained the sequence-to-sector model to solve mass problems. 00:17:26.060 |
In 2021, an OPI actually further extended that approach, 00:17:33.260 |
built a much larger data set called GSM-8K grad school mass problems. 00:17:43.260 |
And then they used those data sets to fine-tune GPT-3 models. 00:17:49.260 |
So, here, let me give an example of how it works, okay. 00:17:57.500 |
Like, for example, at the beginning, I said, okay, we can do large-letter concatenation. 00:18:01.740 |
And you can put this example here to the problem and the answer, okay. 00:18:12.300 |
And then use that as a training data to fine-tune your model. 00:18:15.900 |
And then you can test the model with a new question. 00:18:22.300 |
Probably know why I particularly chose this problem here. 00:18:27.740 |
Because in the social media, many people believe that it's a good question to test 00:18:46.860 |
Once you train the model, you can apply it anywhere, right. 00:18:50.780 |
And if that can solve reasoning, my talk is done here, right. 00:18:57.980 |
Just collect more examples from those brilliant minds in Stanford, right. 00:19:08.540 |
And the way I realized this issue in 2021, in the summer, 00:19:23.260 |
To get more data to train the model and see how it works. 00:19:29.660 |
The lesson here is, you know, don't scale blindly. 00:19:37.420 |
Once the paradigm is wrong, no matter how to scale, it doesn't work. 00:19:49.500 |
So, how to fix the genetic failure from SFT, let's look at the SFT procedure here, right. 00:20:13.420 |
So, if you don't know that before, you'll be surprised, right. 00:20:17.180 |
If human entities are wrong, and how scale AI can make money. 00:20:23.740 |
And, actually, one of my team members invented RF and tuning. 00:20:33.340 |
Actually, one had told me, and the response generated by machines could even better for training than human data. 00:20:45.740 |
I was really surprised at the very beginning, yeah. 00:20:55.580 |
Okay, instead of collecting data from humans, we can just let a model generate data. 00:21:03.020 |
So, collect a set of problems, and also then let your model generate step-by-step solutions. 00:21:12.380 |
And then, again, maximize the likelihood of correct answers. 00:21:19.980 |
So, like math problems, you may know the final answer, right. 00:21:25.340 |
You know the ground truth answer, but you don't have step-by-step solutions. 00:21:30.460 |
Okay, let a model generate step-by-step solutions. 00:21:34.300 |
And then you can use a true answer to decide which response to be used. 00:21:41.580 |
If the answer is correct from the solution, then choose that, otherwise reject. 00:21:49.980 |
And then you can use this dataset to fine-tune your model, okay. 00:21:56.620 |
Exactly as you have done in the SFT, the only difference, the data is from your model. 00:22:04.220 |
And this approach actually was proposed by Eric, right, and Tony, and 00:22:23.180 |
Yeah, the star approach, it's a very amazing paper. 00:22:27.340 |
Actually, in the star paper, actually, when they proposed the approach, they considered to use that to save cost, labeling cost. 00:22:42.380 |
But in the current days, we understand this approach from different perspectives. 00:22:54.060 |
Okay, once the response are generated or training data generated by the model, and the model can be self-improved, right? 00:23:05.980 |
And after the model improved, and then we can collect data again. 00:23:11.980 |
And then this approach is then just the same as the RF and tuning approach in the current days. 00:23:30.940 |
I put a paper here, and I think it's a paper by researchers in Badans published in January 2024. 00:23:43.820 |
I think this is the earliest academic publication I have noticed about RF and tuning. 00:23:54.220 |
Even the paper title is called Reasoning with Reinforced Fan Tuning. 00:24:00.620 |
After OpenAID 01 got popular, and then everyone began to realize Fan Tuning in the public. 00:24:21.580 |
I believe multiple institutions independently discovered this idea, such a simple idea, yeah. 00:24:42.460 |
So, of course, yes, if you are, if you are, if you, after seeing this RF and tuning process, 00:24:49.180 |
and we need via fire in this, in this training loop, the via fire can tell us which response is correct. 00:25:02.140 |
Because we know the final answer, we just need to use that to select the step-by-step reasoning path. 00:25:09.820 |
So, a reliable via fire is the most crucial in IL-F and tuning. 00:25:17.900 |
I know in the country, so many people talk about different algorithms. 00:25:22.140 |
And so many tons of variants of PPO, or reinforced, you know. 00:25:29.580 |
If anyone found some algorithms are significantly better than another one, 00:25:37.420 |
please let me know, probably I missed something. 00:25:41.580 |
I really like what Richard Sartang said here. 00:25:46.700 |
It's an article titled by Richard Sartang in 2001. 00:25:56.860 |
OK, now a very interesting question is, why generated from the model instead of from humans? 00:26:12.380 |
It's not about saving cost, it's about performance. 00:26:25.260 |
Is it consistency and chain of thought structure versus if there's wrong variation, 00:26:37.100 |
The distribution is closer to what you do trains, it's easier to train them all. 00:26:57.020 |
So, yeah, this related to the first principle in machine learning. 00:27:04.940 |
I don't know if anyone still remembers some machine learning stuff here. 00:27:11.420 |
Of course, you guys should remember that, yeah. 00:27:17.340 |
So, if we want to build a model for reasoning, right? 00:27:21.020 |
Or just in general, about generating something interesting, right? 00:27:26.460 |
We need to optimize the metric of measuring generation quality. 00:27:33.020 |
Those metrics could be very different, right? 00:27:40.220 |
we would care about the correctness, if the answer is correct or not. 00:27:45.500 |
If for machine translation, you would optimize blue score. 00:27:49.660 |
Or just about a metric to measure the quality of the generations, OK. 00:27:55.900 |
Once you have a metric, all we need is to compute gradients of the metric 00:28:06.060 |
So, mathematically, we can write this formula, right? 00:28:13.900 |
So, we need a function R to measure the response quality, 00:28:18.700 |
given the problem, and also your model parameter, theta. 00:28:25.980 |
Of course, you can see R is a reward, or R is your calculation accuracy, 00:28:46.700 |
we need to maximize the impacted value of the metric. 00:28:59.020 |
We need to do sampling to compute the impactation. 00:29:08.380 |
There's no, if you understand all the mathematical principles here, 00:29:13.420 |
I know some people would like to talk about something in a more magical way. 00:29:18.380 |
So, for example, how to incentivize your model to sink, incentivize your model to region. 00:29:28.460 |
Define your metric, compute gradient, and do back propagation. 00:29:47.580 |
Once you find your paradigm works well, we need to scale your approach. 00:30:04.380 |
And the interesting is that for this Io-Fantoni approach, 00:30:08.380 |
we scale the output length, or scale the length of a COT. 00:30:25.980 |
the model can solve nearly every computable problem. 00:30:35.900 |
You just need a minimal constant size transform models. 00:30:53.980 |
you're looking at the literature, you've got to scale your model size. 00:30:58.700 |
So, you've got to scale your model size, you've got to scale your model size. 00:30:59.900 |
So, you've got to scale your model size, you've got to scale your model size. 00:31:01.340 |
So, you've got to scale your model size, you've got to scale your model size, you've got to scale your model size, you've got to scale your model size, you've got to scale your model size, you've got to scale your model size, you've got to scale your model size, and you've got to scale your model size. 00:31:12.220 |
So, you've got to scale your model size, you've got to scale your model size. 00:32:21.940 |
So, actually, I used the code to say 30 still is useful. 00:32:28.940 |
Actually, I want to give an example here about why 00:32:32.940 |
AOM region is so different from classical AI here. 00:32:38.940 |
In December 2024, Google released a model called 00:32:47.940 |
So, of course, 2.5 Pro is much more powerful, okay. 00:32:55.940 |
So, in December 2024, after the model released, 00:32:58.940 |
I tried a math problem just to ensure this problem 00:33:06.940 |
Because I used the number 2025 for the next year. 00:33:18.940 |
And using each number once and the primary operations plus 00:33:28.940 |
do exhaustive search, and get results, right. 00:33:31.940 |
Let's look at the syncing process on the red panel 00:33:38.940 |
Actually, for Gimni models, you can check the thinking process. 00:33:48.940 |
Let's see how the model did the syncing, right. 00:34:00.940 |
Suggesting multiplication will be heavily involved. 00:34:11.940 |
And even I see, okay, it's worth noting that 2025 is 45 squared. 00:34:23.940 |
Actually, when I made this question, even I didn't realize that. 00:34:36.940 |
and I started thinking about how to get large intermediate products 00:34:43.940 |
And see, blah, blah, and that's aim for products that get us closer 00:34:57.940 |
That's why we did a long COT in the I/O fine tuning. 00:35:04.940 |
After syncing, the model showed the final answer, right. 00:35:12.940 |
Okay, first part, and the 10 times 4 plus 5 equals 40 plus 5 equals 45. 00:35:21.940 |
And the second part is also, again, 45, and then 45 times 45, 00:35:37.940 |
I don't know if anyone read another paper related to chain of sort prompting. 00:35:46.940 |
In that paper, actually, there's a very interesting example. 00:35:54.940 |
In tray of sort prompting, they combine search with prompting 00:36:04.940 |
The model can solve game 24 just by natural language. 00:36:13.940 |
And again, I would like to cite Richard Sutton here. 00:36:28.940 |
Building our discoveries only makes it harder to see how the discovery process can be done. 00:36:34.940 |
That's, I think Richard Sutton drew the beta lesson after he joined Google DeepMind. 00:36:46.940 |
And he saw the success of AlphaGo and AlphaZero. 00:36:50.940 |
And he said, okay, only two processes are really scalable. 00:36:58.940 |
And, but here, I would like to see only emphasize one thing. 00:37:14.940 |
And the big advantage is that it generalizes so well, but for automatically via file tasks. 00:37:32.940 |
There's no way to put a human in the loop there. 00:37:35.940 |
And of course, not all tasks are automatically via file. 00:38:00.940 |
That's the big restrictions for RL fine tuning at this point. 00:38:08.940 |
I know so many people are really interested in creating RL algorithms to improve the approach. 00:38:15.940 |
I really want to see, we spend more time to think about, you know, how to solve those non-verifiable tasks. 00:38:26.940 |
For real problems are actually really non-verifiable, like creative writing, even like coding. 00:38:32.940 |
I know so people say, okay, coding problem will be solved by AI in a few years. 00:38:39.940 |
And I think it will be very challenging to be solved. 00:38:44.940 |
I know actually for, they will talk about a program, they only talk about a competitive programming. 00:38:50.940 |
Compatible programming is not like our daily programming work, right? 00:38:55.940 |
So we write code, we care about your design, your readability, right? 00:39:12.940 |
I have a talk about, you know, all of the ideas. 00:39:15.940 |
Actually, at the very beginning I talk about CLT decoding, okay? 00:39:18.940 |
Actually, the reason pass is already in the output space. 00:39:24.940 |
To reshape the output distribution, such that the grid decoding is funded, okay? 00:39:29.940 |
And then I talk about channel sort of prompting, or lessons like that, which can reshape the output distribution. 00:39:43.940 |
But we still have a chance to improve those process. 00:39:50.940 |
Basically, I want to talk about two key ideas. 00:40:01.940 |
And we have seen that ALM reasoning is really powerful, right? 00:40:06.940 |
But any decoding issue in the paradigm of generating reasoning tokens and then find answers. 00:40:17.940 |
Given the problem, and then generating the media tokens, and then find an answer. 00:40:29.940 |
Is the design of the model, the model is just designed to predict next outcome. 00:40:35.940 |
The challenge is the way it predicts next outcome. 00:40:39.940 |
That's what creates a situation where the outcome will not be aligned to the expected outcome. 00:40:50.940 |
The model is originally designed just for predict-negade tokens, yeah. 00:40:57.940 |
We need to always keep in mind here that ALMs are probabilistic models. 00:41:13.940 |
Let's think about what ALM does in decoding, right? 00:41:17.940 |
Given the problem, and it generates reasoning, and then find an answer. 00:41:22.940 |
And then the response found by graded decoding. 00:41:34.940 |
However, for us, right, we need to argument the final answer. 00:41:41.940 |
Choose the answer with the maximum probability, right? 00:41:55.940 |
There's such a simple high school conditional probability math here. 00:42:00.940 |
But it's really useful for us to understand the decoding process. 00:42:12.940 |
If we generate reasoning paths, we should sum over all reasoning paths 00:42:24.940 |
In terms of machine learning, it's called marginalization. 00:42:27.940 |
Just sum over all, because all the reasoning paths actually essentially are just latent variables. 00:42:33.940 |
And, of course, if we start machine learning, then we know actually this, 00:42:45.940 |
And then once you get this idea, then you see, okay, that's exactly the motivation. 00:42:57.940 |
And a line or another problem approach is called self-consistency. 00:43:02.940 |
So generate multiple response by random sampling. 00:43:06.940 |
And then choose the answer that appears most frequently. 00:43:20.940 |
For this math problem, you know, and you could sample the response many times. 00:43:26.940 |
For the first response, you would get, let's say, $18. 00:43:45.940 |
So that's exactly the process implementing marginalization in probability. 00:43:58.940 |
We don't look at the reasoning paths, which only chose the most frequent answer. 00:44:10.940 |
So that's called marginalization empirically. 00:44:17.940 |
And if you apply this approach, you can see a huge improvement. 00:44:23.940 |
I know in the current days, you may think, okay, if you want to get a huge improvement, 00:44:28.940 |
you probably need to spend a lot of time to build a suffocated mass formulations. 00:44:37.940 |
So for GSMK problems, we can see that, right? 00:44:47.940 |
Even for fine-tune GPT-3 models, they used, they got actually 33%. 00:44:52.940 |
And then OpenAI used Verifier to get actually 55%. 00:44:58.940 |
That's a matched performance from the Verifier. 00:45:07.940 |
And however, the most surprising thing is after applying self-consciency, 00:45:19.940 |
And using PALM-2 will even get an accuracy of 92%. 00:45:30.940 |
And of course, one may say, okay, yeah, that's for PALM models. 00:45:33.940 |
You know, the model for many, for several years. 00:45:38.940 |
But in the current days, every year is just like one decade. 00:46:00.940 |
And actually, they also showed the results by aggregation. 00:46:08.940 |
And then we still see a great improvement by aggregation or self-consciency. 00:46:16.940 |
Of course, self-consciences should be more expensive. 00:46:33.940 |
Self-consciency and using more samples will be more expensive. 00:46:40.940 |
And people see that's kind of inference time skating. 00:46:46.940 |
There are so many ways for inference time skating. 00:46:55.940 |
So actually, when some people told me about inference time skating earlier, 00:47:03.940 |
Unless they can completely see what's scaled. 00:47:08.940 |
And self-consciency is definitely a way to scale up. 00:47:13.940 |
And also, self-consciency is definitely a way to scale up. 00:47:17.940 |
And also, self-consciency is naturally self-calibrated. 00:47:28.940 |
Higher consistency indicates higher accuracy. 00:47:35.940 |
Actually, when the self-consciency is more than 80%, 00:47:42.940 |
So I know some people care about uncertainty or confidence in prediction. 00:47:50.940 |
And they can just simply try sampling multiple times. 00:48:02.940 |
So to make sure everyone got the really key ideas in self-consciency, 00:48:07.940 |
I hope you know how to, yeah, you really found a lot of fun using this simple idea. 00:48:13.940 |
So the first question is, okay, when the AOM outputs a direct answer without intermediate steps, 00:48:22.940 |
will you still sample several times and then choose the most common answer? 00:48:36.940 |
If the model just directly generates a final answer, what do we do? 00:48:43.940 |
This is like you can just get the probabilities . 00:48:53.940 |
Just like exactly what we did in the classical machine learning, right? 00:48:57.940 |
We will have user logistic regression to get a PY given X. 00:49:02.940 |
We just need to maximize the probability to there. 00:49:05.940 |
That's why we couldn't see self-consciency in the old machine learning literature. 00:49:21.940 |
After we have reasoning, and then we need a self-consciency here. 00:49:26.940 |
And the second question is, change self-consciency by letting AOMs generate multiple response instead of sampling multiple times and then choosing the most common answer. 00:49:42.940 |
You can see, just tell model, generate five answers instead of sample five times. 00:49:53.940 |
So actually, when I try that, and again, you know, for everything, we just need to follow the machine learning principle. 00:50:03.940 |
Actually, this principle is called max marginal inference. 00:50:12.940 |
You just need to choose the final answer with the maximum probability. 00:50:17.940 |
You don't have to think about any fancy things about AOMs. 00:50:28.940 |
And one can naturally, of course, self-consciency has a problem. 00:50:34.940 |
You check the frequency of the unique answer. 00:50:37.940 |
And for general problems, it's hard to see the answer will be by single token. 00:50:46.940 |
And for example, for this problem, you will see all the answers are different. 00:50:54.940 |
In this case, we have a good thing about self-consciency. 00:51:01.940 |
And for this problem here, you can see the second response is the most common one. 00:51:08.940 |
Because all these three countries are in all other answers, right? 00:51:15.940 |
And we just need to let AOMs choose the most consistent response. 00:51:24.940 |
I've talked about how to use aggregation to improve reasoning. 00:51:34.940 |
So I know there's a lot of debate about AOM reasoning. 00:51:38.940 |
People say, okay, AOMs may not just do retrieval instead of reasoning. 00:51:44.940 |
So I know many people, I saw that debate in social media. 00:51:52.940 |
Actually, to me, it's always hard to differentiate retrieval and reasoning. 00:51:59.940 |
And when I, I'm senior AOMs for all the conferences almost every year. 00:52:06.940 |
And we always have to talk about the novelty of each paper. 00:52:10.940 |
And actually, it's similar to the debate, retrieval reasoning, right? 00:52:21.940 |
I saw an experiment trying different models to run in parallel. 00:52:25.940 |
And then running parallel, they may have literally like running GPT-4, 00:52:30.940 |
running the same, like concurrently with running GM9 2.5. 00:52:35.940 |
Like all different models in parallel for the same question. 00:52:38.940 |
And then at the end, just having like a clarifier. 00:52:41.940 |
So some of them find the most consistent results. 00:52:49.940 |
If you generate response from different models, that would be more like the model 00:52:56.940 |
assembling approach with many models and the combined results. 00:53:02.940 |
The mathematical principle is not exactly the same as self-consciency. 00:53:12.940 |
Actually, again, I'm not interested in the debate about retrieval reasoning. 00:53:20.940 |
And for people working on, actually, I work in industry. 00:53:26.940 |
So to me, you know, just to retrieve a plus reasoning, I should do the debate, right? 00:53:32.940 |
So in 2024, we have paper about analogical reasoning. 00:53:43.940 |
So I can just use this small example to show why retrieval is important in reasoning. 00:53:49.940 |
So for this problem, see, what's the area of the square? 00:54:06.940 |
We call a related problem and then solve this one. 00:54:10.940 |
So at that moment, I tried the GPT 3.5 and also our own model and they failed solving this 00:54:20.940 |
After adding this, after adding the prompt of recalling a related problems, and the model 00:54:32.940 |
So after telling the model to recall related problems, and the model did find a related 00:54:41.940 |
Related problem doesn't mean the same problem. 00:54:46.940 |
You can see the related problem here is finding the distance between two points on a coordinate 00:54:55.940 |
And then the model, oh yeah, now I know how to compute the distance and then how to compute 00:55:01.940 |
It's just a small case to show how retrieval is important in reasoning. 00:55:07.940 |
Here's another example called a step back for the physical problems. 00:55:16.940 |
And before solving this problem, we just let a model in. 00:55:21.940 |
We just give a few short examples to show the model. 00:55:25.940 |
Before solving this problem, you can make a step back to consider a more abstract problem. 00:55:48.940 |
Deep research is exactly the same idea here, right? 00:55:55.940 |
So we have a team that deep research and also open AI deep research. 00:56:00.940 |
And one of open AI's deep research lead was my intern. 00:56:07.940 |
And after he's a PhD and he joined open AI and he invented deep research. 00:56:13.940 |
And you see how deep research works because they can find a similar problem or knowledge to solve 00:56:34.940 |
Actually, you know, forget about the debate if ALMs can reason or not. 00:56:39.940 |
For ALMs, reasoning is always better than no reasoning. 00:56:47.940 |
Aggregating multiple answers is better than one answer. 00:56:53.940 |
And retrieval plus reasoning is better than reasoning only. 00:57:05.940 |
And for the next breakthroughs, you know, I really want to see, okay, how to solve the 00:57:16.940 |
And I also want to see how people build real applied casings instead of just solving benchmarks. 00:57:25.940 |
I think all benchmarks will be saturated soon. 00:57:28.940 |
And I know, you know, all you guys are very passionate about AGI or build ALMs. 00:57:43.940 |
"The truth always turns out to be simpler than you thought." 00:57:48.940 |
And I think that's a particular truth for ALM research. 00:57:54.940 |
And I saw so many academic papers that always try complicated many things. 00:57:59.940 |
So that's why I just came and talked as simple as possible. 00:58:06.940 |
Thanks, Danny, for the very insightful as well as interesting talk. 00:58:23.940 |
We have some questions online from Slido and Zoom, but also in person. 00:58:27.940 |
So we can maybe start with some in-person questions. 00:58:35.940 |
So earlier on in the lecture, you talked about confidence. 00:58:39.940 |
And like a common way to do this is like just taking the average log probabilities of output token sequences. 00:58:47.940 |
So like my question is, do you think there are better ways to do this? 00:58:50.940 |
And also, is this a good indicator for hallucinations? 00:58:54.940 |
Oh, for the first slide, when I talk about confidence, just the node aggregation, just the probability for net token prediction. 00:59:03.940 |
Just a conditional probability for the generation. 00:59:06.940 |
You can just look at the log props from the model and you can see the probability. 00:59:16.940 |
And like do you think this is a good indicator for hallucinations? 00:59:26.940 |
And we can see after reasoning pass, there's a huge jump on confidence for the final answer. 00:59:36.940 |
Earlier you mentioned that, for example, Richard Sutton said that it's scaling learning and search, 00:59:52.940 |
and your opinion is more like scaling learning is all you need. 00:59:57.940 |
I'd just like to expand more on that and why you believe that search is not as necessary. 01:00:07.940 |
And actually, okay, so actually I should make it more concrete. 01:00:13.940 |
When you build models, you don't have to keep search in mind. 01:00:17.940 |
But in the after model it's built, and you can use the search as a tool. 01:00:27.940 |
They can just integrate symbolic search with the model. 01:00:33.940 |
So, but for reasoning research, I just care about the fundamental abilities. 01:00:40.940 |
For example, if we want to solve this problem, the model could be motivated to write a pattern 01:00:50.940 |
But for the reasoning process, we don't need to search. 01:01:01.940 |
That's why if you use a search to solve any problems, you can get a higher accuracy. 01:01:15.940 |
You mentioned in the case where there's no reasoning that it's not necessary to sample 01:01:21.940 |
But wouldn't sampling converge on a different distribution in the case, for example, where 01:01:26.940 |
the most likely next token leads to a diffuse distribution for the following token and the 01:01:32.940 |
Whereas if you were to sample and a less likely token were to lead to a sharper distribution, 01:01:37.940 |
you could actually have a more likely path of tokens there. 01:01:40.940 |
So, wouldn't these two methods fundamentally differ? 01:01:48.940 |
The problem is, actually, we still don't know how the distributions are reshaped during the 01:01:57.940 |
So, to me, it's very hard to answer this question. 01:02:00.940 |
But we still don't have good explanation of how those distributions are reshaped for the 01:02:16.940 |
So, how do we differentiate reasoning and answer? 01:02:21.940 |
Like, do we need to extract that number from the tokens, from the final strings, the output 01:02:33.940 |
Then how do we differentiate the reasoning and the answer? 01:02:39.940 |
If the answer is a program, it will be harder to extract. 01:02:50.940 |
So, when people use an aisle of untuning, and that's why you just see those guys are talking 01:02:54.940 |
mass problems or competitive programming problems. 01:03:00.940 |
So, I think for the general case, you have to write a very careful parser for the final answer. 01:03:13.940 |
And also, what if the problem is very challenging, such that actually the lower confidence answer 01:03:25.940 |
Then how can I use the self-consistency better? 01:03:37.940 |
So, considering the, you know, conversations that AGI is coming, you know, like from two 01:03:42.940 |
to five years from now, how, and basically if we, you know, if it's true, then let's say 01:03:51.940 |
What skills do you, you know, develop in kids to give them a shot to survive in the future that 01:04:08.940 |
Like, there are lots of conversations in the AI community that's giving like the timeline 01:04:16.940 |
I was in the, I clear last year, there was a workshop. 01:04:26.940 |
And I remember one audience asked me a question in the panelist. 01:04:32.940 |
And he said, okay, and AI is moving so fast, they failed, you know? 01:04:39.940 |
And what would be the most scary thing in the, in the future, in the next few years? 01:04:48.940 |
And yeah, I remember some people did talk about the, the risk of AI. 01:04:54.940 |
But my answer is, to me, most guys say is, yeah, winter comes back. 01:05:08.940 |
Actually, I saw many restrictions for the country approach. 01:05:11.940 |
So, I actually, I know many people like the chat balls, the OEM sort of things. 01:05:19.940 |
I really, actually really want to see real killer applications from the kind of AI research. 01:05:28.940 |
I don't know if anyone really needed those AI stuff or not just for fun. 01:05:34.940 |
I know, actually, the AI models is really good for programming. 01:05:53.940 |
But thanks, everybody, for your great questions. 01:05:57.940 |
And thanks again to Denny for the great talk.