back to index[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!
00:00:14.420 |
I was surprised to see that it's actually basically just one guy's work. 00:00:23.200 |
And this is his website if you want to go see it. 00:00:35.160 |
Mostly just because they have star in the name. 00:00:36.960 |
But also they seem to be most mentioned by the people that were throwing around all these 00:00:46.080 |
I believe Eugene, Yen, you've already covered this before. 00:00:49.520 |
But I think this is the most foundational and oldest. 00:00:54.720 |
And then I like VSTAR the second and then QSTAR the least. 00:00:59.880 |
So the general idea of STAR is that we have this bootstrapping cycle of creating a rationale 00:01:10.320 |
So when a question is asked of a language model, it is trained to think of a rationale 00:01:19.560 |
So it's basically a form of chain of thought that before you just spout the answer, you 00:01:26.960 |
So it's very related to the chain of thought literature. 00:01:33.840 |
And the interesting thing is that they take the -- they establish a positive loop where 00:01:39.440 |
the rationales -- you know, they generate a bunch of candidate rationales, basically. 00:01:44.160 |
And the rationales that lead to correct answers are viewed to be better rationales than the 00:01:51.480 |
And then our -- and that leads to a fine-tune of the language model. 00:01:56.320 |
There's a second loop, which -- where if the rationale leads to a wrong answer, it can 00:02:03.040 |
So these two words are pretty similar, rationale and rationalization. 00:02:07.360 |
They're two different words as far as the paper is concerned. 00:02:10.840 |
And the rationalization, if it does lead to a correct answer, then it also gets fed back 00:02:15.340 |
And that wrong information is captured a little bit. 00:02:19.360 |
We'll see later that there's actually ways to do this better that the original style 00:02:27.360 |
Propose a bootstrapping mechanism to iteratively generate a rationale dataset from a few initial 00:02:32.120 |
examples with rationales without needing to check new rationales' correctness. 00:02:37.160 |
These kinds of ideas where you can bootstrap from a small set of data and you don't need 00:02:46.240 |
Because that enables you to scale pretty massively. 00:02:49.840 |
We complement rationalization with rationalization, where a model is tasked with justifying an 00:02:56.720 |
answer and fine-tune as if it had come off the rationale without any hint. 00:03:05.280 |
Because when I was doing the livestream, people were asking me, what is this? 00:03:09.120 |
Is rationalization the same thing as rationales? 00:03:17.400 |
So it's very common for language models to get kind of stuck in cycles where it just 00:03:29.240 |
So to overcome this issue, we basically -- they propose rationalization. 00:03:32.640 |
For each problem that the model fails to answer correctly, we generate a new rationale by 00:03:45.600 |
And here we're doing only the positive fine-tuning. 00:03:48.480 |
The next thing we do is we give a hint by giving it the answer and then backwards rationalizing 00:03:53.280 |
what rationale would have led to the right answer and then throwing that into the data 00:04:01.720 |
Rationalization accelerates and improves the bootstrapping process. 00:04:03.960 |
So we have a little chart here showing the star without rationalization on an addition 00:04:14.120 |
problem of n digits and showing that with rationalization, it actually gets training 00:04:30.920 |
I copied this out mostly because this is basically a very nice pseudocode. 00:04:34.440 |
I don't like the way it's presented, but I think for formalism, this makes sense for 00:04:42.580 |
I don't really have any other comments on this, apart from -- I think when I originally 00:04:51.640 |
read the paper, when it started -- when it was presented like this, first we had the 00:04:56.840 |
positive bootstrapping, then we have the negative rationalization, and we have this and we have 00:05:03.720 |
It seemed like the first loop is we'll fine-tune on correct answers, and the second loop is 00:05:13.400 |
But the algorithm that they present in pseudocode does everything in one loop. 00:05:18.680 |
So you split the code path from -- you generate rationales, maybe you rationalize, you filter 00:05:30.160 |
So they're performing both loops in one pass, which seems very efficient for some reason. 00:05:37.960 |
It's not how exactly I would do it, but this is probably more efficient. 00:05:52.480 |
They had a few data sets, math, common sense QA, and GSM 8K. 00:06:03.160 |
I don't think -- I wish the descriptions were better. 00:06:06.640 |
They referenced Jason Wei's paper on chain of thought, but they didn't actually show 00:06:12.360 |
the kind of few-shot examples that they were doing. 00:06:15.840 |
So in some sense, this is a badly written paper in that it is going to be very hard 00:06:19.640 |
to reproduce, because they did not show a lot of the -- the full sample of what they 00:06:27.160 |
They did have some transparency on some of the questions. 00:06:29.680 |
So this is fun, because there's some audience participation points. 00:06:37.160 |
I'll just ask you the question without showing the answer. 00:06:41.280 |
So here's the task that demonstrates the value of Q* and the subtle nuances of what you're 00:06:51.880 |
being asked to do that you might take for granted. 00:06:55.720 |
So here's a question with a multiple choice and three possible answers. 00:07:00.480 |
So I want -- whoever's listening or watching -- oh, Jimmy, you're here. 00:07:11.240 |
So on their hike, they brought a filtering straw. 00:07:16.160 |
Answer choices -- make sick, doctor, water, stream, and mouth. 00:07:23.880 |
If they brought a filtering straw, they were worried about germs in the water. 00:07:28.480 |
Now the question is how to teach the machine to reason their way into understanding that 00:07:33.240 |
water is the right choice, and you don't want to just give the right answer -- you don't 00:07:38.200 |
want to just get the right answer with the wrong rationale. 00:07:40.760 |
You want to have the right rationale as well. 00:07:42.840 |
So for example, answer A, the answer must be something that can filter out germs. 00:07:47.640 |
Filtering straws are used to filter out germs, therefore the answer is filtering straw C. 00:07:53.280 |
It's like they got the right answer, which is C. C is the right answer, but it's the 00:07:59.800 |
Because when you say therefore the answer is filtering straw, that is the wrong reason. 00:08:05.520 |
B, the answer must be something that would cause someone to bring a filtering straw on 00:08:10.840 |
Filtering straws are used to filter water, therefore the answer is C. This is a good 00:08:19.200 |
Straw is something to use to drink water, therefore the answer is water C, right? 00:08:27.720 |
So Eric says -- Cosmin, yeah, the slides are in the Discord. 00:08:37.400 |
I think the more intuitive thing -- this is a very classic NLP entailment thing. 00:08:43.880 |
The more likely answer is water, because there's no assumption that stream is more specific 00:08:48.720 |
So when in doubt, pick the more generally probable answer. 00:08:55.140 |
So anyway, the human-rated task is what is the best actual answer that you want if you're 00:09:01.160 |
trying to train for a dataset that has a reasoning trace, right? 00:09:09.440 |
Because A3 jumps straight to the answer, right, and A1 jumps -- A1 has the right answer but 00:09:16.080 |
for the wrong reasons, or like it has flawed data. 00:09:21.220 |
So this STAAR paper actually used human raters to choose between answers that were correct. 00:09:28.200 |
And I think that's an unusual way to use human raters. 00:09:31.960 |
Usually you use human raters to choose correct answers from wrong answers. 00:09:38.200 |
But here the human raters are being asked to evaluate reasoning and the quality of reasoning. 00:09:47.000 |
Eugene Chow says, "Is there an issue of using all three?" 00:09:53.320 |
In the context of training, why can't we just train all three? 00:09:56.960 |
As long as they are good enough, it's their problem. 00:10:03.560 |
Because A1 has faulty reasoning, A3 has not enough reasoning, right? 00:10:13.880 |
Like logical flow cannot -- like super basic, probably too verbose, but like you cannot 00:10:22.560 |
So if you're to fine-tune a reasoning model, this A2 is the kind of dataset that you want. 00:10:28.880 |
And the star paper, star authors employed human raters to find this. 00:10:40.400 |
I'll give you a little bit more on the details here. 00:10:44.240 |
But when the human raters were given this, they were all randomized. 00:10:50.260 |
So imagine just going through and picking A1, A2, A3, A1, A2, A3, A1, A2, A3, for like 00:11:04.880 |
The human always would have fun making up questions for the AI overlords. 00:11:11.840 |
Answer choices, do enjoy, eat cake, enjoy living, get laid, enjoyable. 00:11:17.120 |
I think it's worthwhile going through these kinds of questions to, you know, key catchphrase, 00:11:23.960 |
When you look at your data, you really understand how inane and mind numbing but also nuanced 00:11:35.920 |
The human always had fun making up questions for the AI overlords. 00:11:39.200 |
And this is also meta because it's making up questions for the AI overlords. 00:11:50.240 |
You know, I said that D is the answer that I wish would happen if I, you know, if I answer 00:11:57.040 |
But actually, the answer is E. That's actually, I think, the most grammatically correct answer. 00:12:01.920 |
So this is actually a grammar question rather than anything. 00:12:08.080 |
So again, A1, the answer must be something that human would enjoy doing, blah, blah, 00:12:15.720 |
So it's like they all got the right answer, but they all took different paths to get there, 00:12:24.880 |
And B, the answer must be something that the human found enjoyable, making enjoyable, blah, 00:12:29.680 |
So you can see, like, this is very laborious. 00:12:33.840 |
And at the end of this whole thing, then you're-- then the big reveal is that this is chain 00:12:47.240 |
So the first answer is in the presenter results. 00:12:50.200 |
The paper has a few dozen of these, by the way. 00:12:53.120 |
In the first answer, always GPT-J, un-fine-tuned. 00:12:56.640 |
The last answer is human entry, a human answering it. 00:13:01.040 |
And you can see, like, the human reasoning is-- humans are really bad at showing rationales. 00:13:06.120 |
They always just jump straight to the answer, which is really funny. 00:13:08.720 |
I'll show you a counterexample at the end where this is the opposite. 00:13:13.480 |
And then B was the star answer, star generally fine-tuned to show reasoning for any task 00:13:26.800 |
I'll go for one more example in the reasoning domain. 00:13:37.220 |
I have to stress, like, we are so early in teaching language models to reason. 00:13:43.000 |
Like this is the pinnacle of reasoning, right? 00:13:45.380 |
This is, like, not fucking reasoning at all as far as my IQ test is concerned. 00:13:49.260 |
But as far as GPT-J6B is concerned, they're good on this, I guess. 00:13:58.340 |
OK, here's a math reasoning using natural language, right? 00:14:02.300 |
Natalia sold clips to 48 of her friends in April. 00:14:07.060 |
How many clips did Natalia sell altogether in April and May? 00:14:14.860 |
Please feel free to think out loud and jump on the mic. 00:14:30.140 |
Betty is saving money for a new wallet, which costs $100. 00:14:33.740 |
Betty only has half the money that she needs. 00:14:35.740 |
Her parents decided to give her $15 for that purpose, and her grandparents twice as much 00:14:39.900 |
How much more money does Betty need to buy the wallet? 00:14:42.580 |
Can someone do a live chain of thought while solving this? 00:14:53.900 |
She only has half of the money she needs, so she has $50. 00:14:58.620 |
Her parents decided to give her $15 for that purpose. 00:15:05.100 |
Her grandparents give her twice as much, so twice as much as $15 is $30. 00:15:11.860 |
And 65 plus 30 is 95, so she needs five more dollars to reach $100 for the wallet. 00:15:21.420 |
And both of you gave a little bit of chain of thought. 00:15:24.580 |
Would you be surprised that a language model can do that? 00:15:26.620 |
So these are the generated answers of Star, showing what Eugene said and what Eric said. 00:15:42.420 |
I actually would not have put it on the screen. 00:15:46.700 |
>> That's pretty good, actually, to do this level of arithmetic. 00:15:52.780 |
It is also that just flexible, natural language understanding of just whatever, just throw 00:16:01.820 |
And it was really just the fine-tuned rationalization, step-by-step, thinking step-by-step, but not 00:16:09.500 |
just the lazy kind of thinking step-by-step, like, what do I need to know first? 00:16:14.740 |
How to combine those pieces of information, how to copy, how to calculate? 00:16:20.500 |
And the paper has quite a few dozen examples of this going on. 00:16:30.300 |
>> They have an N here on how many times it's fine-tuned. 00:16:33.980 |
They don't specify the N. I looked for that number. 00:16:38.820 |
I think, like, max is, like, 20 or 30 iterations, not a ton of iterations. 00:16:43.660 |
But the N is a flexible hyperparameter that they used. 00:16:50.860 |
Their problem is going all the way up to, like, eight steps, which is pretty impressive. 00:16:55.320 |
So that lets you generate a chart of human versus machine reasoning. 00:17:02.140 |
So Eugene and Eric, in answering those two questions, they produced two steps. 00:17:08.940 |
And then we can also compare against the model-produced number of steps, and there's a correlation 00:17:15.340 |
of 53% -- 53 to 57% -- in the sense that when you give them a JSM-AK question, STAR tends 00:17:27.300 |
Obviously it could be a lot better than 53, but it's surprising that there's a general 00:17:35.540 |
And I think, basically, this is a way of understanding reasoning in a structured format that I thought 00:17:48.700 |
Because once you can do something like this, where I can say I can give you a measurably 00:17:53.940 |
harder problem -- because I give you an eight-step problem, it's a harder problem than a two-step 00:17:58.740 |
If I can give you a measurably harder problem and I can roughly grade the calculator on 00:18:03.220 |
its ability to get there, then I can improve it. 00:18:10.540 |
There is -- so I think I'm about to finish this paper. 00:18:14.620 |
There are some cases where the model dataset was actually bad -- or JSM-AK was bad. 00:18:23.700 |
Here's an example of a really stupidly confusing question. 00:18:28.380 |
A van is delivering 180 bottles of drinks to a neighborhood. 00:18:31.500 |
Each bottle contains either cider or beer or a mixture of two. 00:18:34.180 |
Out of the 180 bottles, 40 contain only cider, 80 contain only beer, the rest a mixture of 00:18:39.460 |
If every man gives half the number of each bottle of drink to the first house, how many 00:18:46.080 |
So there's this whole -- there's a lot of random context. 00:18:50.020 |
But actually it's asking you to divide 180 by 2. 00:19:01.900 |
The human didn't read all the way to the end. 00:19:10.480 |
>> So this is good out-of-domain generalization in the sense that it -- we all know datasets 00:19:19.900 |
It's good out-of-domain correction of bad data inside of the dataset. 00:19:27.960 |
Like star understood better than human, which is really, really interesting. 00:19:33.120 |
So I think the relevance here for O1 is that if we were to generate reasoning traces, we 00:19:40.780 |
would have to do work like this, where the rationale would have to be exposed into step-by-step 00:19:49.180 |
thinking and we would have to grade it in a way that makes sense, right? 00:20:02.500 |
I have a question about the -- you said it was a dataset of 50,000, did I hear that right 00:20:08.460 |
It was literally, like, 1+1, 2+2, 3+3, 11+11, 111+111, you know, stuff like that. 00:20:20.340 |
It's even shitty to call it synthetic data because there's no LLMs involved. 00:20:27.140 |
>> What's great about math is it's very cheap to generate. 00:20:39.100 |
But the math stuff lets us do things like this with a very high degree of certainty. 00:20:46.100 |
We had a little debate in the Discord yesterday about n-digit summarization. 00:20:49.100 |
So this is about adding one-digit numbers together, it learns it very quickly. 00:20:53.560 |
Adding two-digit numbers together, it takes a bit more time. 00:20:55.860 |
Adding five-digit takes the longest time, but it eventually learns it as well. 00:21:01.060 |
Do you feel that I think right now star and I think maybe V* actually goes along this 00:21:07.100 |
Do you feel like this is only limited at least in the fine-tuning stage to math and code? 00:21:13.900 |
I know they use Q&A, but it's limited to we have to have the correct answer. 00:21:18.180 |
So when it's math and code, you can infinitely generate as many as you want. 00:21:28.300 |
>> No, it's like, do you feel like this can generalize beyond solely math and code? 00:21:35.980 |
>> Like, maybe there are some things like subjective answers? 00:21:44.380 |
This is the one thing whereby the answer is very objective. 00:21:50.300 |
I said, maybe I wonder if this could generalize to relevance. 00:21:53.620 |
Maybe if I'm searching for an iPhone, should I be showing an iPhone or showing a iPhone 00:21:57.780 |
case or the new iPhone or showing an iPhone that's not... 00:22:02.580 |
It's like things like that where it's a little bit more general. 00:22:15.620 |
>> But this is very impressive for a 2022 paper. 00:22:19.620 |
>> Because it is obviously, you know, something that we need to do. 00:22:26.100 |
I think people have been talking in the chat. 00:22:31.820 |
Would be super interested in seeing how the rational traces end up in an analysis like 00:22:36.700 |
If the particular inner function that empowers using a language-defined world model for rationalization. 00:22:53.220 |
It's a small sub 1B model that's able to do math operations. 00:23:00.220 |
It is along the same idea of like we generate the basic math operations and we just train 00:23:07.340 |
And it works with this giant humongous chain of thought for the multiplication summation. 00:23:13.860 |
The crazy thing that we did was that we inverted the numbers during the calculation and it 00:23:20.820 |
>> So instead of like, you know, like 1,200 is 1,200, it does a chain of thought of 0,021. 00:23:28.820 |
>> Oh, I mean, that I can back rationalize that because when we do addition, we do it 00:23:38.460 |
And we generate from the first digit to the last digit. 00:23:50.860 |
>> But I think the highlight is how small the model is. 00:23:57.340 |
So does it only do random events or does it only do... 00:24:04.060 |
>> I think like here, you're basically just testing universal function approximation and 00:24:33.060 |
>> So if everything works at this scale, right, it's just like adding that layer, then you're 00:24:39.060 |
able to do the rationalization, then everything will be able to chain up. 00:24:42.900 |
That is my point of view on like why even 8Bs can actually do decent chain of thought 00:24:53.060 |
Someone is talking about medical domain, I guess, you know, remains to be seen. 00:25:00.700 |
But I think you can basically just take the methodology from here about the rating the 00:25:05.020 |
answers and all that, and feeding it into the fine tune, and it will probably work. 00:25:11.660 |
Aditya says postfix notation plus inversion sounds smart for reasoning traces. 00:25:18.660 |
Andrei says can this kind of fine tune be applied to any general model like lambda 3? 00:25:24.100 |
This is a method that is general, and I would be very surprised if they did not use this 00:25:35.620 |
I am about to shit horribly on this paper because it was a waste of time. 00:25:41.620 |
So this is the same author as the star paper author, this guy, two years after the fact 00:25:50.060 |
And basically he is trying to extend, he is criticizing himself and saying, like, we inferred 00:25:56.860 |
rationales and learned from those that lead to a correct answer. 00:26:01.580 |
Ideally a language model could instead learn to infer unstated rationales in arbitrary 00:26:06.220 |
So he starts to have this idea of internal rationales and external rationales. 00:26:10.860 |
All of these rationales are externalized in the sense that you can see the chain of thought 00:26:19.060 |
Now he wants to, he basically read Paul's token paper and wanted to apply it to star. 00:26:26.660 |
So we present quiet star, a generalization of star in which LMs learn to generate rationales 00:26:33.820 |
This is Colbert-level crazy of, like, why don't you just throw at every single token? 00:26:42.380 |
So the problem with, obviously, generating chain of thought at each token is that you're, 00:26:50.300 |
The LM doesn't know how to do anything with those internal thoughts as well. 00:26:55.980 |
And you also need to look ahead a little bit more than just the next token. 00:27:03.260 |
I don't super 1,000% get it, but they have a really nice graphic, which I'm going to 00:27:12.380 |
So given a text with token, token, token, token, token, he's just really trying to show 00:27:18.740 |
that you predict, you know, in a very sort of speculative decoding way in parallel, but 00:27:28.900 |
You generate a bunch of parallel thoughts and maybe a bunch of tokens in each of these 00:27:35.840 |
You end the thought process, and then you cut out whatever doesn't work, and then you 00:27:50.580 |
But this is all he has, and he has a bit more of predictions here. 00:27:56.920 |
So basically, it's, like, you have -- let me see if I can get you this better sentence. 00:28:06.600 |
He has this kind of graphic, which is no help at all. 00:28:09.960 |
But basically, it's kind of, like, token, token, token, and you can generate thoughts 00:28:14.440 |
for each token, but then also continue the other tokens in process as well. 00:28:19.880 |
I feel like there's something to this idea, but it is very hard to communicate. 00:28:25.880 |
But I think the way that I would explain it is you have to read the pause token paper 00:28:37.920 |
You have to read this one, then go over here. 00:28:42.960 |
So let me get there first, before I go too hard on this. 00:28:50.280 |
I do like the way that he introduces his papers, though, because it really helps you focus, 00:28:55.840 |
like, on what he thinks is novel about his paper, right? 00:28:58.840 |
So with the star paper, he offered these four things. 00:29:04.920 |
For me, I personally highlighted the first two, because the last two are just evals. 00:29:16.120 |
I think I would highlight maybe the first three, maybe the first four as relevant. 00:29:25.480 |
And honestly, I think three is the main idea. 00:29:32.240 |
So he's basically saying Q* generalizes star to learn from reasoning from diverse unstructured 00:29:38.120 |
To our knowledge, this is the first work explicitly training LLMs to reason generally from text 00:29:49.280 |
Whoever asked about generalizing beyond math. 00:29:52.800 |
Everyone wants to generalize beyond math, right? 00:29:58.160 |
I'm not sure it is scalable or usable, but it's a way. 00:30:02.880 |
Second one is parallel sampling, which is the graphic I already showed you. 00:30:06.640 |
It's a parallelization technique, nothing more. 00:30:12.800 |
Fourth, we apply mixing head to mix the next token prediction from the thought into the 00:30:20.840 |
So it's a little bit of like, I don't know, it's like speculative chain of thought decoding 00:30:26.720 |
Fifth, non-myopic loss, including multiple tokens ahead. 00:30:31.160 |
So there's a look-ahead effect, which we'll cover later. 00:30:37.600 |
So is everyone familiar with the pause-before-you-think paper? 00:30:47.980 |
Without two, you pay a high cost, with two, it comes for free. 00:30:53.920 |
I think it's the main idea if you have the full GPU and it doesn't take up the full GPU, 00:31:00.440 |
Because I understand autoregressive sampling, like you're batching everything, right? 00:31:03.880 |
RJ, is that what-- you're saying the efficiency comes from batching. 00:31:10.240 |
My take is sort of that-- and I agree it was hard to understand, but I thought that the 00:31:14.960 |
attention mechanism, where you already have these tokens in the-- I don't think it's even 00:31:21.160 |
related to the batching, per se, because you already have the whole sequence for that one 00:31:27.680 |
set of-- or for that one piece of text, and you're just masking some of it out with the 00:31:34.520 |
So I think they're using the portions in the unused attention mask to-- yeah, this diagram. 00:31:43.080 |
So I think they're taking advantage of the areas that would have been masked out by the 00:31:47.040 |
attention mask and doing inference in that region, and therefore it comes for free. 00:32:06.460 |
In normal inference economics, I don't know if I was an API provider, I'd be able to do 00:32:15.540 |
I guess I would want to dig in, but my question would be, if this is actually the case, what 00:32:23.140 |
I'm saying, then why isn't everyone doing this, right? 00:32:25.420 |
Because it seems like a little bit of complexity and a lot of benefit, or maybe not huge, but 00:32:40.420 |
But I thought Claude is doing the thinking token thing? 00:32:49.220 |
I don't think anyone actually believes that it is actually doing thinking tokens. 00:33:03.620 |
Does anyone have the Claude Artifacts prompt? 00:33:20.060 |
They're prompted to include and thinking tags, and then the UI manually removes and 00:33:34.420 |
I think it's just a question of tokenize, because if the open bracket and thinking close 00:33:39.300 |
bracket is a token itself, then it is a thinking token. 00:33:47.700 |
But thinking tokens in the context of this research, both Q* and the actual thinking 00:33:57.180 |
tokens paper treat their thinking tokens very differently. 00:34:04.700 |
I understand, I don't know, that it may be a distinction without a difference. 00:34:15.060 |
I'll try researching on this, because I have a separate chain of thought on this. 00:34:22.260 |
I would also recommend people, there's a Backspace token paper. 00:34:25.380 |
The Backspace token paper is not called the Backspace token paper. 00:34:28.180 |
But anyway, there was kind of one observation in the wild on 4.0, where Yann LeCun always 00:34:35.420 |
says autoregressive LLMs, once they start on the wrong path, they will continue down 00:34:42.660 |
And ChaiGBT actually, for the first time, displayed its ability to self-correct in the 00:34:56.260 |
So this generated a little bit of discussion as well. 00:35:00.400 |
We don't know if it's able to Backspace or search. 00:35:07.740 |
There was an interview with John Shulman, where he was mentioning models correcting 00:35:15.540 |
I think in the Dvorkesh, and he mentioned that, if I remember correctly, they just put 00:35:23.420 |
like 30 examples, I think, in pre-training, where you would have some discussion, like 00:35:30.260 |
I'm solving this problem, oh, I'm wrong, and then fixing. 00:35:33.900 |
And he said that just having a few of these examples actually allowed the model to learn 00:35:40.720 |
this ability to kind of, OK, double-check their meanings. 00:35:46.380 |
There's also some theoretical work by a researcher at Meta looking into the internals of transformers. 00:35:58.980 |
And he says the models kind of get in some state where they kind of know they're wrong. 00:36:04.380 |
But if you don't train them to kind of explicitly say they're wrong, then they keep going. 00:36:10.160 |
So he mentioned-- I'll post in Discord-- he mentioned the whole talk by Zeyuan is really 00:36:19.220 |
But this particular part, it seems like you can fix some of the facts that are wrong by 00:36:23.580 |
just allowing them to say, hey, I'm changing my mind. 00:36:37.880 |
But if you find it, drop it in the Discord, I guess. 00:36:42.860 |
I got to plow along because we've got nine minutes left. 00:36:53.640 |
So there are three stages-- think, talk, and learn. 00:36:58.980 |
I tried to turn this really dense algorithm thing into a better pseudocode that is a bit 00:37:13.500 |
And then there's a bit of the mixing and updating the model params with the teacher forcing. 00:37:27.100 |
I don't think it's super insightful apart from-- you don't want to have the thought 00:37:35.140 |
Do you want to introduce a mixing layer in the middle? 00:37:39.460 |
When they say mixing with and without thoughts, what does it mean? 00:37:42.040 |
Is it to mix in the original output without the thought and also mixing in the thought 00:37:53.060 |
I think the demonstrated thoughts are very, very small. 00:37:58.300 |
This is not a truncation just for the graphic. 00:38:10.420 |
So the amount of testing, the amount of thinking ahead is actually not a lot. 00:38:24.500 |
They had a bit in a paper talking about how they're trying to do thinking ahead on every 00:38:31.980 |
single token, but then they also recognize that it's probably not useful to think on 00:38:35.380 |
every single token and you probably want to trim it. 00:38:37.380 |
So I think the MLP is just for filtering out. 00:38:43.460 |
What is curious to me is why they had to mix it. 00:38:45.620 |
Why not just use the one that has a thought alone? 00:38:48.000 |
I guess maybe the one that has a thought alone is too far from distribution and therefore 00:38:52.580 |
You know, something similar to how you would do with KL Divergence. 00:38:55.820 |
That intuition was just, I couldn't get that when I was reading the paper yesterday. 00:39:00.180 |
But maybe I'll read through it again and again. 00:39:03.060 |
And I mean, thanks to your suggestion on the pause before the paper, that was something 00:39:09.700 |
I feel like he just read this paper and he was like, I also want this too, but I'll do 00:39:18.020 |
It produced some really nice charts where you can extend the thinking ahead and you 00:39:26.700 |
So giving, having a nice tunable hyper parameter for thinking ahead, you know, lets you tune 00:39:37.820 |
He did this on Mistral 7b and OpenWebMath in C4. 00:39:42.980 |
So the people were asking, why isn't everyone doing this? 00:39:53.740 |
Not, I don't know if I like, I'm supposed to be impressed by that. 00:40:02.620 |
So, so here, Bayes-Mistral 7b, here's, here's some examples. 00:40:08.460 |
Bayes-Mistral 7b takes this question from GSMAK, Janice Dux lays 16 eggs, she minus 00:40:16.660 |
And then she sells the remainder at $2 per egg. 00:40:19.720 |
So Bayes-Mistral 7b answers the wrong answer, because she's supposed to take nine times 00:40:27.400 |
Instead it gives us 12 times two, the 12 is hallucinated, it gives us 24. 00:40:31.840 |
Whereas QSTAR breaks out step by step with all this reasoning chains and gives us the 00:40:40.160 |
So there's, there's a lot of examples of this where QSTAR train examples with a lot more 00:40:45.960 |
thinking ahead, maybe reduces hallucination and that is the entire, entire source of advantage 00:40:55.120 |
I, yeah, so I, I think like the, the, the lift is not that impressive. 00:40:59.960 |
I think that the, the ability for, to deploy in production is not that impressive. 00:41:05.160 |
So, so I, I, sorry, I, I'm jumping around, but I'm trying to look for the, the, the numbers. 00:41:12.760 |
I think that, I think the performance numbers in the end, it's like not worth the, the juice 00:41:15.940 |
is not worth the squeeze, as a famous member of the paper club has said. 00:41:22.160 |
Like it's, it's kind of theoretically cool, but you know, we probably need something better 00:41:28.320 |
So VSTAR, a February paper done by a Mila PhD student, unrelated to the other guy, takes, 00:41:39.200 |
takes STAR in a different direction, which I, which I like a lot. 00:41:43.960 |
So STAR, again, criticizes, takes the same criticism, criticism that like it's not taking, 00:41:49.220 |
not gaining enough information from the incorrect solutions. 00:41:53.420 |
So it's potentially neglecting valuable solution, valuable information. 00:41:57.180 |
So VSTAR utilizes both correct and incorrect solutions generated during the self-improvement 00:42:01.660 |
process to train a verifier using DPO that judges correctness of model-generated solutions. 00:42:11.180 |
The verifier is used at inference time to select one solution among many candidate solutions. 00:42:20.500 |
But the, the, the improvements is so much more worthwhile than QSTAR that I think we 00:42:32.960 |
So VSTAR is, is, is demonstrated to, to, to improve across a lot of these angles. 00:42:43.920 |
I didn't have time to like make a mock diagram of this. 00:42:47.200 |
But basically like training a verifier to judge between models, let me, let me just, 00:42:56.480 |
I, I, like I'm pretty bullish on training verifiers as a part of your process and then 00:43:01.040 |
using that as an artifact to, to run in, to run during production. 00:43:13.280 |
So like they were, they were comparing against all these other guys, VSTAR versus all the 00:43:22.300 |
And I really like that you can just kind of apply it versus majority voting and basically 00:43:30.240 |
So like VSTAR, like VSTAR is able to scale with a number of K candidates because you're 00:43:38.960 |
already like in the training process, you're already training the verifier and the verifier 00:43:43.080 |
you can use separately from the, the, the raw model itself, which is kind of cool. 00:43:48.720 |
So you can, you can basically pick out the right answers. 00:43:51.600 |
Oh yeah, this is what I was trying to, to offer. 00:43:55.600 |
In this kind of paper, not really great, but like, here's an example of the kind of verifier 00:44:02.800 |
So here's a GSMAK question and here's two candidate answers that were generated by STAR. 00:44:09.040 |
VSTAR adds a verifier on that, that basically trains to detect the right answer from, from 00:44:15.040 |
So it would, it would, it would get a verifier score and it would do, it would do something 00:44:19.640 |
like this, where you'll take, you'll take a question, you'll have a, you'll have a solution, 00:44:26.340 |
you'll pick among a list of candidate solutions. 00:44:29.400 |
Majority voting would pick the worst, the most common solution rather than the most 00:44:35.160 |
And VSTAR uses DPO to pick the most correct, the most correct solution. 00:44:42.800 |
So I guess the question is, how do they even train VSTAR? 00:44:51.840 |
The input label is correct solution and wrong solution, but how would they distinguish? 00:44:56.440 |
Here's the, here's the, the algorithm, at least, at least label correctness, I see. 00:45:00.760 |
A little bit more readable than the other guy. 00:45:07.640 |
So I, I like this just because like, you want, you want to maximize information from 00:45:15.760 |
And it was obvious that the original star people threw away a lot of correct stuff. 00:45:19.540 |
And like, the original star people's insight was to do rationalizations. 00:45:23.160 |
But here, here we're actually using that, we're training that into a verifier model, 00:45:28.440 |
which can get better over time, but then also be deployed. 00:45:33.080 |
And I like that idea that we can deploy this and not have to stick it into the, the hope 00:45:39.560 |
that we can fine tune it into the base model. 00:45:48.080 |
This ties in a lot with the, let's verify step-by-step from OpenAI. 00:45:52.560 |
So this, this is where I see us going from star, Q star into V star. 00:46:02.760 |
The verifier verifies the entire thing or like, cause the verify step-by-step verify 00:46:16.680 |
So the one in V star is process reward or is full? 00:46:29.600 |
But I think we could use it to, to create process reward as well. 00:46:37.440 |
It only talks about the, the, the label correctness at the end. 00:46:42.880 |
So this is, this is outcome reward model, not process reward. 00:46:51.680 |
Like there, there's a body of literature that is coming together and it's like, you know, 00:46:54.960 |
like, oh, it probably will use like some combination of all these things. 00:47:04.600 |
Stop the recording here and we can open up for other Q and A's or whatever.