[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!

And I've lost many recordings to that. OK. All right. Well, OK. So I'll just go ahead. So STAR is a 2022 paper. I was surprised to see that it's actually basically just one guy's work. Eric now works at XAI. And this is his website if you want to go see it.

He is responsible for the first two papers. We're doing STAR. We're doing QuietSTAR. We're doing VSTAR. Mostly just because they have star in the name. But also they seem to be most mentioned by the people that were throwing around all these survey papers and stuff. OK. So STAR. I believe Eugene, Yen, you've already covered this before.

But I think this is the most foundational and oldest. So I liked it the most, I think. And then I like VSTAR the second and then QSTAR the least. So the general idea of STAR is that we have this bootstrapping cycle of creating a rationale for each answer. So when a question is asked of a language model, it is trained to think of a rationale before answering.

So it's basically a form of chain of thought that before you just spout the answer, you think a little bit first. So it's very related to the chain of thought literature. I think it's very related to ORCA as well. And the interesting thing is that they take the -- they establish a positive loop where the rationales -- you know, they generate a bunch of candidate rationales, basically.

And the rationales that lead to correct answers are viewed to be better rationales than the rationales that lead to wrong answers. And then our -- and that leads to a fine-tune of the language model. There's a second loop, which -- where if the rationale leads to a wrong answer, it can generate a rationalization.

So these two words are pretty similar, rationale and rationalization. They're two different words as far as the paper is concerned. And the rationalization, if it does lead to a correct answer, then it also gets fed back in. And that wrong information is captured a little bit. We'll see later that there's actually ways to do this better that the original style paper missed.

So methodology. Propose a bootstrapping mechanism to iteratively generate a rationale dataset from a few initial examples with rationales without needing to check new rationales' correctness. I really like these ideas. These kinds of ideas where you can bootstrap from a small set of data and you don't need to really check the new set of data.

Because that enables you to scale pretty massively. We complement rationalization with rationalization, where a model is tasked with justifying an answer and fine-tune as if it had come off the rationale without any hint. So I think I have a slide on this. Okay. Because when I was doing the livestream, people were asking me, what is this?

Is rationalization the same thing as rationales? Basically, it's not. It's kind of back to front. So it's very common for language models to get kind of stuck in cycles where it just answers the same question. It gets kind of stuck in a loop. So to overcome this issue, we basically -- they propose rationalization.

For each problem that the model fails to answer correctly, we generate a new rationale by providing the model with a correct answer. This lets the model reason backward. So okay. You fail to do the correct path, right? And here we're doing only the positive fine-tuning. The next thing we do is we give a hint by giving it the answer and then backwards rationalizing what rationale would have led to the right answer and then throwing that into the data set for fine-tuning.

Rationalization accelerates and improves the bootstrapping process. So we have a little chart here showing the star without rationalization on an addition problem of n digits and showing that with rationalization, it actually gets training a lot, lot faster than without. So this is a pretty effective idea, I think. And we'll see how to improve on that later.

I copied this out mostly because this is basically a very nice pseudocode. I don't like the way it's presented, but I think for formalism, this makes sense for some people. I don't really have any other comments on this, apart from -- I think when I originally read the paper, when it started -- when it was presented like this, first we had the positive bootstrapping, then we have the negative rationalization, and we have this and we have this.

It seemed like two loops. It seemed like the first loop is we'll fine-tune on correct answers, and the second loop is we'll correct wrong answers. But the algorithm that they present in pseudocode does everything in one loop. So you split the code path from -- you generate rationales, maybe you rationalize, you filter rationales using ground truth.

So they're performing both loops in one pass, which seems very efficient for some reason. I don't know. It's not how exactly I would do it, but this is probably more efficient. Okay. This is a 2022 paper. It actually started with GPTJ6B. So super small model by modern standards. They had a few data sets, math, common sense QA, and GSM 8K.

I don't really have any quarrels with this. I don't think -- I wish the descriptions were better. They referenced Jason Wei's paper on chain of thought, but they didn't actually show the kind of few-shot examples that they were doing. So in some sense, this is a badly written paper in that it is going to be very hard to reproduce, because they did not show a lot of the -- the full sample of what they did.

They did have some transparency on some of the questions. So this is fun, because there's some audience participation points. All right. I'll just ask you the question without showing the answer. So here's the task that demonstrates the value of Q* and the subtle nuances of what you're being asked to do that you might take for granted.

So here's a question with a multiple choice and three possible answers. So I want -- whoever's listening or watching -- oh, Jimmy, you're here. Hey, Jimmy. Sorry. I said you weren't active anymore. That was a lie. Okay. So on their hike, they brought a filtering straw. They were worried about germs in the what?

Answer choices -- make sick, doctor, water, stream, and mouth. So the correct answer is water, right? If they brought a filtering straw, they were worried about germs in the water. We as humans know this. Now the question is how to teach the machine to reason their way into understanding that water is the right choice, and you don't want to just give the right answer -- you don't want to just get the right answer with the wrong rationale.

You want to have the right rationale as well. So for example, answer A, the answer must be something that can filter out germs. Filtering straws are used to filter out germs, therefore the answer is filtering straw C. This is wrong, right? It's like they got the right answer, which is C.

C is the right answer, but it's the wrong reasoning. Because when you say therefore the answer is filtering straw, that is the wrong reason. B, the answer must be something that would cause someone to bring a filtering straw on a hike. Filtering straws are used to filter water, therefore the answer is C.

This is a good reasoning trace. The last answer. Straw is -- there's a typo here. Straw is something to use to drink water, therefore the answer is water C, right? So which is the best -- right. So Eric says -- Cosmin, yeah, the slides are in the Discord. Eric says the answer is C and D overlap.

They do overlap. I think the more intuitive thing -- this is a very classic NLP entailment thing. What is the more likely answer? The more likely answer is water, because there's no assumption that stream is more specific than water. So when in doubt, pick the more generally probable answer.

So anyway, the human-rated task is what is the best actual answer that you want if you're trying to train for a dataset that has a reasoning trace, right? Is it A1, A2, A3? It is actually A2, right? Because A3 jumps straight to the answer, right, and A1 jumps -- A1 has the right answer but for the wrong reasons, or like it has flawed data.

So this STAAR paper actually used human raters to choose between answers that were correct. And I think that's an unusual way to use human raters. Usually you use human raters to choose correct answers from wrong answers. But here the human raters are being asked to evaluate reasoning and the quality of reasoning.

Eugene Chow says, "Is there an issue of using all three?" What do you mean? In the context of training, why can't we just train all three? As long as they are good enough, it's their problem. Well, this is bad. This is bad. A1 and A3 are bad. Because A1 has faulty reasoning, A3 has not enough reasoning, right?

So A2 has just enough. Like logical flow cannot -- like super basic, probably too verbose, but like you cannot argue with any of the steps. So if you're to fine-tune a reasoning model, this A2 is the kind of dataset that you want. And the star paper, star authors employed human raters to find this.

So okay. I'll give you a little bit more on the details here. But when the human raters were given this, they were all randomized. So imagine just going through and picking A1, A2, A3, A1, A2, A3, A1, A2, A3, for like 50,000 answers. It was very laborious. But it's kind of fun.

Okay. This one is another one that's kind of fun. Again, I'll just run it through. The human always would have fun making up questions for the AI overlords. He found the task quite what? Answer choices, do enjoy, eat cake, enjoy living, get laid, enjoyable. I think it's worthwhile going through these kinds of questions to, you know, key catchphrase, look at your data.

When you look at your data, you really understand how inane and mind numbing but also nuanced some of these choices are, right? So what is the right choice? I had trouble parsing this. The human always had fun making up questions for the AI overlords. He found the task. And this is also meta because it's making up questions for the AI overlords.

He found the task quite what? He found the task quite do enjoy. No, that's not grammatical. He found the task quite eat cake. No. He found the task quite get laid. You know, I said that D is the answer that I wish would happen if I, you know, if I answer enough questions for the AIs, I'll get laid.

But actually, the answer is E. That's actually, I think, the most grammatically correct answer. So this is actually a grammar question rather than anything. So again, A1, the answer must be something that human would enjoy doing, blah, blah, blah. Therefore, the answer is enjoyable. So it's like they all got the right answer, but they all took different paths to get there, right?

So the last question, the last answer. Having fun is enjoyable. Therefore, the answer is enjoyable. And B, the answer must be something that the human found enjoyable, making enjoyable, blah, blah, blah. So you can see, like, this is very laborious. Everyone's kind of reading this through it. And at the end of this whole thing, then you're-- then the big reveal is that this is chain of thoughted, un-fine-tuned GPT-J.

So the first answer is in the presenter results. The paper has a few dozen of these, by the way. In the first answer, always GPT-J, un-fine-tuned. The last answer is human entry, a human answering it. And you can see, like, the human reasoning is-- humans are really bad at showing rationales.

They always just jump straight to the answer, which is really funny. I'll show you a counterexample at the end where this is the opposite. And then B was the star answer, star generally fine-tuned to show reasoning for any task very, very well. So I thought this was very impressive.

I'll go for one more example in the reasoning domain. This is a math question. So we're jumping from simple logic. So this is super simple. I have to stress, like, we are so early in teaching language models to reason. Like this is the pinnacle of reasoning, right? This is, like, not fucking reasoning at all as far as my IQ test is concerned.

But as far as GPT-J6B is concerned, they're good on this, I guess. We cannot take this for granted. OK, here's a math reasoning using natural language, right? Natalia sold clips to 48 of her friends in April. Then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

Does anyone want to solve this? Please feel free to think out loud and jump on the mic. I want to make this interactive. I don't want to make this, like, a lecture. >> 72. >> All right. So 4 plus 24, 72, right? OK. Next question. Betty is saving money for a new wallet, which costs $100.

Betty only has half the money that she needs. Her parents decided to give her $15 for that purpose, and her grandparents twice as much as the parents. How much more money does Betty need to buy the wallet? Can someone do a live chain of thought while solving this? Eric?

>> Sure. So Betty needs $100. She only has half of the money she needs, so she has $50. Her parents decided to give her $15 for that purpose. So in total, she has $65. Her grandparents give her twice as much, so twice as much as $15 is $30. And 65 plus 30 is 95, so she needs five more dollars to reach $100 for the wallet.

>> Perfect. So the answer is 72 and 5. And both of you gave a little bit of chain of thought. Would you be surprised that a language model can do that? So these are the generated answers of Star, showing what Eugene said and what Eric said. I thought it was pretty cool.

I actually would not have put it on the screen. >> This is a 6B model? >> Yes. >> That's pretty good, actually, to do this level of arithmetic. It is also that just flexible, natural language understanding of just whatever, just throw it in there. It got it. And it was really just the fine-tuned rationalization, step-by-step, thinking step-by-step, but not just the lazy kind of thinking step-by-step, like, what do I need to know first?

What do I need to know next? How to combine those pieces of information, how to copy, how to calculate? It's really good. And the paper has quite a few dozen examples of this going on. >> This is after fine-tuning, right? Or is it before? >> Yeah, it's after fine-tuning.

>> Oh, after fine-tuning. That's actually kind of impressive. >> They have an N here on how many times it's fine-tuned. They don't specify the N. I looked for that number. I don't think it's that many. I think, like, max is, like, 20 or 30 iterations, not a ton of iterations.

But the N is a flexible hyperparameter that they used. So this is a two-step problem. This is a three-step problem. Their problem is going all the way up to, like, eight steps, which is pretty impressive. So that lets you generate a chart of human versus machine reasoning. So Eugene and Eric, in answering those two questions, they produced two steps.

And then we can also compare against the model-produced number of steps, and there's a correlation of 53% -- 53 to 57% -- in the sense that when you give them a JSM-AK question, STAR tends to think very, very human-like. Obviously it could be a lot better than 53, but it's surprising that there's a general correlation at all.

And I think, basically, this is a way of understanding reasoning in a structured format that I thought was insightful that I had not seen before. Because once you can do something like this, where I can say I can give you a measurably harder problem -- because I give you an eight-step problem, it's a harder problem than a two-step problem.

If I can give you a measurably harder problem and I can roughly grade the calculator on its ability to get there, then I can improve it. So I thought that was pretty cool. There is -- so I think I'm about to finish this paper. There are some cases where the model dataset was actually bad -- or JSM-AK was bad.

Here's an example of a really stupidly confusing question. A van is delivering 180 bottles of drinks to a neighborhood. Each bottle contains either cider or beer or a mixture of two. Out of the 180 bottles, 40 contain only cider, 80 contain only beer, the rest a mixture of two drinks.

If every man gives half the number of each bottle of drink to the first house, how many bottles does the first house get? So there's this whole -- there's a lot of random context. But actually it's asking you to divide 180 by 2. So the human gave this. And star gave this.

>> No, seriously, that's the human? >> Yeah. >> Like W. Okay. The human didn't read all the way to the end. >> So this is good out-of-domain generalization in the sense that it -- we all know datasets have errors. So this, like, star improved on human. It's good out-of-domain correction of bad data inside of the dataset.

So it's kind of nice. Like star understood better than human, which is really, really interesting. So I think the relevance here for O1 is that if we were to generate reasoning traces, we would have to do work like this, where the rationale would have to be exposed into step-by-step thinking and we would have to grade it in a way that makes sense, right?

So that's my TLDR. Any questions on star? >> Yeah. I have a question about the -- you said it was a dataset of 50,000, did I hear that right earlier? >> Yeah, generated dataset. It was literally, like, 1+1, 2+2, 3+3, 11+11, 111+111, you know, stuff like that. >> Synthetic data.

It was all? >> Yeah. Sure. It's even shitty to call it synthetic data because there's no LLMs involved. It's math. It's a for-loop. >> I see. >> What's great about math is it's very cheap to generate. We absolutely know the right answer. But the math stuff lets us do things like this with a very high degree of certainty.

We had a little debate in the Discord yesterday about n-digit summarization. So this is about adding one-digit numbers together, it learns it very quickly. Adding two-digit numbers together, it takes a bit more time. Adding five-digit takes the longest time, but it eventually learns it as well. >> Yeah. Do you feel that I think right now star and I think maybe V* actually goes along this track.

Do you feel like this is only limited at least in the fine-tuning stage to math and code? >> I know they use GSMK. I know they use Q&A, but it's limited to we have to have the correct answer. So when it's math and code, you can infinitely generate as many as you want.

Sort of like what REST-EM did. >> Right. What's the question? I don't know. Is it... >> No, it's like, do you feel like this can generalize beyond solely math and code? >> I do think so. I do. I do think so. >> Like, maybe there are some things like subjective answers?

>> Yeah. Like, this is not math or code, you know? >> Yeah. That's the thing. This is the one thing whereby the answer is very objective. Okay. Yeah. It's very objective. And you rely on... I said, maybe I wonder if this could generalize to relevance. Maybe if I'm searching for an iPhone, should I be showing an iPhone or showing a iPhone case or the new iPhone or showing an iPhone that's not...

You cannot order, but it says pre-order. It's like things like that where it's a little bit more general. I just wonder how that would work. But maybe there's no answer to this as well. >> Work left to future readers, I'm sure. >> Yeah. >> But this is very impressive for a 2022 paper.

>> It is. >> Because it is obviously, you know, something that we need to do. Okay. Well, there's some questions. I think people have been talking in the chat. Would be super... Alex says... Would be super interested in seeing how the rational traces end up in an analysis like scaling monosemanticity.

If the particular inner function that empowers using a language-defined world model for rationalization. Yeah. Let us know when you do. Eugene Chia, what is this? Oh, from Blink as well. Yes. >> Yeah. It's a small sub 1B model that's able to do math operations. It is along the same idea of like we generate the basic math operations and we just train it.

And it works with this giant humongous chain of thought for the multiplication summation. The crazy one... The crazy thing that we did was that we inverted the numbers during the calculation and it seems to work better. >> What is inverted? >> So instead of like, you know, like 1,200 is 1,200, it does a chain of thought of 0,021.

>> Oh, I mean, that I can back rationalize that because when we do addition, we do it from right to left. Right? >> Correct. And we generate from the first digit to the last digit. I'm mixing up my right and left right now. >> Yeah. Okay. Interesting. Yeah. >> But I think the highlight is how small the model is.

>> Yeah. I mean... Yeah. Sorry. >> Yeah. So does it only do random events or does it only do... >> No, only math. This is just a pure math model. >> I think like here, you're basically just testing universal function approximation and we know it does that. So, like, you know, that's all I got.

Do you... What is the tokenizer for rwkv? Do you tokenize each number separately? >> Oh, this is a character encoding. It was a toy model that we experimented on. >> Right, right, right. That makes sense. Okay. Yeah. Of course it will do it. Nice proof. Anything else? >> So if everything works at this scale, right, it's just like adding that layer, then you're able to do the rationalization, then everything will be able to chain up.

That is my point of view on like why even 8Bs can actually do decent chain of thought math. >> Yeah. Yeah. Yeah. Yeah. Someone is talking about medical domain, I guess, you know, remains to be seen. Somebody needs to try it. But I think you can basically just take the methodology from here about the rating the answers and all that, and feeding it into the fine tune, and it will probably work.

Aditya says postfix notation plus inversion sounds smart for reasoning traces. Okay. Yep. Agreed. Andrei says can this kind of fine tune be applied to any general model like lambda 3? Of course. Absolutely. Yeah. This is a method that is general, and I would be very surprised if they did not use this for O1.

Okay. Moving on. Quiet star. I am about to shit horribly on this paper because it was a waste of time. So this is the same author as the star paper author, this guy, two years after the fact of the original star paper. And basically he is trying to extend, he is criticizing himself and saying, like, we inferred rationales and learned from those that lead to a correct answer.

This is highly constrained setting. Ideally a language model could instead learn to infer unstated rationales in arbitrary text. So he starts to have this idea of internal rationales and external rationales. All of these rationales are externalized in the sense that you can see the chain of thought that's going on in here.

Now he wants to, he basically read Paul's token paper and wanted to apply it to star. So we present quiet star, a generalization of star in which LMs learn to generate rationales at each token. This is, like, crazy. This is Colbert-level crazy of, like, why don't you just throw at every single token?

What would happen there? So the problem with, obviously, generating chain of thought at each token is that you're, like, it costs a lot. The LM doesn't know how to do anything with those internal thoughts as well. And you also need to look ahead a little bit more than just the next token.

So they have a parallel sampling algorithm. I don't super 1,000% get it, but they have a really nice graphic, which I'm going to show now. So given a text with token, token, token, token, token, he's just really trying to show that you predict, you know, in a very sort of speculative decoding way in parallel, but then you add a start thought token.

You generate a bunch of parallel thoughts and maybe a bunch of tokens in each of these things. Only, like, up to 12 tokens, by the way. You end the thought process, and then you cut out whatever doesn't work, and then you generate the next set of tokens. This GIF is all there is.

I wish the animation was better. But this is all he has, and he has a bit more of predictions here. So basically, it's, like, you have -- let me see if I can get you this better sentence. He has this kind of graphic, which is no help at all.

But basically, it's kind of, like, token, token, token, and you can generate thoughts for each token, but then also continue the other tokens in process as well. I feel like there's something to this idea, but it is very hard to communicate. But I think the way that I would explain it is you have to read the pause token paper first.

This one. So maybe I'll rearrange this slightly. I'll just say... You have to read this one, then go over here. Okay. So let me get there first, before I go too hard on this. I do like the way that he introduces his papers, though, because it really helps you focus, like, on what he thinks is novel about his paper, right?

So with the star paper, he offered these four things. He said these are the four things. For me, I personally highlighted the first two, because the last two are just evals. Here he's highlighting six things. I think I would highlight maybe the first three, maybe the first four as relevant.

And honestly, I think three is the main idea. So he's basically saying Q* generalizes star to learn from reasoning from diverse unstructured text data. To our knowledge, this is the first work explicitly training LLMs to reason generally from text rather than on curator reasoning tasks. So this is very relevant to...

What's that guy? Whoever asked about generalizing beyond math. I think Eugene asked about it. Everyone wants to generalize beyond math, right? This is a way to do it. I'm not sure it is scalable or usable, but it's a way. Second one is parallel sampling, which is the graphic I already showed you.

It's a parallelization technique, nothing more. Third, we introduce custom metatokens. This is what we'll dive into next. Fourth, we apply mixing head to mix the next token prediction from the thought into the next token prediction. So it's a little bit of like, I don't know, it's like speculative chain of thought decoding or whatever.

Fifth, non-myopic loss, including multiple tokens ahead. So there's a look-ahead effect, which we'll cover later. And then six, there's a bit of evals. So is everyone familiar with the pause-before-you-think paper? Two is the main idea. Without two, you pay a high cost, with two, it comes for free. RJ says two is the main idea.

I think it's the main idea if you have the full GPU and it doesn't take up the full GPU, I guess. I don't know. Because I understand autoregressive sampling, like you're batching everything, right? RJ, is that what-- you're saying the efficiency comes from batching. My take is sort of that-- and I agree it was hard to understand, but I thought that the attention mechanism, where you already have these tokens in the-- I don't think it's even related to the batching, per se, because you already have the whole sequence for that one set of-- or for that one piece of text, and you're just masking some of it out with the attention mask.

So I think they're using the portions in the unused attention mask to-- yeah, this diagram. So I think they're taking advantage of the areas that would have been masked out by the attention mask and doing inference in that region, and therefore it comes for free. That was my-- On the same hardware?

Yeah. Yeah. OK. Yeah, I mean, that's cool, I guess. In normal inference economics, I don't know if I was an API provider, I'd be able to do this for my customers. Yeah, that's unclear to me, too. I guess I would want to dig in, but my question would be, if this is actually the case, what I'm saying, then why isn't everyone doing this, right?

Because it seems like a little bit of complexity and a lot of benefit, or maybe not huge, but at least marginal benefit. I mean, correct. It's not worth it. That's my guess. Yeah. So that's my, I guess, example. Yeah. But I thought Claude is doing the thinking token thing?

It is simulating thinking tokens. I don't think anyone actually believes that it is actually doing thinking tokens. Yes, that would be my statement on that. Why not? Could you explain a bit? What would they do instead? They are prompted. Does anyone have the Claude Artifacts prompt? There we go.

They're prompted to include and thinking tags, and then the UI manually removes and linking tags. So this is not a thinking token. This is a prompt. Yeah. I think it's just a question of tokenize, because if the open bracket and thinking close bracket is a token itself, then it is a thinking token.

Well, typically, so sure. But thinking tokens in the context of this research, both Q* and the actual thinking tokens paper treat their thinking tokens very differently. They are never emitted. So that would be my two cents on that. I understand, I don't know, that it may be a distinction without a difference.

Yeah, I suspect it might be the case. I'll try researching on this, because I have a separate chain of thought on this. Sure. Agreed. I would also recommend people, there's a Backspace token paper. The Backspace token paper is not called the Backspace token paper. But anyway, there was kind of one observation in the wild on 4.0, where Yann LeCun always says autoregressive LLMs, once they start on the wrong path, they will continue down the wrong path.

And ChaiGBT actually, for the first time, displayed its ability to self-correct in the middle of its own thinking. And that's kind of cool. So this generated a little bit of discussion as well. We don't know if it's able to Backspace or search. Maybe it just got lucky. But there's-- There was an interview with John Shulman, where he was mentioning models correcting themselves.

I think in the Dvorkesh, and he mentioned that, if I remember correctly, they just put like 30 examples, I think, in pre-training, where you would have some discussion, like I'm solving this problem, oh, I'm wrong, and then fixing. And he said that just having a few of these examples actually allowed the model to learn this ability to kind of, OK, double-check their meanings.

There's also some theoretical work by a researcher at Meta looking into the internals of transformers. And he says the models kind of get in some state where they kind of know they're wrong. But if you don't train them to kind of explicitly say they're wrong, then they keep going.

So he mentioned-- I'll post in Discord-- he mentioned the whole talk by Zeyuan is really interesting. But this particular part, it seems like you can fix some of the facts that are wrong by just allowing them to say, hey, I'm changing my mind. Let me explore this other part.

I thought it's relevant to your point here. Thanks. Yeah. Yeah. I'd be interested in that second one. I don't have a link for that. But if you find it, drop it in the Discord, I guess. OK. I got to plow along because we've got nine minutes left. What else can I say about this?

OK. So there are three stages-- think, talk, and learn. I tried to turn this really dense algorithm thing into a better pseudocode that is a bit more accessible. There's a lot of parallel thinking. There's a lot of adding the thinking tokens. And then there's a bit of the mixing and updating the model params with the teacher forcing.

So thinking, we already talked about it. Talking, it's an MLP. It's a three-layer MLP. I don't think it's super insightful apart from-- you don't want to have the thought token always influence the next state. Do you want to introduce a mixing layer in the middle? So it's on this, right?

When they say mixing with and without thoughts, what does it mean? Is it to mix in the original output without the thought and also mixing in the thought plus the expected output? Yeah, with and without thoughts. I think the demonstrated thoughts are very, very small. This is not a truncation just for the graphic.

This is actually the thought. Yeah, it's short. It's very short. It's so short. It's like 16 to 16 tokens, right? 12. Is it 12 or 16? Oh, 24. Sorry. Yeah. So the amount of testing, the amount of thinking ahead is actually not a lot. OK, OK. So yeah, I don't know.

This one makes sense. They had a bit in a paper talking about how they're trying to do thinking ahead on every single token, but then they also recognize that it's probably not useful to think on every single token and you probably want to trim it. So I think the MLP is just for filtering out.

Yeah. Yeah. What is curious to me is why they had to mix it. Why not just use the one that has a thought alone? I guess maybe the one that has a thought alone is too far from distribution and therefore they had to mix it. You know, something similar to how you would do with KL Divergence.

That intuition was just, I couldn't get that when I was reading the paper yesterday. But maybe I'll read through it again and again. And I mean, thanks to your suggestion on the pause before the paper, that was something I was missing. It's obvious. I feel like he just read this paper and he was like, I also want this too, but I'll do the star version.

Yeah. So. I see. I see. So I mean, it is nice. It produced some really nice charts where you can extend the thinking ahead and you see that accuracy improves. So giving, having a nice tunable hyper parameter for thinking ahead, you know, lets you tune up your accuracy, which is kind of cool.

Obviously the cost will increase. He did this on Mistral 7b and OpenWebMath in C4. So the people were asking, why isn't everyone doing this? Well, the improvement isn't that much. It's like 10% on CQA, 5% on GSMAK. Cool? Not, I don't know if I like, I'm supposed to be impressed by that.

Okay. So. Yeah. So, so here, Bayes-Mistral 7b, here's, here's some examples. I got to move on to VSTAR. Bayes-Mistral 7b takes this question from GSMAK, Janice Dux lays 16 eggs, she minus three, minus four. So that's nine. And then she sells the remainder at $2 per egg. How much in dollars does she make?

So Bayes-Mistral 7b answers the wrong answer, because she's supposed to take nine times two and give us 18. Instead it gives us 12 times two, the 12 is hallucinated, it gives us 24. Whereas QSTAR breaks out step by step with all this reasoning chains and gives us the correct answer at the end.

So there's, there's a lot of examples of this where QSTAR train examples with a lot more thinking ahead, maybe reduces hallucination and that is the entire, entire source of advantage for QSTAR. Okay. I, yeah, so I, I think like the, the, the lift is not that impressive. I think that the, the ability for, to deploy in production is not that impressive.

So, so I, I, sorry, I, I'm jumping around, but I'm trying to look for the, the, the numbers. Yeah. I think that, I think the performance numbers in the end, it's like not worth the, the juice is not worth the squeeze, as a famous member of the paper club has said.

Like it's, it's kind of theoretically cool, but you know, we probably need something better than this. So VSTAR, a February paper done by a Mila PhD student, unrelated to the other guy, takes, takes STAR in a different direction, which I, which I like a lot. So STAR, again, criticizes, takes the same criticism, criticism that like it's not taking, not gaining enough information from the incorrect solutions.

So it's potentially neglecting valuable solution, valuable information. So VSTAR utilizes both correct and incorrect solutions generated during the self-improvement process to train a verifier using DPO that judges correctness of model-generated solutions. You could even call this an LLM as a judge. The verifier is used at inference time to select one solution among many candidate solutions.

The diagram is shit, so I made it small. Okay. But the, the, the improvements is so much more worthwhile than QSTAR that I think we should just look at, look at VSTAR instead. So VSTAR is, is, is demonstrated to, to, to improve across a lot of these angles. I wish that, I wish I had a better diagram.

I didn't have time to like make a mock diagram of this. But basically like training a verifier to judge between models, let me, let me just, where's the, where's the paper? I, I, like I'm pretty bullish on training verifiers as a part of your process and then using that as an artifact to, to run in, to run during production.

Where can I show this? So like they were, they were comparing against all these other guys, VSTAR versus all the others. And I really like that you can just kind of apply it versus majority voting and basically destroy it. So like VSTAR, like VSTAR is able to scale with a number of K candidates because you're already like in the training process, you're already training the verifier and the verifier you can use separately from the, the, the raw model itself, which is kind of cool.

So you can, you can basically pick out the right answers. So let me show some examples. Oh yeah, this is what I was trying to, to offer. In this kind of paper, not really great, but like, here's an example of the kind of verifier they were trained, right? So here's a GSMAK question and here's two candidate answers that were generated by STAR.

VSTAR adds a verifier on that, that basically trains to detect the right answer from, from these, right? So it would, it would, it would get a verifier score and it would do, it would do something like this, where you'll take, you'll take a question, you'll have a, you'll have a solution, you'll pick among a list of candidate solutions.

Majority voting would pick the worst, the most common solution rather than the most correct solution. And VSTAR uses DPO to pick the most correct, the most correct solution. I hope I explained that correctly. So I guess the question is, how do they even train VSTAR? What do you mean?

What was the input label? The input label is correct solution and wrong solution, but how would they distinguish? Here's the, here's the, the algorithm, at least, at least label correctness, I see. A little bit more readable than the other guy. I'll be going into this. Yeah. So I, I like this just because like, you want, you want to maximize information from your data set, your information gain.

And it was obvious that the original star people threw away a lot of correct stuff. And like, the original star people's insight was to do rationalizations. But here, here we're actually using that, we're training that into a verifier model, which can get better over time, but then also be deployed.

And I like that idea that we can deploy this and not have to stick it into the, the hope that we can fine tune it into the base model. This, this ties in. But now you have two models, right? Now you have two models, yeah. This ties in a lot with the, let's verify step-by-step from OpenAI.

So this, this is where I see us going from star, Q star into V star. Yeah. The verifier verifies the entire thing or like, cause the verify step-by-step verify sub parts, right? Verifies, yeah. It's a process reward model. It verifies parts along the way. Yeah. So the one in V star is process reward or is full?

Is V star process reward? That's a question. I don't think so. I don't think it's process reward. But I think we could use it to, to create process reward as well. But this is a relatively simple paper. It only talks about the, the, the label correctness at the end.

So this is, this is outcome reward model, not process reward. Thank you. So I don't know. Like there, there's a body of literature that is coming together and it's like, you know, like, oh, it probably will use like some combination of all these things. I think we ran out of time.

Okay. I'll stop here. Stop the recording here and we can open up for other Q and A's or whatever. Oops. No, no, no, no, no, no, no.

[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!

Transcript