[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!

00:00:00.000 | And I've lost many recordings to that.

00:00:04.920 | OK.

00:00:05.920 | All right.

00:00:06.920 | Well, OK.

00:00:08.320 | So I'll just go ahead.

00:00:10.080 | So STAR is a 2022 paper.

00:00:14.420 | I was surprised to see that it's actually basically just one guy's work.

00:00:20.600 | Eric now works at XAI.

00:00:23.200 | And this is his website if you want to go see it.

00:00:27.280 | He is responsible for the first two papers.

00:00:32.160 | We're doing STAR.

00:00:33.160 | We're doing QuietSTAR.

00:00:34.160 | We're doing VSTAR.

00:00:35.160 | Mostly just because they have star in the name.

00:00:36.960 | But also they seem to be most mentioned by the people that were throwing around all these

00:00:42.400 | survey papers and stuff.

00:00:44.080 | OK.

00:00:45.080 | So STAR.

00:00:46.080 | I believe Eugene, Yen, you've already covered this before.

00:00:49.520 | But I think this is the most foundational and oldest.

00:00:51.740 | So I liked it the most, I think.

00:00:54.720 | And then I like VSTAR the second and then QSTAR the least.

00:00:59.880 | So the general idea of STAR is that we have this bootstrapping cycle of creating a rationale

00:01:08.480 | for each answer.

00:01:10.320 | So when a question is asked of a language model, it is trained to think of a rationale

00:01:18.560 | before answering.

00:01:19.560 | So it's basically a form of chain of thought that before you just spout the answer, you

00:01:24.840 | think a little bit first.

00:01:26.960 | So it's very related to the chain of thought literature.

00:01:30.040 | I think it's very related to ORCA as well.

00:01:33.840 | And the interesting thing is that they take the -- they establish a positive loop where

00:01:39.440 | the rationales -- you know, they generate a bunch of candidate rationales, basically.

00:01:44.160 | And the rationales that lead to correct answers are viewed to be better rationales than the

00:01:47.880 | rationales that lead to wrong answers.

00:01:51.480 | And then our -- and that leads to a fine-tune of the language model.

00:01:56.320 | There's a second loop, which -- where if the rationale leads to a wrong answer, it can

00:02:00.400 | generate a rationalization.

00:02:03.040 | So these two words are pretty similar, rationale and rationalization.

00:02:07.360 | They're two different words as far as the paper is concerned.

00:02:10.840 | And the rationalization, if it does lead to a correct answer, then it also gets fed back

00:02:14.340 | in.

00:02:15.340 | And that wrong information is captured a little bit.

00:02:19.360 | We'll see later that there's actually ways to do this better that the original style

00:02:24.640 | paper missed.

00:02:26.360 | So methodology.

00:02:27.360 | Propose a bootstrapping mechanism to iteratively generate a rationale dataset from a few initial

00:02:32.120 | examples with rationales without needing to check new rationales' correctness.

00:02:35.800 | I really like these ideas.

00:02:37.160 | These kinds of ideas where you can bootstrap from a small set of data and you don't need

00:02:41.880 | to really check the new set of data.

00:02:46.240 | Because that enables you to scale pretty massively.

00:02:49.840 | We complement rationalization with rationalization, where a model is tasked with justifying an

00:02:56.720 | answer and fine-tune as if it had come off the rationale without any hint.

00:03:00.120 | So I think I have a slide on this.

00:03:04.280 | Okay.

00:03:05.280 | Because when I was doing the livestream, people were asking me, what is this?

00:03:09.120 | Is rationalization the same thing as rationales?

00:03:11.840 | Basically, it's not.

00:03:13.360 | It's kind of back to front.

00:03:17.400 | So it's very common for language models to get kind of stuck in cycles where it just

00:03:25.760 | answers the same question.

00:03:26.760 | It gets kind of stuck in a loop.

00:03:29.240 | So to overcome this issue, we basically -- they propose rationalization.

00:03:32.640 | For each problem that the model fails to answer correctly, we generate a new rationale by

00:03:36.600 | providing the model with a correct answer.

00:03:38.360 | This lets the model reason backward.

00:03:41.560 | So okay.

00:03:42.560 | You fail to do the correct path, right?

00:03:45.600 | And here we're doing only the positive fine-tuning.

00:03:48.480 | The next thing we do is we give a hint by giving it the answer and then backwards rationalizing

00:03:53.280 | what rationale would have led to the right answer and then throwing that into the data

00:03:57.840 | set for fine-tuning.

00:04:01.720 | Rationalization accelerates and improves the bootstrapping process.

00:04:03.960 | So we have a little chart here showing the star without rationalization on an addition

00:04:14.120 | problem of n digits and showing that with rationalization, it actually gets training

00:04:18.840 | a lot, lot faster than without.

00:04:21.600 | So this is a pretty effective idea, I think.

00:04:27.000 | And we'll see how to improve on that later.

00:04:30.920 | I copied this out mostly because this is basically a very nice pseudocode.

00:04:34.440 | I don't like the way it's presented, but I think for formalism, this makes sense for

00:04:39.320 | some people.

00:04:42.580 | I don't really have any other comments on this, apart from -- I think when I originally

00:04:51.640 | read the paper, when it started -- when it was presented like this, first we had the

00:04:56.840 | positive bootstrapping, then we have the negative rationalization, and we have this and we have

00:05:01.720 | this.

00:05:02.720 | It seemed like two loops.

00:05:03.720 | It seemed like the first loop is we'll fine-tune on correct answers, and the second loop is

00:05:09.200 | we'll correct wrong answers.

00:05:13.400 | But the algorithm that they present in pseudocode does everything in one loop.

00:05:18.680 | So you split the code path from -- you generate rationales, maybe you rationalize, you filter

00:05:27.240 | rationales using ground truth.

00:05:30.160 | So they're performing both loops in one pass, which seems very efficient for some reason.

00:05:36.960 | I don't know.

00:05:37.960 | It's not how exactly I would do it, but this is probably more efficient.

00:05:43.320 | Okay.

00:05:45.720 | This is a 2022 paper.

00:05:46.720 | It actually started with GPTJ6B.

00:05:49.160 | So super small model by modern standards.

00:05:52.480 | They had a few data sets, math, common sense QA, and GSM 8K.

00:06:00.200 | I don't really have any quarrels with this.

00:06:03.160 | I don't think -- I wish the descriptions were better.

00:06:06.640 | They referenced Jason Wei's paper on chain of thought, but they didn't actually show

00:06:12.360 | the kind of few-shot examples that they were doing.

00:06:15.840 | So in some sense, this is a badly written paper in that it is going to be very hard

00:06:19.640 | to reproduce, because they did not show a lot of the -- the full sample of what they

00:06:26.160 | did.

00:06:27.160 | They did have some transparency on some of the questions.

00:06:29.680 | So this is fun, because there's some audience participation points.

00:06:35.160 | All right.

00:06:37.160 | I'll just ask you the question without showing the answer.

00:06:41.280 | So here's the task that demonstrates the value of Q* and the subtle nuances of what you're

00:06:51.880 | being asked to do that you might take for granted.

00:06:55.720 | So here's a question with a multiple choice and three possible answers.

00:07:00.480 | So I want -- whoever's listening or watching -- oh, Jimmy, you're here.

00:07:04.480 | Hey, Jimmy.

00:07:05.480 | Sorry.

00:07:06.480 | I said you weren't active anymore.

00:07:07.480 | That was a lie.

00:07:10.240 | Okay.

00:07:11.240 | So on their hike, they brought a filtering straw.

00:07:13.840 | They were worried about germs in the what?

00:07:16.160 | Answer choices -- make sick, doctor, water, stream, and mouth.

00:07:20.440 | So the correct answer is water, right?

00:07:23.880 | If they brought a filtering straw, they were worried about germs in the water.

00:07:26.920 | We as humans know this.

00:07:28.480 | Now the question is how to teach the machine to reason their way into understanding that

00:07:33.240 | water is the right choice, and you don't want to just give the right answer -- you don't

00:07:38.200 | want to just get the right answer with the wrong rationale.

00:07:40.760 | You want to have the right rationale as well.

00:07:42.840 | So for example, answer A, the answer must be something that can filter out germs.

00:07:47.640 | Filtering straws are used to filter out germs, therefore the answer is filtering straw C.

00:07:51.400 | This is wrong, right?

00:07:53.280 | It's like they got the right answer, which is C. C is the right answer, but it's the

00:07:57.320 | wrong reasoning.

00:07:59.800 | Because when you say therefore the answer is filtering straw, that is the wrong reason.

00:08:05.520 | B, the answer must be something that would cause someone to bring a filtering straw on

00:08:09.840 | a hike.

00:08:10.840 | Filtering straws are used to filter water, therefore the answer is C. This is a good

00:08:14.240 | reasoning trace.

00:08:15.680 | The last answer.

00:08:16.960 | Straw is -- there's a typo here.

00:08:19.200 | Straw is something to use to drink water, therefore the answer is water C, right?

00:08:23.120 | So which is the best -- right.

00:08:27.720 | So Eric says -- Cosmin, yeah, the slides are in the Discord.

00:08:34.040 | Eric says the answer is C and D overlap.

00:08:35.640 | They do overlap.

00:08:37.400 | I think the more intuitive thing -- this is a very classic NLP entailment thing.

00:08:42.880 | What is the more likely answer?

00:08:43.880 | The more likely answer is water, because there's no assumption that stream is more specific

00:08:47.720 | than water.

00:08:48.720 | So when in doubt, pick the more generally probable answer.

00:08:55.140 | So anyway, the human-rated task is what is the best actual answer that you want if you're

00:09:01.160 | trying to train for a dataset that has a reasoning trace, right?

00:09:04.240 | Is it A1, A2, A3?

00:09:07.520 | It is actually A2, right?

00:09:09.440 | Because A3 jumps straight to the answer, right, and A1 jumps -- A1 has the right answer but

00:09:16.080 | for the wrong reasons, or like it has flawed data.

00:09:21.220 | So this STAAR paper actually used human raters to choose between answers that were correct.

00:09:28.200 | And I think that's an unusual way to use human raters.

00:09:31.960 | Usually you use human raters to choose correct answers from wrong answers.

00:09:38.200 | But here the human raters are being asked to evaluate reasoning and the quality of reasoning.

00:09:47.000 | Eugene Chow says, "Is there an issue of using all three?"

00:09:49.520 | What do you mean?

00:09:53.320 | In the context of training, why can't we just train all three?

00:09:56.960 | As long as they are good enough, it's their problem.

00:09:59.800 | Well, this is bad.

00:10:00.800 | This is bad.

00:10:01.800 | A1 and A3 are bad.

00:10:03.560 | Because A1 has faulty reasoning, A3 has not enough reasoning, right?

00:10:11.720 | So A2 has just enough.

00:10:13.880 | Like logical flow cannot -- like super basic, probably too verbose, but like you cannot

00:10:19.680 | argue with any of the steps.

00:10:22.560 | So if you're to fine-tune a reasoning model, this A2 is the kind of dataset that you want.

00:10:28.880 | And the star paper, star authors employed human raters to find this.

00:10:39.400 | So okay.

00:10:40.400 | I'll give you a little bit more on the details here.

00:10:44.240 | But when the human raters were given this, they were all randomized.

00:10:50.260 | So imagine just going through and picking A1, A2, A3, A1, A2, A3, A1, A2, A3, for like

00:10:55.520 | 50,000 answers.

00:10:56.520 | It was very laborious.

00:10:58.520 | But it's kind of fun.

00:11:00.440 | Okay.

00:11:01.440 | This one is another one that's kind of fun.

00:11:02.440 | Again, I'll just run it through.

00:11:04.880 | The human always would have fun making up questions for the AI overlords.

00:11:09.320 | He found the task quite what?

00:11:11.840 | Answer choices, do enjoy, eat cake, enjoy living, get laid, enjoyable.

00:11:17.120 | I think it's worthwhile going through these kinds of questions to, you know, key catchphrase,

00:11:22.080 | look at your data.

00:11:23.960 | When you look at your data, you really understand how inane and mind numbing but also nuanced

00:11:31.520 | some of these choices are, right?

00:11:33.000 | So what is the right choice?

00:11:34.080 | I had trouble parsing this.

00:11:35.920 | The human always had fun making up questions for the AI overlords.

00:11:38.200 | He found the task.

00:11:39.200 | And this is also meta because it's making up questions for the AI overlords.

00:11:41.640 | He found the task quite what?

00:11:43.480 | He found the task quite do enjoy.

00:11:44.960 | No, that's not grammatical.

00:11:45.960 | He found the task quite eat cake.

00:11:47.840 | No.

00:11:48.840 | He found the task quite get laid.

00:11:50.240 | You know, I said that D is the answer that I wish would happen if I, you know, if I answer

00:11:54.640 | enough questions for the AIs, I'll get laid.

00:11:57.040 | But actually, the answer is E. That's actually, I think, the most grammatically correct answer.

00:12:01.920 | So this is actually a grammar question rather than anything.

00:12:08.080 | So again, A1, the answer must be something that human would enjoy doing, blah, blah,

00:12:13.720 | blah.

00:12:14.720 | Therefore, the answer is enjoyable.

00:12:15.720 | So it's like they all got the right answer, but they all took different paths to get there,

00:12:19.040 | right?

00:12:20.040 | So the last question, the last answer.

00:12:21.360 | Having fun is enjoyable.

00:12:22.360 | Therefore, the answer is enjoyable.

00:12:24.880 | And B, the answer must be something that the human found enjoyable, making enjoyable, blah,

00:12:28.680 | blah, blah.

00:12:29.680 | So you can see, like, this is very laborious.

00:12:31.480 | Everyone's kind of reading this through it.

00:12:33.840 | And at the end of this whole thing, then you're-- then the big reveal is that this is chain

00:12:41.560 | of thoughted, un-fine-tuned GPT-J.

00:12:47.240 | So the first answer is in the presenter results.

00:12:50.200 | The paper has a few dozen of these, by the way.

00:12:53.120 | In the first answer, always GPT-J, un-fine-tuned.

00:12:56.640 | The last answer is human entry, a human answering it.

00:13:01.040 | And you can see, like, the human reasoning is-- humans are really bad at showing rationales.

00:13:06.120 | They always just jump straight to the answer, which is really funny.

00:13:08.720 | I'll show you a counterexample at the end where this is the opposite.

00:13:13.480 | And then B was the star answer, star generally fine-tuned to show reasoning for any task

00:13:23.160 | very, very well.

00:13:25.000 | So I thought this was very impressive.

00:13:26.800 | I'll go for one more example in the reasoning domain.

00:13:33.140 | This is a math question.

00:13:34.300 | So we're jumping from simple logic.

00:13:36.220 | So this is super simple.

00:13:37.220 | I have to stress, like, we are so early in teaching language models to reason.

00:13:43.000 | Like this is the pinnacle of reasoning, right?

00:13:45.380 | This is, like, not fucking reasoning at all as far as my IQ test is concerned.

00:13:49.260 | But as far as GPT-J6B is concerned, they're good on this, I guess.

00:13:56.060 | We cannot take this for granted.

00:13:58.340 | OK, here's a math reasoning using natural language, right?

00:14:02.300 | Natalia sold clips to 48 of her friends in April.

00:14:05.280 | Then she sold half as many clips in May.

00:14:07.060 | How many clips did Natalia sell altogether in April and May?

00:14:09.500 | Does anyone want to solve this?

00:14:14.860 | Please feel free to think out loud and jump on the mic.

00:14:17.900 | I want to make this interactive.

00:14:18.900 | I don't want to make this, like, a lecture.

00:14:21.900 | >> 72.

00:14:22.900 | >> All right.

00:14:24.380 | So 4 plus 24, 72, right?

00:14:28.140 | OK.

00:14:29.140 | Next question.

00:14:30.140 | Betty is saving money for a new wallet, which costs $100.

00:14:33.740 | Betty only has half the money that she needs.

00:14:35.740 | Her parents decided to give her $15 for that purpose, and her grandparents twice as much

00:14:38.900 | as the parents.

00:14:39.900 | How much more money does Betty need to buy the wallet?

00:14:42.580 | Can someone do a live chain of thought while solving this?

00:14:48.700 | Eric?

00:14:49.700 | >> Sure.

00:14:51.020 | So Betty needs $100.

00:14:53.900 | She only has half of the money she needs, so she has $50.

00:14:58.620 | Her parents decided to give her $15 for that purpose.

00:15:01.700 | So in total, she has $65.

00:15:05.100 | Her grandparents give her twice as much, so twice as much as $15 is $30.

00:15:11.860 | And 65 plus 30 is 95, so she needs five more dollars to reach $100 for the wallet.

00:15:18.140 | >> Perfect.

00:15:19.140 | So the answer is 72 and 5.

00:15:21.420 | And both of you gave a little bit of chain of thought.

00:15:24.580 | Would you be surprised that a language model can do that?

00:15:26.620 | So these are the generated answers of Star, showing what Eugene said and what Eric said.

00:15:41.420 | I thought it was pretty cool.

00:15:42.420 | I actually would not have put it on the screen.

00:15:44.340 | >> This is a 6B model?

00:15:45.700 | >> Yes.

00:15:46.700 | >> That's pretty good, actually, to do this level of arithmetic.

00:15:52.780 | It is also that just flexible, natural language understanding of just whatever, just throw

00:15:58.820 | it in there.

00:15:59.820 | It got it.

00:16:01.820 | And it was really just the fine-tuned rationalization, step-by-step, thinking step-by-step, but not

00:16:09.500 | just the lazy kind of thinking step-by-step, like, what do I need to know first?

00:16:13.740 | What do I need to know next?

00:16:14.740 | How to combine those pieces of information, how to copy, how to calculate?

00:16:19.500 | It's really good.

00:16:20.500 | And the paper has quite a few dozen examples of this going on.

00:16:25.300 | >> This is after fine-tuning, right?

00:16:26.300 | Or is it before?

00:16:27.300 | >> Yeah, it's after fine-tuning.

00:16:28.300 | >> Oh, after fine-tuning.

00:16:29.300 | That's actually kind of impressive.

00:16:30.300 | >> They have an N here on how many times it's fine-tuned.

00:16:33.980 | They don't specify the N. I looked for that number.

00:16:37.820 | I don't think it's that many.

00:16:38.820 | I think, like, max is, like, 20 or 30 iterations, not a ton of iterations.

00:16:43.660 | But the N is a flexible hyperparameter that they used.

00:16:47.180 | So this is a two-step problem.

00:16:49.260 | This is a three-step problem.

00:16:50.860 | Their problem is going all the way up to, like, eight steps, which is pretty impressive.

00:16:55.320 | So that lets you generate a chart of human versus machine reasoning.

00:17:02.140 | So Eugene and Eric, in answering those two questions, they produced two steps.

00:17:08.940 | And then we can also compare against the model-produced number of steps, and there's a correlation

00:17:15.340 | of 53% -- 53 to 57% -- in the sense that when you give them a JSM-AK question, STAR tends

00:17:24.620 | to think very, very human-like.

00:17:27.300 | Obviously it could be a lot better than 53, but it's surprising that there's a general

00:17:31.820 | correlation at all.

00:17:35.540 | And I think, basically, this is a way of understanding reasoning in a structured format that I thought

00:17:44.220 | was insightful that I had not seen before.

00:17:48.700 | Because once you can do something like this, where I can say I can give you a measurably

00:17:53.940 | harder problem -- because I give you an eight-step problem, it's a harder problem than a two-step

00:17:57.740 | problem.

00:17:58.740 | If I can give you a measurably harder problem and I can roughly grade the calculator on

00:18:03.220 | its ability to get there, then I can improve it.

00:18:07.440 | So I thought that was pretty cool.

00:18:10.540 | There is -- so I think I'm about to finish this paper.

00:18:14.620 | There are some cases where the model dataset was actually bad -- or JSM-AK was bad.

00:18:23.700 | Here's an example of a really stupidly confusing question.

00:18:28.380 | A van is delivering 180 bottles of drinks to a neighborhood.

00:18:31.500 | Each bottle contains either cider or beer or a mixture of two.

00:18:34.180 | Out of the 180 bottles, 40 contain only cider, 80 contain only beer, the rest a mixture of

00:18:38.460 | two drinks.

00:18:39.460 | If every man gives half the number of each bottle of drink to the first house, how many

00:18:43.140 | bottles does the first house get?

00:18:46.080 | So there's this whole -- there's a lot of random context.

00:18:50.020 | But actually it's asking you to divide 180 by 2.

00:18:53.940 | So the human gave this.

00:18:57.900 | And star gave this.

00:18:58.900 | >> No, seriously, that's the human?

00:18:59.900 | >> Yeah.

00:19:00.900 | >> Like W. Okay.

00:19:01.900 | The human didn't read all the way to the end.

00:19:10.480 | >> So this is good out-of-domain generalization in the sense that it -- we all know datasets

00:19:15.620 | have errors.

00:19:16.780 | So this, like, star improved on human.

00:19:19.900 | It's good out-of-domain correction of bad data inside of the dataset.

00:19:24.420 | So it's kind of nice.

00:19:27.960 | Like star understood better than human, which is really, really interesting.

00:19:33.120 | So I think the relevance here for O1 is that if we were to generate reasoning traces, we

00:19:40.780 | would have to do work like this, where the rationale would have to be exposed into step-by-step

00:19:49.180 | thinking and we would have to grade it in a way that makes sense, right?

00:19:58.660 | So that's my TLDR.

00:20:00.500 | Any questions on star?

00:20:01.500 | >> Yeah.

00:20:02.500 | I have a question about the -- you said it was a dataset of 50,000, did I hear that right

00:20:06.460 | earlier?

00:20:07.460 | >> Yeah, generated dataset.

00:20:08.460 | It was literally, like, 1+1, 2+2, 3+3, 11+11, 111+111, you know, stuff like that.

00:20:16.340 | >> Synthetic data.

00:20:17.340 | It was all?

00:20:18.340 | >> Yeah.

00:20:19.340 | Sure.

00:20:20.340 | It's even shitty to call it synthetic data because there's no LLMs involved.

00:20:24.140 | It's math.

00:20:25.140 | It's a for-loop.

00:20:26.140 | >> I see.

00:20:27.140 | >> What's great about math is it's very cheap to generate.

00:20:36.860 | We absolutely know the right answer.

00:20:39.100 | But the math stuff lets us do things like this with a very high degree of certainty.

00:20:46.100 | We had a little debate in the Discord yesterday about n-digit summarization.

00:20:49.100 | So this is about adding one-digit numbers together, it learns it very quickly.

00:20:53.560 | Adding two-digit numbers together, it takes a bit more time.

00:20:55.860 | Adding five-digit takes the longest time, but it eventually learns it as well.

00:21:00.060 | >> Yeah.

00:21:01.060 | Do you feel that I think right now star and I think maybe V* actually goes along this

00:21:06.100 | track.

00:21:07.100 | Do you feel like this is only limited at least in the fine-tuning stage to math and code?

00:21:11.980 | >> I know they use GSMK.

00:21:13.900 | I know they use Q&A, but it's limited to we have to have the correct answer.

00:21:18.180 | So when it's math and code, you can infinitely generate as many as you want.

00:21:22.300 | Sort of like what REST-EM did.

00:21:24.300 | >> Right.

00:21:25.300 | What's the question?

00:21:26.300 | I don't know.

00:21:27.300 | Is it...

00:21:28.300 | >> No, it's like, do you feel like this can generalize beyond solely math and code?

00:21:32.980 | >> I do think so.

00:21:33.980 | I do.

00:21:34.980 | I do think so.

00:21:35.980 | >> Like, maybe there are some things like subjective answers?

00:21:38.180 | >> Yeah.

00:21:39.180 | Like, this is not math or code, you know?

00:21:42.380 | >> Yeah.

00:21:43.380 | That's the thing.

00:21:44.380 | This is the one thing whereby the answer is very objective.

00:21:46.300 | Okay.

00:21:47.300 | Yeah.

00:21:48.300 | It's very objective.

00:21:49.300 | And you rely on...

00:21:50.300 | I said, maybe I wonder if this could generalize to relevance.

00:21:53.620 | Maybe if I'm searching for an iPhone, should I be showing an iPhone or showing a iPhone

00:21:57.780 | case or the new iPhone or showing an iPhone that's not...

00:22:00.980 | You cannot order, but it says pre-order.

00:22:02.580 | It's like things like that where it's a little bit more general.

00:22:05.680 | I just wonder how that would work.

00:22:07.580 | But maybe there's no answer to this as well.

00:22:12.780 | >> Work left to future readers, I'm sure.

00:22:14.620 | >> Yeah.

00:22:15.620 | >> But this is very impressive for a 2022 paper.

00:22:18.460 | >> It is.

00:22:19.620 | >> Because it is obviously, you know, something that we need to do.

00:22:24.100 | Okay.

00:22:25.100 | Well, there's some questions.

00:22:26.100 | I think people have been talking in the chat.

00:22:29.820 | Would be super...

00:22:30.820 | Alex says...

00:22:31.820 | Would be super interested in seeing how the rational traces end up in an analysis like

00:22:35.700 | scaling monosemanticity.

00:22:36.700 | If the particular inner function that empowers using a language-defined world model for rationalization.

00:22:42.780 | Yeah.

00:22:43.780 | Let us know when you do.

00:22:48.180 | Eugene Chia, what is this?

00:22:49.340 | Oh, from Blink as well.

00:22:51.220 | Yes.

00:22:52.220 | >> Yeah.

00:22:53.220 | It's a small sub 1B model that's able to do math operations.

00:23:00.220 | It is along the same idea of like we generate the basic math operations and we just train

00:23:06.340 | it.

00:23:07.340 | And it works with this giant humongous chain of thought for the multiplication summation.

00:23:12.860 | The crazy one...

00:23:13.860 | The crazy thing that we did was that we inverted the numbers during the calculation and it

00:23:18.060 | seems to work better.

00:23:19.820 | >> What is inverted?

00:23:20.820 | >> So instead of like, you know, like 1,200 is 1,200, it does a chain of thought of 0,021.

00:23:28.820 | >> Oh, I mean, that I can back rationalize that because when we do addition, we do it

00:23:35.460 | from right to left.

00:23:36.460 | Right?

00:23:37.460 | >> Correct.

00:23:38.460 | And we generate from the first digit to the last digit.

00:23:43.460 | I'm mixing up my right and left right now.

00:23:46.860 | >> Yeah.

00:23:47.860 | Okay.

00:23:48.860 | Interesting.

00:23:49.860 | Yeah.

00:23:50.860 | >> But I think the highlight is how small the model is.

00:23:52.340 | >> Yeah.

00:23:53.340 | I mean...

00:23:54.340 | Yeah.

00:23:55.340 | Sorry.

00:23:56.340 | >> Yeah.

00:23:57.340 | So does it only do random events or does it only do...

00:23:59.340 | >> No, only math.

00:24:00.340 | This is just a pure math model.

00:24:04.060 | >> I think like here, you're basically just testing universal function approximation and

00:24:08.940 | we know it does that.

00:24:09.940 | So, like, you know, that's all I got.

00:24:15.260 | Do you...

00:24:16.260 | What is the tokenizer for rwkv?

00:24:18.000 | Do you tokenize each number separately?

00:24:21.500 | >> Oh, this is a character encoding.

00:24:24.020 | It was a toy model that we experimented on.

00:24:26.060 | >> Right, right, right.

00:24:27.060 | That makes sense.

00:24:28.060 | Okay.

00:24:29.060 | Yeah.

00:24:30.060 | Of course it will do it.

00:24:31.060 | Nice proof.

00:24:32.060 | Anything else?

00:24:33.060 | >> So if everything works at this scale, right, it's just like adding that layer, then you're

00:24:39.060 | able to do the rationalization, then everything will be able to chain up.

00:24:42.900 | That is my point of view on like why even 8Bs can actually do decent chain of thought

00:24:48.060 | math.

00:24:49.060 | >> Yeah.

00:24:50.060 | Yeah.

00:24:51.060 | Yeah.

00:24:52.060 | Yeah.

00:24:53.060 | Someone is talking about medical domain, I guess, you know, remains to be seen.

00:24:58.380 | Somebody needs to try it.

00:25:00.700 | But I think you can basically just take the methodology from here about the rating the

00:25:05.020 | answers and all that, and feeding it into the fine tune, and it will probably work.

00:25:11.660 | Aditya says postfix notation plus inversion sounds smart for reasoning traces.

00:25:15.660 | Okay.

00:25:16.660 | Yep.

00:25:17.660 | Agreed.

00:25:18.660 | Andrei says can this kind of fine tune be applied to any general model like lambda 3?

00:25:21.100 | Of course.

00:25:22.100 | Absolutely.

00:25:23.100 | Yeah.

00:25:24.100 | This is a method that is general, and I would be very surprised if they did not use this

00:25:28.980 | for O1.

00:25:30.620 | Okay.

00:25:31.700 | Moving on.

00:25:34.300 | Quiet star.

00:25:35.620 | I am about to shit horribly on this paper because it was a waste of time.

00:25:41.620 | So this is the same author as the star paper author, this guy, two years after the fact

00:25:47.260 | of the original star paper.

00:25:50.060 | And basically he is trying to extend, he is criticizing himself and saying, like, we inferred

00:25:56.860 | rationales and learned from those that lead to a correct answer.

00:25:59.980 | This is highly constrained setting.

00:26:01.580 | Ideally a language model could instead learn to infer unstated rationales in arbitrary

00:26:05.220 | text.

00:26:06.220 | So he starts to have this idea of internal rationales and external rationales.

00:26:10.860 | All of these rationales are externalized in the sense that you can see the chain of thought

00:26:17.340 | that's going on in here.

00:26:19.060 | Now he wants to, he basically read Paul's token paper and wanted to apply it to star.

00:26:26.660 | So we present quiet star, a generalization of star in which LMs learn to generate rationales

00:26:31.380 | at each token.

00:26:32.720 | This is, like, crazy.

00:26:33.820 | This is Colbert-level crazy of, like, why don't you just throw at every single token?

00:26:39.020 | What would happen there?

00:26:42.380 | So the problem with, obviously, generating chain of thought at each token is that you're,

00:26:48.540 | like, it costs a lot.

00:26:50.300 | The LM doesn't know how to do anything with those internal thoughts as well.

00:26:55.980 | And you also need to look ahead a little bit more than just the next token.

00:26:59.700 | So they have a parallel sampling algorithm.

00:27:03.260 | I don't super 1,000% get it, but they have a really nice graphic, which I'm going to

00:27:08.580 | show now.

00:27:12.380 | So given a text with token, token, token, token, token, he's just really trying to show

00:27:18.740 | that you predict, you know, in a very sort of speculative decoding way in parallel, but

00:27:25.740 | then you add a start thought token.

00:27:28.900 | You generate a bunch of parallel thoughts and maybe a bunch of tokens in each of these

00:27:32.860 | things.

00:27:33.860 | Only, like, up to 12 tokens, by the way.

00:27:35.840 | You end the thought process, and then you cut out whatever doesn't work, and then you

00:27:42.720 | generate the next set of tokens.

00:27:44.960 | This GIF is all there is.

00:27:46.800 | I wish the animation was better.

00:27:50.580 | But this is all he has, and he has a bit more of predictions here.

00:27:56.920 | So basically, it's, like, you have -- let me see if I can get you this better sentence.

00:28:06.600 | He has this kind of graphic, which is no help at all.

00:28:09.960 | But basically, it's kind of, like, token, token, token, and you can generate thoughts

00:28:14.440 | for each token, but then also continue the other tokens in process as well.

00:28:19.880 | I feel like there's something to this idea, but it is very hard to communicate.

00:28:25.880 | But I think the way that I would explain it is you have to read the pause token paper

00:28:32.160 | first.

00:28:33.160 | This one.

00:28:34.160 | So maybe I'll rearrange this slightly.

00:28:36.920 | I'll just say...

00:28:37.920 | You have to read this one, then go over here.

00:28:41.960 | Okay.

00:28:42.960 | So let me get there first, before I go too hard on this.

00:28:50.280 | I do like the way that he introduces his papers, though, because it really helps you focus,

00:28:55.840 | like, on what he thinks is novel about his paper, right?

00:28:58.840 | So with the star paper, he offered these four things.

00:29:03.480 | He said these are the four things.

00:29:04.920 | For me, I personally highlighted the first two, because the last two are just evals.

00:29:11.720 | Here he's highlighting six things.

00:29:16.120 | I think I would highlight maybe the first three, maybe the first four as relevant.

00:29:25.480 | And honestly, I think three is the main idea.

00:29:32.240 | So he's basically saying Q* generalizes star to learn from reasoning from diverse unstructured

00:29:37.120 | text data.

00:29:38.120 | To our knowledge, this is the first work explicitly training LLMs to reason generally from text

00:29:41.880 | rather than on curator reasoning tasks.

00:29:44.040 | So this is very relevant to...

00:29:48.280 | What's that guy?

00:29:49.280 | Whoever asked about generalizing beyond math.

00:29:50.760 | I think Eugene asked about it.

00:29:52.800 | Everyone wants to generalize beyond math, right?

00:29:56.640 | This is a way to do it.

00:29:58.160 | I'm not sure it is scalable or usable, but it's a way.

00:30:02.880 | Second one is parallel sampling, which is the graphic I already showed you.

00:30:06.640 | It's a parallelization technique, nothing more.

00:30:09.120 | Third, we introduce custom metatokens.

00:30:11.100 | This is what we'll dive into next.

00:30:12.800 | Fourth, we apply mixing head to mix the next token prediction from the thought into the

00:30:18.680 | next token prediction.

00:30:20.840 | So it's a little bit of like, I don't know, it's like speculative chain of thought decoding

00:30:25.720 | or whatever.

00:30:26.720 | Fifth, non-myopic loss, including multiple tokens ahead.

00:30:31.160 | So there's a look-ahead effect, which we'll cover later.

00:30:34.320 | And then six, there's a bit of evals.

00:30:37.600 | So is everyone familiar with the pause-before-you-think paper?

00:30:43.920 | Two is the main idea.

00:30:47.980 | Without two, you pay a high cost, with two, it comes for free.

00:30:50.780 | RJ says two is the main idea.

00:30:53.920 | I think it's the main idea if you have the full GPU and it doesn't take up the full GPU,

00:30:58.440 | I guess.

00:30:59.440 | I don't know.

00:31:00.440 | Because I understand autoregressive sampling, like you're batching everything, right?

00:31:03.880 | RJ, is that what-- you're saying the efficiency comes from batching.

00:31:10.240 | My take is sort of that-- and I agree it was hard to understand, but I thought that the

00:31:14.960 | attention mechanism, where you already have these tokens in the-- I don't think it's even

00:31:21.160 | related to the batching, per se, because you already have the whole sequence for that one

00:31:27.680 | set of-- or for that one piece of text, and you're just masking some of it out with the

00:31:33.520 | attention mask.

00:31:34.520 | So I think they're using the portions in the unused attention mask to-- yeah, this diagram.

00:31:43.080 | So I think they're taking advantage of the areas that would have been masked out by the

00:31:47.040 | attention mask and doing inference in that region, and therefore it comes for free.

00:31:53.940 | That was my--

00:31:54.940 | On the same hardware?

00:31:55.940 | Yeah.

00:31:56.940 | Yeah.

00:31:57.940 | OK.

00:31:58.940 | Yeah, I mean, that's cool, I guess.

00:32:06.460 | In normal inference economics, I don't know if I was an API provider, I'd be able to do

00:32:10.860 | this for my customers.

00:32:13.060 | Yeah, that's unclear to me, too.

00:32:15.540 | I guess I would want to dig in, but my question would be, if this is actually the case, what

00:32:23.140 | I'm saying, then why isn't everyone doing this, right?

00:32:25.420 | Because it seems like a little bit of complexity and a lot of benefit, or maybe not huge, but

00:32:33.420 | at least marginal benefit.

00:32:34.420 | I mean, correct.

00:32:35.420 | It's not worth it.

00:32:36.420 | That's my guess.

00:32:37.420 | Yeah.

00:32:38.420 | So that's my, I guess, example.

00:32:39.420 | Yeah.

00:32:40.420 | But I thought Claude is doing the thinking token thing?

00:32:47.620 | It is simulating thinking tokens.

00:32:49.220 | I don't think anyone actually believes that it is actually doing thinking tokens.

00:32:53.460 | Yes, that would be my statement on that.

00:32:56.780 | Why not?

00:32:57.780 | Could you explain a bit?

00:32:59.620 | What would they do instead?

00:33:02.620 | They are prompted.

00:33:03.620 | Does anyone have the Claude Artifacts prompt?

00:33:07.860 | There we go.

00:33:20.060 | They're prompted to include and thinking tags, and then the UI manually removes and

00:33:28.620 | linking tags.

00:33:30.260 | So this is not a thinking token.

00:33:31.820 | This is a prompt.

00:33:33.420 | Yeah.

00:33:34.420 | I think it's just a question of tokenize, because if the open bracket and thinking close

00:33:39.300 | bracket is a token itself, then it is a thinking token.

00:33:43.340 | Well, typically, so sure.

00:33:47.700 | But thinking tokens in the context of this research, both Q* and the actual thinking

00:33:57.180 | tokens paper treat their thinking tokens very differently.

00:33:59.780 | They are never emitted.

00:34:02.780 | So that would be my two cents on that.

00:34:04.700 | I understand, I don't know, that it may be a distinction without a difference.

00:34:13.100 | Yeah, I suspect it might be the case.

00:34:15.060 | I'll try researching on this, because I have a separate chain of thought on this.

00:34:20.260 | Sure.

00:34:21.260 | Agreed.

00:34:22.260 | I would also recommend people, there's a Backspace token paper.

00:34:25.380 | The Backspace token paper is not called the Backspace token paper.

00:34:28.180 | But anyway, there was kind of one observation in the wild on 4.0, where Yann LeCun always

00:34:35.420 | says autoregressive LLMs, once they start on the wrong path, they will continue down

00:34:40.900 | the wrong path.

00:34:42.660 | And ChaiGBT actually, for the first time, displayed its ability to self-correct in the

00:34:48.620 | middle of its own thinking.

00:34:53.860 | And that's kind of cool.

00:34:56.260 | So this generated a little bit of discussion as well.

00:35:00.400 | We don't know if it's able to Backspace or search.

00:35:05.180 | Maybe it just got lucky.

00:35:06.740 | But there's--

00:35:07.740 | There was an interview with John Shulman, where he was mentioning models correcting

00:35:13.140 | themselves.

00:35:15.540 | I think in the Dvorkesh, and he mentioned that, if I remember correctly, they just put

00:35:23.420 | like 30 examples, I think, in pre-training, where you would have some discussion, like

00:35:30.260 | I'm solving this problem, oh, I'm wrong, and then fixing.

00:35:33.900 | And he said that just having a few of these examples actually allowed the model to learn

00:35:40.720 | this ability to kind of, OK, double-check their meanings.

00:35:46.380 | There's also some theoretical work by a researcher at Meta looking into the internals of transformers.

00:35:58.980 | And he says the models kind of get in some state where they kind of know they're wrong.

00:36:04.380 | But if you don't train them to kind of explicitly say they're wrong, then they keep going.

00:36:10.160 | So he mentioned-- I'll post in Discord-- he mentioned the whole talk by Zeyuan is really

00:36:18.220 | interesting.

00:36:19.220 | But this particular part, it seems like you can fix some of the facts that are wrong by

00:36:23.580 | just allowing them to say, hey, I'm changing my mind.

00:36:27.460 | Let me explore this other part.

00:36:28.960 | I thought it's relevant to your point here.

00:36:32.060 | Thanks.

00:36:33.060 | Yeah.

00:36:34.060 | Yeah.

00:36:35.060 | I'd be interested in that second one.

00:36:36.060 | I don't have a link for that.

00:36:37.880 | But if you find it, drop it in the Discord, I guess.

00:36:41.860 | OK.

00:36:42.860 | I got to plow along because we've got nine minutes left.

00:36:48.580 | What else can I say about this?

00:36:51.300 | OK.

00:36:53.640 | So there are three stages-- think, talk, and learn.

00:36:58.980 | I tried to turn this really dense algorithm thing into a better pseudocode that is a bit

00:37:04.620 | more accessible.

00:37:05.620 | There's a lot of parallel thinking.

00:37:08.060 | There's a lot of adding the thinking tokens.

00:37:13.500 | And then there's a bit of the mixing and updating the model params with the teacher forcing.

00:37:19.580 | So thinking, we already talked about it.

00:37:22.420 | Talking, it's an MLP.

00:37:24.340 | It's a three-layer MLP.

00:37:27.100 | I don't think it's super insightful apart from-- you don't want to have the thought

00:37:32.320 | token always influence the next state.

00:37:35.140 | Do you want to introduce a mixing layer in the middle?

00:37:38.460 | So it's on this, right?

00:37:39.460 | When they say mixing with and without thoughts, what does it mean?

00:37:42.040 | Is it to mix in the original output without the thought and also mixing in the thought

00:37:47.300 | plus the expected output?

00:37:48.740 | Yeah, with and without thoughts.

00:37:53.060 | I think the demonstrated thoughts are very, very small.

00:37:58.300 | This is not a truncation just for the graphic.

00:38:00.300 | This is actually the thought.

00:38:01.420 | Yeah, it's short.

00:38:02.420 | It's very short.

00:38:03.420 | It's so short.

00:38:04.420 | It's like 16 to 16 tokens, right?

00:38:05.420 | 12.

00:38:06.420 | Is it 12 or 16?

00:38:07.420 | Oh, 24.

00:38:08.420 | Sorry.

00:38:09.420 | Yeah.

00:38:10.420 | So the amount of testing, the amount of thinking ahead is actually not a lot.

00:38:16.260 | OK, OK.

00:38:21.340 | So yeah, I don't know.

00:38:23.500 | This one makes sense.

00:38:24.500 | They had a bit in a paper talking about how they're trying to do thinking ahead on every

00:38:31.980 | single token, but then they also recognize that it's probably not useful to think on

00:38:35.380 | every single token and you probably want to trim it.

00:38:37.380 | So I think the MLP is just for filtering out.

00:38:41.460 | Yeah.

00:38:42.460 | Yeah.

00:38:43.460 | What is curious to me is why they had to mix it.

00:38:45.620 | Why not just use the one that has a thought alone?

00:38:48.000 | I guess maybe the one that has a thought alone is too far from distribution and therefore

00:38:51.580 | they had to mix it.

00:38:52.580 | You know, something similar to how you would do with KL Divergence.

00:38:55.820 | That intuition was just, I couldn't get that when I was reading the paper yesterday.

00:39:00.180 | But maybe I'll read through it again and again.

00:39:03.060 | And I mean, thanks to your suggestion on the pause before the paper, that was something

00:39:07.700 | I was missing.

00:39:08.700 | It's obvious.

00:39:09.700 | I feel like he just read this paper and he was like, I also want this too, but I'll do

00:39:12.020 | the star version.

00:39:13.020 | Yeah.

00:39:14.020 | So.

00:39:15.020 | I see.

00:39:16.020 | I see.

00:39:17.020 | So I mean, it is nice.

00:39:18.020 | It produced some really nice charts where you can extend the thinking ahead and you

00:39:24.540 | see that accuracy improves.

00:39:26.700 | So giving, having a nice tunable hyper parameter for thinking ahead, you know, lets you tune

00:39:34.100 | up your accuracy, which is kind of cool.

00:39:36.420 | Obviously the cost will increase.

00:39:37.820 | He did this on Mistral 7b and OpenWebMath in C4.

00:39:42.980 | So the people were asking, why isn't everyone doing this?

00:39:46.540 | Well, the improvement isn't that much.

00:39:48.260 | It's like 10% on CQA, 5% on GSMAK.

00:39:52.740 | Cool?

00:39:53.740 | Not, I don't know if I like, I'm supposed to be impressed by that.

00:39:59.620 | Okay.

00:40:00.620 | So.

00:40:01.620 | Yeah.

00:40:02.620 | So, so here, Bayes-Mistral 7b, here's, here's some examples.

00:40:06.660 | I got to move on to VSTAR.

00:40:08.460 | Bayes-Mistral 7b takes this question from GSMAK, Janice Dux lays 16 eggs, she minus

00:40:14.220 | three, minus four.

00:40:15.340 | So that's nine.

00:40:16.660 | And then she sells the remainder at $2 per egg.

00:40:18.380 | How much in dollars does she make?

00:40:19.720 | So Bayes-Mistral 7b answers the wrong answer, because she's supposed to take nine times

00:40:24.920 | two and give us 18.

00:40:27.400 | Instead it gives us 12 times two, the 12 is hallucinated, it gives us 24.

00:40:31.840 | Whereas QSTAR breaks out step by step with all this reasoning chains and gives us the

00:40:38.320 | correct answer at the end.

00:40:40.160 | So there's, there's a lot of examples of this where QSTAR train examples with a lot more

00:40:45.960 | thinking ahead, maybe reduces hallucination and that is the entire, entire source of advantage

00:40:53.120 | for QSTAR.

00:40:54.120 | Okay.

00:40:55.120 | I, yeah, so I, I think like the, the, the lift is not that impressive.

00:40:59.960 | I think that the, the ability for, to deploy in production is not that impressive.

00:41:05.160 | So, so I, I, sorry, I, I'm jumping around, but I'm trying to look for the, the, the numbers.

00:41:11.760 | Yeah.

00:41:12.760 | I think that, I think the performance numbers in the end, it's like not worth the, the juice

00:41:15.940 | is not worth the squeeze, as a famous member of the paper club has said.

00:41:22.160 | Like it's, it's kind of theoretically cool, but you know, we probably need something better

00:41:27.320 | than this.

00:41:28.320 | So VSTAR, a February paper done by a Mila PhD student, unrelated to the other guy, takes,

00:41:39.200 | takes STAR in a different direction, which I, which I like a lot.

00:41:43.960 | So STAR, again, criticizes, takes the same criticism, criticism that like it's not taking,

00:41:49.220 | not gaining enough information from the incorrect solutions.

00:41:53.420 | So it's potentially neglecting valuable solution, valuable information.

00:41:57.180 | So VSTAR utilizes both correct and incorrect solutions generated during the self-improvement

00:42:01.660 | process to train a verifier using DPO that judges correctness of model-generated solutions.

00:42:07.860 | You could even call this an LLM as a judge.

00:42:11.180 | The verifier is used at inference time to select one solution among many candidate solutions.

00:42:16.140 | The diagram is shit, so I made it small.

00:42:19.500 | Okay.

00:42:20.500 | But the, the, the improvements is so much more worthwhile than QSTAR that I think we

00:42:28.680 | should just look at, look at VSTAR instead.

00:42:32.960 | So VSTAR is, is, is demonstrated to, to, to improve across a lot of these angles.

00:42:41.720 | I wish that, I wish I had a better diagram.

00:42:43.920 | I didn't have time to like make a mock diagram of this.

00:42:47.200 | But basically like training a verifier to judge between models, let me, let me just,

00:42:53.320 | where's the, where's the paper?

00:42:56.480 | I, I, like I'm pretty bullish on training verifiers as a part of your process and then

00:43:01.040 | using that as an artifact to, to run in, to run during production.

00:43:09.480 | Where can I show this?

00:43:13.280 | So like they were, they were comparing against all these other guys, VSTAR versus all the

00:43:21.300 | others.

00:43:22.300 | And I really like that you can just kind of apply it versus majority voting and basically

00:43:29.240 | destroy it.

00:43:30.240 | So like VSTAR, like VSTAR is able to scale with a number of K candidates because you're

00:43:38.960 | already like in the training process, you're already training the verifier and the verifier

00:43:43.080 | you can use separately from the, the, the raw model itself, which is kind of cool.

00:43:48.720 | So you can, you can basically pick out the right answers.

00:43:50.600 | So let me show some examples.

00:43:51.600 | Oh yeah, this is what I was trying to, to offer.

00:43:55.600 | In this kind of paper, not really great, but like, here's an example of the kind of verifier

00:44:01.800 | they were trained, right?

00:44:02.800 | So here's a GSMAK question and here's two candidate answers that were generated by STAR.

00:44:09.040 | VSTAR adds a verifier on that, that basically trains to detect the right answer from, from

00:44:13.680 | these, right?

00:44:15.040 | So it would, it would, it would get a verifier score and it would do, it would do something

00:44:19.640 | like this, where you'll take, you'll take a question, you'll have a, you'll have a solution,

00:44:26.340 | you'll pick among a list of candidate solutions.

00:44:29.400 | Majority voting would pick the worst, the most common solution rather than the most

00:44:34.160 | correct solution.

00:44:35.160 | And VSTAR uses DPO to pick the most correct, the most correct solution.

00:44:40.240 | I hope I explained that correctly.

00:44:42.800 | So I guess the question is, how do they even train VSTAR?

00:44:49.320 | What do you mean?

00:44:50.840 | What was the input label?

00:44:51.840 | The input label is correct solution and wrong solution, but how would they distinguish?

00:44:56.440 | Here's the, here's the, the algorithm, at least, at least label correctness, I see.

00:45:00.760 | A little bit more readable than the other guy.

00:45:04.560 | I'll be going into this.

00:45:06.640 | Yeah.

00:45:07.640 | So I, I like this just because like, you want, you want to maximize information from

00:45:12.680 | your data set, your information gain.

00:45:15.760 | And it was obvious that the original star people threw away a lot of correct stuff.

00:45:19.540 | And like, the original star people's insight was to do rationalizations.

00:45:23.160 | But here, here we're actually using that, we're training that into a verifier model,

00:45:28.440 | which can get better over time, but then also be deployed.

00:45:33.080 | And I like that idea that we can deploy this and not have to stick it into the, the hope

00:45:39.560 | that we can fine tune it into the base model.

00:45:43.480 | This, this ties in.

00:45:44.480 | But now you have two models, right?

00:45:46.040 | Now you have two models, yeah.

00:45:48.080 | This ties in a lot with the, let's verify step-by-step from OpenAI.

00:45:52.560 | So this, this is where I see us going from star, Q star into V star.

00:46:01.760 | Yeah.

00:46:02.760 | The verifier verifies the entire thing or like, cause the verify step-by-step verify

00:46:09.680 | sub parts, right?

00:46:10.680 | Verifies, yeah.

00:46:11.680 | It's a process reward model.

00:46:12.680 | It verifies parts along the way.

00:46:15.680 | Yeah.

00:46:16.680 | So the one in V star is process reward or is full?

00:46:20.360 | Is V star process reward?

00:46:21.760 | That's a question.

00:46:27.320 | I don't think so.

00:46:28.320 | I don't think it's process reward.

00:46:29.600 | But I think we could use it to, to create process reward as well.

00:46:35.320 | But this is a relatively simple paper.

00:46:37.440 | It only talks about the, the, the label correctness at the end.

00:46:42.880 | So this is, this is outcome reward model, not process reward.

00:46:46.760 | Thank you.

00:46:50.680 | So I don't know.

00:46:51.680 | Like there, there's a body of literature that is coming together and it's like, you know,

00:46:54.960 | like, oh, it probably will use like some combination of all these things.

00:46:58.000 | I think we ran out of time.

00:47:02.600 | Okay.

00:47:03.600 | I'll stop here.

00:47:04.600 | Stop the recording here and we can open up for other Q and A's or whatever.

00:47:08.640 | Oops.

00:47:09.640 | No, no, no, no, no, no, no.