LLM Post-Training: A Deep Dive into Reasoning Large Language Models

All right. Okay, I'll kick it off. So last week, we picked this as the hottest paper at the time. I will say this is kind of like a, I definitely feel like there haven't been as many new hot papers. This might be an argument for us to do like test of time paper club.

Because we can't always do, you know, the paper that was published next last week, we'll cover it this week. Anyway, so this one was a survey paper. Basically, these guys read 400 papers on post-training and produced this nice chart. And I, you know, everything here is nice. And then, like, I think the LLM layer to classify each one of these as a single particular category of, like, why they're outperforming.

Seems a little bit forced. But I don't know. It's pretty graphic. Okay. So I'm going to start off discussing the paper by not discussing the paper by actually discussing this other blog post that came out recently. And mostly it is talking about, like, why, you know, there's so much focus on post-training now, you know, this, this paper is basically diving into reasoning post-training.

And I think it's almost like stage one and two are kind of solved problems. And now the only thing left to do is scale up. Stage three is still at the position where, like, techniques actually still matter. And there's this, there's still reasonable upsets when you discover new techniques.

Okay. And I also thought, I also thought this was kind of interesting and relevant. I think it'd be really cool to be able to add a column and, you know, kind of browse to the paper and extract stuff. This is how illicit should basically work. What else can I comment on here?

I didn't, I didn't have much else to comment. Oh, sorry. I just, I just opened the chat. Eugene says it's surprising how the fundamentals should focus up. Yeah. Oh yeah. I was just in support of VeeWoo's test of time. I mean, we talk about it, but then always we, we do the hot paper cause it's hot.

Second paper cause we're, we're doing test of time. I'm putting together curriculum. I'll, I'll, I'll share details in this. Yes. I think the curriculum will be the, the, the, the, the bulk of the value. I think six has a list, right? You created that in December. Yeah. Yeah. Um, but he may, he may want to make different choices.

Yeah. Okay. Um, I'll, I'll let you know if anything in chat is important, but just, yeah, for now. I mean, so, um, I think this is the, this is kind of old school-ish in terms of like, you have RL methods and then you have fine tuning and those will mostly track like the, the, the, the major hyper hyper params of what's going on for each of these models.

But, uh, I kind of shake the feeling that this is like almost going to be out of date. Um, anyway, um, so, so, uh, this is the, the first major, uh, explanation from this paper. So basically, uh, we're just, we're just going, we're, we're going by charts, but then also you can sort of use this as a menu of like, do I know about this terminology?

Um, if you don't know about it, you can double click on it. They, they probably have like a one to two paragraph explanation of it, which is kind of cool. Um, uh, this, so this paper, uh, basically, I, I like it when papers break down side by side things that are introduced independently.

So for example, if you read the PPO paper, um, you will, you will have one view of the world. If you read the DPO paper, you have one view of the world, uh, and it takes people like Deep Seek and it takes papers like this one to just put them side by side in the same framework.

So you can kind of compare vertically across them, uh, and see, um, you know, how they might, uh, defer. Uh, I, I definitely think like they should probably use more, um, um, um, of a color gradient or something to indicate which is an LLM. Um, or yeah, like which parts of models, cause I think like here, it doesn't really capture the distinctions between DPO and, and RLHF.

It doesn't really capture, uh, distinctions between GRPO and all the other, all the other guys. Um, let me see if there's any comments here. Ba, ba, ba, okay, Robert, um, okay, Robert has an interesting comment. Um, I don't know if the pre and post-training distinction is super clear in the context of these reasoning models.

Like one thing they mentioned is using test and scaling to generate high quality levels of supervised training on. Yeah, all that is post-training, right? So you can kind of go back here. Pre-training, um, you, uh, you should be only doing it on large, very, very large corpuses. And the only, your only job is to get the complexity down or the loss down, um, um, on the, on the predicting the next word.

Um, do you have like a, like, like, is that the kind of boundary between pre-training and post-training? Like, yeah, there's a famous Karpathy chart. Um, uh, this is, this is, uh, a window into how I index things online. There we go. See, so, so quickly, my, like, my access is very, very quick with anything Karpathy related.

Um, so, uh, pre-trained is trillions of words and then thousands of GPUs, right? Um, anything beyond that, uh, is like in the hundreds of Ks in terms of data sets. Um, so like not even comparable, you're like three orders of magnitude off. Yeah. Yeah. But in terms of the intention, it's just, you have a small set of data that you really want to index on.

Um, I don't even use those words. Uh, the, the in vogue thing that people say now is, it's a behavior elicitation. Um, pre-training loads the model with a ton of different behaviors and the fine tuning or post-training process will elicit the behaviors that are, that you want to select for.

Um, why I have a negative reaction to what you just said is that you are overly focusing on the data. Um, say like, let's say, and like, that is the opposite of what you want. So you actually want to generalize beyond your data. So it's a, it's a slightly subtle point in the, in the, in the fact that you are trying to provide some data that kind of vaguely points towards the behavior that you want.

And you want it to just take the hint and keep going. Um, whereas the thing you just said was something like, you know, focus on this, this data set. That's, that's kind of, um, uh, you, you probably want to generalize out of domain. Uh, at least as far as the foundation model lab is concerned.

Okay. I might need to think about that a little bit more, but I think another like way to think about a very simple, uh, let me, let me cut you off a little bit to, to, I just thought of another one. Um, these guys will happily provide more math data sets in order to hope in the hope that it will generalize to broader reasoning.

Okay. So they, they like, yes, obviously it'll do better on math. Cause you gave it more math data sets. But the idea is that you're eliciting behavior that, um, that has more reasoning. Uh, just because you, you use math to do it. Yeah. I think what I was going to say is that maybe where I was getting tripped up is the distinction between the intention and the technology.

So like, if you're just doing supervised training, supervised pine tuning, the, like, actual implementation details of what you're doing are maybe not that different from the pre-training stage. Um, then there's this sort of reinforcement learning techniques that are a little bit different. Yes. These are all you're talking about is the intention.

So that's kind of separate from the actual implementation of what's happening. The intention is going to show through, um, with the, the choices you make. Um, but yeah, sure. Yeah, absolutely. Um, I just, I was just trying to reacting towards the, the data focus that you had. Um, okay.

Tyler cross. What do you say? Reading the v3 technical report. Like probably both in the costs. Yeah. Okay. I mean, this part, uh, of the, of the deep seek stuff. This is almost certainly a site up. Um, they know that fine reporting last run numbers is, is meaningless. Um, anyone who knows anything knows it's meaningless.

That's why we always report total GPU office. Um, yeah. Is this, is this last run? Actually, no, this is all of it, right? Huh? It's last run. Oh, it's last run. Okay. Well, they, they, they say in the text of it that it is, uh, last run. Um, I, I have a video where I, I back into these numbers in a spreadsheet.

Oh, can you drop that video? Yeah. Oh, sure. I can drop the spreadsheet too. Cool. Now the, the fun question is, you know, Dario. Dario on his blog, uh, and Mel is all you need. Dario on his blog, uh, leak some alpha on what, uh, Claude's sonnet cost. And it's not clear.

Um, which, where, where, where, where is it? Where's the quote? It's not, it's not, not clear. Um, ah, God. Uh, yeah, there we go. Claude 3.5 sonnet is a mid-sized model that costs a few 10 M's to train. Which one is this? What, what number does this refer to?

It's unclear. Um, also a few 10 million. This is quite, quite a range, you know? Um, I mean, you know, five, right? Plus, minus two. It's, it's, it's, it's in the ballpark. Uh, let's. That's not the final run. Sorry to interrupt. I, I, I don't think that's the final run.

Um, so, uh, that tens of millions and there's a. Block. Shot from, uh, a talk given by somebody at open AI. I think it was actually, I think it was known Brown that said chat TPD was like 50 million. Um, so, and let me find you that spreadsheet. Um, but yeah, that's more likely the closer to the final number.

Uh, like Dari, Dario's number and the deep seek paper number are what I call apples and oranges. They're not like measuring the same thing. Um, is, is my interpretation of it. Totally, totally fair. Uh, you know, honestly, like when you're in the business of raising like $4 billion. Uh, this is just like, whatever.

Um, okay. Um, yeah. Yeah. Well, feel free to drop the spreadsheet when you find it. Uh, but I'm just going to keep going. Yeah. Uh, I think, uh, is there anything interesting in this, in these slides? Um, I would say that, uh, I have been, I would highlight that obviously GRPO has gotten a lot of attention.

When I originally read the deep seek math paper, uh, it seems like they were saying that they, these are all kind of equivalents. If you, if you read like deep seek math. Um. Um, and they introduced GRPO, let me show you the exact like, uh, chart that they. Blah, blah, blah.

Uh, they had some like ablations of like different. Things. Um, they were like, okay, like it's, it's, it's good. Um, no, actually, I don't think this is it. This is what I'm looking for. Um, okay. There are some other ablations. Maybe it's in the R1. Maybe it's in the V3 paper.

I don't know. Um. Um. Um, but, um. Um, no. That's one. Sorry. Let me just. I'm pretty confident. I can find it quickly. This is why I'm, I'm persisting even instead of giving up. Um, why did they release this on GitHub? This is super annoying. Okay. I, I can't, I can't find it here, but there's ablations showing that like, at the time they were like, okay, this is like kind of equivalent to the others.

Uh, this is the, you know, more helpful way to do it. But, um, for those who are members of the discord, um, you can just type in GRPO. I've been keeping track of a lot of GRPO explainers, uh, over here. Um, and, uh, I, I think, I definitely think it's, it's kind of emerged as, uh, something that people are focusing on for, uh, improving your, uh, efficiency of training.

Right. I, I think the, the main knock on PPO is that, uh, there's, it's like models everywhere. Um, and you just need to, there's a lot of data that you just have to chew through it in order to get anywhere useful, um, or as deep seekers, all about efficiency.

Okay. Um, let's, let's catch up on the chat. Uh, right. Ishan has dropped his spreadsheet. Yeah. Actually, Ishan, is there anything you want to talk to? I don't know if there's anything. Uh, go to the second tab and I just use the number of flops to back into the estimate and come up with a MFU, the model flop utilization you see on line 16 of about 17%, which is a little low.

Um, but it's within reason. Um, uh, that's the only part that like you, you take the number of flops and you can use that to calculate based on, because the entire thing is very fixed. It's not even a calculator. It's an equation, right? It's a giant equation. So you know how many, how much compute was needed in theory.

Um, and that number actually looks pretty low given all their efficiency. That's the only thing that seems a little odd, but that is a reasonable number. Um, from compared to other models that I could find, uh, proof points for. Okay. I can walk through the whole thing if you want, but, uh, I don't want to sidetrack us.

Cause we're talking still about this is, this is all pre-training. Uh, and this is supposed to be a post-training conversation or topic, but I'm happy to go through the details or you can also just watch the video where I do it. Yeah. I would say that that's just, let's just send people to the video.

Yeah. We can move on. But appreciate always the, the knowledge drops. Um, I think the site, I have a slight comment regarding the flops thingy. Is it because it's the H800 because it's just memory bound rather than, rather than compute bound for the deep seek site. Uh, maybe we sidebar in the chat.

I don't want to, uh, tangent that too far. Okay. Okay. I mean, you know, sidebars are fine. Uh, I don't, like I said, I don't think I'll take the whole hour. Um, this was weird. Uh, I, I, I did spot this in the, in the, in the chart where they start talking about episodic memory agent, whereas the everyone else is like kind of reasonable techniques that you're familiar with in the literature.

I don't know what the hell episodic memory agent is. Um, yeah, I didn't really look it up. Yeah. Okay. This is only mentioned in figure two. Um, so I feel like they just made this up and stuck it in there as a hallucination test, just to see if you're awake.

Um, anyway. Um, okay. Uh, this is a further deep dive into, into the optimization techniques. Um, I, I definitely think like some high level familiarity with these are good. Uh, like this, this chart is, I think the probable top three that you should be familiar with. Um, and, uh, I don't know what else I can say about that.

I think like what the, the, the overall lesson that I've understood is data efficiency is king. Because alignments data is very rare or like human annotation data is, is very expensive and you have to get as much mileage as you can out of it. Um, so, uh, so even though, uh, for example, the, the prevailing conversation at the time of the DPO paper, which is, uh, maybe like half a year or a year ago, was that it was probably worse than PPO, but because it got rid of, uh, one of the models, uh, inside of the PPO.

Tool chain, uh, it's just that much more efficient to train there and therefore you could just get a lot out of it. Um, okay. Uh, I'm just gonna keep going unless people have comments on, uh, GRPO stuff. Anyway, so like, uh, there's, there's all this, like, this is all prelude to R1.

They had, uh, uh, obviously there's been a lot, a lot, a lot of explainers of R1 in the last few days, in the last couple months. Um, I would say that, uh, they have a pretty decent, um, I guess like, you know, one page explanation of, uh, the, the RL stages for, as far as, you know, summary, summarizing the R1 paper.

Um, I would definitely go through that if, uh, I mean, we already did an R1 session, so we don't have to. Um, but I, I think that this kind of is, uh, something that. Ilya has maybe mentioned in the sense that, um, you there, there's, there's always this cycle in, in ML where you, you tack on like increasingly complicated things in, in, in, in, to some extent, this is the opposite.

You start with a complicated one and then you have to, you try to look for ways to simplify, but, um, and then, and then you find a self-play pure RL approach and that wins. So, um, we have a, we have an article on what Ilya saw. Um, and I remember this one was, was very interesting.

Um, so he had, uh, uh, uh, this is a direct quote from an email that Ilya wrote back in, uh, 2014, uh, 2016 when they were founding OpenAI. Um, he said that we realize self-play in multi-agent environments is magical. If you place agents into an environment, no matter how smart they are, environment will provide them with the exact level of challenge because they're doing self-play, which can be faced only by outsmarting the competition, which is again, a clone of itself.

Um, so for example, if you have a group of children, they'll find each other's company should be challenging. If you have a collection of super intelligence, they will also compete with each other. The point is self-play is king. And I think that's what R10, obviously alpha zero, um, also discovered.

Um, self-play lets us get something out of nothing. The rules of a competitive game can be simple, but the best strategy for playing this game can be immensely complex. I think this is really, really insightful. This last sentence, basically the rules of chess are simple, but then the strategy for playing chess can, can get really complicated.

And I think, uh, that is a really, uh, powerful insight, um, that it makes me think that a lot of this doesn't have to exist as long as we had the right, you know, reward function or whatever. Um, uh, and, and, and, and I, I definitely think like R10, uh, you know, found a little bit of that.

Obviously that's not everything that we need for R1. Um, but R10, I think, uh, a lot of people will say that, you know, it's more profound than R1 itself because R10 just like did the, did the pure thing, which is, uh, which is really cool. Okay. Um, there seems to be a lot of chat in the things.

Eugene is talking about some training stuff. Okay. We're doing post-training. All right. Let's keep going. Um, obligatory Laura and fine tuning discussions. Um, I don't know why, what this is doing here. Apart from that, I guess these are, these are important parts of the ecosystem to know. Um, like the, this part of the, this part of the paper felt very much like they felt they had to include this, but it's not really post-training.

This is just peft. Um, and then there's the, then you put that alongside of deep speed, which is like a completely different, uh, part of the stack. So like, I just, I, this, this one just felt weird. Uh, I don't know what this, uh, this part, this section of the paper was trying to do.

Yeah. Why do you say peft and Laura is not part of post-training? Like, uh, if we look at apple intelligence, bunch of Laura adapters and base model, they use that as a form of post-training, right? It's not post-training. It's not post-training in the sense that any of these guys use peft.

Do apple in there. Yeah. Yeah. I don't know. Um, I, I also Microsoft Microsoft is not. They did Laura. It's okay. So you pick the, like the two worst models in the whole set. March is all you need for one. You, for one, for five, you train all benchmarks to look like, you know what you're doing.

For Apple, you train all benchmarks because that's what your users care about. And then you ship three years later. Yeah. That's, that's great. It's good strategy. Um, all right. I'm going to keep going. Cause I didn't, I didn't super like this part of the paper. Um, test time scaling.

Very cool. Very hot. Um, I do think that this is probably a decent, uh, set of things to know. I, I actually never really thought about sequential revision being a thing, but like, uh, actually, this is how, this is what I think about when I. I work on, uh, you know, a chat GPT canvas, which, so I, like, I, I'm, I'm default pro this.

This one looks completely out of place. Compute optimal scaling is chinchilla. Um, I don't know what it's doing inside of here. Um, maybe it, I, I, I don't mean something else, but like, uh, I, I feel like this, this term is overloaded enough because that's the exact title of the chinchilla paper that they should probably use something else if, uh, that's what they're trying to say.

Um, I will keep going on terms. Um, so, um, I, I will, I, I think that, um, chain of thought, uh, obviously it's like something that, uh, we have known works. I think that, um, there's this, there's this period of like a year, a year and a half that everyone is doing like skeleton of thought, trio of thought, graph of thought, whatever of thought.

Um, and it was all just minor iterations on like, oh, let's search in different, let's search thoughts in different ways. Um, for those who are keen to catch up, um, we actually did a paper with the trio of thoughts author who now works at open AI on reasoning. Um, yeah, I, I would say like, uh, I, I, I think that this is worth including, even though I think it looks so low tech or it looks so basic because, you know, the original chain of thought was just let's think step by step.

That was the prompt, uh, and that improved, uh, performance a lot. But I think like in some ways, the, the fact that we can distill deep seek R1 into Quen, into Lama and still do enormously well. That is just like, you know, sparkling chain of thought. I like, like, I don't see how fundamentally it's any difference, which is like very disturbing that we took this long to get there.

I don't know if the people have any, uh, comments on this part of like the chain of thought distillation. Nope, silence. Okay. Sorry, what was the question? Like, we know, we've known chain of thought was good for a long time. And how come we took like two, three years from the original chain of thought paper to roundabout discover R1, blah, blah, blah.

And then, and then, and then finally be able to use thinking models, quote unquote thinking models to distill into non thinking models, but suddenly they're a lot better. I think there are two factors for this. Yeah. So like, I think one of it was, uh, like, I think there was a Facebook paper that was very early on saying that you can distill chain of thought and everyone kind of knew that.

But another factor was that when chain of thought started being used, right. The general consensus and opinion of using chain of thoughts for evals was that it's a, it's not a good way to do the eval. Uh, and, and, and, and there was even debates about it early on, even within E-Rooter AI and a few other eval communities that, that, um, that if we allow chain of thought evals to happen.

Um, it, it would, uh, basically people would just like go longer and longer chains. And then it just, it, how, how do you measure one model after another? Obviously all of that went out the window when open AI said, this is the new standard. So it's really that moment that really changed the perception because it was like, oh, great.

This model can only be eval through chain of thought. We don't have a choice now. Uh, and I think that was the fostering function, uh, for a lot of time, people didn't want to, at least within the eval community. And that didn't give the incentive. I suspect that was the case.

Oh, that's fascinating. Um, I have, uh, I was actually going to do, uh, a post or a blog on this, but, uh, I have a thread, Twitter thread. I'll drop in where, um, I've been collecting various posts from folks on why they think it emerged right now. Um, that the one Eugene mentioned is really fascinating.

Like we were just not evaling it this way. Um, some people have said the models were not powerful enough. Um, there's one guy who two years ago, he said two years and like every six months, I think he tried doing R1-0 and it never worked for him. And he's got a whole thread on, on what he thought.

Uh, there's another one from Ross Taylor, who was Meta's reasoning lead, why he thinks it works now. Um, and, uh, another one, um, I, I forgot her name. Uh, Vogel is the, the handle. I'm sure a bunch of you recognize her. She had a, a theory that basically there's chain of thought data leakage into the training of models today.

And so now they had some hint that was, that we can now elicit or summon out of the model, uh, that wasn't there before. Um, those are like the three most popular theories I've seen so far. I've been collecting them in this thread. If folks see others, you know, let me know, add them to that.

But, uh, I've, I've wondered a lot extensively, like, how did we miss this? It just seems crazy. Yeah, it does. Um, I, I, that is likely to be all we get. We just wonder. All right. Sorry. Can I ask one thing? I, it's from my understanding, these techniques are really reliant on this, like verifiable, like math problems and stuff.

Yeah. As in, it's scored by an Oracle to do RL. So I wonder if there's any, like, just the data has accumulated over time to the point where the RL works where it didn't before. Uh, I think it's a bit of a mix of things, right? Outside of just RL, I think you also need a good base model.

Like the base model has to be able to be at some level of intelligence to do this stuff way. And up till like, yeah, two years ago, you know, we were optimizing for like inference time, but now, or sorry, for train time. But now, like starting from base LLAMA models is when there was even ability for people to start replicating some of this stuff.

So like, first we kind of had that shift, right? Models are now kind of over-trained to be good base models. From that, they reach a base level of intelligence that the self-play and all these tricks start to work. It's like, uh, even with the quad Pokemon thing, 3.5 Sonnet would not play Pokemon at all.

It would get nowhere. 3.7 can. There's just like base level little like frontiers of, you know, this is the next stepping stone. So as much as people tried this sort of RL, um, base model was just not that good a while ago. Like, yeah. Yeah. 3.0 Sonnet sucked. Okay.

Um, question from Bing. Is there any info on how much COT data was used to train DeepSeq and its distilled models? Ha, ha, ha, ha, ha, ha, ha, ha, ha. Um, I think they, I think they like said like the rough number, um, and that was all we get.

Right. Um, um, yeah, I, I, I wouldn't search that. I mean, you can search the R1 paper. I, I, I, uh, suffice to say, I don't think we have much info on the dataset. Uh, they said a little bit of how many samples, like, uh, 800,000 samples, how much SFT and then are all they did.

I don't remember these numbers off the top of my head, but they, you know, they, they say how many SFT to RL and stuff, but they don't tell you how many like pre-trained tokens, but they, they have a little bit 600 reasoning, 200 general, but that's what someone says, uh, 800 K samples.

But that's, I think that the store still models, but it's something. There's a, so again, in the discord, we have the R1 thread. Uh, let's see one. Um, there's a better chart with all the data sets kind of visualized. Um, let me see, where is it? Wow. We have a lot of.

Nope. There it is. Okay. Here. Yeah. Here's all the data sets. Um, so six, yeah, 600 here, 200 here, but then also there was like this SFT stage. Um, oh, I guess they would, it was just combined. Okay. Got it. Yeah. Yeah. There you go. Okay. Um, yeah, I mean, the thing I was gonna highlight was, uh, verifiers.

Uh, I guess we already. So like, I guess the, the thing I was gonna push back on Robert on was the fact that like, we've always had the ability to do these verifiers. Um, by the way, opening, I calls it graders. Uh, they're going to launch this in a GA format, uh, by, uh, the June world's fair.

Um, and, um, I, I, I think like, basically this is the, this is the, what will from, uh, the New York summit is working on is basically the open source version of whatever opening. AI has in house. Um, so, uh, I, I, I, I think it's important. I think it's, um, uh, but like, I think the other, the other tricky thing is like, we're very unsure how this generalizes.

Um, the, the hope is just like, you know, you, you, you, you do, you do better at coding and math, and then suddenly you're just better at like writing poems. I don't know. Um, it's very, very, very big. And still like how this actually improves reasoning. Uh, more, more like beyond just the domains that is specifically is evaluating.

Yeah. I don't know, but like, it's, it's, it's a way to generate more data for free. Um, yeah. So, uh, why am I focusing on verifiers here? Because it's down here. Um, I would say this is kind of the single hottest place in the post training stack right now.

Um, because everything else seems like relatively scoped out, I would say. Um, now it's just literally, uh, uh, doing this stuff. Cool. Um, I don't know if there's any other comments, but I'm about to be basically done. They, the, the paper kind of ends by doing this like really opaque classification of the 400 papers that is something that they did the survey paper on.

Um, and everything's just going up into the right because the general field is going up into the right. So it's a little bit hard to tell what's going on. Uh, I think it's a little bit ironic that MCTS was the, that's the biggest spike, uh, only to find that obviously MCTS, uh, according to deep seek is not useful.

It's, it sounds like the other frontier labs, uh, have also agreed that, uh, yeah, we also found that and we just never reported it, but deep seek did. Um, so that, that is the, oh, actually, no, no, sorry. Uh, I, I think, I think I was talking about processor modeling.

I mean, both of these were, were sort of poopooed in the deep seek paper. Um, and that's, uh, very sad for the state of research. Uh, you cannot use, um, raw numbers of, uh, papers published to estimate how much, um, how, how, how rewarding a field is going to be to actually do research in.

Um, yeah, and I think that's the paper summarized. Um, I would say, um, if you needed a place to start and like, understand which terms and which papers to look up, I think this is a very good summary of the state of the art. Um, it might actually be a better use of time to look at Sebastian's work, uh, because he only focuses on like 10 papers, uh, and, and, and does a little bit more explanations, but, um, you will miss some stuff.

So this, this one is a much more, does a much better job of being an actual survey paper, uh, than, uh, than Sebastian. And they just have different goals. Cool. Um, Akash says, I'm a guy who is new to this. I want to learn how reasoning works and more about RL.

Do you think I just find classic RL and then come to RL apply it on LL and say, what an approach. Okay. Uh, it looks like people are discussing there. Um, yeah, I, I don't know any, any other like open questions that people are interested in. Um, I, I think this is like a decent survey.

Um, but also I'm clearly like not, um, not wowed. Um, but I, I, I mean, I, I do think it's a decent survey. So is what, one question I have is, is current test time scaling approaches that are used in like, you know, the leading edge models. Are they mostly just use more thinking tokens or is search like still a big component of that?

Uh, zero search at inference. We're not sure how much search there is for generating. Um, so, so, okay. If you look at here, right. Um, uh, this is a form of search. This is not search, but like you can, you can do whatever you want at the, in the pre-trained synthetic data generation process, including search and then rejection of the search, rejection sampling of the search right over here.

They, by the way, they have a, I think they have good definitions of all these things. Right. So if you want to understand like beam search, rejection sampling, all these things, like you can do that. And that like, but that is the synthetic data stage. And, um, that does not apply in inference for sure.

We do not search during inference. We do not search during inference. I think a lot of people right up to the release of, uh, Oh, one, and maybe a little bit beyond that. We're like, Oh, surely strawberry has to have Q star. Q star has to involve some kind of search because of the Q star algorithm.

Uh, and that was a complete site up. Yeah. So one, one question I asked in the chat that I was trying to figure out is whether there's a value function that we get out of the generative model as part of GRPO. And I think the answer is. No. Um, so it's difficult to map that to like the alpha zero paradigm where, or sorry, alpha go paradigm where you're using like MCTS and you're like sampling the, um, the trajectories that you take based on the, the, you know, policy itself.

Um, but it sounds like that's not something that's actually needed in the final like reasoning model. Um, and if for inference, yeah, yeah, I would definitely say that. Um, Eugene is talking about Atari and Q learning. Yeah. I mean, like Q learning, I honestly, it's like, it's one of, one of those things where it's like, uh, a form of abstraction that you're like, I don't know if this adds any value.

And then you're like, and then you, and then it, and then you find that it does. And it's, it's kind of a mind fuck because now like nothing makes sense anymore. You just have to put it, put things in black boxes. And hope that it works. Yeah. Maybe I'm old school, but I think that learning about Q learning and policy and value iteration really helped me understand and grow.

Why was it, I was trying to learn and how discounted rewards matter. Um, it's clearly a little bit divorced from what we're doing at LMS right now, like DPU and PPU and all that, but I thought, I thought it was really fun. Um, yeah. Um, yeah. Uh, I, uh, so I, I think you and I, we, we did this during the, the, the.

Yes. Yeah. The Georgia tech thing. They, they covered this in, in OMS ES. Um, I don't know. Um, I don't know. It's like, uh, in some, in some, in some extent, in some way, like PPO is like a kind of, kind of a similar thing. Um, I don't know if, with, with the KL constraint.

Yeah. I don't know if going through this. I, I, I honestly, I don't know. Yeah. Yeah. Shall buy in the chat says, does one B B four or five B papers, just PR model is to be helpful. I saw small language models to achieve great results. Sure. Um, the, not, not the more, not the, not the harder.

not the, not the harder thing is the, the 0.1 B. Um, uh, uh, Oh, actually we have Eugene on the call. What Eugene, what is Blink doing? Huh? He's just working on the 0.1 B for the, the RWKB7. Just lots of experiments on it. Yeah. So isn't that more interesting than 1 B?

0.1 B is better. It's interesting. No, it's fun. But, uh, but, uh, the only reason why, why, why, why, why I'm always having lukewarm on it is that every time we launch a 0.1 B and then we just show that, Hey, it can create funny and interesting text. It gets viral and everything's great.

And then I see the usage numbers and no one uses it. Everyone's just like, okay, we'll use the 1.5 B. This is the first time I've seen it. You mean there were previous attempts? Yeah. So our previous 0.1 B is coherent as well. Oh, okay. Well, this is the first time it's actually reached my attention.

I just, I didn't know about the previous ones. Yeah. So I think the big shift is this is the first 0.1 B that we did with thinking token. Yeah. Right. Oh, there you go. Thinking tokens are all you need, man. Yeah. Yeah. If you look at the, the verifiers, um, code base, I was just kind of browsing through it.

Um, it's kind of cool to see, uh, you see, look at, go look at the prompts to how you format these tokens. Um, you can just kind of do this. You have this like XML parser where you have a reasoning field and an answer field. So now all the data that he generates will be just wrapped in a reasoning field.

And then the answer field will be wrapped in the answer. Uh, and so that's how he's doing his, uh, data generation. Uh, and it's kind of cool. Like, uh, like, uh, I, I didn't, I never really, I would never have resorted to an XML parser to do this. I would have just have, uh, created some text, but, um, you know, whatever you need to, to, to do, to do your little angle brackets, uh, that, that unlock so much reasoning.

Also, if people really like 0.1 B models, right. You, there's also an entire subset domain of like this more specialized use case. Like we have a 0.1 B Sudoku solver. We have, uh, uh, I, uh, and I think like there's another 0.1 B, uh, uh, uh, previously just, just more, uh, just more like a goal player, like goal game.

It's, it's quite impressive. Like how far you can push the, the reasoning class, the bus, like doing a small game, essentially. Hmm. Um, that does make sense. Um, cool. Shabai says, uh, distills. Um, um, yeah. Okay. Uh, this is a classic question of, uh, I, I think that, so let me, let me make sure that people can see it on the, on the recording.

Um, it's a classic question of like, given the same amount of data, is it better to train a large model and distill down or just train a small model and go as much as possible? And the answer is train a large model and distill down. Um, we don't super yet know why, but this is demonstrated enough that it just is true.

Uh, I think it is a little bit the, the PPO effect again. Uh, the fact that you train a model and then you sort of like preference, uh, reward models, uh, are some forms of distillation anyway. Um, so yeah. Um, so yeah. Okay. Wait, we have to interrupt this message for, for doggo report.

Vibu, where is your doggo? Vibu, where is your doggo? Vibu, where is your doggo? Vibu, where is your doggo? Vibu, where is your doggo? Vibu, where is your doggo? Vibu, where is your doggo? Vibu, where is your doggo? Vibu, where is your doggo? Vibu, where is your doggo? Vibu, where is your doggo?

Vibu, where is your doggo? Vibu, where is your doggo? Vibu, where is your doggo? Vibu, where is your doggo? Vibu, where is your doggo? everyone. Bye-bye.

LLM Post-Training: A Deep Dive into Reasoning Large Language Models

Transcript