RLVR DARLING: Reinforcing Diversity & Quality in LLM Generations (Paper Club Oct 15)

Let's try Sonnet 4.5, I'm sorry, Ico 4.5, small model. How is it? Is it as good as this one? I think it's fast. It's a pretty big thing. Speed plays a lot. Recently, I've been trying Codex CLI a lot more, and frankly, it's very slow. Trying GPT-5 Pro, which has a noticeable performance difference, stuff is taking 20 minutes.

And it's very async in CLI, but Cloud Code has a hybrid mode. So, you know, planning with Opus or Sonnet 4.5 and then doing a lot of coding with a fast model, I'm pretty excited for it. But, you know, performance-wise on numbers, it's pretty good. I find it like, where is this chart?

I find it interesting that, you know, GPT-5 High still doesn't compare to Haiku 4.5 on Coding, which we've managed verified. Like, the Codex-specific model, sure, but like, yeah, Anthropic is doing good with coding. Yeah, that is a really nice result, actually. So, it seems good. It seems fast. Slick tweeted this somewhere.

Basically, we did speed as AGA, but otherwise, I don't know, it's nice to have it out. Eugene Yan was asking about this the other day. He still has Haiku in production for speed and cost, and he's like, none of the open-source models fill the gap. They don't have tail-end reliability.

So, he was looking forward to this. Interesting to see people still use Haiku 3.5, you know? But, okay, I think I can stop sharing, unless anyone else has fun stuff. I guess there's a system card. It's ASL2, model page, docs. Oh, system card is loading. Nothing crazy here. But, I'm sure someone will dig into this, see what other fun stuff they did.

But, you know, not the paper for today. This week, we have three papers with a volunteer, so I'm going to pass it off. Swix is speaking right now. He is at a Cloudflare conference, so he's literally on stage at this minute. So, we're good to start whenever I've got the recording going.

You want to do a quick intro? Yeah. Just want to make sure my mic is working. Yep, you're good. Cool. And your screen's here, too. Awesome. So, yeah, I'm Rob. I moved to San Francisco about a year ago. Started with a startup, it died. I used to work at Google as a machine learning engineer, and I've just been kind of deep diving some RL VR papers, and this is, oh, I wanted to present three, but I realized I didn't understand them super well, so I flimmed it down to just the one that I really got nailed down.

So, I think it's a quite nice and elegant paper, and, yeah, I'm going to talk about that. And we have slides, and we have slides. Yeah, there we go. Very nice. Very nice. Cool. Yeah, so feel free to interrupt me whenever. I'm happy to, you know, go slow, chat, whatever, answer questions.

So, first, I'm going to go through just, like, a basic high-level overview of reinforcement learning from verified rewards that probably, if you don't know it, not a lot of stuff is going to make sense. And then I'll dive into specifically the Darling paper from Meta. So, yeah, to type up RL VR for a sec, first, a quote from Rich Sutton, "An AI system can create and maintain knowledge only to the extent that it can verify the knowledge itself." So, this is an interesting take for Rich Sutton, who's not even really that much of a LLM guy, and I think he made this quote more than 20 years ago, but a bit, you know, saw some of the future.

RL VR, it was the sort of key unlock that OpenAI, that, you know, enabled OpenAI to have their 01 series where the model would, instead of just instantly sort of RL VR, spitting out an answer, you know, token at a time. RL VR, it allowed kind of RL VR, this thinking process where it will generate a lot of internal tokens and then give a high quality answer.

RL VR, it enabled recently gold matter performance in both this collegiate programming contest, ICBC, RL VR, other kind of math, high school math competitions, I think it did almost top performance, and this, an RL VR is like a key tool for training agentic tool use. So, yeah, is there, is there, I don't know if there's like a chat I can see, if anyone- RL VR, there's some chat, um, not much yet, someone was asking if you can share slides, but I think once we go into the media feed, you can share later.

RL VR, yeah, so this, oh, should I just share the link to the slides, rather, like you don't see my slides itself? RL VR, no, no, we see them, we see them. RL VR, okay. RL VR, someone was just asking where you can download them later. RL VR, oh yeah, yeah, yeah, yeah, I just put it in, I'm sorry, I put it in the main, uh, latent space discord chat.

RL VR, perfect, if you guys don't know who Ridge Sutton is, there's this paper, The Bitter Lesson, that everyone likes to cite, um, you should read it, recently he came back on the Dwarkesh podcast, and he's been having some hot takes, like, uh, you know, this, these LLMs are not going to do AGI, they don't actively learn, there's some interesting takes this one.

RL VR, yeah, and actually if you've been looking at that code and think about it a sec, uh, right, it's saying the verifiers are important, but if, like, the deployed language models don't have verifiers, right, like we, we use them during training, um, but at inference time, the verifiers are gone, so, you know, Rich might, uh, object still to, uh, the current state of the art, um, because of this, right, they, they only use verifiers during training, and so, um, at inference, they still might be, uh, lying to you, or misleading themselves or misleading you, uh, it's an interesting take with how, how deep verifiers go, right, like, we've got very fuzzy verifiers too now, right, like, all the, all the hype is on rubrics, as better LLM as a judge, and, you know, you're abstracting away a strict definition of verifiers, or verification, right, like, you're, you're, I mean, a lot of labs are trying to, trying to do reasoning without verification, how can we strip away as much of the verification as we need, and then you've also got this, know like you're only as good as you can verify so interesting little divergence that we're taking yeah yeah um yeah so well what is the verifier um it's a function that you're given a prompt and a response um tells you um true like the the response was correct so it's it's very clean and binary um for sort of some math problems where you just want an exact answer you can force the model to say like oh here's here's my answer after it's done all this thinking um for code you can like if you're just wanting to implement a function or something you can run the unit tests for that function and see if they passes them all um you can do other clever things like i think openai um has a benchmark called webbench or something where they um got interesting information from like the web and then generated um like true false questions from that data um so they use the web as a way of generating um sort of complicated questions and then they just throw away some details and then um have like you know you're encouraged or you encourage the lm agent whatever use a web browser and try to figure it out and yeah there's just lots of different ways you can sort of fit interesting problems um into uh this verified rewards kind of domain um so this also should be kind of review for uh reinforcement learning versus pre-training um but when you're pre-training you know for any given prefix uh you have you know what's the next token so you have a lot of dense reward and it's not really like hackable you can't really cheat that too well um but since it's teacher forced the model like isn't um sort of recursively generating from its own output so it's really kind of stays on the rails but um and it's kind of efficient for just ingesting large amounts of information um but um unlike reinforcement learning um it's not learning from its own outputs or its own mistakes so um rlhf which you know was very big and important about three years ago to kind of unlock usable chat bots um trains a separate reward model that doesn't say yes or no or this is absolutely good or this is absolutely bad but you know tries to encode human preferences um so you still have a dense reward or you can sort of judge partial answers um but it's kind of easy to hack this reward um you can just look for kind of exploits in the reward model and you know i think in the original rhf paper they when they when they don't kind of keep the model close to the original uh pre-trained model it starts just like spamming words like you know amazing and beautiful and generous and doesn't even generate coherent sentences it just uh spans out these kind of nice adjectives and the the model likes it um reinforcement learning from verified rewards doesn't have this problem where uh you know you can kind of hack the model because you actually have to really generate you know a single or like a specific um token that says or you know whatever sequence of tokens let's say you really nailed the problem so you've got a clean reward signal um but now you only get feedback um once you've done your your entire generation right so the reward signal is quite sparse um um what's his name if you guys know sholto douglas he works at anthropic he's been he says he's scaling rl he's one of the guys that's heavily scaling reinforcement learning in every domain he's giving a talk this week and one of his quotes is uh rlhf is fake rl so he had like a little skit about how rlhf is fake rl uh someone in the chat is asking if if you can elaborate on reward hackability i think it's a i think it's pretty important to understand what that is in rl do you wanna do you wanna take it yeah yeah i mean you can also take a shot at this but um so when you're doing rlhf um basically you'll you'll train a sort of separate reward model so you'll say collect i don't know a thousand prompts or ten thousand prompts get um pairs of answers from your model um and then send those pairs to humans humans will say you know iris what you know for this prompt i liked the the one on the right better for this one i like the one on the left better so you get this uh this kind of training set of pairs of uh responses to a given prompt and then you won't learn a model that tries to encode like what would a human uh what would they prefer right and so you have um a kind of a model that can tell you given two responses which one is better um but you know you don't have tons and tons of data for this and that that reward model itself might um have like maybe misaligned preferences or bugs or make mistakes um and so as you train your model um your sort of main shot model to generate responses that the the reward model likes um increasingly as you as you train it more and more or as you let it go further and further from the base model um you stop getting the the real signal like you know response a is better than b um for real like uh humans would agree with that and you start picking up on just kind of um whatever is weird or errors in the in that reward model so you kind of uh this is called reward hacking um and this is like a big problem with um using models to to judge the reward um where yeah there was there was this concept of rlaif so you have like a another model judge the output of another one and um it's not always like reward hacking just really really like leads to you add random stuff and then reward goes up because the models disagree but sometimes it's like it's hard to model your high level preference the little changes right so some examples are like emojis right um some of the models got really over excited with emojis like basically anything you'd ask them it would respond with emojis fire emojis happy faces and you know it learned that through rl because that optimized the mathematical like rl reward policy that okay you know in some domains like emojis are good or m dashes like let's always add m dashes and then same thing with some stuff like cificancy right you want better outputs that seem like they agree with you but then you take it too far and now the model just agrees with you in every sense and it doesn't push back so like there's clearly like a trade-off balance and it's really hard to get that mixture right but those are like some of the nuanced real examples of reward hacking right like it's starting to realize like okay you like how i chat but i can get a little better reward by throwing some emojis in there and then that like there was the whole thing off balance so now you're asking it about like explain a math formula and you've got a bunch of emojis right so it can be subtle and it can also be like very very obvious like you know um you just start shitting out longer responses and you get more rewards because that's what the policy is like saying and you found out this trick right let me just add noise because long response is good or the opposite let me let me be really short because short response is good so those are some examples that are like easier to understand right where you could understand how in some cases that that's like giving a good reward signal because it's good for some senses but then it kind of throws the whole thing off and reward hacking is where the model like really fits in on okay this is like this is giving me a lot of reward let me let me start doing this a lot more but you might not want it yeah and another sort of interesting problem with um reward modeling is like this problem is syncophancy right where people tend to like responses that agree with them maybe even if they're not true so this can slip into your reward model and then also slip into your your chat bot right so if you sort of have vaguely equivalent prompts but one that you know leans on expecting the answer to go a certain way uh models sometimes pick up on that too much uh it's another like important problem with uh parley jeff and reward hacking last year in the end of april open ai actually no this year what am i talking about in april they had to roll back a version of gpt4o they put out a good blog post i just shared it basically they're like um yeah gpt4o we did a little change and it's loving to agree with people um not what we want we got to roll it back so that was like the first you know big lab like okay here's here's a president issue that we're gonna have to deal with yeah yeah and i'm sure it's like you know that this is one where they really screwed it up so badly that they had to roll back a model but i think it's something that you kind of always have to fight with uh when you're using these reward models um i'm gonna kind of move on i think that's um so about a year ago um deep seek released uh r3 and i think or it was r1 i forget the names of it but it was a very high quality open source model um and they also released uh also released uh a paper uh where they described this group relative policy optimization which has become um i think a bedrock of uh work in the research world and open source ai world um so uh what it did was basically replace that reward model um for with like simple verifiers right so this only works in domains where you have uh verifiers but um you've given a prompt you generate a bunch of responses um you score them with the verifiers and then you say uh you know the reward uh is just some uh like a sum of your scores from your verifiers and then you say every token um and the response that is good it's kind of sort of shares that reward um and every every token and responses that are bad uh gets penalized so it's it's a very kind of broad brush but it's a clean signal and this i think allowed them to get quite close to uh the quality of opening eyes thinking models at the time it was released um and uh became not just this basis for a lot of additional work uh because it you know it worked well uh and then uh one one interesting problem people found uh with uh these verified reward or grpo trained models right is though it helped uh improve the quality of the responses especially if you're going to get you know just a single response um it actually hurt uh the the diversity of the degenerations like it's kind of compressed uh the response space on what it could find is as the kind of the good answer uh or a good answer um but it maybe hacked away interesting and useful bits of the model so if we look at this uh this plot here it's from a different paper um you see uh the base model is this quin uh 2.57b model and then there's um that model trained with grpo through more and more steps um and on the x-axis is uh the number of samples you generate and then on the y-axis is uh so-called passet k so does does any you know if any of the those k let's say in this 256 uh like x-axis you're going to generate 20 256 responses do any of them pass the firefighter right and we could see that as the grpo training goes on and on right the passet one gets better and better um but as the the k increases here up to 200 256 by the time you get here the base model actually is maybe doing a little bit better than the untrained model right so um empirically maybe one way to say this is um you're kind of getting rid of or um suppressing some of the models uh kind of true intelligence in some way right like um it's yeah i don't know maybe i should slow down here or people can ask questions but i think this is a very kind of important idea to understand is it suppressing or is it channeling sorry say that again is it suppressing or is it channeling channeling yeah i mean i'm not sure what what would the difference be so let's say a base model is super creative uh and then rl hf sorry rlbr model is less creative but more capable yeah yeah yeah i think that's a fair characterization yeah yeah yeah i mean this terms get tricky but yeah yeah um yeah um could i explain this exactly okay yeah yeah so i'll sort of slow down and try to say this again um in a different way um or like we'll just try to slowly explain this graph um so what we're going to do is we have one prompt or we'll actually be able to say we'll have 10 prompts right and then for each of those prompts we're going to generate let's say um 256 answers right um if any of those responses are like past the verifier right then you get credit for um covering or passing at 256 so you can you can be right simply one out of 256 times for a given prompt um and that will give you a point there right so if you you know in seven out of those ten prompts you'll have you know seven times 256 answers uh for a given prompt you know just imagine oring right say false false false false false true false false false if that or right if there's any true you get credit per prompt right that's how you make this plot um and yeah empirically what people have found in rl vr is um right as you continue to train um you know when when k is small you have a big advantage you do much better than your your base model but as you generate just tons and tons of responses right um you're actually less likely to solve kind of maybe really hard problems or problems that require like um some maybe clever spark of inspiration or some kind of weird way of solving the problem right so um the the training process uh is kind of smoothing out or getting rid of these kind of rough edges that don't tend to work but at the cost uh of some diversity or creativity in the models there are similar findings for rlhf in terms of um like i think like diversity of uh responses right if you ask um um all right i think i'm gonna move on here i think i've got this explained basically as well as i can but um having other people take a shot at explaining this would be worthwhile if if it's not clear um so yeah this is the paper that uh you know really motivated this talk um jointly reinforcing diversity and quality in language model generations um came out of meta mostly out of meta i think one of the authors is at johns hopkins um and uh this is i'm just like a beautiful slide that explains um the process quite well um so okay this may be not a verified reward kind of problem right write a story about a programmer with superpowers um these um blue rollouts are considered identical and good and this green rollout is considered good right it's actually a story about superpowers um and it's different than the blue ones which are considered kind of the same and then this red one is just not doesn't pass the verifier right so we want to penalize the red one in both grpo and and in darling right so that that advantage is kind of negative off to the left um whereas both the blue and green ones have advantage that's positive you know kind of to the right um but what darling says is since this uh this green response is different than the blue responses and you know it doesn't have any other um similar responses we're going to upgrade it all right so uh the advantage uh the advantage the kind of reward the score that we assigned to the screen one is going we're going to we're going to say it's it's more important or better than the blue ones even though um you know they all pass the verifier um just because it's distinct grpo doesn't have any mechanism like this it only uses the verifier and your score is independent of um how different or any any kind of diversity measure from uh from the other responses oh can i make this slide bigger um oh i should have went into presentation mode i'm sorry uh i don't actually give many presentations some uh full screen is there next to a slideshow right right here if i just put this okay well it's huge um i will say just for people to have more background information the original grpo was like very vanilla right you output stuff you you pick whatever passes the bar and you average out multiple if they do but there have been a lot of iterations on grpo right uh some of them change the kale divergence penalty some of them like there's like three or four different papers we've done in paper club gimme k2 did a different instance the original um grpo came out before deep seek r1 deep seek put out deep seek math in february or so of 2024 that had a slightly different grpo policy but the base premise is always the same right you have you generate a bunch of outputs and you have rollouts per se and then you want some way to give preference towards something now it's known that like you lose diversity and responses so this is specifically tackling that but there are other um variations as well yeah yeah i mean i think there's been a ton of work on improving grpo and a nice thing about them doing so well and open sourcing it i think is you know that it spurred a bunch of research yeah so so like for example um this one is focusing on diversity right you cluster outputs and you add more weight to the unique cluster right the one that only has one sample that will promote diversity some of the other stuff they've done is uh they have budget controls for output length right so um you can give more reward at different lengths um stuff like that right and then you can yeah so so that's just like another example of stuff that you could prefer right you don't want simple questions to have really long rollouts um you'll you'll basically do a similar thing where if an answer is shorter you give it more reward or or if it's longer for certain domain so there's stuff like that yeah and then here's yeah just a restatement of that that slide um to emphasize the difference right we have their input prompts they go through the verifier in grpo we do reward our aggregation and then we use actually kind of the same algorithm that is used for rlhf to do the updating once we've got the rewards um for darling we take the inputs uh we generate our rollouts now we apply our diversity classifier um we rewrite our rewards for um answers that are both correct and different than other correct responses and then we do the ppo update so it's it's not really that much uh a change on gpr grpo it just really focuses on boosting um answers that are both correct and different than other correct answers what is this sorry to interrupt how is this different from um uh i read a paper called us a d grpo done right or i think it's abbreviated as a dr grpo um i think i i i don't know dr grpo that well but i think this like dr dr um does some calculation that changes like how much reward you get based on the length of the response or something like it's okay it changes the reward aggregation um but it doesn't do anything that like compares responses against each other for their their semantic content and then reward like ones that are different right so i think um dr grpo is like a smaller change to grpo than this doesn't doesn't require any other classifiers or anything like that yeah got it thanks okay um so yeah what does the paper do well they open source their code which you know i love um they show that you can use this learn classifier to do this reward we're waiting and um now i'll show you some results later but they um like promoting diversity actually um seems to get rid of this problem of uh uh for large k like lots of rollouts uh losing your like the this kind of base model intelligence or creativity um so i hope you guys can can see this and see the graphs but the general gist is the base model is green um here it's a similar um plot to the the one that i showed a few slides ago but this is actually from the paper um and yeah so the number of rollouts is on this x-axis your pass at k is on the y-axis and we see this kind of same trend for the green like the base model versus the orange models is grpo versus where um as k increases kind of the base model tends to catch up to or i guess in some cases even pass um the grpo trains model um but um darling the this idea that um rewards diversity tends to keep this this edge over grpo and over the base model even at large k uh so we can you know at least in this math domain uh maybe see that it's really attacking this this problem and giving benefits even most of the time at for pass at one right which is maybe what you care about the most right because this is what you'll tend to ship to users right you only generate one response um okay um here's a question of like how do they train um this math equivalence classifier right so um the problem it solves is given given two rollouts um tries to maybe vaguely quickly determine you know are they saying essentially the same thing right not not not string match but um you know are they solving the problem in the same way um so the way they did this was they tried to get um a bunch of uh like a diverse solution set for different problems by generating uh rollouts from i think there's like five or six different models they're mostly it's from the coin family some from llama and then um given pairs of these responses they ask uh llama 70b to be like this oracle that generates the right answer and then um they take these pairs of responses with the uh the answer with with the answer from llama and they train a classifier on top of a coin embedding model to be able to um mimic llama mimic what this big llama model does um and they got pretty good accuracy i guess eight nine percent on holdout set um here's the prompts that they uh use to generate the training data for this equivalence classifier uh which i don't know seems pretty well thought out and uh well reasoned but it's not not actually that long um so right so how does the mechanics of this work right when you're doing your rollouts i think in the paper or in the code they said only eight um rollouts um they'll actually call this uh this embedding based classifier on every pair of inputs um and then if any pair uh before any pair the the classifier says yes um those those answers get merged um and then after you do this basically you'll have a partitioning of those responses right so um if a matches with b and b matches with c even if a doesn't match with c you still include uh consider a and c equivalent um and then you form this diversity score that's just like a simple function of the size of the partition um which just says like how many uh other correct responses are you um distinct from right so if if your cluster is big you kind of lose some credit uh because um you know the model tended to like to generate that style of answering um whereas if you got the answer correct and you are distinct from all of the other responses um you'll get uh substantially higher reward so maybe in some sense it's trying to um reward or uh upweight these kind of more creative or weird or subtle kind of reasoning paths uh to keep that diversity in the model uh yeah note that this is like super super simple right we do these n squared classifications um and then we just binarize the score from the classifier and then form clusters right so we kind of do quite a a bit of work and then just get very kind of simple outputs to um change the reward using that um yeah and then this is the actual uh math basically this darling score is a re-weighting of these space um you know reward scores from the verifiers um with some normalization um they tried some other ways of incorporating the diversity score using addition rather than multiplication uh they found that this uh multiplicative score um works the best and then yeah as that previous slide showed right um once you're done with this now you've got your reward scores calculated and it's it's identical to uh grpo from here um okay so what did i like about the paper right so it's it's a well-motivated problem i think this approach is is pretty simple right it's maybe not as simple as like dr gpr grpo which just as far as i understand like changes some like length normalization right you have to introduce this new classifier um but i think fundamentally you're going to attack diversity you need to be able to you know compare responses and the semantic quality content of responses against each other so it's sort of some necessary complexity um has you know good results and um they open source the code um what don't i like so i actually tried to run the code um and then you quickly uh hit this problem where they don't actually include this semantic classifier for you so you can't just out of the box run it um maybe this diversity calculations overly simple right um you could imagine maybe you've got four responses and a is equivalent to b is equivalent to c and then just you know d is equivalent to b but not to a and c right so you've got this little cluster or this big cluster and then just one little spike spoke coming out connecting them now we the calculation will say d is exactly equivalent to a b and c even though even like this graph structure kind of makes you think probably d is a little bit different than a b and c so maybe you should actually give d more credit than a b and c the paper only talks about using you know one particular diversity classifier for the math problems they didn't say you know see what happens if they were to train that classifier for longer on more data or use like a simpler model and they also don't actually say how much time is going into doing this diversity classification right um could be the case that if you were to spend that time uh you know just generating more rollouts uh you know maybe you would have done better right um but yeah all in all i was pretty happy with uh how this paper was written you know the idea how to attack the problem uh i will say it's pretty fun exercise there's like seven or eight different like easy to implement your po papers that don't share the code and really open source it but like the models are really good at converting this stuff to code so like it's one thing to take a paper that doesn't have an open source implementation and like caught it up from scratch but grpo is one of those things that's actually very easy because they spend a very long section of this paper of like all the papers explaining the changes to math right like okay we're getting rid of this like normalization distribution we're doing these little changes and then you know you go into cloud code or whatever you want to use and you you get a base grpo implementation you share the changes and then you just have it kind of implement it right tests and like they're pretty good at this because you not only have math formulas you also have like two pages of here's why and what's changing and how it changes and it's a little hard math at least for me to follow like and implement myself in code it would take a lot longer but like and while they're pretty good at this stuff so uh if any of you guys are going through like the nano gpt nano chats or whatever and you want to like do some rl training from scratch implementing grpo with like a coding co-pilot and iterations of it really helps you understand what's going on and it's a lot easier than it seems but like you know quite irrelevant to the paper sorry no no no it's cool i mean i yeah i kind of want to get my hands dirty like you know hacking on this stuff um so yeah maybe i should kind of pull out some other papers and uh you know not get stunted by oh you know this big missing piece of them uh the paper is it's not there uh yeah this is the end of my talk happy to answer more questions very nice slides by the way yeah thanks and thank you for offering to present yeah i mean i i initially wanted to present three and then uh realized that was a lot of work and i didn't like really understand the details of the other two papers super well so i pumped it and only did the one that you know i feel like i really wrapped my head around so um okay any other questions comments from anyone i have a question about like the open-ended nature of task verification for let's say llms and the fact that that's what's led to not having good verifiers like uh what's the state of the art in terms of like developing closed tasks of some kind and benchmarking them and using them as like a harder type of type of verifier yeah i don't know this stuff super well um yeah i know karpathy has this tweet like three months ago that says like you know everyone should try to be coding up verifiers for their problems and this would you know really help the models if you guys have seen the current meme trend of everyone starting an oral environments company that's kind of what this is right uh everyone's starting up companies to sell to the labs and make environments uh i'll give you a bit of the background of what it's really like the labs want exclusivity right they want their environment like your unique environment to be just for them so uh you know you can be the mercore that's like you're the player and you can do stuff for everyone or if you're a small startup you're basically a little consulting arm creating little interactive environments for one lab and then it's like a whole thing to manage to do multiple unique ones and then they don't want you to work with other ones but on the other side um a lot of the pilots and where they start is actually you're making environments for evals so say you're working with not a lab doing rl you're you're working or you are you're working with like a company that wants to do rl or test their agent or whatever what you're really doing is you're making like a sandbox environment that should test whatever they want to test in like a unique setting and then they're really just using it for evals even though the conversations like oh i want to do rl this that it kind of always starts with um i want evals for my thing and we'll help you create unique evals right like it's a eval as a service game so instead of you can run on sweet bench how about even instead of like running on real github how about we make you a sandbox of like random coming in through a slack like chatbot and this all end to end and you can see where your agent fails but um i low-key forgot the question you were asking where i started this yeah what's the state of the art of like yeah exactly yeah yeah yeah so uh a lot of the companies that are trying to try to make environments and stuff um you know they're going at it from that angle or or they're all trying to just sell to a lab which is cool too um outside of that like you know no one not many people outside of labs are doing the real rl yet so it's interesting that it covers both domains right um but that's kind of where it is and then it's very very saturated like you want rl environments we'll send you a list of 20 that'll build you a sandbox that tries to mimic your thing but there's no value in evals for it right it's stuff that your thing has probably never seen yeah it seems like one approach would have just been like hopefully rl just scaled all the way to some sort of all-encompassing god model but you know short of that uh we'll take rl for task specific uh and just just um max out each task domain basically i think rl is very general right it scales out as a god model that's what all the thinking models are right like uh o1 o3 gpt5 thought thinking those are very good general purpose reasoning models they're not per se trained to just one thing now on the other and if you look at scaling laws and um generalization of this stuff it just so happens that when you do this type of rl on math and coding it actually generalizes like uh look at a chess benchmark we're not doing rl on chess but you rl on math and code your test gets really good um it's it's just that general roll out a chain of thought step by step and it's not to say that there aren't issues right like right now we can't do um we can't do a lot of um parallel rl because you have to stay on policy right so the pre-training side is very parallelized um but rl push training per se we don't know how to do it that well in parallel across a budget you know like i can't just optimize parallel rl then there's other stuff like diffusion i can't diffuse rl i can't do um native voice spectrogram or also there's stuff there and i guess maybe my chess example isn't good yikes but the the the topic still uh you know the premise still holds like there's there is generalization in rl uh alpha zero was a specific model on chess but you can you can rl on just math and you can see that it generalizes to other domains there's like quite a few papers on this there's a lot of small models where they like they show rl just on the data set make sure like i think even the mixture models uh the phi models they explain what their data set is and there's definitely a few that are just all math and code and then you run the benchmark on regular gsm ak and you know stuff like that and it still goes up so it does generalize it's also just that that's the most verifiable data right now yeah that's fair and i think uh what i was thinking about was like the alpha go uh debut and like that you know that that novel move that happened as a result of just rl maxing all the way till it invented a new move that's we haven't seen that level of rl maxing on like the other areas right so yeah move 37 you should look into 37 pretty pretty fun stuff like it'd be interesting what is the move 37 of like triage my inbox or something you know we'll see uh some people are using it for proofs christian sagetti is doing a new math inc company trying to solve math um the i guess the move 37 for interesting stuff is like periodic labs from liam they're doing it for rl for material sciences um you know predict new materials silicon and stuff all on mid-trained rl models and hopefully you find something cool harmonic i don't know what that is but there's there's people doing in different domains it's just it's just different problems i think it reflects back on the hardware soft verifiers and like the our bio that we did last week is sort of related as well uh this idea of like what's hard verifier ever soft and yeah cool more comments questions thoughts okay if not we can always give people some time back on their wednesday um next week uh of what exactly is mid training does anybody have a good succinct uh description of that yikes you do yikes knows what mid training is i was gonna yeah i don't have a good um succinct one uh normally i would assume it's it's the essentially the rl process right where it's if post training is going to be the chat or instruction tune and then pre-training is going to be the foundation then mid training i guess should come before the instruction or chat tune so i am wrong yeah yeah my understanding of mid training is like you're selecting really high quality documents for pre-training so you know you know you'll first pre-training on like just everything and then you'll select i don't know the top one percent or ten percent or something of like really good documents and and finish your pre-training there oh and then yeah yeah so there's a yeah just interesting problems of like figuring out you know what's what's the real good data um and yeah that and um it's in so we have we have pre-training we have initial pre-training then we have mid training with a refined data set then we have chat or instruction tune with a different data set and then we have rl on top of that which is building the chain of thought to be more rewardy um one wonders like how many other junctures there are could be different stuff uh presumably i can extend the principle in some way or another yeah typically it's a mid training stage isn't like it's very odd to just say mid training for general use right because yeah after you do your trillions of tokens of pre-training you do save the best for last but when you think of it like in a specific domain right like say periodic or are all for material sciences um they'll take a base model like trained from scratch and preferably not sft then they themselves like they'll license a bunch of data in different domains like material sciences books articles patents like as many domain specific tokens as they can do mid training where you take like a base quen model you domain adapt it towards your your domain of material sciences and you have how many ever high quality tokens then they'll do a sft rl whatever um so that's that's kind of the mid training phase uh in general purpose models it's more so like okay you know here's here's basic tool use here's code here's whatever use cases we find interesting let's let's start training in on those then do our 15 rl as well um i because i remember the thing that's coming to mind is the um like the galactica model and then maybe like slightly bloomberg gbt where they have the general idea was like okay we need a new foundation model because we want to do domain specific stuff and the issue that they were running into was that the domain specific jargon uh like needed its own tokenizer effectively um so i'm curious as to because galactic is kind of old news at this point so i'm curious if we got to hey what we should actually do is not really worry about the tokenizer stuff and we should just take like a pretty you know a capable generalized model then train it so that it's more more capable in this domain and i yeah i'm curious i won't like uh uh galactica plus plus bench i guess to see if uh galactic and fire the one of the ones on the line yeah that's that's basically a good example actually the the bloomberg gpt that it's not necessarily mid training because that thing was so under trained but um you know towards towards the end of it they they did train on a bunch of unique bloomberg data set stuff i think we actually did a paper club i covered it on bloomberg gpt so it should be recorded that paper was basically all just data set uh but but then they're like you know we have hundreds of billions of tokens of financial data sets news filings press bloomberg stuff and the order in which they do it is pretty interesting i think you get a lot more conceptual understanding of mid training by reading small language model papers um so we've gone through a lot of those because they're a big component is basically their training mixture and the um order in which they do training so just find any of the new small small models and then look at look at their training stuff so was was bloomberg gpt effectively like mid-trained with the bloomberg stuff because i feel like they said that they had a new foundation model right or so there are they yeah mid training it and then calling it a new foundation model or is they actually like okay we handpicked all these tokens and now we're just rolling it from scratch i don't know i think the problem with considering bloomberg gpt mid training is that half the data set was financial like i can pull it up but i quite strongly remember that half that data set for pre-training as well like it was a very clear split half of it was like pile c4 books or something the other half was 40 financial news so you know it was just a pre-train with a lot of financial yeah yeah and i think galactica is the same way where it's like uh yeah we mess with the tokenizer we you know uh decided to do the data set this way and then train the whole thing so i'm yeah i'm interested that's an interesting like there's a lot of decisions that you could make there that uh didn't really occur to me as like decisions that you could really make architecturally or like ones that people don't often take i'm curious if it's just like not effective or if it's um mid training is kind of the better way to do that regarding the nomenclature mid training seems pretty similar to continued pre-training like what's the difference yeah there's uh i'll quote jeremy howard on the latent space podcast there's no concept of post training it's all just training right like yeah at the at some level like post training is still training rl is still training it's all it's all just training it's just that the the bound that we define is like quality of data right you typically consider uh like chat um instruction following a separate category because the data is per se instruction following right and it's like chat user assistant pairs but you're not doing anything different you're just training you're still doing typically sft right so the words go in the magic sand and that's what it is um in some actually like in a pure literal sense it is still training there's pre-training post training is the same the the useful bounds difference is that you know the majority of your 15 trillion tokens will be just random token understand the world english the reason we call it post training is not because you're doing any fancy training it's just because that's where you're starting to do instruction following and you know user assistant chat so there's a there's a bound in terms of what the you know what we're trying to accomplish but in the literal sense it's all just training right it's we're not changing the training objective we're just predicting tokens right so yeah the way i look at it is pre-training is mostly unsupervised right like including continued pre-training or mid-training the data might be changing for domain adaptation but post-training you're adding in some supervised element or reinforcement element um not completely unsupervised uh yeah yeah there are ways to look at it yeah and i'm curious like do you see any uh companies or you know um use cases where uh rl is seeping into agentic workflows uh uh you know i almost see it as one way to squeeze you know most of the foundation models um yeah i'm curious if you have any interesting use cases that you came across yeah a lot of the labs have done cool rl for specific stuff right like deep research is rl on opening eyes model there's coding ones that have come out uh we've shared quite a few in discord but there are cool use cases of um people doing rl for a task and and that turning straight into product um yeah yeah yeah okay guys thanks again rob for sharing we're kind of at that time next week we have uh real-time detection of hallucinations you can find it in discord i shared it in the link here but thank you thank you for sharing really good slides good presentation

RLVR DARLING: Reinforcing Diversity & Quality in LLM Generations (Paper Club Oct 15)

Chapters

Transcript