back to indexRLVR DARLING: Reinforcing Diversity & Quality in LLM Generations (Paper Club Oct 15)

Chapters
0:0 Introduction to Sonnet 4.5 and Haiku 4.5 Performance
1:47 Eugene Yan's Use of Haiku 3.5 and Tail-End Reliability
2:50 Introduction to Rob and the RLVR Paper
3:55 Basic Overview of Reinforcement Learning from Verified Rewards (RLVR)
4:21 Rich Sutton's Quote on Knowledge Verification
4:47 OpenAI's 01 Series and Gold Medal Performance in Programming Contests
6:14 Rich Sutton's "Bitter Lesson" and Hot Takes on LLMs
7:48 Defining the Verifier in RLVR
9:15 Reinforcement Learning vs. Pre-training: Dense vs. Sparse Rewards
9:54 RLHF: Reward Hacking and Human Preferences
11:34 Shto Douglas on RLHF being "Fake RL"
11:55 Elaboration on Reward Hackability in RLHF
16:20 The Problem of Syncopancy in Reward Modeling
17:35 DeepSeek R3/R1 and Group Relative Policy Optimization (GRPO)
19:24 The Diversity Problem in GRPO-trained Models (Pass@K Plot)
21:47 Discussion: Suppressing vs. Channeling Creativity
22:44 Detailed Explanation of the Pass@K Graph
25:45 Introduction to the DARLING Paper
26:0 Explaining DARLING's Approach to Diversity and Reward Upweighting
28:20 Context on GRPO Iterations and Budget Controls
30:28 GRPO vs. DARLING Workflow Comparison
31:13 DARLING vs. DrGRPO (Diversity in Reward Aggregation)
32:12 DARLING's Contributions: Open-sourcing, Reward Reweighting, and Results
32:55 DARLING's Improved Pass@K Results at Large K
34:25 Training the Math Equivalence Classifier for Diversity
36:22 Mechanics of Diversity Calculation and Partitioning in DARLING
38:40 The Multiplicative DARLING Score
39:29 Pros and Cons of the DARLING Paper
41:47 Coding GRPO Implementations from Scratch
44:26 Open-Ended Task Verification and the State-of-the-Art in Verifiers
45:6 The Rise of RL Environments Companies
47:32 Generalization of RL in Math and Coding to Other Domains
49:8 Parallel RL and Other RL Limitations
49:50 AlphaGo's "Move 37" and Novelty in RL
50:31 RL for Proofs and Material Sciences
51:43 Mid-Training Definition and Purpose
53:11 Mid-Training in Domain-Specific Models (e.g., Material Sciences)
54:28 Galactica and Bloomberg GPT: Tokenizers and Domain Adaptation
57:42 Mid-Training vs. Continued Pre-training
59:47 RL in Agentic Workflows
00:00:00.000 |
Let's try Sonnet 4.5, I'm sorry, Ico 4.5, small model. 00:00:36.880 |
Recently, I've been trying Codex CLI a lot more, and frankly, it's very slow. 00:00:45.880 |
Trying GPT-5 Pro, which has a noticeable performance difference, stuff is taking 20 minutes. 00:00:54.160 |
And it's very async in CLI, but Cloud Code has a hybrid mode. 00:01:00.200 |
So, you know, planning with Opus or Sonnet 4.5 and then doing a lot of coding with a fast model, I'm pretty excited for it. 00:01:09.500 |
But, you know, performance-wise on numbers, it's pretty good. 00:01:15.340 |
I find it interesting that, you know, GPT-5 High still doesn't compare to Haiku 4.5 on Coding, which we've managed verified. 00:01:24.780 |
Like, the Codex-specific model, sure, but like, yeah, Anthropic is doing good with coding. 00:01:30.900 |
Yeah, that is a really nice result, actually. 00:01:39.600 |
Basically, we did speed as AGA, but otherwise, I don't know, it's nice to have it out. 00:01:48.420 |
Eugene Yan was asking about this the other day. 00:01:51.260 |
He still has Haiku in production for speed and cost, and he's like, none of the open-source models fill the gap. 00:02:03.880 |
Interesting to see people still use Haiku 3.5, you know? 00:02:11.880 |
But, okay, I think I can stop sharing, unless anyone else has fun stuff. 00:02:27.140 |
But, I'm sure someone will dig into this, see what other fun stuff they did. 00:02:35.500 |
This week, we have three papers with a volunteer, so I'm going to pass it off. 00:02:43.160 |
He is at a Cloudflare conference, so he's literally on stage at this minute. 00:02:47.880 |
So, we're good to start whenever I've got the recording going. 00:03:10.560 |
I used to work at Google as a machine learning engineer, and I've just been kind of deep diving some RL VR papers, and this is, oh, I wanted to present three, but I realized I didn't understand them super well, so I flimmed it down to just the one that I really got nailed down. 00:03:28.940 |
So, I think it's a quite nice and elegant paper, and, yeah, I'm going to talk about that. 00:03:45.760 |
I'm happy to, you know, go slow, chat, whatever, answer questions. 00:03:53.120 |
So, first, I'm going to go through just, like, a basic high-level overview of reinforcement learning from verified rewards that probably, if you don't know it, not a lot of stuff is going to make sense. 00:04:04.800 |
And then I'll dive into specifically the Darling paper from Meta. 00:04:11.060 |
So, yeah, to type up RL VR for a sec, first, a quote from Rich Sutton, "An AI system can create and maintain knowledge only to the extent that it can verify the knowledge itself." 00:04:33.040 |
So, this is an interesting take for Rich Sutton, who's not even really that much of a LLM guy, and I think he made this quote more than 20 years ago, but a bit, you know, saw some of the future. 00:04:43.900 |
RL VR, it was the sort of key unlock that OpenAI, that, you know, enabled OpenAI to have their 01 series where the model would, instead of just instantly sort of 00:05:01.280 |
RL VR, spitting out an answer, you know, token at a time. 00:05:06.440 |
RL VR, this thinking process where it will generate a lot of internal tokens and then give a high quality answer. 00:05:13.940 |
RL VR, it enabled recently gold matter performance in both this collegiate programming contest, ICBC, 00:05:23.680 |
RL VR, other kind of math, high school math competitions, I think it did almost top performance, and this, an RL VR is like a key tool for training agentic tool use. 00:05:42.440 |
So, yeah, is there, is there, I don't know if there's like a chat I can see, if anyone- 00:05:49.580 |
RL VR, there's some chat, um, not much yet, someone was asking if you can share slides, but I think once we go into the media feed, you can share later. 00:05:58.340 |
RL VR, yeah, so this, oh, should I just share the link to the slides, rather, like you don't see my slides itself? 00:06:04.480 |
RL VR, someone was just asking where you can download them later. 00:06:06.800 |
RL VR, oh yeah, yeah, yeah, yeah, I just put it in, I'm sorry, I put it in the main, uh, latent space discord chat. 00:06:12.980 |
RL VR, perfect, if you guys don't know who Ridge Sutton is, there's this paper, The Bitter Lesson, that everyone likes to cite, um, you should read it, recently he came back on the Dwarkesh podcast, and he's been having some hot takes, like, uh, you know, this, these LLMs are not going to do AGI, they don't actively learn, there's some interesting takes this one. 00:06:35.120 |
RL VR, yeah, and actually if you've been looking at that code and think about it a sec, uh, right, it's saying the verifiers are important, but if, like, the deployed language models don't have verifiers, right, like we, we use them during training, um, but at inference time, the verifiers are gone, so, you know, Rich might, uh, object still to, uh, the current state of the art, um, because of this, right, they, they only use verifiers during training, and so, um, at inference, they still might be, uh, lying to you, 00:07:05.100 |
or misleading themselves or misleading you, uh, it's an interesting take with how, how deep verifiers go, right, like, we've got very fuzzy verifiers too now, right, like, all the, all the hype is on rubrics, as better LLM as a judge, and, you know, you're abstracting away a strict definition of verifiers, or verification, right, like, you're, you're, I mean, a lot of labs are trying to, trying to do reasoning without verification, how can we strip away as much of the verification as we need, and then you've also got this, 00:07:35.080 |
know like you're only as good as you can verify so interesting little divergence that we're taking 00:07:41.080 |
yeah yeah um yeah so well what is the verifier um it's a function that you're given a prompt and a 00:07:52.200 |
response um tells you um true like the the response was correct so it's it's very clean and binary um 00:08:04.840 |
for sort of some math problems where you just want an exact answer you can force the model to say like 00:08:11.480 |
oh here's here's my answer after it's done all this thinking um for code you can like if you're just 00:08:19.400 |
wanting to implement a function or something you can run the unit tests for that function and see 00:08:23.160 |
if they passes them all um you can do other clever things like i think openai um has a benchmark called 00:08:32.280 |
webbench or something where they um got interesting information from like the web and then generated 00:08:39.880 |
um like true false questions from that data um so they use the web as a way of generating 00:08:47.400 |
um sort of complicated questions and then they just throw away some details and then 00:08:51.960 |
um have like you know you're encouraged or you encourage the lm agent whatever use a web browser 00:08:58.120 |
and try to figure it out and yeah there's just lots of different ways you can sort of fit interesting 00:09:04.600 |
problems um into uh this verified rewards kind of domain um so this also should be kind of review for 00:09:16.840 |
uh reinforcement learning versus pre-training um but when you're pre-training you know for any given 00:09:23.480 |
prefix uh you have you know what's the next token so you have a lot of dense reward and it's not 00:09:29.560 |
really like hackable you can't really cheat that too well um but since it's teacher forced the model 00:09:35.720 |
like isn't um sort of recursively generating from its own output so it's really kind of stays on the rails 00:09:42.600 |
but um and it's kind of efficient for just ingesting large amounts of information um but 00:09:51.320 |
um unlike reinforcement learning um it's not learning from its own outputs or its own mistakes 00:09:57.000 |
so um rlhf which you know was very big and important about three years ago to kind of unlock 00:10:05.160 |
usable chat bots um trains a separate reward model that doesn't say yes or no or this is absolutely good 00:10:12.680 |
or this is absolutely bad but you know tries to encode human preferences um so you still have a dense 00:10:20.440 |
reward or you can sort of judge partial answers um but it's kind of easy to hack this reward um you can 00:10:30.520 |
just look for kind of exploits in the reward model and you know i think in the original rhf paper they 00:10:38.360 |
when they when they don't kind of keep the model close to the original uh pre-trained model it starts 00:10:43.880 |
just like spamming words like you know amazing and beautiful and generous and doesn't even generate 00:10:48.840 |
coherent sentences it just uh spans out these kind of nice adjectives and the the model likes it um 00:10:55.480 |
reinforcement learning from verified rewards doesn't have this problem where uh you know you 00:11:01.640 |
can kind of hack the model because you actually have to really generate you know a single or like a 00:11:08.280 |
specific um token that says or you know whatever sequence of tokens let's say you really nailed the 00:11:15.400 |
problem so you've got a clean reward signal um but now you only get feedback um once you've done your 00:11:24.920 |
your entire generation right so the reward signal is quite sparse um um what's his name if you guys know 00:11:36.200 |
sholto douglas he works at anthropic he's been he says he's scaling rl he's one of the guys that's heavily 00:11:44.120 |
scaling reinforcement learning in every domain he's giving a talk this week and one of his quotes is uh 00:11:50.680 |
rlhf is fake rl so he had like a little skit about how rlhf is fake rl uh someone in the chat is asking if 00:11:59.240 |
if you can elaborate on reward hackability i think it's a i think it's pretty important to understand 00:12:04.920 |
what that is in rl do you wanna do you wanna take it yeah yeah i mean you can also take a shot at this but 00:12:09.800 |
um so when you're doing rlhf um basically you'll you'll train a sort of separate reward model so you'll 00:12:17.560 |
say collect i don't know a thousand prompts or ten thousand prompts get um pairs of answers from your model 00:12:24.680 |
um and then send those pairs to humans humans will say you know iris what you know for this prompt i 00:12:33.160 |
liked the the one on the right better for this one i like the one on the left better so you get this uh 00:12:38.680 |
this kind of training set of pairs of uh responses to a given prompt and then you won't learn a model 00:12:46.040 |
that tries to encode like what would a human uh what would they prefer right and so you have um 00:12:56.040 |
a kind of a model that can tell you given two responses which one is better um but you know you 00:13:03.400 |
don't have tons and tons of data for this and that that reward model itself might um have like maybe 00:13:12.280 |
misaligned preferences or bugs or make mistakes um and so as you train your model um your sort of main 00:13:21.320 |
shot model to generate responses that the the reward model likes um increasingly as you as you train it 00:13:29.720 |
more and more or as you let it go further and further from the base model um you stop getting the the real 00:13:37.320 |
signal like you know response a is better than b um for real like uh humans would agree with that and 00:13:44.600 |
you start picking up on just kind of um whatever is weird or errors in the in that reward model so you 00:13:53.080 |
kind of uh this is called reward hacking um and this is like a big problem with um using models to to 00:14:04.200 |
judge the reward um where yeah there was there was this concept of rlaif so you have like a another 00:14:11.880 |
model judge the output of another one and um it's not always like reward hacking just really really like 00:14:20.360 |
leads to you add random stuff and then reward goes up because the models disagree but sometimes it's 00:14:25.880 |
like it's hard to model your high level preference the little changes right so some examples are like 00:14:32.840 |
emojis right um some of the models got really over excited with emojis like basically anything you'd ask 00:14:40.040 |
them it would respond with emojis fire emojis happy faces and you know it learned that through rl because that 00:14:47.000 |
optimized the mathematical like rl reward policy that okay you know in some domains like emojis are good 00:14:54.920 |
or m dashes like let's always add m dashes and then same thing with some stuff like cificancy right you 00:15:00.680 |
want better outputs that seem like they agree with you but then you take it too far and now the model just 00:15:06.760 |
agrees with you in every sense and it doesn't push back so like there's clearly like a trade-off balance and 00:15:12.040 |
it's really hard to get that mixture right but those are like some of the nuanced real examples of reward 00:15:19.000 |
hacking right like it's starting to realize like okay you like how i chat but i can get a little better 00:15:25.960 |
reward by throwing some emojis in there and then that like there was the whole thing off balance so now 00:15:30.040 |
you're asking it about like explain a math formula and you've got a bunch of emojis right so it can be 00:15:35.080 |
subtle and it can also be like very very obvious like you know um you just start shitting out longer 00:15:42.520 |
responses and you get more rewards because that's what the policy is like saying and you found out this 00:15:48.520 |
trick right let me just add noise because long response is good or the opposite let me let me be really 00:15:53.960 |
short because short response is good so those are some examples that are like easier to understand right where you could 00:15:59.720 |
understand how in some cases that that's like giving a good reward signal because it's good for some 00:16:06.120 |
senses but then it kind of throws the whole thing off and reward hacking is where the model like really 00:16:11.000 |
fits in on okay this is like this is giving me a lot of reward let me let me start doing this a lot more 00:16:16.840 |
but you might not want it yeah and another sort of interesting problem with um reward modeling is like this 00:16:23.400 |
problem is syncophancy right where people tend to like responses that agree with them maybe even if 00:16:28.920 |
they're not true so this can slip into your reward model and then also slip into your your chat bot right 00:16:35.480 |
so if you sort of have vaguely equivalent prompts but one that you know leans on expecting the answer to 00:16:42.040 |
go a certain way uh models sometimes pick up on that too much uh it's another like important problem with uh 00:16:50.520 |
parley jeff and reward hacking last year in the end of april open ai actually no this year what am i 00:16:58.040 |
talking about in april they had to roll back a version of gpt4o they put out a good blog post i just 00:17:03.560 |
shared it basically they're like um yeah gpt4o we did a little change and it's loving to agree with people 00:17:10.920 |
um not what we want we got to roll it back so that was like the first you know big lab like okay here's 00:17:17.640 |
here's a president issue that we're gonna have to deal with yeah yeah and i'm sure it's like you know 00:17:22.360 |
that this is one where they really screwed it up so badly that they had to roll back a model but i 00:17:25.800 |
think it's something that you kind of always have to fight with uh when you're using these reward models 00:17:34.920 |
um i'm gonna kind of move on i think that's um so about a year ago um deep seek released uh r3 and i 00:17:45.880 |
think or it was r1 i forget the names of it but it was a very high quality open source model um and they 00:17:53.080 |
also released uh also released uh a paper uh where they described this group relative policy optimization 00:18:00.120 |
which has become um i think a bedrock of uh work in the research world and open source ai world um 00:18:08.440 |
so uh what it did was basically replace that reward model um for with like simple verifiers right so this 00:18:19.240 |
only works in domains where you have uh verifiers but um you've given a prompt you generate a bunch of 00:18:27.000 |
responses um you score them with the verifiers and then you say uh you know the reward uh is just 00:18:37.800 |
some uh like a sum of your scores from your verifiers and then you say every token um 00:18:45.160 |
and the response that is good it's kind of sort of shares that reward um and every every token and 00:18:51.880 |
responses that are bad uh gets penalized so it's it's a very kind of broad brush but it's a clean signal 00:18:59.560 |
and this i think allowed them to get quite close to uh the quality of opening eyes thinking models 00:19:08.280 |
at the time it was released um and uh became not just this basis for a lot of additional work uh because 00:19:22.120 |
and then uh one one interesting problem people found uh with uh these verified reward 00:19:31.080 |
or grpo trained models right is though it helped uh improve the quality of the responses 00:19:39.400 |
especially if you're going to get you know just a single response um 00:19:45.560 |
it actually hurt uh the the diversity of the degenerations like it's kind of compressed uh the 00:19:54.680 |
response space on what it could find is as the kind of the good answer uh or a good answer um but it 00:20:03.400 |
maybe hacked away interesting and useful bits of the model so if we look at this uh this plot here it's 00:20:12.120 |
from a different paper um you see uh the base model is this quin uh 2.57b model and then there's um that 00:20:24.760 |
model trained with grpo through more and more steps um and on the x-axis is uh the number of samples you 00:20:34.200 |
generate and then on the y-axis is uh so-called passet k so does does any you know if any of the 00:20:42.040 |
those k let's say in this 256 uh like x-axis you're going to generate 20 256 responses do any of them 00:20:52.680 |
pass the firefighter right and we could see that as the grpo training goes on and on right the passet one 00:21:03.960 |
gets better and better um but as the the k increases here up to 200 00:21:11.960 |
256 by the time you get here the base model actually is maybe doing a little bit better than 00:21:19.320 |
the untrained model right so um empirically maybe one way to say this is um you're kind of 00:21:28.840 |
getting rid of or um suppressing some of the models uh kind of true intelligence in some way right like um 00:21:41.880 |
it's yeah i don't know maybe i should slow down here or people can ask questions but 00:21:46.680 |
i think this is a very kind of important idea to understand is it suppressing or is it channeling 00:21:53.800 |
sorry say that again is it suppressing or is it channeling channeling yeah i mean i'm not sure 00:22:03.720 |
what what would the difference be so let's say a base model is super creative uh and then rl hf sorry 00:22:11.800 |
rlbr model is less creative but more capable yeah yeah yeah i think that's a fair characterization 00:22:27.320 |
yeah yeah yeah i mean this terms get tricky but yeah yeah um 00:22:37.880 |
um could i explain this exactly okay yeah yeah so i'll sort of slow down and try to say this again 00:22:50.120 |
um in a different way um or like we'll just try to slowly explain this graph um 00:22:56.920 |
so what we're going to do is we have one prompt or we'll actually be able to say we'll have 10 prompts 00:23:02.760 |
right and then for each of those prompts we're going to generate let's say um 256 answers right 00:23:15.800 |
are like past the verifier right then you get credit for um 00:23:25.800 |
covering or passing at 256 so you can you can be right simply one out of 256 times 00:23:33.400 |
for a given prompt um and that will give you a point there right so if you you know in 00:23:39.160 |
seven out of those ten prompts you'll have you know seven times 256 answers uh for a given prompt 00:23:47.080 |
you know just imagine oring right say false false false false false true false false false if that or 00:23:52.520 |
right if there's any true you get credit per prompt right that's how you make this plot 00:23:59.320 |
um and yeah empirically what people have found in rl vr is um right as you continue to train um 00:24:11.320 |
you know when when k is small you have a big advantage you do much better 00:24:18.520 |
than your your base model but as you generate just tons and tons of responses right 00:24:26.280 |
um you're actually less likely to solve kind of maybe really hard problems or problems that require like 00:24:34.840 |
um some maybe clever spark of inspiration or some kind of weird way of solving the problem right so 00:24:44.520 |
um the the training process uh is kind of smoothing out or getting rid of these kind of rough edges that 00:24:54.040 |
don't tend to work but at the cost uh of some diversity or creativity in the models 00:25:01.880 |
there are similar findings for rlhf in terms of um like i think like diversity of uh responses right if you 00:25:29.960 |
all right i think i'm gonna move on here i think i've got this explained basically as well as i can but 00:25:36.680 |
um having other people take a shot at explaining this would be worthwhile if if it's not clear 00:25:44.920 |
um so yeah this is the paper that uh you know really motivated this talk um jointly reinforcing 00:25:53.400 |
diversity and quality in language model generations um came out of meta mostly out of meta i think one 00:26:00.520 |
of the authors is at johns hopkins um and uh this is i'm just like a beautiful slide that explains 00:26:14.280 |
um the process quite well um so okay this may be not a verified reward kind of problem right 00:26:21.080 |
write a story about a programmer with superpowers um these um blue rollouts 00:26:28.680 |
are considered identical and good and this green rollout is considered good right it's actually a story 00:26:38.680 |
about superpowers um and it's different than the blue ones which are considered kind of the same 00:26:45.320 |
and then this red one is just not doesn't pass the verifier right so we want to penalize 00:26:52.120 |
the red one in both grpo and and in darling right so that that advantage is kind of negative off to the 00:26:58.520 |
left um whereas both the blue and green ones have advantage that's positive you know kind of to the right 00:27:05.480 |
um but what darling says is since this uh this green response is different than the blue responses and 00:27:13.560 |
you know it doesn't have any other um similar responses we're going to upgrade it all right so 00:27:20.840 |
uh the advantage uh the advantage the kind of reward the score that we assigned to the screen one is 00:27:26.040 |
going we're going to we're going to say it's it's more important or better than the blue ones 00:27:30.760 |
even though um you know they all pass the verifier um just because it's distinct 00:27:37.240 |
grpo doesn't have any mechanism like this it only uses the verifier and your score is independent of 00:27:46.440 |
um how different or any any kind of diversity measure from uh from the other responses 00:27:53.160 |
oh can i make this slide bigger um oh i should have went into presentation mode i'm sorry 00:28:09.240 |
some uh full screen is there next to a slideshow right right here if i just put this okay well it's 00:28:18.280 |
huge um i will say just for people to have more background information the original grpo was like 00:28:26.040 |
very vanilla right you output stuff you you pick whatever passes the bar and you average out multiple 00:28:34.360 |
if they do but there have been a lot of iterations on grpo right uh some of them change the kale divergence 00:28:41.400 |
penalty some of them like there's like three or four different papers we've done in paper club 00:28:46.040 |
gimme k2 did a different instance the original um grpo came out before deep seek r1 deep seek put out 00:28:54.120 |
deep seek math in february or so of 2024 that had a slightly different grpo policy but the base premise 00:29:03.320 |
is always the same right you have you generate a bunch of outputs and you have rollouts per se and then 00:29:07.960 |
you want some way to give preference towards something now it's known that like you lose 00:29:15.800 |
diversity and responses so this is specifically tackling that but there are other um variations as well 00:29:24.280 |
yeah yeah i mean i think there's been a ton of work on improving grpo and a nice thing about them doing 00:29:30.520 |
so well and open sourcing it i think is you know that it spurred a bunch of research yeah so so like for 00:29:35.800 |
example um this one is focusing on diversity right you cluster outputs and you add more weight to the 00:29:43.720 |
unique cluster right the one that only has one sample that will promote diversity some of the other stuff 00:29:49.640 |
they've done is uh they have budget controls for output length right so um you can give more reward 00:29:57.080 |
at different lengths um stuff like that right and then you can yeah so so that's just like another 00:30:04.040 |
example of stuff that you could prefer right you don't want simple questions to have really long 00:30:09.640 |
rollouts um you'll you'll basically do a similar thing where if an answer is shorter you give it more 00:30:15.320 |
reward or or if it's longer for certain domain so there's stuff like that yeah and then here's yeah 00:30:22.840 |
just a restatement of that that slide um to emphasize the difference right we have their input 00:30:28.840 |
prompts they go through the verifier in grpo we do reward our aggregation and then we use actually kind 00:30:34.840 |
of the same algorithm that is used for rlhf to do the updating once we've got the rewards um for darling 00:30:41.800 |
we take the inputs uh we generate our rollouts now we apply our diversity classifier um we rewrite our 00:30:50.840 |
rewards for um answers that are both correct and different than other correct responses and then we 00:30:57.720 |
do the ppo update so it's it's not really that much uh a change on gpr grpo it just really focuses on 00:31:05.640 |
boosting um answers that are both correct and different than other correct answers 00:31:12.280 |
what is this sorry to interrupt how is this different from um uh i read a paper called us a d 00:31:20.280 |
grpo done right or i think it's abbreviated as a dr grpo um i think i i i don't know dr grpo that well 00:31:31.640 |
but i think this like dr dr um does some calculation that changes like how much reward you get based on 00:31:41.960 |
the length of the response or something like it's okay it changes the reward aggregation um but it 00:31:47.320 |
doesn't do anything that like compares responses against each other for their their semantic content 00:31:54.440 |
and then reward like ones that are different right so i think um dr grpo is like a smaller change 00:32:02.280 |
to grpo than this doesn't doesn't require any other classifiers or anything like that yeah got it thanks 00:32:10.120 |
okay um so yeah what does the paper do well they open source their code which you know i love 00:32:17.880 |
um they show that you can use this learn classifier to do this reward we're waiting and um now i'll show 00:32:29.560 |
you some results later but they um like promoting diversity actually um seems to get rid of this problem of 00:32:39.000 |
uh uh for large k like lots of rollouts uh losing your like the this kind of base model intelligence or 00:32:55.000 |
so i hope you guys can can see this and see the graphs but the general gist is the base model is 00:33:04.200 |
green um here it's a similar um plot to the the one that i showed a few slides ago but this is actually 00:33:11.800 |
from the paper um and yeah so the number of rollouts is on this x-axis your pass at k is on the y-axis 00:33:19.640 |
and we see this kind of same trend for the green like the base model versus the orange models is grpo 00:33:26.440 |
versus where um as k increases kind of the base model tends to catch up to or i guess in some cases even 00:33:34.200 |
pass um the grpo trains model um but um darling the this idea that um rewards diversity tends to keep this 00:33:48.840 |
this edge over grpo and over the base model even at large k uh so we can you know at least in this math 00:34:00.760 |
domain uh maybe see that it's really attacking this this problem and giving benefits even most of the 00:34:10.200 |
time at for pass at one right which is maybe what you care about the most right because this is what 00:34:17.240 |
you'll tend to ship to users right you only generate one response um 00:34:29.560 |
here's a question of like how do they train um this math equivalence classifier right so um the problem 00:34:40.600 |
it solves is given given two rollouts um tries to maybe vaguely quickly determine you know are they 00:34:48.120 |
saying essentially the same thing right not not not string match but um you know are they solving 00:34:54.120 |
the problem in the same way um so the way they did this was they tried to get um a bunch of uh 00:35:02.840 |
like a diverse solution set for different problems by generating uh rollouts from i think there's like 00:35:10.120 |
five or six different models they're mostly it's from the coin family some from llama 00:35:15.320 |
and then um given pairs of these responses they ask uh llama 70b to be like this oracle that generates the 00:35:26.200 |
right answer and then um they take these pairs of responses with the uh the answer 00:35:37.400 |
with with the answer from llama and they train a classifier on top of a coin embedding model to be 00:35:44.680 |
able to um mimic llama mimic what this big llama model does um and they got pretty good accuracy i 00:35:53.480 |
guess eight nine percent on holdout set um here's the prompts that they uh use to generate the training 00:36:06.120 |
data for this equivalence classifier uh which i don't know seems pretty well thought out and 00:36:12.200 |
uh well reasoned but it's not not actually that long um 00:36:17.880 |
so right so how does the mechanics of this work right when you're doing your rollouts i think in 00:36:29.160 |
the paper or in the code they said only eight um rollouts um they'll actually call this uh this 00:36:39.080 |
embedding based classifier on every pair of inputs um and then if any pair uh before any pair the the 00:36:50.520 |
classifier says yes um those those answers get merged um and then after you do this basically 00:37:01.240 |
you'll have a partitioning of those responses right so um if a matches with b and b matches with c 00:37:09.400 |
even if a doesn't match with c you still include uh consider a and c equivalent um and then you form this 00:37:16.200 |
diversity score that's just like a simple function of the size of the partition um which just says like 00:37:24.600 |
how many uh other correct responses are you um distinct from right so if if your cluster is big you kind of 00:37:37.400 |
lose some credit uh because um you know the model tended to like to generate that style of answering um 00:37:46.920 |
whereas if you got the answer correct and you are distinct from all of the other responses um you'll get 00:37:55.320 |
uh substantially higher reward so maybe in some sense it's trying to um reward or uh upweight these kind of 00:38:06.520 |
more creative or weird or subtle kind of reasoning paths uh to keep that diversity in the model 00:38:15.400 |
uh yeah note that this is like super super simple right we do these n squared classifications um and 00:38:23.800 |
then we just binarize the score from the classifier and then form clusters right so we kind of do quite a 00:38:29.400 |
a bit of work and then just get very kind of simple outputs to um change the reward using that um yeah 00:38:40.520 |
and then this is the actual uh math basically this darling score is a re-weighting of these space um 00:38:51.640 |
you know reward scores from the verifiers um with some normalization um they tried some other ways of 00:39:01.080 |
incorporating the diversity score using addition rather than multiplication uh they found that this 00:39:05.880 |
uh multiplicative score um works the best and then yeah as that previous slide showed right um once you're 00:39:14.360 |
done with this now you've got your reward scores calculated and it's it's identical to uh grpo from 00:39:22.280 |
here um okay so what did i like about the paper right so it's it's a well-motivated problem i think 00:39:33.880 |
this approach is is pretty simple right it's maybe not as simple as like dr gpr grpo which just as far as 00:39:39.400 |
i understand like changes some like length normalization right you have to introduce this new 00:39:43.240 |
classifier um but i think fundamentally you're going to attack diversity you need to be able to 00:39:47.880 |
you know compare responses and the semantic quality content of responses against each other so it's sort 00:39:54.520 |
of some necessary complexity um has you know good results and um they open source the code 00:40:05.000 |
um what don't i like so i actually tried to run the code um and then you quickly uh hit this problem 00:40:12.120 |
where they don't actually include this semantic classifier for you so you can't just out of the box 00:40:17.960 |
run it um maybe this diversity calculations overly simple right um you could imagine maybe you've got 00:40:28.040 |
four responses and a is equivalent to b is equivalent to c and then just you know d is equivalent to b 00:40:35.400 |
but not to a and c right so you've got this little cluster or this big cluster and then just one little 00:40:41.240 |
spike spoke coming out connecting them now we the calculation will say d is exactly equivalent to a b and 00:40:49.240 |
c even though even like this graph structure kind of makes you think probably d is a little bit different 00:40:55.560 |
than a b and c so maybe you should actually give d more credit than a b and c the paper only talks about 00:41:05.160 |
using you know one particular diversity classifier for the math problems they didn't say you know see 00:41:11.640 |
what happens if they were to train that classifier for longer on more data or use like a simpler model 00:41:18.680 |
and they also don't actually say how much time is going into doing this diversity classification 00:41:25.640 |
right um could be the case that if you were to spend that time uh you know just generating more rollouts 00:41:32.760 |
uh you know maybe you would have done better right um but yeah all in all i was pretty happy with uh 00:41:40.600 |
how this paper was written you know the idea how to attack the problem 00:41:44.360 |
uh i will say it's pretty fun exercise there's like seven or eight different like easy to implement 00:41:54.680 |
your po papers that don't share the code and really open source it but like the models are really good 00:42:02.360 |
at converting this stuff to code so like it's one thing to take a paper that doesn't have an open 00:42:09.240 |
source implementation and like caught it up from scratch but grpo is one of those things that's 00:42:14.200 |
actually very easy because they spend a very long section of this paper of like all the papers 00:42:19.320 |
explaining the changes to math right like okay we're getting rid of this like normalization 00:42:26.040 |
distribution we're doing these little changes and then you know you go into cloud code or whatever 00:42:31.080 |
you want to use and you you get a base grpo implementation you share the changes and then 00:42:36.520 |
you just have it kind of implement it right tests and like they're pretty good at this because you not 00:42:41.400 |
only have math formulas you also have like two pages of here's why and what's changing and how it changes and 00:42:48.120 |
it's a little hard math at least for me to follow like and implement myself in code it would take a lot 00:42:53.800 |
longer but like and while they're pretty good at this stuff so uh if any of you guys are going through 00:42:58.760 |
like the nano gpt nano chats or whatever and you want to like do some rl training from scratch 00:43:04.440 |
implementing grpo with like a coding co-pilot and iterations of it really helps you understand what's 00:43:13.000 |
but like you know quite irrelevant to the paper sorry no no no it's cool i mean i yeah i kind of want 00:43:24.680 |
to get my hands dirty like you know hacking on this stuff um so yeah maybe i should kind of pull out some 00:43:30.840 |
other papers and uh you know not get stunted by oh you know this big missing piece of them uh the 00:43:37.480 |
paper is it's not there uh yeah this is the end of my talk happy to answer more questions 00:43:45.240 |
very nice slides by the way yeah thanks and thank you for offering to present yeah i mean i i initially 00:43:59.560 |
wanted to present three and then uh realized that was a lot of work and i didn't like really understand 00:44:05.320 |
the details of the other two papers super well so i pumped it and only did the one that you know i 00:44:09.880 |
feel like i really wrapped my head around so um okay any other questions comments from anyone 00:44:26.120 |
i have a question about like the open-ended nature of task verification for let's say llms 00:44:33.800 |
and the fact that that's what's led to not having good verifiers like uh what's the state of the art in 00:44:39.080 |
terms of like developing closed tasks of some kind and benchmarking them and using them as like a harder 00:44:44.920 |
type of type of verifier yeah i don't know this stuff super well um yeah i know karpathy has this 00:44:56.040 |
tweet like three months ago that says like you know everyone should try to be coding up verifiers for 00:44:59.960 |
their problems and this would you know really help the models if you guys have seen the current meme trend 00:45:07.720 |
of everyone starting an oral environments company that's kind of what this is right uh everyone's 00:45:13.480 |
starting up companies to sell to the labs and make environments uh i'll give you a bit of the background 00:45:19.720 |
of what it's really like the labs want exclusivity right they want their environment like your unique 00:45:26.200 |
environment to be just for them so uh you know you can be the mercore that's like you're the player and 00:45:33.000 |
you can do stuff for everyone or if you're a small startup you're basically a little consulting arm 00:45:37.880 |
creating little interactive environments for one lab and then it's like a whole thing to manage to do 00:45:44.040 |
multiple unique ones and then they don't want you to work with other ones but on the other side um a lot 00:45:51.160 |
of the pilots and where they start is actually you're making environments for evals so say you're 00:45:59.080 |
working with not a lab doing rl you're you're working or you are you're working with like a company that 00:46:05.240 |
wants to do rl or test their agent or whatever what you're really doing is you're making like a sandbox 00:46:11.320 |
environment that should test whatever they want to test in like a unique setting and then they're really 00:46:17.000 |
just using it for evals even though the conversations like oh i want to do rl this that it kind of always 00:46:23.560 |
starts with um i want evals for my thing and we'll help you create unique evals right like it's a eval as 00:46:31.320 |
a service game so instead of you can run on sweet bench how about even instead of like running on real 00:46:38.920 |
github how about we make you a sandbox of like random coming in through a slack like chatbot and this 00:46:45.160 |
all end to end and you can see where your agent fails but um i low-key forgot the question you 00:46:51.960 |
were asking where i started this yeah what's the state of the art of like yeah exactly yeah yeah yeah 00:46:57.640 |
so uh a lot of the companies that are trying to try to make environments and stuff um you know they're 00:47:03.720 |
going at it from that angle or or they're all trying to just sell to a lab which is cool too um outside of 00:47:11.000 |
that like you know no one not many people outside of labs are doing the real rl yet so it's interesting 00:47:17.000 |
that it covers both domains right um but that's kind of where it is and then it's very very saturated like 00:47:23.320 |
you want rl environments we'll send you a list of 20 that'll build you a sandbox that tries to mimic 00:47:29.000 |
your thing but there's no value in evals for it right it's stuff that your thing has probably never seen 00:47:33.320 |
yeah it seems like one approach would have just been like hopefully rl just scaled all the way to some 00:47:38.200 |
sort of all-encompassing god model but you know short of that uh we'll take rl for task specific uh 00:47:44.040 |
and just just um max out each task domain basically i think rl is very general right it scales out 00:47:51.960 |
as a god model that's what all the thinking models are right like uh o1 o3 gpt5 thought thinking those 00:48:01.160 |
are very good general purpose reasoning models they're not per se trained to just one thing now on the other 00:48:07.640 |
and if you look at scaling laws and um generalization of this stuff it just so happens that when you do 00:48:15.640 |
this type of rl on math and coding it actually generalizes like uh look at a chess benchmark we're 00:48:22.840 |
not doing rl on chess but you rl on math and code your test gets really good um it's it's just that 00:48:29.080 |
general roll out a chain of thought step by step and it's not to say that there aren't issues right like 00:48:34.520 |
right now we can't do um we can't do a lot of um parallel rl because you have to stay on policy right 00:48:44.280 |
so the pre-training side is very parallelized um but rl push training per se we don't know how to do it 00:48:53.720 |
that well in parallel across a budget you know like i can't just optimize parallel rl then there's other stuff 00:48:59.960 |
like diffusion i can't diffuse rl i can't do um native voice spectrogram or also there's stuff there 00:49:07.240 |
and i guess maybe my chess example isn't good yikes but the the the topic still uh you know the premise 00:49:14.520 |
still holds like there's there is generalization in rl uh alpha zero was a specific model on 00:49:20.760 |
chess but you can you can rl on just math and you can see that it generalizes to other domains there's like 00:49:27.560 |
quite a few papers on this there's a lot of small models where they like they show rl just on the 00:49:33.560 |
data set make sure like i think even the mixture models uh the phi models they explain what their 00:49:39.080 |
data set is and there's definitely a few that are just all math and code and then you run the benchmark 00:49:44.600 |
on regular gsm ak and you know stuff like that and it still goes up so it does generalize it's also just 00:49:50.600 |
that that's the most verifiable data right now yeah that's fair and i think uh what i was thinking 00:49:56.120 |
about was like the alpha go uh debut and like that you know that that novel move that happened as a 00:50:01.160 |
result of just rl maxing all the way till it invented a new move that's we haven't seen that level of rl 00:50:06.840 |
maxing on like the other areas right so yeah move 37 you should look into 37 pretty pretty fun stuff 00:50:18.920 |
like it'd be interesting what is the move 37 of like triage my inbox or something you know we'll see 00:50:31.560 |
uh some people are using it for proofs christian sagetti is doing a new math inc company trying to 00:50:39.240 |
solve math um the i guess the move 37 for interesting stuff is like periodic labs from liam they're doing 00:50:48.840 |
it for rl for material sciences um you know predict new materials silicon and stuff all on mid-trained rl 00:50:57.240 |
models and hopefully you find something cool harmonic i don't know what that is but there's there's people 00:51:04.200 |
doing in different domains it's just it's just different problems i think it reflects back on 00:51:11.160 |
the hardware soft verifiers and like the our bio that we did last week is sort of related as well 00:51:16.120 |
uh this idea of like what's hard verifier ever soft and yeah 00:51:37.240 |
okay if not we can always give people some time back on their wednesday um next week 00:51:43.640 |
uh of what exactly is mid training does anybody have a good succinct uh description of that 00:51:50.840 |
yikes you do yikes knows what mid training is i was gonna yeah i don't have a good um succinct one uh 00:52:01.080 |
normally i would assume it's it's the essentially the rl process right where it's if post training is 00:52:07.560 |
going to be the chat or instruction tune and then pre-training is going to be the foundation 00:52:11.240 |
then mid training i guess should come before the instruction or chat tune so i am wrong yeah yeah my 00:52:20.200 |
understanding of mid training is like you're selecting really high quality documents for pre-training so you know 00:52:26.680 |
you know you'll first pre-training on like just everything and then you'll select i don't know 00:52:31.320 |
the top one percent or ten percent or something of like really good documents and and finish your 00:52:36.600 |
pre-training there oh and then yeah yeah so there's a yeah just interesting problems of like figuring out 00:52:43.480 |
you know what's what's the real good data um and yeah that and um it's in so we have we have pre-training 00:52:52.200 |
we have initial pre-training then we have mid training with a refined data set then we have 00:52:55.800 |
chat or instruction tune with a different data set and then we have rl on top of that which is building 00:53:01.080 |
the chain of thought to be more rewardy um one wonders like how many other junctures there are could be 00:53:07.720 |
different stuff uh presumably i can extend the principle in some way or another yeah typically it's a 00:53:15.640 |
mid training stage isn't like it's very odd to just say mid training for general use right because 00:53:21.960 |
yeah after you do your trillions of tokens of pre-training you do save the best for last but when 00:53:28.760 |
you think of it like in a specific domain right like say periodic or are all for material sciences um 00:53:36.120 |
they'll take a base model like trained from scratch and preferably not sft then they themselves like 00:53:44.120 |
they'll license a bunch of data in different domains like material sciences books articles patents like 00:53:51.320 |
as many domain specific tokens as they can do mid training where you take like a base quen model 00:53:59.000 |
you domain adapt it towards your your domain of material sciences and you have how many ever high 00:54:05.720 |
quality tokens then they'll do a sft rl whatever um so that's that's kind of the mid training phase 00:54:14.360 |
uh in general purpose models it's more so like okay you know here's here's basic tool use here's code 00:54:21.480 |
here's whatever use cases we find interesting let's let's start training in on those then do our 15 rl as well 00:54:28.200 |
um i because i remember the thing that's coming to mind is the um like the galactica model and then 00:54:34.200 |
maybe like slightly bloomberg gbt where they have the general idea was like okay we need a new foundation 00:54:40.920 |
model because we want to do domain specific stuff and the issue that they were running into was that 00:54:45.800 |
the domain specific jargon uh like needed its own tokenizer effectively um so i'm curious 00:54:55.640 |
as to because galactic is kind of old news at this point so i'm curious if we got to hey what we 00:55:01.640 |
should actually do is not really worry about the tokenizer stuff and we should just take like a pretty 00:55:05.960 |
you know a capable generalized model then train it so that it's more more capable in this domain and 00:55:13.800 |
i yeah i'm curious i won't like uh uh galactica plus plus bench i guess to see if uh galactic and fire the 00:55:22.520 |
one of the ones on the line yeah that's that's basically a good example actually the the bloomberg 00:55:29.800 |
gpt that it's not necessarily mid training because that thing was so under trained but um you know 00:55:36.920 |
towards towards the end of it they they did train on a bunch of unique bloomberg data set stuff i think we 00:55:45.080 |
actually did a paper club i covered it on bloomberg gpt so it should be recorded that paper was basically 00:55:51.640 |
all just data set uh but but then they're like you know we have hundreds of billions of tokens of 00:55:57.000 |
financial data sets news filings press bloomberg stuff and the order in which they do it is pretty 00:56:02.840 |
interesting i think you get a lot more conceptual understanding of mid training by reading small 00:56:09.080 |
language model papers um so we've gone through a lot of those because they're a big component is 00:56:15.720 |
basically their training mixture and the um order in which they do training so just find any of the 00:56:23.640 |
new small small models and then look at look at their training stuff so was was bloomberg gpt effectively 00:56:30.200 |
like mid-trained with the bloomberg stuff because i feel like they said that they had a new foundation 00:56:35.800 |
model right or so there are they yeah mid training it and then calling it a new foundation model or is 00:56:40.120 |
they actually like okay we handpicked all these tokens and now we're just rolling it from scratch i don't 00:56:44.520 |
know i think the problem with considering bloomberg gpt mid training is that half the data set was 00:56:50.920 |
financial like i can pull it up but i quite strongly remember that half that data set for pre-training 00:57:00.040 |
as well like it was a very clear split half of it was like pile c4 books or something the other half was 00:57:07.720 |
40 financial news so you know it was just a pre-train with a lot of financial yeah yeah and i think 00:57:16.440 |
galactica is the same way where it's like uh yeah we mess with the tokenizer we you know uh decided to do 00:57:23.240 |
the data set this way and then train the whole thing so i'm yeah i'm interested that's an interesting like 00:57:29.560 |
there's a lot of decisions that you could make there that uh didn't really occur to me as like 00:57:33.240 |
decisions that you could really make architecturally or like ones that people don't often take i'm 00:57:37.800 |
curious if it's just like not effective or if it's um mid training is kind of the better way to do that 00:57:42.680 |
regarding the nomenclature mid training seems pretty similar to continued pre-training like 00:57:51.080 |
what's the difference yeah there's uh i'll quote jeremy howard on the latent space podcast there's 00:57:59.480 |
no concept of post training it's all just training right like yeah at the at some level like post 00:58:06.040 |
training is still training rl is still training it's all it's all just training it's just that the the 00:58:11.320 |
bound that we define is like quality of data right you typically consider uh like chat um instruction 00:58:20.760 |
following a separate category because the data is per se instruction following right and it's like 00:58:28.200 |
chat user assistant pairs but you're not doing anything different you're just training you're still doing 00:58:34.360 |
typically sft right so the words go in the magic sand and that's what it is um in some actually like 00:58:42.280 |
in a pure literal sense it is still training there's pre-training post training is the same 00:58:48.040 |
the the useful bounds difference is that you know the majority of your 15 trillion tokens will be 00:58:54.360 |
just random token understand the world english the reason we call it post training is not because 00:59:01.080 |
you're doing any fancy training it's just because that's where you're starting to do instruction 00:59:05.640 |
following and you know user assistant chat so there's a there's a bound in terms of what the 00:59:12.840 |
you know what we're trying to accomplish but in the literal sense it's all just training right it's 00:59:19.560 |
we're not changing the training objective we're just predicting tokens right so yeah the way i look 00:59:26.280 |
at it is pre-training is mostly unsupervised right like including continued pre-training or mid-training 00:59:31.880 |
the data might be changing for domain adaptation but post-training you're adding in some supervised 00:59:39.080 |
element or reinforcement element um not completely unsupervised uh yeah yeah there are ways to look at it 00:59:53.480 |
yeah and i'm curious like do you see any uh companies or you know um use cases where uh 00:59:59.560 |
rl is seeping into agentic workflows uh uh you know i almost see it as one way to squeeze 01:00:08.440 |
you know most of the foundation models um yeah i'm curious if you have any interesting use cases that you came across 01:00:19.960 |
yeah a lot of the labs have done cool rl for specific stuff right like deep research is rl 01:00:26.440 |
on opening eyes model there's coding ones that have come out uh we've shared quite a few in discord but 01:00:32.760 |
there are cool use cases of um people doing rl for a task and and that turning straight into product 01:00:41.400 |
um yeah yeah yeah okay guys thanks again rob for sharing we're kind of at that time next week we have 01:00:51.080 |
uh real-time detection of hallucinations you can find it in discord i shared it in the link here but 01:00:56.600 |
thank you thank you for sharing really good slides good presentation