Back to Index

Reflect, Retry, Reward: Self-Improving LLMs


Transcript

Cool. I'm going to record. You're welcome to set some context. Yeah, go ahead. Cool. And everyone can see my slides and everything. Well, I can see your page. It looks like a paper, but I guess it's a slide. Yeah. Yeah, this is just cut and paste to the first slide.

Cool. Well, thank you all for being here. This is really exciting. And yeah, super excited to chat about this. So I'm Shelly. I'm one of our AI engineers and researchers at Ryder. And one of the things we've been talking and thinking about a lot in the last six to nine months has been self-evolving large language models, self-improving large language models.

And I think this is a term or phrase that's been coming up more and more across the board for lots of different companies, right? How can we functionally fine-tune small, large language models for specific customers in a way that adapts to their use cases so that they can learn from their mistakes and become more and more useful?

And I have a couple of canonical examples that I always use when I'm doing less technical talks about the subject, or I talk about these sorts of tasks that you would expect a large language model to be able to do based on knowing that, you know, they've been trained on reasoning and math and coding and all of these things, but there are still so many gaps that enterprise customers, but then also like us in our everyday lives feel.

So like, for example, one of the examples I give, I recently moved and so I have a new apartment and the leasing office gave me a floor plan. And so I asked Claude Sonnet 4 to basically take, I was like, okay, if I have this floor plan, can Claude put furniture onto the floor plan for me, right?

Like put a bed here, that's the right size and better that. And like, you know, the bed ends up in the bathroom and the couch is like sideways and like all these crazy things, like all the walls are moving around. And it's like one of those things where it's like, you know, it's not necessarily an intuitive that this doesn't work, like it should work.

And so, you know, when we think about self-evolving or self-improving models, what we're talking about is how can we, between big model releases, create large language models that can learn and adapt to these use cases that should be in distribution, but for whatever reason aren't. So I'm going to spend maybe 20 minutes walking through the paper and then hopefully we can just kind of have a greater discussion about self-improvement, about GRPO, like lots of cool research in this space in the last couple of months, especially after sort of the May conference submission deadlines.

So, yeah, I'll just kind of get into it. So this is the paper that we released in the late May, early June called Reflect, Retry, Reward, Self-improving Large Language Models to be in Reinforcement Learning. And it's kind of motivated by what I was speaking about before, this desire to improve the performance of large language models under certain constraints.

We were focusing on use cases where, as I said, tasks that all models do poorly on, generally speaking, even very, very large models. So, you know, you don't necessarily have other models that you can use as a judge or for synthetic data. So that's kind of how we ended up at Reinforcement Learning, right?

We want just instead tasks that are easily verifiable, but that we don't necessarily have large ground truth data sets for or ways to use judge models or other synthetic data techniques. And we also stuck to binary reward settings, right? We wanted to keep the reward as simple as possible.

And so we kind of ended up on tasks that have very simple expressions of reward. Let me see. Okay, right. So this is like, again, the sort of setting of what is it that we're trying to achieve with this paper? when you have this standard flow, you generate some output given a task.

And if you succeed, that's great. But when you fail, you don't really have anything we can do to get better. Aside from collecting large amounts of data, generating large amounts of data and doing an SFT run, we want to think about how can we incrementally do better and have models that learn from their mistakes and are adaptive and evolve.

So we were quite inspired by a 2023 work called Self-Refine. I believe it was like CMU and maybe the University of Washington or I'm probably getting that wrong, but a couple of organizations came together to do this work, which the premise is, well, so they showed, and this is part of a larger narrative around self-reflection with large language models, but they were one of the seminal papers that showed that if you ask a model to self-critique or to provide feedback on its own output and then use that feedback to refine its answer, you can see like up to 20% performance gains.

And I think a lot of their experiments were on whatever chatGPT, whatever was powering chatGPT at the time. And so at the bottom, this is an illustrative example of how that would look in practice. So you have a user talking about table tennis. There's like kind of a mediocre response from the model, but then you tell the model, you prompt the model to say, what are the issues with this?

Like what is missing? And it says, oh, there's no information about how to play table tennis and there's a lack of user understanding. And so then the refined response is qualitatively much better. So we're definitely inspired by this known property of large language models that they can to some degree, like provide self-feedback or self-critiques.

So this is what that flow would look like if we were just prompting for self-reflection, right? So on the left side, if you succeed, great. On the right side, if you fail, right? If we detected a failure, as I talked about, with some sort of verification, then we could generate a self-reflection in this style, like ask the model to provide feedback and then have the model retry the task, right?

And we see pretty much like pretty immediate improvements from this on the task that we're talking about, right? So this is cool and all. That's how reflection actually works. But then sort of comes in the learning aspect or the evolution aspect, right? The self-reflection prompting is static and the model doesn't learn anything.

And so we're using reinforcement learning again, which is why the verifiable rewards are lovely to teach the model how to reflect better. And our specific formulation is that we specifically rewarded the self-reflection tokens. So what we were trying to do here with this approach is incentivize the model to learn how to self-reflect better, not to get the right answer.

So there was no reward for the answer tokens. There was only reward for the self-reflection tokens. And so the flow on the right now is when we fail, we generate a self-reflection, we retry the task. And then if we are on that specific fail, retry success path, just that pathway where a self-reflection led to success, then we reward the self-reflection tokens because that means that is a high-quality self-reflection that led to a success when there was previously a failure.

Any questions at this point? I don't know if there's somewhere I can see. There's... No, no, no, not specifically. Okay, perfect. Okay, okay. I'll keep moving. Okay, awesome. I'll just keep moving and Sam, if you want to jump in on the chat, you're great. Awesome, okay. So that's like the basic formulation, right?

It's a modification to a sort of standard reinforcement learning approach with this extra self-reflection and rewarding self-reflection step. And so we're incentivizing the model to self-reflection better. Okay, so in the paper itself, we focused on two tasks that again fit that description that I've spoken about previously, two things, one, verifiable reward into like tasks that you would think should be in distribution but aren't.

So the first one is function calling. We use the API gen, like the Salesforce function calling data set. It's about 60,000 data points that came out mid last year and that everyone has been using. And then we also use this task called countdown, which I actually recently learned is based on a British game show, which I didn't realize.

I didn't know where that name came from. but this is a task that got pretty popular in January of this year because there was like a project that showed that RL can like dramatically improve ability to do this particular task. And it was like a GRPO experiment. But models are surprisingly bad at this task.

The formulation is that you get a list of numbers between three to four numbers and you have to create an equation that equals some other target number and you can only use like basic arithmetic operations plus minus times divide and you can only use each number once. And so all of the questions are of this format where like the numbers are different, but like it's always just create this equation.

It's actually really hard. Yeah. Again, there's a British game show where the entire premise is that people go and try to do this quickly and it's very difficult. So these are the two tasks that we experimented on because again, like large models are surprisingly not as high accuracy on these tasks as you would think.

So we did very standard GRPO training for this, of course, with that modification of self-reflection and I'm excited to see like when we get to the end of this, like what people think about, you know, all of the recent advancements and all of the building upon GRPO and I have a couple papers that I want to talk about too.

But throughout our training process, when we started, we saw qualitatively that these self-reflections were very specific, right? They're very long, they're very like verbose and they kind of repeat the same thing over and over again and they're very like specific to a specific problem, right? And then we saw this really cool thing happen where as you get to like a thousand steps, you start to see these much shorter, clearer and general examples or these better, like they're just higher quality self-reflections, right?

They're very specific, they're very effective, they're very short. So it was a cool qualitative result. Like we can see that the models are sort of becoming quote unquote better at self-reflection in our opinion. Okay. And then I'll talk through actual results. So this is on the function calling task.

So it's a pretty standard function calling task, right? You provide some tools, you have a user query and then the model has to pick a tool and provide the right parameters to that tool. So pretty standard function calling. And we stuck to pretty small models, particularly since we started this research around March when, you know, most of the GRPO research was being done on like under 10 billion parameter models.

So the top half of this table is the models that we actually trained. We stuck to like a diverse set of open source models between one and a half billion and eight billion parameters. And just to walk you through what the different columns here mean, vanilla first try is like pass at one on function column, right?

So you can see surprisingly low and even with like the very big models towards the bottom, 72 billion or 70 billion parameter models, like they're only around like 70% accuracy, right? And then the second column, plus reflection, second try is with that prompting technique, right? So how much of a raise do you get just by asking for self-reflection and then giving the models a second try at the task, right?

And then for the top half of these models, the smaller models, we trained them with this approach I've been talking about of incentivizing better self-reflection. and we see pretty immediate performance gains both on pass at one, so trained first try and then also if you give it a prompt and a retry, you see even more more of a performance jump.

So the bottom half of the table is our sort of baseline. The way that we wanted to think about this was how effective can we be at bringing small models up to the performance of large models? And one thing I want to highlight here is if you look at, for example, the Quen 2 7 billion instruct rows, so that's the second row of the table, you see the model vanilla first try you're at 66.4% and then with the training and then with giving you a second try you bring it all the way up past 77%.

And if you compare this to plus reflection second try of the large models, the Quen 2 72 billion and the Llama 70 billion, like that 77% is actually higher than when you prompt Quen and Llama very large models for self-reflection, right? So what we're showing is that you can use this training recipe to bring very small models past the performance of models 10 times their size just by like using a particular training recipe, which is really cool, right?

Because I think that there has been a movement that I really believe in about like, you know, the power of small models and why personalization can be really powerful. And I think this idea that you can bring a small model up past the performance of a very general out-of-the-box model is a cool thing to see.

Cool. So this is the function coin. Oh, sorry. Do you mind if I ask a question? Yeah, I noticed that you had this sentence in your write-up as well that small models can outperform larger models. I guess I just want to clarify. This means that what you're saying is that small models can outperform larger models.

Small models that are fine-tuned can outperform larger models without fine-tuning. exactly. Thank you. Cool. And then these are the countdown results. So that second sort of math equation writing task. There's a few more models on here, just a slightly different set. One thing that we did for sort of academic integrity is like, based on when the data set was released, we only chose models to train that were released prior to that data set being released because obviously with models like Quan and LLAMA, you don't know exactly what they're trained upon.

And so we wanted to make sure that this data wasn't already in their training data, right? Because that would definitely see results. So that's why there's a slightly different model set here and then also some older models, right? But we, again, see some really similar results. You can see the prompting technique helps and then the training helps.

And again, you can bring Quan 2.57 billion with the training, with the reflection up to over 50%, which again, like handily beats Quan 72 billion and gets close, like, yeah, it just gets close to the performance of home RRX4, which is cool. Awesome. So another thing we wanted to look at as sort of a side effect or a desirable property was catastrophic forgetting and investigating how much we saw and we luckily saw very little.

This is a lot of numbers and I'm not going to walk through them too much, but just at a high level, like we were seeing that there was not much of a performance drop, particularly around a statistically significant one across a wide variety of tasks, even though you're doing this approach to fine-tuning.

And we sort of credit that a little bit to how generally we're fine-tuning, right? Because again, we're incentivizing self-reflection and reasoning in general as opposed to specific answers on a specific task. we were seeing relatively low catastrophic forgetting, which is cool and kind of evidence is that you can train models in this way and still use them for lots of different things.

Cool. This is my last slide. I guess I moved to this actually quite quickly. But I just, I think in, you know, as I mentioned in May and June, there has been a lot of work that has been released on GRPO and adjacent methods and has kind of given us a lens for like where collectively we should go from here.

And I just wanted to highlight some papers that came out in the last couple months since this paper was released that I thought were really interesting and cool. So the first Spurious Rewards is, I believe, Allen AI. and they showed that for Quen models specifically, when you do RL training, it's like the reward function doesn't necessarily need to be correlated for the right answer in order to get strong mathematical reasoning.

And this was kind of speaking, I think, to a general sort of potential under-training of Quen models and then also to this idea that we need to be investigating more carefully, like what it is or how we're surfacing. aspects of model performance through RL. Like what, like I think it really speaks to like what is actually going on here because there's this one, you know, research angle, which is like let's design really good reward functions and they kind of showed like you actually maybe don't need to design really good reward functions, especially for certain models.

And it's worth noting that these results primarily held for Quen and didn't really hold for Lama and other models. So, you know, potentially it's something specific that Quen is doing, but yeah, it's cool paper. The next one. I thought was really interesting. The next two are basically about alternate approaches to reward.

And the second one is about using self-certainty as a reward signal. So you can basically get rid of verifiers by simply just using, and this is like very related to what we did, right? So I thought it was really cool paper. And then the last paper is using RLU directly maximize the probability of generating a reference answer.

So again, they have a technique where they're not using a verifier at all. And so I think it's showcasing that like we can build upon GRPO where we got to cut out a model and like cut out even more models and cut out even more like sort of verification steps.

And I think that's really promising and interesting. Yeah, I think that's everything I had for you all. Happy to take questions. Happy to for just a greater discussion about GRPO, reinforcement learning, self-improvement, all of the things we mentioned here. So yeah, thanks. Yes, Chris? Yeah, I found it interesting how you also looked at the Spurious Reward Rethinking Training Signals RLVR paper.

I saw how when you did the benchmarks that you did models other than QN for the improvements. Was that to account for that error from like the previous one where they showed how the QN models were mostly focused on in these RLVR papers to prevent that mistake? Yeah, so this Spurious Reward paper actually came out around slash after when our paper came out.

So I wasn't aware of this research, but I do think that in general, we wanted to show that our technique holds across different model families, different sizes of models, all of those things, right? Just for like rigor and like to sort of prove out the technique. And I think in general, I appreciate a lot when papers do this to just show that it's not something specific to a particular model family or a particular recipe that one company was using.

But yeah, we weren't aware of this at the time. I think Ted had a question. I don't know, Ted, if you want to come on camera and ask it. Mark can read it all for you. Yeah, so specifically, you mentioned this other Spurious Rewards paper, and I can't remember where, but I thought I saw somebody was posting different papers RL papers, RL papers had baselines that were lower than the model authors were able to achieve.

So the baselines used in the paper were suboptimal. And that basically, if you suboptimally use the model and then use RL, you can sort of correct for your suboptimality, but you're not actually improving the performance. this one, which I thought was a very interesting conclusion saying that you don't really need great rewards, that maybe that, in particular, that result was more about fixing suboptimality versus improving reasoning.

I don't know if you've seen that discussion. Yeah, I haven't, but that sounds interesting, and yeah, I would love to get a link to that. Yeah, sorry, I don't have the reference at my fingertips. You're good. with the TRL framework that applied the GRPO to the model. So is that like fine-tuning?

Is that like directly updating the weights, or is it some kind of wrapper on the model, or how does that GRPO trainer work? Yeah, so we are directly updating the weights. Okay. So, yeah, generally speaking, you do need an open-source model because you are directly updating weights. Got it.

Thank you. Hey, Shelly, thanks for the great presentation. Thanks for joining us. I was curious, I don't think I saw anything about this in the paper, but maybe you've thought about this. You know, it seems like the larger goal is to be able to sort of continuously update the model as new information comes in and maybe distribution shift and so forth.

Have you thought about, and you know, the catastrophic forgetting part is maybe addressing that to some extent. Have you thought about like what happens when you do many rounds of this kind of like self-reinforced training? And maybe you did some experiments. I'm curious to hear your thoughts. Yeah, that's a really good question.

I do think that is a very natural extension of this work. And one of the ideas that we talked about internally was swishing between rounds of RL and SFT, right? Which is kind of a proven thing like do some SFT, do some RL, then you keep doing that over and over again because you kind of elicit slightly new behavior with each round that way.

What we were seeing, we were training, so we kind of let the models over-train a little bit, right? like we let them take a data set and run until they converge and we were seeing that they kind of leveled out and potentially started to get a little bit worse, usually around like, I mean, there's like sample numbers in the paper, like about 20 to 25,000 samples, but probably we could have been even more sample efficient if we tried, and about a thousand steps.

So my sense is that you need to interject probably something more in between those rounds of RL in order to squeeze out more performance because I feel like I've seen some reports where it's like GRPO will like indefinitely go up, and I haven't seen that be the case. Yeah, and in terms of distribution shift in general or like how catastrophic we're getting changes, we weren't able to run any super long-term experiments, but I think that would be very interesting and very interesting extension.

Great, thank you very much. I think we have a question from Xiaobai. I don't know, Xiaobai, if you want to voice over your question? Yeah, thanks. Yeah, I think the question I have is like, it seems like multiple tries are very dependent on the temperature. So I guess like when temperature zero, no matter how many you retry, you won't be actually successful rate, more successful rate.

So I wonder, have you done any analysis on how actually temperature is going to impact your experiment result? Yeah, that's a really good point. I believe I would have to go back. It's been a couple months since I looked at the code. I believe I set temperature to like 0.9 or something like that.

so you get maybe a little lower than that. But like, yeah, you definitely need some variance in your results because you need, like especially in the countdown case, like you need your model to explore multiple paths so that you can reward the right one, right? Like you need it to kind of, you don't want it to fail the same way multiple times because it always goes down the exact same reasoning path no matter what because you need to hopefully like elicit enough branches such that like something ends up in a success.

We didn't do any specific experiments on like playing with the temperature a lot and seeing how that changes things. But intuitively we did agree that if you set temperature to zero and it's like very deterministic all the time, that isn't going to work. But yeah, it would be interested in in general, like temperature with RL seems like something that we could all investigate a little bit more.

That makes sense. Yeah, thanks. Yikes. Do you want to go ahead with your chain of thought? A slew of questions? Yeah. Let me finish clicking this button maybe. Okay, that should do that. And then that. And then this one. Okay. Yeah, so let's see. for the technique that you guys did to find the thing that you're training for, I didn't quite like catch it all the way, but it sounds like what we were doing, the basic principle is you could just continue thinking forever and ever.

and it turns out that the more length we train for, the better performance we tend to get, as so far kind of been the verdict. And so I'm curious for this technique, I guess I should like go back with a paper a little bit. Oh, yeah. Can we, how do we do better if we fail?

So for the self reflection thing, um, the sort of like basic principle here, let's see, generate output, success, do nothing, fail, generate self reflection, retry tasks, same conversation. Okay. So you do a think block and then a task attempt. And then if the task is insufficient, then you do another think block and another task attempt.

And that's all in the same window. Yeah. and then you reward until you get it or do you do it? So, so yeah, it's just one, one extra step. So it's two task retries total. Um, and there's actually not, we weren't specifically using, uh, reasoning models. Uh, so, so there isn't an explicit think block that this is just like for, uh, I mean, yeah, this was a lot of these models were released like early 2024, like reasoning models weren't as much of a thing at that point in time.

So, um, yeah. Do you have any intuition on how, or if this kind of technique might apply to reasoning models and kind of the reasoning RL paradigm? Yeah, that's a good question. I mean, my, my intuition is that these sort like this similar approaches are already being used to trade good reasoning.

Right. And so I don't know how much this would necessarily build upon a reasoning model that has already, uh, been hyper-optimized to, for example, probably like self-reflect really well. Right. That's probably a side, uh, quest or an adjacent RL target to what we were shooting for. Right. Like it is like the reasoning rewards that generally speaking, these large lots use.

So my sense is it will be less effective on reasoning models because has this is almost an approach to like incentivizing good reasoning, pre-reasoning models. Let's yeah. Well, so yeah, that's, I think that's the interesting part to me, the largely because the reasoning models, or at least from what I've seen, if you just train them to think longer, like that's really all we're doing is, is the, the more performance we get is just like more think tokens or less think tokens essentially.

Um, and then there's very, or like there seems to be some contention on, um, does the content of those think tokens actually matter? There's like a couple of, um, papers that sort of indicate like, no, it actually like these reasoning traces can be completely incoherent. Um, but you still get the performance increase.

Um, and then some papers that are like, oh, if the, if the reasoning traces change in this particular way, then you see a more significant, um, performance increase than you otherwise would. Um, and I'm sort of like trying to dial in the bottom of that. So yeah, I guess like it doesn't quite apply here since we don't have a reasoning model to work with, but I'm, I would be interested to see the I would, I would be interested to sit and to like basically just run this as identical, identical pipeline, except with a reasoning model because you have, you would have like reflect task, output, fail, like then the second self reflection there might have like some weird stuff in the reasoning block is like where I'm, where am I, where I'm headed, I think.

Um, but yeah, no, I think that, that, that does it for me basically. Um, other than, well, and then I guess it also applies to reasoning, but there's, there's been some interesting stuff. Um, namely the paper like absolute zero. And then I think Sakana has another, um, like, Hey, we secretly actually don't even need teacher models kind of thing.

Um, but I think those all technically apply to reasoning models. So that's, um, um, did you see any like, uh, unexpected behavior in the self reflection block or did it like when you're reading it, does it all like make sense? Like, Oh, okay. It looks like the model is self-reflecting on this and then it generates a new output that is correct kind of thing.

Yeah. Um, yeah. To your first point, like before I answer this question, like to your first point, I think this falls very neatly into like this greater research area of meta prompting. And I think that's really cool. Right. Because we're telling the model a very specific way to use its tokens where we're saying like create a self-reflection.

but I think a lot of where I would love to head with this is what if you just give the models more tokens in general, right. And give it lots of different ways that it can prompt, like, what is it about self-reflection specifically, or is it, if there are nothing specific about self-reflection, it's just, yeah, more thinking spaces.

What is what gets the job done period, right. You just give it more tokens. Um, so yeah, curious to, I think like over time, hopefully like have some bandwidth to pull that apart, um, and, and think about, uh, how that relates. I think your question was, um, sorry, I'm blanking on your question.

Uh, no, you're good. The, um, it was, uh, uh, uh, I don't remember it now either. I just had a different, a different thought. The, um, did you try, so task, generate, output, succeed, do nothing. And then other side, we have the chain. Um, and then for the, for the benchmarking that you did after, did you, did you have the model that you were benchmarking against try twice or did you, did you have the, the, let's see, uh, performance on both first.

Okay. So you did have it just like fire two attempts and like without the training and as opposed to the other model who got trained to have two, my ASML. Oh, wait, go to 900. Uh, we'll see. Uh, what, huh? Um, okay. Oh, okay. Got it. Uh, did the, the, so for the model, you're benchmarking against a, does it have both of its attempts in the con?

Like is the, is the experiment identical? Okay. So, so the, the model that doesn't have training has two attempts in the window. The model that does have training also has two attempts in the window. Model number two. Was trained, um, to have two attempts. And so is being rewarded for getting it right on the second attempt.

Does it, did you, did you benchmark or like see any meaningful difference in attempt number one? Like it's, it's like, it makes sense to me that like if you're training for the shape of two attempts and succeed, then like it should be better at that. Um, yeah, so, so just to like unravel this table a little bit more.

So the, the, cause it's, it's, it's, it's a, it's a lot of numbers. Um, okay. So if we compare vanilla first try to trained first try, that's seeing how much better the model got at pass at one through this training process. Right. So like first line when to 1.5 billion goes from 32.6% on a first task try after training, it's at 48.6%, right?

So ignore the reflection, second try columns, just look at like first and third column. That's how much it got better at the task itself, which is actually a really interesting result because we never directly incentivize the model to get better at the task. We just, uh, we explicitly only rewarded the self-reflection tokens, right?

Like, so we were never directly rewarding the model's answers and being like, you know, get better answers. I think this, there is some stuff that falls out of like the spurious rewards paper that helps kind of explain this, right? A little bit where it's like, Hey, like to some extent exposing the model to data is what matters.

the reward actually matters way less than we think it does specifically for client models. Um, and then another way of, or another sort of thing about to think about what this table, right, is another column we could have had here. We could have had two more columns, which is basically the diff, um, between the vanilla first try and the reflection second try before training and after training, right?

So if you look at, again, the first row, you can see that we go, um, so 32.6% pass at one on the vanilla model. And then when you give it that second try, it goes up to 34.8%, right? So that's 2.2% better, right? And if we go to the two columns on the right, we go from 48.6% to 52.9%, which is more than 2.2.

It's, uh, 4.3%, right? So the model has gotten better at utilizing that self, like the self-reflections are better. And so that second try is right now, more of the time, right? You have like, you, we went from reflection and a second try gives us 2.2% improvement to that, that prompt by the end of training now gives us 4.3% improvement on the metrics.

So that's kind of how you could, you could see that, uh, the self-reflections are quote unquote better, right? Uh, objectively. I'm curious, did you evaluate what is good self-reflection? How did you induce that? Yeah. So I think for us, we really looked at it qualitatively, um, cause it is interesting to like, sort of see what self-reflections lead to that improvement, right?

And we didn't encourage any specific format or anything. We just, uh, rewarded it when a self-reflection led to success and saw what happened. And actually, I think this was one of the, like earlier questions, uh, the self-reflections of selves in kind of clean language mixes a lot, like a lot, a lot.

And sometimes it was just like pure gibberish for some models sometimes. Um, so there is like definitely evidence that like more thinking space of kind of what the models need as opposed to these like very parsable human, uh, legible self- reflections. Um, but qualitatively they do sort of seem like quote unquote better self- reflections or like more effective, more efficient over time, um, in many cases.

So that was kind of cool to see, but because we didn't have any format constraints, um, yeah. Sometimes it was gibberish, particularly again, when models and the language stuff is very interesting. I think Frankie has their hand up. I think it'd be helpful to go back to tasks. um, the success field.

Um, what, what's actually put there? Yeah. So, um, I think it'd be helpful to go back to tasks. So for the function calling dataset, uh, we cheated a little bit in that we like, you theoretically don't need a ground truth dataset, right? You could do something where it's a simple as like, did this function call, uh, create a request that when it hits an API, it gets you back like a two-one.

Uh, like those sorts of things, but in our case, we did actually check to see if the correct answer is the answer from the ground truth dataset because we were like using a, uh, SFT dataset. But generally speaking, for function calling, if you have any sort of binary reward checker, like any way of saying, like, I think this was a good function call versus a bad function call, you should be able to like, um, do this with countdown.

With countdown, this was a little bit more of like a true verifier, um, because, you know, like many math questions, it's very easy to check if a particular equation that the model has generated evaluates to the right number, but it's like hard to generate all the answers, right? Like there's many possible answers.

So what we did here is quite literally ran, like, so we checked to make sure like the numbers that were allowed were the numbers in the equation that the model wrote. Um, and then we just ran eval on it and like, saw if it hit the target number. So it was just like a very basic, like evaluate the function and see if there's success.

Yeah. Go back to the background. Sorry. I just wanted to follow along just couple of questions there. So on that first block to the right of fail, generate self-reflection. What are you adding directly there when you get a failure? So we have a prompt that is in the paper.

There's prompt templates at the bottom, um, just prompt. And then it generates a self-reflection and then you, uh, prompt the question. You like put the, the original question back in to retry. Got you. Okay. So then, um, since you follow the success path of the second retry, that is, you got the correct, your, uh, your verifier knows that you got the correct answer.

And then you're going to say reward the self-reflection tokens. So which tokens specifically are you rewarding? Just the fact that on that first fail, you generated some new tokens from that. That's, um, that's what you're rewarding for that particular path. Yeah, exactly. So after the failure, we prompt for self-reflection and the prompt is something like try again, like you got the wrong answer, please reflect on what you did wrong.

So you can get the right answer next time or something like the prompt is something like that. And then whatever the model answers directly after that, that's exactly what we reward. Okay. And, uh, and you did not do anything with the failed path. That is not at all. Yeah.

Why not? Because you feel like that's not useful to say negative reward or something. Yeah, pretty much, pretty much. I mean, I think we found pretty early on that this very simplistic reward formulation worked quite well. And so we didn't do a lot of work on like exploring alternate reward formulations.

Um, cause I think this one also feels very, I think it's a team. And we just like really like these like simple intuitive approaches, right? Like fail retry success is a good thing, right? Um, fail retry fail. Isn't necessarily a bad thing because maybe the question is impossible for everyone.

Right. Like, um, yes. So I just, I'm curious to think about the logic here, meaning that you're rewarding something that comes as a result of the follow-up prompt that says you failed, right? Please try this problem again. And you're rewarding that particular path. So I'm trying to understand what is it?

Sorry. Maybe I don't understand. Like you did this, but then on your, your table, you said, Oh, we did this training. And that first train pass kind of like got 32 to 48%, right? So I'm trying to understand what is actually, what do you think is going on there, right?

Because it's a very, you're rewarding something that doesn't quite. Yeah. And I want them to understand. Yeah. Yeah, that's a really good question. I think that, uh, we don't super know. And I think like some of the new papers that have been coming out recently also speak to like in general, when we do RL, we don't super know what's going on.

Um, and in particular, I think this first paper starts to unlock partially what might be going on in that interestingly, like where word functions don't have a lot to do with, uh, whether or not these models get better, right? Like the exact formulation of reward has, even if it has very little correlation with the right answer and like, just the fact that we're rewarding in the space of the right answer, just because we're wearing other tokens, like it sort of doesn't matter.

It's just kind of just exposure to the data and any sort of reward sometimes is leading to pass at one improvement as well. So can I conjecture that, uh, really when you're rewarding that second attempt, you're just rewarding a better response, right? Because you actually put in the same question to it again.

So you're just rewarding the fact that They also want to survive just in order to achieve the goals we set. So here's an example from Apollo research of a super intelligent, not a super intelligent, so, um, very unsuper intelligence, but something that's moderately intelligent. Sorry, maybe I, maybe I've misunderstood that you're, uh, the main point that you're bringing up.

Can you repeat that again? That sounded like a TTS thing. Like, I think you guys can continue with whatever conversation was going on before. Because it's like unmuted. Yeah. Okay, sorry. Yeah. So going back to my question, it feels like you're just rewarding for a better answer in some sense, because I guess having the fact that, you know, the same question, right, was posed on the retry and it gave a response and you like that response better than the first one.

Uh, you're kind of like rewarding that, uh, the fact that it got closer to the answer because you only be working on the success path. Right. So, yeah. Yeah. And I think, and I think, and I think generally speaking, so, so yes, although I think that a thing that is still kind of interesting is that we are not rewarding the answer token directly, but I do think in practice what happens is often self-reflections have the right answer somewhere in them.

Um, and so the answer is leaking into the self-reflection. And so then when we reward the self-reflection token, at times we are rewarding the answer. Um, because a lot of, a lot of the self-reflections answer the question. Yeah. Have you noticed like, sorry, sorry to interrupt again. Uh, have you noticed, have you noticed that the response got actually maybe longer or, or have you had just a metrics that to, to figure it out?

Like after you did your training, what's the quality of the responses that that 48% column? Uh, have you like done any like simple metrics on those? Honestly, I haven't, but I should, there is an error analysis, um, section of the paper that mostly discusses like how errors have changed pre and post-training.

Like what the sorts of mistakes models make are, um, that I found pretty interesting. But, but no, like, I think like speed to response, like how many tokens does it take to get the right answer? Hopefully we would see, it would be lower, uh, over time. So yeah, that'd be cool to look at.

Well, yeah, I, I, if you saved that information, like if you, if you still have the traces of the runs, I would be quite interested to see. Cause I, what I, what I would be, would be inclined to check is just the raw number of tokens count. Cause my suspicion would be the, like, if that number is larger than you are going to see the perf increase.

If it's lower than you don't would be my, my guess. Yeah. It makes sense. I mean, but, uh, an alternative hypothesis here is that the self-reflection induces the model to early on in the, in the answer process to use. Like sort of that, the language from the self-reflections in its initial response.

Right. And so they then, and then induces a better reasoning process about the answer. So that, I mean, that would be the, like what you would hope would happen. So like maybe that is an, I think a credible alternative hypothesis here. for the initial prompt, did you ask for a chain of thought prior to, or just say like one shot this for me kind of thing.

Yeah. These were all one shot because a lot of these models were early enough that like, yeah, chain of thought prompting or like models specifically optimized for train of thought prompting wasn't as much of a thing yet. Like, because our, um, the function calling data set was June, 2024.

And so we were using Quintu models, which are not necessarily super optimized for reasoning because the issue is that a lot of these models were trained on this data set, right? Like when you have a really high quality data set and it's open source and it's public, like model companies just swallow it right pretty quickly.

And then all of your results are skewed because it's like, okay, we're training on data that it's already seen. And so like, how much is this training recipe actually valid versus just like reinforcing something that already exists that you SFT on. So we wanted to keep the data really pure.

And so, yeah, there, there are slightly older models. So, Shelley, I had another question, um, directly from the paper. So, um, there's a, in section 4.3, and you're talking about the, um, you know, sort of decision to, to, um, emphasize, uh, um, you know, sort of failure only path.

And, and you say in the last sentence, it, um, it is other function, otherwise functionally equivalent to learn from a real world scenario where we believe we receive both successes and failed responses. And I, I, that, that seems unintuitive to me because it seems like if you use the successes that you're going to be like maybe overtraining on the successful response and therefore, um, you might have more catastrophic forgetting.

So I wanted to, I wanted to hear what you have to say about that. Um, is section 4.3, the part that discusses the failure data set? Yeah, yeah, exactly. Okay, cool. Yeah. So for context for everyone here, because I didn't really talk about this. One of the things we did to make GRPO training a lot more tractable was we pre-generated a data set of failures, right?

So like this whole top half, you can do offline and then you just do GRPO on the, on the second half. Right. Um, and the reason that we did that is because we were seeing, if you run full-scale GRPO with this entire pathway, it was very, very, it was a very low number of trajectories that specifically hit failure retry success initially.

And so it was really incredibly resource intensive for, for what it is. Right. And so instead of what we did was we just like offline stored, um, task, task prompt output, like pairs in the case of failure. Um, and, and what we gauged is basically like, yeah, this is functionally equivalent in our opinion to, uh, or the way we were training this was, was very similar to if you didn't do this offline thing, but you're right, RJ.

And that like, there is actually differences because you could, the model could drift over time. And like, we're anchoring on this offline dataset that was generated from like model at step zero. Right. We're not adapting to models over time, potentially having new failures. Right. And so there isn't a lot of preventative stuff in place here to prevent catastrophic forgetting with respect to other data points in the training dataset.

Right. I think our sense was that that wasn't as big of an, of a problem, especially since we saw low catastrophic forgetting in general. Um, and then of course, like when you evaluate, you see that like, no matter what, you are strictly better at the task than you were before.

But I think it's definitely possible that for certain tasks, you could see this thing happen where like, as you train things that you were succeeding at before have somehow you started failing at. And this, this offline dataset, you're correct. When you capture that, I actually, um, that's a good point.

I was actually thinking the opposite that if it seems like this is an important feature of your methodology and not just a functional, I, it seems like a functional feature of the methodology because you're really only you're, you're basically saying things that I used to get wrong. I get like, I, and now I got right by re-prompting.

Right. And so that, that, and you're basically identifying that it's very specific subset as the important thing to train on. Whereas if you were to use the the successes as well, then you wouldn't have honed in on that specific subset. Yeah. That makes a lot of sense. Like we could be rewarding first try success as well, pretty continuously.

Um, yeah, I think our, our approach, like we were really keen on this idea that we can do this like meta learning, like don't specialize for a specific task, like just incentivize self-reflection. And so if we were rewarding initial successes, there's, we're rewarding the task. We're not rewarding self-reflection ability.

Um, but I agree that there are like a lot of ways to extend this, to like both reward the task and self-reflection capability, and hopefully see both things get better and potentially like you get better at the task faster. Cool. Um, Ted, go ahead. Hey, thanks again, Shelley, for, for joining and, and coming and discussing this.

I hope this is a quick question. Um, can you say how you formed your batches when you were doing GRPO? Did you like mix the original and the new success or, or, and did you just randomly permute them or shuffle them or do anything special? It was pretty random, honestly.

Um, I think we stuck to like between eight and 14 generations per failure. So, um, tasks to generate output to fail, what would happen is for each task. I want to say we generated it. So the number of times we generated pathways for that first attempt to vary depending on model, um, capability, right?

Smaller models, you give them less tries because actually more of them are failures. So we gathered the failure data, data, data set by just, um, generating a bunch of times and saving the ones that were failures. And then, yeah, with the actual GRPO training, um, yeah, I would say nothing special, like between 18, eight and 14 generations, not a lot of mixing, um, pretty, pretty standard GRPO training, especially since like, again, this was like February, March.

And I feel like we, as a community, we're still like figuring out GRPO, um, or at least I personally was. Um, and so like, there's also this thing in the paper where it's like, oh, and we kind of period on less than 10 billion parameters. And like, there was an infrastructure to train on more than like, there was no multi-node implementation of GRPO publicly available until after I ran these experiments.

So yeah, it's, uh, it's a process for sure. I'm sure that there are many papers that have come out since that would optimize specifically the GRPO approach. Yeah. Cool. Thanks. Okay. Yeah, of course we have, we have one more question. Vishvesh? Uh, hi, Shelly. Yeah. Thank you for the presentation.

Uh, I was just thinking about, uh, uh, uh, I was thinking about your motivation that you want to learn self-reflection process rather than the specific task. So, looking at this particular experiment in this setup, this would be a very good, uh, uh, like, uh, like, do you think that this is a good ablation would be to mix and rejected pairs for the particular setup that now, now the first field like success that I can make a chosen repair and then can do prep, a direct prep optimization to compare my, it is focusing on self-reflection and not on the task, like some, some form of ablation where I can also have post-training, uh, to compare whether just, uh, just, uh, rewarding my self-reflection token is, uh, is the best way forward.

Yeah. I, okay. So I didn't quite catch all of that because I think maybe your connection or my connection wasn't amazing for a bit there, but what I caught was basically ablation studies around comparing, uh, rewarding self-reflection specifically to rewarding both self-reflection and the task or just rewarding the task itself and seeing how performance changes.

Um, yeah, super agree that that would be an interesting ablation. I think like we pretty intentionally in this paper stepped away from directly comparing to things like, uh, like other reward functions and instead approach it as let's compare to larger models, right? Let's use our baselines to be like, how much can this training recipe bring us up to, um, bigger models.

But I super agree that like, head-to-head this approach versus standard GP, GRPO where you reward the answer or other, other like, combine it with both or whatever. Like, yeah, that would be very interesting and, and, um, hopefully something the writer team can get to, uh, in the next few months or anyone else.

Cool. Awesome. Just one more, one more point. Uh, do you plan to make the code open source anytime after like maybe you've submitted somewhere and then you plan to do it? If you. Yes. The, the goal is to definitely make the code open source. Um, yeah, we are waiting, um, on a few things before we do that.

Um, but we did actually document, in my opinion, like relatively well, hopefully, uh, within the paper, how we did this is actually a pretty straightforward modification. And so, um, to like open source libraries. And so, um, definitely encourage people to just try it out and implement it. And of course, email me if they have questions and I'm happy to help, but, um, yeah, super happy to, to answer.

Um, but hopefully all the pieces should be there, uh, where if you would like to implement on your own, you can, but yeah, eventually, hopefully we will release the code as well. Thank you so much. Awesome. Sam, do you want to close this out? Uh, I don't have much to say.

Just thanks. Thanks a lot, Shelly. I really appreciate it. And thanks everybody for joining and asking such great questions. And also vote on hugging face apparently. Yes. And also vote on hugging face. I didn't know. I didn't know they had voting. That's, uh, Yeah. They have paper of the day.

It's run. Yeah. It's paper of the day and then paper of the week and then paper of the month. So shameless self-promotion. Yeah. It's still time. Still got time. All right. You got two or nine votes now. Okay. Well, um, I'll drop the links in the YouTube. Thanks everyone.

Thanks everybody. Thank you, everyone. Bye. Bye. Bye.