Reflect, Retry, Reward: Self-Improving LLMs

00:37:10.260 | So for the function calling dataset, uh, we cheated a little bit in that we like, you theoretically don't need a ground truth dataset, right? You could do something where it's a simple as like, did this function call, uh, create a request that when it hits an API, it gets you back like a two-one.

00:37:19.200 | Uh, like those sorts of things, but in our case, we did actually check to see if the correct answer is the answer from the ground truth dataset because we were like using a, uh, SFT dataset.

00:37:32.540 | But generally speaking, for function calling, if you have any sort of binary reward checker, like any way of saying, like, I think this was a good function call versus a bad function call, you should be able to like, um, do this with countdown.

00:37:56.780 | With countdown, this was a little bit more of like a true verifier, um, because, you know, like many math questions, it's very easy to check if a particular equation that the model has generated evaluates to the right number, but it's like hard to generate all the answers, right?

00:38:14.240 | Like there's many possible answers.

00:38:15.560 | So what we did here is quite literally ran, like, so we checked to make sure like the numbers that were allowed were the numbers in the equation that the model wrote.

00:38:24.400 | Um, and then we just ran eval on it and like, saw if it hit the target number.

00:38:28.060 | So it was just like a very basic, like evaluate the function and see if there's success.

00:38:32.500 | Yeah.

00:38:33.040 | Go back to the background.

00:38:34.120 | Sorry.

00:38:34.400 | I just wanted to follow along just couple of questions there.

00:38:36.520 | So on that first block to the right of fail, generate self-reflection.

00:38:40.480 | What are you adding directly there when you get a failure?

00:38:43.480 | So we have a prompt that is in the paper.

00:38:47.960 | There's prompt templates at the bottom, um, just prompt.

00:38:51.340 | And then it generates a self-reflection and then you, uh, prompt the question.

00:38:55.300 | You like put the, the original question back in to retry.

00:38:58.540 | Got you.

00:38:58.900 | Okay.

00:38:59.380 | So then, um, since you follow the success path of the second retry, that is, you got the correct, your, uh, your verifier knows that you got the correct answer.

00:39:12.360 | And then you're going to say reward the self-reflection tokens.

00:39:15.300 | So which tokens specifically are you rewarding?

00:39:18.240 | Just the fact that on that first fail, you generated some new tokens from that.

00:39:23.340 | That's, um, that's what you're rewarding for that particular path.

00:39:26.940 | Yeah, exactly.

00:39:28.800 | So after the failure, we prompt for self-reflection and the prompt is something like try again, like you got the wrong answer, please reflect on what you did wrong.

00:39:39.340 | So you can get the right answer next time or something like the prompt is something like that.

00:39:42.360 | And then whatever the model answers directly after that, that's exactly what we reward.

00:39:48.080 | Okay.

00:39:48.360 | And, uh, and you did not do anything with the failed path.

00:39:51.460 | That is not at all.

00:39:53.300 | Yeah.

00:39:53.720 | Why not?

00:39:54.860 | Because you feel like that's not useful to say negative reward or something.

00:40:00.560 | Yeah, pretty much, pretty much.

00:40:03.140 | I mean, I think we found pretty early on that this very simplistic reward formulation worked quite well.

00:40:08.000 | And so we didn't do a lot of work on like exploring alternate reward formulations.

00:40:12.380 | Um, cause I think this one also feels very, I think it's a team.

00:40:17.360 | And we just like really like these like simple intuitive approaches, right?

00:40:20.720 | Like fail retry success is a good thing, right?

00:40:24.080 | Um, fail retry fail.

00:40:25.640 | Isn't necessarily a bad thing because maybe the question is impossible for everyone.

00:40:30.140 | Right.

00:40:30.480 | Like, um, yes.

00:40:32.760 | So I just, I'm curious to think about the logic here, meaning that you're rewarding something that comes as a result of the follow-up prompt that says you failed, right?

00:40:44.240 | Please try this problem again.

00:40:46.220 | And you're rewarding that particular path.

00:40:48.620 | So I'm trying to understand what is it?

00:40:51.380 | Sorry.

00:40:51.580 | Maybe I don't understand.

00:40:52.880 | Like you did this, but then on your, your table, you said, Oh, we did this training.

00:40:59.300 | And that first train pass kind of like got 32 to 48%, right?

00:41:04.520 | So I'm trying to understand what is actually, what do you think is going on there, right?

00:41:08.460 | Because it's a very, you're rewarding something that doesn't quite.

00:41:11.800 | Yeah.

00:41:12.020 | And I want them to understand.

00:41:13.740 | Yeah.

00:41:14.240 | Yeah, that's a really good question.

00:41:16.380 | I think that, uh, we don't super know.

00:41:21.680 | And I think like some of the new papers that have been coming out recently also speak to like in general, when we do RL, we don't super know what's going on.

00:41:29.440 | Um, and in particular, I think this first paper starts to unlock partially what might be going on in that interestingly, like where word functions don't have a lot to do with, uh,

00:41:43.840 | whether or not these models get better, right?

00:41:45.760 | Like the exact formulation of reward has, even if it has very little correlation with the right answer and like, just the fact that we're rewarding in the space of the right answer, just because we're wearing other tokens, like it sort of doesn't matter.

00:41:58.840 | It's just kind of just exposure to the data and any sort of reward sometimes is leading to pass at one improvement as well.

00:42:04.480 | So can I conjecture that, uh, really when you're rewarding that second attempt, you're just rewarding a better response, right?

00:42:12.820 | Because you actually put in the same question to it again.

00:42:15.280 | So you're just rewarding the fact that

00:42:17.620 | They also want to survive just in order to achieve the goals we set.

00:42:22.000 | So here's an example from Apollo research of a super intelligent, not a super intelligent, so, um, very unsuper intelligence, but something that's moderately intelligent.

00:42:34.960 | Sorry, maybe I, maybe I've misunderstood that you're, uh, the main point that you're bringing up. Can you repeat that again?

00:43:01.960 | That sounded like a TTS thing. Like, I think you guys can continue with whatever conversation was going on before.

00:43:06.760 | Because it's like unmuted. Yeah.

00:43:08.920 | Okay, sorry. Yeah. So going back to my question, it feels like you're just rewarding for a better answer in some sense, because

00:43:17.080 | I guess having the fact that, you know, the same question, right, was posed on the retry and it gave a response and you like that response better than the first one.

00:43:27.220 | Uh, you're kind of like rewarding that, uh, the fact that it got closer to the answer because you only be working on the success path.

00:43:33.480 | Right. So, yeah.

00:43:36.660 | Yeah. And I think, and I think, and I think generally speaking, so, so yes, although I think that a thing that is still kind of interesting is that we are not rewarding the answer token directly, but I do think in practice what happens is often self-reflections have the right answer somewhere in them.

00:43:54.340 | Um, and so the answer is leaking into the self-reflection. And so then when we reward the self-reflection token, at times we are rewarding the answer. Um, because a lot of, a lot of the self-reflections

00:44:07.220 | answer the question. Yeah.

00:44:10.980 | Have you noticed like, sorry, sorry to interrupt again. Uh, have you noticed, have you noticed that the response got actually maybe longer or, or have you had just a metrics that to, to figure it out? Like after you did your training, what's the quality of the responses that that 48% column?

00:44:26.820 | Uh, have you like done any like simple metrics on those?

00:44:30.540 | Honestly, I haven't, but I should, there is an error analysis, um, section of the paper that mostly discusses like how errors have changed pre and post-training.

00:44:40.820 | Like what the sorts of mistakes models make are, um, that I found pretty interesting. But, but no, like, I think like speed to response, like how many tokens does it take to get the right answer? Hopefully we would see, it would be lower, uh, over time. So yeah, that'd be cool to look at.

00:44:55.220 | Well, yeah, I, I, if you saved that information, like if you, if you still have the traces of the runs, I would be quite interested to see. Cause I, what I, what I would be, would be inclined to check is just the raw number of tokens count. Cause my suspicion would be the, like,

00:45:10.660 | if that number is larger than you are going to see the perf increase. If it's lower than you don't would be my, my guess.

00:45:16.100 | Yeah. It makes sense.

00:45:19.700 | I mean, but, uh, an alternative hypothesis here is that the self-reflection induces the model to early on in the, in the answer process to use.

00:45:33.780 | Like sort of that, the language from the self-reflections in its initial response. Right. And so they then, and then induces

00:45:42.260 | a better reasoning process about the answer.

00:45:45.940 | So that, I mean, that would be the, like what you would hope would happen. So like maybe that is an, I think a credible alternative hypothesis here.

00:45:56.500 | for the initial prompt, did you ask for a chain of thought prior to, or just say like one shot this for

00:46:01.940 | me kind of thing. Yeah. These were all one shot because a lot of these models were early enough that

00:46:07.300 | like, yeah, chain of thought prompting or like models specifically optimized for train of thought prompting

00:46:14.020 | wasn't as much of a thing yet. Like, because our, um, the function calling data set was June, 2024.

00:46:19.620 | And so we were using Quintu models, which are not necessarily super optimized for reasoning because

00:46:25.300 | the issue is that a lot of these models were trained on this data set, right? Like when you have a really

00:46:30.020 | high quality data set and it's open source and it's public, like model companies just swallow it right

00:46:35.140 | pretty quickly. And then all of your results are skewed because it's like, okay, we're training on data

00:46:38.340 | that it's already seen. And so like, how much is this training recipe actually valid versus just like

00:46:41.860 | reinforcing something that already exists that you SFT on. So we wanted to keep the data really pure.

00:46:47.300 | And so, yeah, there, there are slightly older models.

00:46:49.060 | So, Shelley, I had another question, um, directly from the paper. So, um, there's a, in section 4.3,

00:47:00.340 | and you're talking about the, um, you know, sort of decision to, to, um, emphasize, uh, um, you know,

00:47:10.740 | sort of failure only path. And, and you say in the last sentence, it, um, it is other function, otherwise

00:47:19.060 | functionally equivalent to learn from a real world scenario where we believe we receive both successes

00:47:24.500 | and failed responses. And I, I, that, that seems unintuitive to me because it seems like if you use the

00:47:31.540 | successes that you're going to be like maybe overtraining on the successful response and therefore,

00:47:38.340 | um, you might have more catastrophic forgetting. So I wanted to, I wanted to hear what you have to say

00:47:46.500 | about that. Um, is section 4.3, the part that discusses the failure data set?

00:47:52.580 | Yeah, yeah, exactly. Okay, cool. Yeah. So for context for everyone here, because I didn't really

00:48:00.900 | talk about this. One of the things we did to make GRPO training a lot more tractable was we pre-generated

00:48:08.180 | a data set of failures, right? So like this whole top half, you can do offline and then you just do

00:48:18.420 | GRPO on the, on the second half. Right. Um, and the reason that we did that is because we were seeing,

00:48:25.700 | if you run full-scale GRPO with this entire pathway, it was very, very, it was a very low number of

00:48:31.380 | trajectories that specifically hit failure retry success initially. And so it was really

00:48:37.860 | incredibly resource intensive for, for what it is. Right. And so instead of what we did was we just

00:48:42.180 | like offline stored, um, task, task prompt output, like pairs in the case of failure. Um, and, and what

00:48:53.700 | we gauged is basically like, yeah, this is functionally equivalent in our opinion to, uh, or the way we

00:48:59.300 | were training this was, was very similar to if you didn't do this offline thing, but you're right, RJ.

00:49:05.140 | And that like, there is actually differences because you could, the model could drift over

00:49:09.540 | time. And like, we're anchoring on this offline dataset that was generated from like model at step

00:49:15.300 | zero. Right. We're not adapting to models over time, potentially having new failures. Right. And so there

00:49:22.580 | isn't a lot of preventative stuff in place here to prevent catastrophic forgetting with respect to other

00:49:33.700 | data points in the training dataset. Right. I think our sense was that that wasn't as big of an, of a problem,

00:49:41.060 | especially since we saw low catastrophic forgetting in general. Um, and then of course, like when you evaluate,

00:49:46.900 | you see that like, no matter what, you are strictly better at the task than you were before. But I

00:49:50.420 | think it's definitely possible that for certain tasks, you could see this thing happen where like,

00:49:53.780 | as you train things that you were succeeding at before have somehow you started failing at. And this,

00:49:59.140 | this offline dataset, you're correct. When you capture that, I actually, um, that's a good point.

00:50:04.420 | I was actually thinking the opposite that if it seems like this is an important feature of your

00:50:09.460 | methodology and not just a functional, I, it seems like a functional feature of the methodology

00:50:15.540 | because you're really only you're, you're basically saying things that I used to get wrong. I get like,

00:50:21.860 | I, and now I got right by re-prompting. Right. And so that, that, and you're basically identifying

00:50:28.180 | that it's very specific subset as the important thing to train on. Whereas if you were to use the

00:50:34.340 | the successes as well, then you wouldn't have honed in on that specific subset.

00:50:39.780 | Yeah. That makes a lot of sense. Like we could be rewarding first try success as well, pretty

00:50:44.980 | continuously. Um, yeah, I think our, our approach, like we were really keen on this idea that we can

00:50:53.060 | do this like meta learning, like don't specialize for a specific task, like just incentivize self-reflection.

00:50:58.820 | And so if we were rewarding initial successes, there's, we're rewarding the task. We're not

00:51:02.980 | rewarding self-reflection ability. Um, but I agree that there are like a lot of ways to extend this,

00:51:09.060 | to like both reward the task and self-reflection capability, and hopefully see both things get

00:51:14.180 | better and potentially like you get better at the task faster. Cool. Um, Ted, go ahead.

00:51:20.660 | Hey, thanks again, Shelley, for, for joining and, and coming and discussing this. I hope this is a

00:51:26.580 | quick question. Um, can you say how you formed your batches when you were doing GRPO? Did you like mix

00:51:35.060 | the original and the new success or, or, and did you just randomly permute them or shuffle them or do

00:51:44.100 | anything special? It was pretty random, honestly. Um, I think we stuck to like between eight and 14

00:51:51.380 | generations per failure. So, um, tasks to generate output to fail, what would happen is for each task.

00:51:59.780 | I want to say we generated it. So the number of times we generated pathways for that first attempt

00:52:05.620 | to vary depending on model, um, capability, right? Smaller models, you give them less tries because

00:52:12.340 | actually more of them are failures. So we gathered the failure data, data, data set by just, um,

00:52:16.420 | generating a bunch of times and saving the ones that were failures. And then, yeah, with the actual

00:52:20.820 | GRPO training, um, yeah, I would say nothing special, like between 18, eight and 14 generations,

00:52:26.820 | not a lot of mixing, um, pretty, pretty standard GRPO training, especially since like, again, this

00:52:32.340 | was like February, March. And I feel like we, as a community, we're still like figuring out GRPO,

00:52:35.860 | um, or at least I personally was. Um, and so like, there's also this thing in the paper where it's like,

00:52:41.140 | oh, and we kind of period on less than 10 billion parameters. And like, there was an infrastructure to

00:52:44.980 | train on more than like, there was no multi-node implementation of GRPO publicly available until

00:52:52.500 | after I ran these experiments. So yeah, it's, uh, it's a process for sure. I'm sure that there are

00:52:57.940 | many papers that have come out since that would optimize specifically the GRPO approach. Yeah.

00:53:03.620 | Cool. Thanks.

00:53:05.220 | Okay. Yeah, of course we have, we have one more question.

00:53:07.860 | Vishvesh?

00:53:09.860 | Uh, hi, Shelly. Yeah. Thank you for the presentation. Uh, I was just thinking about, uh,

00:53:14.980 | uh, uh, I was thinking about your motivation that you want to learn self-reflection process rather

00:53:22.500 | than the specific task. So, looking at this particular experiment in this setup, this would be a very good,

00:53:34.980 | uh, uh, like, uh, like, do you think that this is a good ablation would be to mix and rejected pairs

00:53:45.060 | for the particular setup that now, now the first field like success that I can make a chosen repair

00:53:50.740 | and then can do prep, a direct prep optimization to compare my, it is focusing on self-reflection and

00:53:59.460 | not on the task, like some, some form of ablation where I can also have post-training, uh, to compare

00:54:05.700 | whether just, uh, just, uh, rewarding my self-reflection token is, uh, is the best way forward.

00:54:12.740 | Yeah. I, okay. So I didn't quite catch all of that because I think maybe your connection or my

00:54:18.020 | connection wasn't amazing for a bit there, but what I caught was basically ablation studies around

00:54:23.140 | comparing, uh, rewarding self-reflection specifically to rewarding both self-reflection

00:54:29.380 | and the task or just rewarding the task itself and seeing how performance changes. Um, yeah,

00:54:34.500 | super agree that that would be an interesting ablation. I think like we pretty intentionally

00:54:37.940 | in this paper stepped away from directly comparing to things like, uh, like other reward functions and

00:54:43.300 | instead approach it as let's compare to larger models, right? Let's use our baselines to be like,

00:54:47.860 | how much can this training recipe bring us up to, um, bigger models. But I super agree that like,

00:54:53.860 | head-to-head this approach versus standard GP, GRPO where you reward the answer or other, other like,

00:55:02.660 | combine it with both or whatever. Like, yeah, that would be very interesting and, and, um,

00:55:06.260 | hopefully something the writer team can get to, uh, in the next few months or anyone else.

00:55:10.900 | Cool. Awesome. Just one more, one more point. Uh, do you plan to make the code open source anytime

00:55:18.500 | after like maybe you've submitted somewhere and then you plan to do it? If you. Yes. The,

00:55:22.980 | the goal is to definitely make the code open source. Um, yeah, we are waiting, um, on a few things before

00:55:29.540 | we do that. Um, but we did actually document, in my opinion, like relatively well, hopefully, uh, within

00:55:37.780 | the paper, how we did this is actually a pretty straightforward modification. And so, um, to like

00:55:41.940 | open source libraries. And so, um, definitely encourage people to just try it out and implement

00:55:47.220 | it. And of course, email me if they have questions and I'm happy to help, but, um, yeah, super happy to,

00:55:52.420 | to answer. Um, but hopefully all the pieces should be there, uh, where if you would like to implement on

00:55:59.780 | your own, you can, but yeah, eventually, hopefully we will release the code as well.

00:56:10.660 | Thank you so much. Awesome.

00:56:13.060 | Sam, do you want to close this out?

00:56:17.620 | Uh, I don't have much to say. Just thanks. Thanks a lot, Shelly. I really appreciate it. And

00:56:24.180 | thanks everybody for joining and asking such great questions.

00:56:26.980 | And also vote on hugging face apparently. Yes. And also vote on hugging face.

00:56:31.220 | I didn't know. I didn't know they had voting. That's, uh,

00:56:34.740 | Yeah. They have paper of the day. It's run. Yeah. It's paper of the day and then paper of the week

00:56:38.820 | and then paper of the month. So shameless self-promotion.

00:56:41.780 | Yeah. It's still time. Still got time. All right. You got two or nine votes now.

00:56:47.140 | Okay. Well, um, I'll drop the links in the YouTube. Thanks everyone. Thanks everybody.

00:56:54.660 | Thank you, everyone.

00:56:55.500 | Bye.

00:56:56.680 | Bye.

00:56:56.840 | Bye.