back to index

Reflect, Retry, Reward: Self-Improving LLMs


Whisper Transcript | Transcript Only Page

00:00:00.640 | Cool.
00:00:01.040 | I'm going to record.
00:00:02.460 | You're welcome to set some context.
00:00:04.540 | Yeah, go ahead.
00:00:05.200 | Cool.
00:00:06.400 | And everyone can see my slides and everything.
00:00:08.640 | Well, I can see your page.
00:00:11.540 | It looks like a paper, but I guess it's a slide.
00:00:13.400 | Yeah.
00:00:13.700 | Yeah, this is just cut and paste to the first slide.
00:00:17.620 | Cool.
00:00:18.960 | Well, thank you all for being here.
00:00:20.080 | This is really exciting.
00:00:20.960 | And yeah, super excited to chat about this.
00:00:25.140 | So I'm Shelly.
00:00:27.180 | I'm one of our AI engineers and researchers
00:00:29.640 | at Ryder.
00:00:30.160 | And one of the things we've been talking
00:00:33.940 | and thinking about a lot in the last six to nine months
00:00:37.360 | has been self-evolving large language models,
00:00:40.140 | self-improving large language models.
00:00:41.640 | And I think this is a term or phrase
00:00:43.300 | that's been coming up more and more
00:00:44.960 | across the board for lots of different companies, right?
00:00:50.060 | How can we functionally fine-tune small, large language models
00:00:54.040 | for specific customers in a way that adapts
00:00:57.360 | to their use cases so that they can learn
00:00:59.360 | from their mistakes and become more and more useful?
00:01:03.260 | And I have a couple of canonical examples
00:01:05.080 | that I always use when I'm doing less technical talks
00:01:07.280 | about the subject, or I talk about these sorts of tasks
00:01:10.120 | that you would expect a large language model
00:01:14.060 | to be able to do based on knowing that,
00:01:16.140 | you know, they've been trained on reasoning
00:01:17.860 | and math and coding and all of these things,
00:01:19.640 | but there are still so many gaps
00:01:21.540 | that enterprise customers,
00:01:23.200 | but then also like us in our everyday lives feel.
00:01:26.920 | So like, for example, one of the examples I give,
00:01:30.000 | I recently moved and so I have a new apartment
00:01:33.120 | and the leasing office gave me a floor plan.
00:01:35.240 | And so I asked Claude Sonnet 4 to basically take,
00:01:38.520 | I was like, okay, if I have this floor plan,
00:01:40.780 | can Claude put furniture onto the floor plan for me, right?
00:01:44.260 | Like put a bed here, that's the right size and better that.
00:01:46.840 | And like, you know, the bed ends up in the bathroom
00:01:48.860 | and the couch is like sideways
00:01:50.160 | and like all these crazy things,
00:01:51.420 | like all the walls are moving around.
00:01:52.960 | And it's like one of those things where it's like,
00:01:54.400 | you know, it's not necessarily an intuitive
00:01:56.920 | that this doesn't work, like it should work.
00:01:59.560 | And so, you know, when we think about self-evolving
00:02:02.040 | or self-improving models,
00:02:03.040 | what we're talking about is how can we,
00:02:05.200 | between big model releases,
00:02:07.360 | create large language models
00:02:08.920 | that can learn and adapt to these use cases
00:02:10.680 | that should be in distribution,
00:02:11.980 | but for whatever reason aren't.
00:02:13.260 | So I'm going to spend maybe 20 minutes
00:02:16.840 | walking through the paper
00:02:17.680 | and then hopefully we can just kind of
00:02:18.820 | have a greater discussion about self-improvement,
00:02:21.260 | about GRPO,
00:02:22.260 | like lots of cool research in this space
00:02:23.840 | in the last couple of months,
00:02:25.000 | especially after sort of the May conference
00:02:27.380 | submission deadlines.
00:02:28.120 | So, yeah, I'll just kind of get into it.
00:02:30.520 | So this is the paper that we released
00:02:32.860 | in the late May, early June
00:02:34.460 | called Reflect, Retry, Reward,
00:02:37.540 | Self-improving Large Language Models
00:02:39.200 | to be in Reinforcement Learning.
00:02:40.240 | And it's kind of motivated
00:02:42.980 | by what I was speaking about before,
00:02:44.400 | this desire to improve the performance
00:02:47.320 | of large language models
00:02:48.320 | under certain constraints.
00:02:49.540 | We were focusing on use cases
00:02:52.000 | where, as I said,
00:02:52.720 | tasks that all models do poorly on,
00:02:55.140 | generally speaking,
00:02:55.760 | even very, very large models.
00:02:57.120 | So, you know,
00:02:57.720 | you don't necessarily have other models
00:03:00.020 | that you can use as a judge
00:03:01.280 | or for synthetic data.
00:03:02.580 | So that's kind of how we ended up
00:03:03.800 | at Reinforcement Learning, right?
00:03:04.980 | We want just instead tasks
00:03:07.240 | that are easily verifiable,
00:03:08.400 | but that we don't necessarily
00:03:09.360 | have large ground truth data sets for
00:03:10.920 | or ways to use judge models
00:03:13.400 | or other synthetic data techniques.
00:03:15.360 | And we also stuck to
00:03:17.940 | binary reward settings, right?
00:03:20.140 | We wanted to keep the reward
00:03:21.020 | as simple as possible.
00:03:21.940 | And so we kind of ended up on tasks
00:03:24.840 | that have very simple expressions
00:03:27.480 | of reward.
00:03:28.000 | Let me see.
00:03:31.820 | Okay, right.
00:03:32.320 | So this is like, again,
00:03:33.980 | the sort of setting of
00:03:34.940 | what is it that we're trying
00:03:36.600 | to achieve with this paper?
00:03:38.080 | when you have this standard flow,
00:03:39.620 | you generate some output
00:03:42.020 | given a task.
00:03:42.780 | And if you succeed, that's great.
00:03:44.600 | But when you fail,
00:03:45.220 | you don't really have anything
00:03:46.180 | we can do to get better.
00:03:47.840 | Aside from collecting
00:03:49.120 | large amounts of data,
00:03:49.900 | generating large amounts of data
00:03:51.300 | and doing an SFT run,
00:03:52.320 | we want to think about
00:03:53.100 | how can we incrementally do better
00:03:54.760 | and have models that learn
00:03:56.780 | from their mistakes
00:03:57.460 | and are adaptive and evolve.
00:04:01.300 | So we were quite inspired
00:04:03.000 | by a 2023 work
00:04:05.680 | called Self-Refine.
00:04:08.120 | I believe it was like CMU
00:04:10.660 | and maybe the University of Washington
00:04:12.360 | or I'm probably getting that wrong,
00:04:14.340 | but a couple of organizations
00:04:16.240 | came together to do this work,
00:04:17.580 | which the premise is,
00:04:19.660 | well, so they showed,
00:04:20.460 | and this is part of a larger narrative
00:04:22.260 | around self-reflection
00:04:23.860 | with large language models,
00:04:25.040 | but they were one of the seminal papers
00:04:27.800 | that showed that if you ask a model
00:04:29.800 | to self-critique
00:04:30.620 | or to provide feedback
00:04:31.740 | on its own output
00:04:32.720 | and then use that feedback
00:04:34.600 | to refine its answer,
00:04:36.900 | you can see like up to 20%
00:04:38.800 | performance gains.
00:04:39.620 | And I think a lot of their experiments
00:04:40.860 | were on whatever chatGPT,
00:04:42.340 | whatever was powering chatGPT
00:04:45.220 | at the time.
00:04:45.800 | And so at the bottom,
00:04:48.260 | this is an illustrative example
00:04:50.020 | of how that would look in practice.
00:04:52.340 | So you have a user
00:04:53.800 | talking about table tennis.
00:04:54.980 | There's like kind of a mediocre response
00:04:57.060 | from the model,
00:04:59.020 | but then you tell the model,
00:05:00.900 | you prompt the model to say,
00:05:02.920 | what are the issues with this?
00:05:04.360 | Like what is missing?
00:05:07.500 | And it says,
00:05:08.840 | oh, there's no information
00:05:10.100 | about how to play table tennis
00:05:11.120 | and there's a lack of user understanding.
00:05:12.580 | And so then the refined response
00:05:14.380 | is qualitatively much better.
00:05:15.720 | So we're definitely inspired
00:05:17.180 | by this known property
00:05:18.640 | of large language models
00:05:19.540 | that they can to some degree,
00:05:20.880 | like provide self-feedback
00:05:22.380 | or self-critiques.
00:05:25.160 | So this is what that flow
00:05:26.420 | would look like
00:05:27.040 | if we were just prompting
00:05:28.960 | for self-reflection, right?
00:05:30.040 | So on the left side,
00:05:31.240 | if you succeed, great.
00:05:32.280 | On the right side,
00:05:33.140 | if you fail, right?
00:05:34.920 | If we detected a failure,
00:05:36.320 | as I talked about,
00:05:37.080 | with some sort of verification,
00:05:40.280 | then we could generate
00:05:42.380 | a self-reflection in this style,
00:05:43.760 | like ask the model
00:05:44.400 | to provide feedback
00:05:44.980 | and then have the model
00:05:46.140 | retry the task, right?
00:05:47.360 | And we see pretty much
00:05:48.340 | like pretty immediate
00:05:49.800 | improvements from this
00:05:51.700 | on the task
00:05:52.220 | that we're talking about, right?
00:05:53.240 | So this is cool and all.
00:05:54.360 | That's how reflection actually works.
00:05:56.640 | But then sort of comes
00:05:59.360 | in the learning aspect
00:06:00.320 | or the evolution aspect, right?
00:06:01.940 | The self-reflection prompting
00:06:03.120 | is static
00:06:03.600 | and the model doesn't learn anything.
00:06:05.100 | And so we're using
00:06:06.060 | reinforcement learning again,
00:06:07.520 | which is why the verifiable
00:06:09.560 | rewards are lovely
00:06:10.320 | to teach the model
00:06:11.840 | how to reflect better.
00:06:12.860 | And our specific formulation
00:06:14.960 | is that we specifically rewarded
00:06:18.440 | the self-reflection tokens.
00:06:19.700 | So what we were trying
00:06:20.700 | to do here with this approach
00:06:21.980 | is incentivize the model
00:06:23.640 | to learn how to self-reflect better,
00:06:26.340 | not to get the right answer.
00:06:27.840 | So there was no reward
00:06:30.060 | for the answer tokens.
00:06:31.460 | There was only reward
00:06:32.460 | for the self-reflection tokens.
00:06:33.640 | And so the flow on the right now
00:06:37.660 | is when we fail,
00:06:38.700 | we generate a self-reflection,
00:06:39.920 | we retry the task.
00:06:40.980 | And then if we are
00:06:41.840 | on that specific fail,
00:06:43.180 | retry success path,
00:06:44.940 | just that pathway
00:06:45.940 | where a self-reflection
00:06:47.080 | led to success,
00:06:50.420 | then we reward
00:06:51.360 | the self-reflection tokens
00:06:52.360 | because that means
00:06:53.020 | that is a high-quality
00:06:53.880 | self-reflection
00:06:54.480 | that led to a success
00:06:56.020 | when there was previously
00:06:56.720 | a failure.
00:06:57.060 | Any questions at this point?
00:07:00.240 | I don't know
00:07:01.560 | if there's somewhere
00:07:01.980 | I can see.
00:07:02.600 | There's...
00:07:07.860 | No, no, no,
00:07:08.820 | not specifically.
00:07:10.240 | Okay, perfect.
00:07:10.580 | Okay, okay.
00:07:11.480 | I'll keep moving.
00:07:12.560 | Okay, awesome.
00:07:13.600 | I'll just keep moving
00:07:14.220 | and Sam,
00:07:15.020 | if you want to jump in
00:07:15.940 | on the chat,
00:07:16.260 | you're great.
00:07:16.680 | Awesome, okay.
00:07:18.280 | So that's like
00:07:19.080 | the basic formulation, right?
00:07:20.100 | It's a modification
00:07:20.660 | to a sort of standard
00:07:22.360 | reinforcement learning approach
00:07:24.820 | with this extra
00:07:25.640 | self-reflection
00:07:27.060 | and rewarding
00:07:27.660 | self-reflection step.
00:07:28.540 | And so we're incentivizing
00:07:29.400 | the model
00:07:29.740 | to self-reflection better.
00:07:30.600 | Okay, so in the paper itself,
00:07:32.840 | we focused on two tasks
00:07:33.980 | that again fit
00:07:34.860 | that description
00:07:35.660 | that I've spoken about
00:07:36.580 | previously,
00:07:37.020 | two things,
00:07:37.600 | one, verifiable reward
00:07:38.780 | into like tasks
00:07:40.020 | that you would think
00:07:40.640 | should be in distribution
00:07:41.520 | but aren't.
00:07:42.140 | So the first one
00:07:42.960 | is function calling.
00:07:43.660 | We use the API gen,
00:07:46.760 | like the Salesforce
00:07:47.420 | function calling data set.
00:07:48.560 | It's about 60,000 data points
00:07:49.820 | that came out
00:07:51.020 | mid last year
00:07:51.820 | and that everyone
00:07:52.520 | has been using.
00:07:53.160 | And then we also use
00:07:55.920 | this task called countdown,
00:07:57.200 | which I actually
00:07:58.240 | recently learned
00:07:59.680 | is based on
00:08:01.120 | a British game show,
00:08:02.020 | which I didn't realize.
00:08:02.880 | I didn't know
00:08:03.260 | where that name came from.
00:08:04.120 | but this is a task
00:08:05.520 | that got pretty popular
00:08:07.340 | in January of this year
00:08:08.400 | because there was
00:08:08.920 | like a project
00:08:09.500 | that showed
00:08:10.000 | that RL can like
00:08:11.480 | dramatically improve
00:08:13.340 | ability to do
00:08:14.280 | this particular task.
00:08:15.140 | And it was like
00:08:15.480 | a GRPO experiment.
00:08:16.360 | But models are
00:08:18.120 | surprisingly bad
00:08:18.920 | at this task.
00:08:19.580 | The formulation
00:08:20.180 | is that you get
00:08:22.220 | a list of numbers
00:08:23.420 | between three
00:08:24.240 | to four numbers
00:08:24.880 | and you have to create
00:08:25.820 | an equation
00:08:26.260 | that equals
00:08:26.720 | some other target number
00:08:27.760 | and you can only use
00:08:29.160 | like basic arithmetic
00:08:30.420 | operations plus
00:08:31.460 | minus times divide
00:08:32.780 | and you can only use
00:08:34.100 | each number once.
00:08:34.780 | And so all of the questions
00:08:36.780 | are of this format
00:08:37.560 | where like the numbers
00:08:38.380 | are different,
00:08:38.880 | but like it's always
00:08:39.600 | just create this equation.
00:08:40.620 | It's actually really hard.
00:08:43.200 | Yeah.
00:08:43.780 | Again, there's a British game show
00:08:45.840 | where the entire premise
00:08:46.940 | is that people go
00:08:47.800 | and try to do this quickly
00:08:49.020 | and it's very difficult.
00:08:50.100 | So these are the two tasks
00:08:52.460 | that we experimented on
00:08:53.380 | because again,
00:08:54.080 | like large models
00:08:55.100 | are surprisingly
00:08:55.720 | not as high accuracy
00:08:57.980 | on these tasks
00:08:58.660 | as you would think.
00:09:02.720 | So we did very standard
00:09:04.340 | GRPO training for this,
00:09:05.600 | of course,
00:09:05.900 | with that modification
00:09:06.640 | of self-reflection
00:09:07.560 | and I'm excited to see
00:09:08.980 | like when we get
00:09:10.520 | to the end of this,
00:09:11.460 | like what people think
00:09:12.600 | about, you know,
00:09:13.480 | all of the recent advancements
00:09:15.000 | and all of the building
00:09:16.600 | upon GRPO
00:09:17.340 | and I have a couple papers
00:09:19.060 | that I want to talk about too.
00:09:20.320 | But throughout our training process,
00:09:23.620 | when we started,
00:09:24.420 | we saw qualitatively
00:09:26.360 | that these self-reflections
00:09:27.400 | were very specific, right?
00:09:28.720 | They're very long,
00:09:29.680 | they're very like verbose
00:09:31.400 | and they kind of repeat
00:09:33.120 | the same thing
00:09:33.580 | over and over again
00:09:34.160 | and they're very like specific
00:09:35.280 | to a specific problem, right?
00:09:36.720 | And then we saw
00:09:38.040 | this really cool thing happen
00:09:38.980 | where as you get
00:09:39.660 | to like a thousand steps,
00:09:40.700 | you start to see
00:09:41.680 | these much shorter,
00:09:42.420 | clearer and general examples
00:09:45.460 | or these better,
00:09:47.400 | like they're just higher quality
00:09:48.480 | self-reflections, right?
00:09:49.320 | They're very specific,
00:09:50.120 | they're very effective,
00:09:51.380 | they're very short.
00:09:52.500 | So it was a cool
00:09:53.620 | qualitative result.
00:09:54.340 | Like we can see
00:09:55.120 | that the models
00:09:55.680 | are sort of becoming
00:09:57.160 | quote unquote
00:09:57.580 | better at self-reflection
00:09:58.800 | in our opinion.
00:09:59.660 | Okay.
00:10:02.120 | And then I'll talk
00:10:03.480 | through actual results.
00:10:05.000 | So this is on
00:10:06.920 | the function calling task.
00:10:07.880 | So it's a pretty standard
00:10:08.840 | function calling task, right?
00:10:09.820 | You provide some tools,
00:10:10.760 | you have a user query
00:10:12.840 | and then the model
00:10:13.840 | has to pick a tool
00:10:15.080 | and provide the right parameters
00:10:17.280 | to that tool.
00:10:17.760 | So pretty standard
00:10:18.380 | function calling.
00:10:18.960 | And we stuck
00:10:19.840 | to pretty small models,
00:10:20.840 | particularly since
00:10:21.720 | we started this research
00:10:22.600 | around March
00:10:23.960 | when, you know,
00:10:24.680 | most of the GRPO
00:10:26.020 | research was being done
00:10:27.640 | on like under 10 billion
00:10:28.720 | parameter models.
00:10:29.540 | So the top half
00:10:31.660 | of this table
00:10:32.300 | is the models
00:10:34.920 | that we actually trained.
00:10:35.840 | We stuck to like
00:10:37.640 | a diverse set
00:10:38.640 | of open source models
00:10:39.760 | between one and a half
00:10:41.680 | billion and eight billion
00:10:42.420 | parameters.
00:10:42.840 | And just to walk you
00:10:44.720 | through what the
00:10:45.100 | different columns here mean,
00:10:46.520 | vanilla first try
00:10:48.020 | is like pass at one
00:10:49.120 | on function column,
00:10:50.200 | right?
00:10:50.460 | So you can see
00:10:51.180 | surprisingly low
00:10:52.380 | and even with like
00:10:53.140 | the very big models
00:10:53.960 | towards the bottom,
00:10:54.620 | 72 billion
00:10:56.760 | or 70 billion
00:10:57.360 | parameter models,
00:10:57.940 | like they're only
00:10:58.880 | around like 70%
00:10:59.940 | accuracy, right?
00:11:00.680 | And then the second
00:11:02.660 | column,
00:11:03.140 | plus reflection,
00:11:04.040 | second try
00:11:04.720 | is with that
00:11:05.300 | prompting technique,
00:11:06.020 | right?
00:11:06.300 | So how much of a raise
00:11:07.220 | do you get just by
00:11:08.000 | asking for self-reflection
00:11:09.340 | and then giving the models
00:11:10.440 | a second try
00:11:11.060 | at the task, right?
00:11:11.840 | And then for the top
00:11:13.480 | half of these models,
00:11:14.480 | the smaller models,
00:11:15.180 | we trained them
00:11:15.880 | with this approach
00:11:18.560 | I've been talking
00:11:19.060 | about of incentivizing
00:11:20.020 | better self-reflection.
00:11:21.600 | and we see
00:11:22.500 | pretty immediate
00:11:23.380 | performance gains
00:11:26.620 | both on
00:11:27.580 | pass at one,
00:11:29.340 | so trained first try
00:11:30.520 | and then also
00:11:31.220 | if you give it
00:11:32.200 | a prompt
00:11:32.700 | and a retry,
00:11:33.640 | you see even more
00:11:35.560 | more of a performance
00:11:37.400 | jump.
00:11:37.700 | So the bottom half
00:11:39.720 | of the table
00:11:40.160 | is our
00:11:42.000 | sort of baseline.
00:11:43.320 | The way that we
00:11:44.420 | wanted to think
00:11:44.920 | about this
00:11:45.360 | was how effective
00:11:46.700 | can we be
00:11:48.340 | at bringing
00:11:48.900 | small models
00:11:49.540 | up to the performance
00:11:50.320 | of large models?
00:11:51.020 | And one thing
00:11:53.060 | I want to highlight
00:11:53.500 | here is if you look
00:11:54.260 | at, for example,
00:11:54.880 | the Quen 2 7 billion
00:11:56.240 | instruct rows,
00:11:57.020 | so that's the second
00:11:57.640 | row of the table,
00:11:58.340 | you see the model
00:12:00.560 | vanilla first try
00:12:01.620 | you're at 66.4%
00:12:02.940 | and then with the training
00:12:04.140 | and then with giving
00:12:05.020 | you a second try
00:12:05.640 | you bring it all
00:12:06.180 | the way up
00:12:06.680 | past 77%.
00:12:07.800 | And if you compare
00:12:09.040 | this to plus reflection
00:12:10.980 | second try
00:12:11.720 | of the large models,
00:12:12.740 | the Quen 2 72 billion
00:12:14.380 | and the Llama 70 billion,
00:12:16.300 | like that 77%
00:12:17.860 | is actually higher
00:12:18.800 | than when you
00:12:20.160 | prompt Quen and Llama
00:12:21.480 | very large models
00:12:22.500 | for self-reflection,
00:12:23.340 | right?
00:12:23.580 | So what we're showing
00:12:24.100 | is that you can use
00:12:24.800 | this training recipe
00:12:25.420 | to bring very small
00:12:26.340 | models past the
00:12:27.740 | performance of models
00:12:28.580 | 10 times their size
00:12:29.520 | just by like using
00:12:30.900 | a particular training
00:12:31.560 | recipe, which is
00:12:32.780 | really cool, right?
00:12:33.340 | Because I think that
00:12:34.080 | there has been a
00:12:35.940 | movement that I
00:12:37.760 | really believe in
00:12:38.580 | about like, you know,
00:12:39.300 | the power of small
00:12:39.940 | models and why
00:12:40.660 | personalization can be
00:12:41.640 | really powerful.
00:12:42.220 | And I think this idea
00:12:43.520 | that you can bring
00:12:44.280 | a small model up
00:12:45.100 | past the performance
00:12:45.960 | of a very general
00:12:47.620 | out-of-the-box model
00:12:48.980 | is a cool thing
00:12:50.100 | to see.
00:12:50.480 | Cool.
00:12:51.640 | So this is the
00:12:52.240 | function coin.
00:12:52.800 | Oh, sorry.
00:12:53.760 | Do you mind if I ask
00:12:54.280 | a question?
00:12:54.520 | Yeah, I noticed
00:12:55.320 | that you had this
00:12:56.000 | sentence in your
00:12:57.340 | write-up as well
00:12:58.680 | that small models
00:12:59.600 | can outperform
00:13:00.480 | larger models.
00:13:00.980 | I guess I just
00:13:02.640 | want to clarify.
00:13:03.060 | This means that
00:13:03.760 | what you're saying
00:13:04.360 | is that small models
00:13:05.120 | can outperform
00:13:05.620 | larger models.
00:13:06.360 | Small models
00:13:08.660 | that are fine-tuned
00:13:10.780 | can outperform
00:13:11.360 | larger models
00:13:12.000 | without fine-tuning.
00:13:13.500 | exactly.
00:13:13.940 | Thank you.
00:13:14.840 | Cool.
00:13:18.720 | And then these
00:13:19.820 | are the countdown
00:13:20.440 | results.
00:13:20.820 | So that second sort
00:13:21.700 | of math equation
00:13:22.320 | writing task.
00:13:23.080 | There's a few more
00:13:24.500 | models on here,
00:13:25.400 | just a slightly
00:13:26.060 | different set.
00:13:26.640 | One thing that we
00:13:28.000 | did for sort of
00:13:29.220 | academic integrity
00:13:29.980 | is like, based on
00:13:31.260 | when the data set
00:13:32.880 | was released,
00:13:33.700 | we only chose
00:13:34.620 | models to train
00:13:35.280 | that were released
00:13:35.880 | prior to that data
00:13:36.700 | set being released
00:13:38.000 | because obviously
00:13:38.660 | with models like
00:13:39.940 | Quan and LLAMA,
00:13:40.340 | you don't know
00:13:40.920 | exactly what they're
00:13:41.500 | trained upon.
00:13:42.000 | And so we wanted
00:13:43.560 | to make sure that
00:13:44.080 | this data wasn't
00:13:45.240 | already in their
00:13:47.560 | training data, right?
00:13:49.780 | Because that would
00:13:50.600 | definitely see results.
00:13:51.500 | So that's why
00:13:52.420 | there's a slightly
00:13:53.200 | different model set
00:13:54.280 | here and then also
00:13:55.000 | some older models,
00:13:56.580 | right?
00:13:56.860 | But we, again,
00:14:00.860 | see some really
00:14:01.620 | similar results.
00:14:02.300 | You can see the
00:14:03.080 | prompting technique
00:14:03.760 | helps and then the
00:14:04.480 | training helps.
00:14:05.080 | And again, you can
00:14:06.960 | bring Quan 2.57
00:14:08.820 | billion with the
00:14:11.140 | training, with the
00:14:11.780 | reflection up to
00:14:12.920 | over 50%, which
00:14:14.620 | again, like handily
00:14:15.900 | beats Quan 72
00:14:17.000 | billion and gets
00:14:19.060 | close, like, yeah,
00:14:20.260 | it just gets close
00:14:21.380 | to the performance
00:14:21.960 | of home RRX4,
00:14:24.300 | which is cool.
00:14:24.780 | Awesome.
00:14:26.280 | So another thing
00:14:29.260 | we wanted to look
00:14:29.860 | at as sort of a
00:14:30.700 | side effect or a
00:14:31.620 | desirable property
00:14:32.860 | was catastrophic
00:14:33.880 | forgetting and
00:14:34.800 | investigating how much
00:14:36.460 | we saw and we
00:14:37.300 | luckily saw very
00:14:37.900 | little.
00:14:38.220 | This is a lot of
00:14:40.260 | numbers and I'm not
00:14:41.000 | going to walk
00:14:41.340 | through them too
00:14:41.760 | much, but just at a
00:14:43.060 | high level, like we
00:14:43.780 | were seeing that there
00:14:44.700 | was not much of a
00:14:45.520 | performance drop,
00:14:46.240 | particularly around a
00:14:47.060 | statistically significant
00:14:47.900 | one across a wide
00:14:48.920 | variety of tasks, even
00:14:50.500 | though you're doing
00:14:50.920 | this approach to
00:14:51.440 | fine-tuning.
00:14:51.820 | And we sort of credit
00:14:53.180 | that a little bit to
00:14:54.080 | how generally we're
00:14:55.660 | fine-tuning, right?
00:14:56.320 | Because again, we're
00:14:56.960 | incentivizing self-reflection
00:14:58.320 | and reasoning in general
00:14:59.560 | as opposed to
00:15:00.820 | specific answers on a
00:15:03.180 | specific task.
00:15:04.000 | we were seeing
00:15:04.500 | relatively low
00:15:05.140 | catastrophic forgetting,
00:15:05.940 | which is cool and
00:15:06.660 | kind of evidence is
00:15:07.440 | that you can train
00:15:09.860 | models in this way and
00:15:10.900 | still use them for
00:15:11.820 | lots of different
00:15:12.320 | things.
00:15:12.740 | Cool.
00:15:16.080 | This is my last slide.
00:15:17.800 | I guess I moved to
00:15:18.460 | this actually quite
00:15:19.180 | quickly.
00:15:19.440 | But I just, I think
00:15:21.700 | in, you know, as I
00:15:24.280 | mentioned in May and
00:15:25.000 | June, there has been a
00:15:25.980 | lot of work that has
00:15:26.820 | been released on GRPO
00:15:28.660 | and adjacent methods
00:15:29.640 | and has kind of given
00:15:32.000 | us a lens for like
00:15:33.180 | where collectively we
00:15:34.840 | should go from here.
00:15:35.540 | And I just wanted to
00:15:36.200 | highlight some papers that
00:15:37.200 | came out in the last
00:15:37.980 | couple months since this
00:15:38.940 | paper was released that I
00:15:39.860 | thought were really
00:15:40.320 | interesting and cool.
00:15:41.100 | So the first Spurious
00:15:44.880 | Rewards is, I believe,
00:15:46.940 | Allen AI.
00:15:49.380 | and they showed that
00:15:50.680 | for Quen models
00:15:51.380 | specifically, when you
00:15:54.800 | do RL training, it's
00:15:56.180 | like the reward function
00:15:57.540 | doesn't necessarily need
00:15:58.960 | to be correlated for the
00:15:59.800 | right answer in order to
00:16:00.740 | get strong mathematical
00:16:01.740 | reasoning.
00:16:02.160 | And this was kind of
00:16:03.500 | speaking, I think, to a
00:16:04.500 | general sort of potential
00:16:07.820 | under-training of Quen
00:16:08.800 | models and then also to
00:16:10.420 | this idea that we need to
00:16:12.960 | be investigating more
00:16:13.900 | carefully, like what it
00:16:16.160 | is or how we're
00:16:17.340 | surfacing.
00:16:17.920 | aspects of model
00:16:20.320 | performance through RL.
00:16:21.460 | Like what, like I think
00:16:22.420 | it really speaks to like
00:16:23.080 | what is actually going on
00:16:24.000 | here because there's this
00:16:24.740 | one, you know, research
00:16:27.340 | angle, which is like let's
00:16:28.420 | design really good reward
00:16:29.340 | functions and they kind of
00:16:30.200 | showed like you actually
00:16:31.280 | maybe don't need to
00:16:32.160 | design really good reward
00:16:33.080 | functions, especially for
00:16:34.620 | certain models.
00:16:35.120 | And it's worth noting that
00:16:36.440 | these results primarily
00:16:37.540 | held for Quen and didn't
00:16:38.440 | really hold for Lama and
00:16:39.400 | other models.
00:16:39.860 | So, you know, potentially
00:16:41.180 | it's something specific
00:16:41.900 | that Quen is doing, but
00:16:43.140 | yeah, it's cool paper.
00:16:44.480 | The next one.
00:16:47.040 | I thought was really
00:16:48.460 | interesting.
00:16:48.960 | The next two are basically
00:16:51.760 | about alternate
00:16:52.740 | approaches to reward.
00:16:54.100 | And the second one is
00:16:57.960 | about using self-certainty
00:16:59.140 | as a reward signal.
00:16:59.920 | So you can basically get
00:17:05.260 | rid of verifiers by simply
00:17:07.420 | just using, and this is
00:17:08.760 | like very related to what
00:17:09.920 | we did, right?
00:17:10.460 | So I thought it was really
00:17:11.360 | cool paper.
00:17:12.020 | And then the last paper
00:17:13.340 | is using RLU directly
00:17:15.380 | maximize the probability
00:17:16.420 | of generating a reference
00:17:17.420 | answer.
00:17:17.700 | So again, they have a
00:17:18.520 | technique where they're
00:17:19.860 | not using a verifier at
00:17:21.300 | And so I think it's
00:17:22.440 | showcasing that like we
00:17:23.500 | can build upon GRPO where
00:17:25.200 | we got to cut out a model
00:17:26.220 | and like cut out even more
00:17:27.360 | models and cut out even
00:17:28.360 | more like sort of
00:17:29.520 | verification steps.
00:17:30.760 | And I think that's
00:17:32.140 | really promising and
00:17:33.140 | interesting.
00:17:33.460 | Yeah, I think that's
00:17:36.000 | everything I had for you
00:17:37.340 | Happy to take questions.
00:17:38.420 | Happy to for just a
00:17:41.160 | greater discussion about
00:17:42.260 | GRPO, reinforcement
00:17:43.800 | learning, self-improvement,
00:17:44.880 | all of the things we
00:17:45.560 | mentioned here.
00:17:46.000 | So yeah, thanks.
00:17:47.220 | Yes, Chris?
00:17:53.920 | Yeah, I found it
00:17:54.960 | interesting how you
00:17:55.680 | also looked at the
00:17:56.540 | Spurious Reward
00:17:57.580 | Rethinking Training
00:17:58.800 | Signals RLVR
00:18:00.300 | paper.
00:18:00.740 | I saw how when you
00:18:04.480 | did the benchmarks that
00:18:05.440 | you did models other
00:18:06.780 | than QN for the
00:18:08.540 | improvements.
00:18:09.920 | Was that to account for
00:18:11.140 | that error from like the
00:18:12.820 | previous one where they
00:18:13.560 | showed how the QN
00:18:14.520 | models were mostly
00:18:15.980 | focused on in these
00:18:16.960 | RLVR papers to prevent
00:18:18.360 | that mistake?
00:18:20.360 | Yeah, so this
00:18:22.380 | Spurious Reward
00:18:23.400 | paper actually came
00:18:24.300 | out around slash
00:18:26.160 | after when our paper
00:18:26.980 | came out.
00:18:27.420 | So I wasn't aware of
00:18:28.500 | this research, but I
00:18:29.760 | do think that in
00:18:30.560 | general, we wanted to
00:18:31.660 | show that our
00:18:32.940 | technique holds across
00:18:34.240 | different model
00:18:35.720 | families, different
00:18:36.420 | sizes of models, all
00:18:37.380 | of those things, right?
00:18:38.100 | Just for like rigor and
00:18:39.380 | like to sort of prove
00:18:40.220 | out the technique.
00:18:40.820 | And I think in
00:18:41.460 | general, I appreciate
00:18:44.880 | a lot when papers do
00:18:46.000 | this to just show that
00:18:47.480 | it's not something
00:18:48.300 | specific to a
00:18:49.040 | particular model
00:18:49.540 | family or a
00:18:50.160 | particular recipe
00:18:50.740 | that one company
00:18:51.720 | was using.
00:18:52.180 | But yeah, we
00:18:53.640 | weren't aware of
00:18:54.080 | this at the time.
00:18:54.700 | I think Ted had
00:19:01.020 | a question.
00:19:01.440 | I don't know, Ted,
00:19:02.340 | if you want to come
00:19:02.860 | on camera and ask
00:19:03.760 | Mark can read
00:19:06.160 | it all for you.
00:19:06.740 | Yeah, so
00:19:08.880 | specifically, you
00:19:10.240 | mentioned this
00:19:11.680 | other Spurious Rewards
00:19:12.720 | paper, and I can't
00:19:13.580 | remember where, but
00:19:14.200 | I thought I saw
00:19:14.980 | somebody was posting
00:19:17.240 | different papers
00:19:19.360 | RL papers, RL papers
00:19:21.680 | had baselines
00:19:23.160 | that were lower
00:19:24.360 | than the model
00:19:26.380 | authors were able
00:19:27.440 | to achieve.
00:19:28.460 | So the baselines
00:19:29.920 | used in the paper
00:19:30.740 | were suboptimal.
00:19:31.800 | And that basically,
00:19:33.420 | if you suboptimally
00:19:36.480 | use the model
00:19:36.980 | and then use RL,
00:19:37.780 | you can sort of
00:19:38.380 | correct for your
00:19:39.960 | suboptimality, but
00:19:41.100 | you're not actually
00:19:41.940 | improving the performance.
00:19:44.040 | this one, which I
00:19:46.740 | thought was a very
00:19:47.360 | interesting conclusion
00:19:50.260 | saying that you don't
00:19:51.180 | really need great
00:19:52.060 | rewards, that maybe
00:19:54.200 | that, in particular,
00:19:55.140 | that result was more
00:19:56.940 | about fixing
00:19:58.380 | suboptimality versus
00:20:00.200 | improving reasoning.
00:20:01.240 | I don't know if you've
00:20:02.820 | seen that discussion.
00:20:06.720 | Yeah, I haven't, but
00:20:08.100 | that sounds
00:20:08.820 | interesting, and
00:20:10.380 | yeah, I would love
00:20:11.860 | to get a link to
00:20:12.740 | that.
00:20:12.980 | Yeah, sorry, I
00:20:14.160 | don't have the
00:20:14.800 | reference at my
00:20:16.140 | fingertips.
00:20:16.640 | You're good.
00:20:17.400 | with the TRL
00:20:30.880 | framework that
00:20:31.820 | applied the
00:20:32.780 | GRPO to the
00:20:34.260 | model.
00:20:34.640 | So is that
00:20:36.560 | like fine-tuning?
00:20:38.120 | Is that like
00:20:39.140 | directly updating
00:20:40.080 | the weights, or is
00:20:40.820 | it some kind of
00:20:41.440 | wrapper on the
00:20:42.400 | model, or how
00:20:43.600 | does that GRPO
00:20:44.820 | trainer work?
00:20:46.440 | Yeah, so we
00:20:48.360 | are directly
00:20:49.020 | updating the
00:20:49.600 | weights.
00:20:49.880 | Okay.
00:20:50.980 | So, yeah,
00:20:52.300 | generally speaking,
00:20:52.940 | you do need an
00:20:53.500 | open-source model
00:20:54.160 | because you are
00:20:55.060 | directly updating
00:20:56.720 | weights.
00:20:57.000 | Got it.
00:20:58.440 | Thank you.
00:20:58.980 | Hey, Shelly,
00:21:02.100 | thanks for the
00:21:03.400 | great presentation.
00:21:04.400 | Thanks for joining
00:21:05.420 | I was curious,
00:21:07.580 | I don't think I
00:21:09.540 | saw anything about
00:21:10.060 | this in the paper,
00:21:10.520 | but maybe you've
00:21:11.760 | thought about this.
00:21:12.240 | You know, it
00:21:14.480 | seems like the
00:21:15.060 | larger goal is to
00:21:16.420 | be able to
00:21:17.260 | sort of
00:21:17.620 | continuously
00:21:18.140 | update the
00:21:19.280 | model as
00:21:20.560 | new information
00:21:21.540 | comes in and
00:21:23.220 | maybe distribution
00:21:24.040 | shift and so
00:21:24.820 | forth.
00:21:25.060 | Have you thought
00:21:26.140 | about, and
00:21:26.760 | you know, the
00:21:27.360 | catastrophic
00:21:27.880 | forgetting part is
00:21:29.060 | maybe addressing
00:21:30.000 | that to some
00:21:30.560 | extent.
00:21:30.920 | Have you thought
00:21:31.780 | about like what
00:21:32.440 | happens when you
00:21:33.160 | do many rounds
00:21:34.140 | of this kind of
00:21:35.860 | like self-reinforced
00:21:37.420 | training?
00:21:38.020 | And maybe you
00:21:40.200 | did some
00:21:40.500 | experiments.
00:21:40.980 | I'm curious to
00:21:41.800 | hear your thoughts.
00:21:42.380 | Yeah, that's a
00:21:44.120 | really good question.
00:21:45.040 | I do think that
00:21:45.540 | is a very
00:21:45.920 | natural extension
00:21:46.680 | of this work.
00:21:47.360 | And one of the
00:21:47.940 | ideas that we
00:21:48.560 | talked about
00:21:48.980 | internally was
00:21:49.820 | swishing between
00:21:50.820 | rounds of RL and
00:21:51.680 | SFT, right?
00:21:52.400 | Which is kind of
00:21:52.920 | a proven thing
00:21:53.540 | like do some
00:21:54.340 | SFT, do some
00:21:55.380 | RL, then you
00:21:56.600 | keep doing that
00:21:57.300 | over and over
00:21:57.700 | again because you
00:21:58.220 | kind of elicit
00:21:59.220 | slightly new
00:21:59.800 | behavior with
00:22:00.460 | each round that
00:22:01.460 | What we were
00:22:05.700 | seeing, we were
00:22:06.340 | training, so we
00:22:07.220 | kind of let the
00:22:09.080 | models over-train a
00:22:09.940 | little bit, right?
00:22:10.560 | like we let
00:22:12.160 | them take a
00:22:13.680 | data set and run
00:22:14.440 | until they
00:22:14.860 | converge and we
00:22:15.580 | were seeing that
00:22:16.820 | they kind of
00:22:17.400 | leveled out and
00:22:18.260 | potentially started
00:22:18.860 | to get a little
00:22:19.240 | bit worse, usually
00:22:21.280 | around like, I
00:22:24.020 | mean, there's like
00:22:24.480 | sample numbers in
00:22:25.140 | the paper, like
00:22:25.920 | about 20 to 25,000
00:22:28.840 | samples, but probably
00:22:30.460 | we could have been
00:22:30.980 | even more sample
00:22:31.600 | efficient if we
00:22:32.140 | tried, and about
00:22:34.000 | a thousand steps.
00:22:35.200 | So my sense is
00:22:36.980 | that you need to
00:22:37.560 | interject probably
00:22:39.240 | something more in
00:22:40.600 | between those rounds
00:22:41.520 | of RL in order to
00:22:42.580 | squeeze out more
00:22:43.100 | performance because I
00:22:43.940 | feel like I've seen
00:22:44.460 | some reports where
00:22:45.140 | it's like GRPO will
00:22:46.380 | like indefinitely go
00:22:47.260 | up, and I haven't
00:22:48.320 | seen that be the
00:22:48.860 | case.
00:22:49.180 | Yeah, and in terms
00:22:53.620 | of distribution
00:22:54.080 | shift in general or
00:22:55.000 | like how catastrophic
00:22:55.740 | we're getting changes,
00:22:56.460 | we weren't able to
00:22:57.460 | run any super long-term
00:22:58.480 | experiments, but I
00:22:59.320 | think that would be
00:22:59.940 | very interesting and
00:23:00.580 | very interesting
00:23:01.000 | extension.
00:23:01.420 | Great, thank you
00:23:04.280 | very much.
00:23:04.780 | I think we have a
00:23:10.560 | question from
00:23:11.180 | Xiaobai.
00:23:11.620 | I don't know, Xiaobai,
00:23:12.280 | if you want to
00:23:13.520 | voice over your
00:23:14.320 | question?
00:23:14.620 | Yeah, thanks.
00:23:18.800 | Yeah, I think the
00:23:19.480 | question I have is
00:23:20.320 | like, it seems like
00:23:22.420 | multiple tries are
00:23:23.560 | very dependent on
00:23:24.900 | the temperature.
00:23:25.800 | So I guess like
00:23:27.600 | when temperature
00:23:28.060 | zero, no matter
00:23:29.480 | how many you
00:23:30.200 | retry, you won't
00:23:31.140 | be actually successful
00:23:32.320 | rate, more
00:23:33.000 | successful rate.
00:23:33.920 | So I wonder, have
00:23:34.780 | you done any
00:23:35.740 | analysis on how
00:23:36.800 | actually temperature
00:23:37.620 | is going to
00:23:38.740 | impact your
00:23:39.900 | experiment result?
00:23:40.880 | Yeah, that's a
00:23:43.580 | really good point.
00:23:44.240 | I believe I would
00:23:46.820 | have to go back.
00:23:47.360 | It's been a couple
00:23:48.840 | months since I
00:23:49.400 | looked at the code.
00:23:49.960 | I believe I set
00:23:50.820 | temperature to like
00:23:51.480 | 0.9 or something
00:23:53.780 | like that.
00:23:54.200 | so you get
00:23:54.840 | maybe a little
00:24:00.300 | lower than that.
00:24:00.940 | But like, yeah, you
00:24:01.560 | definitely need some
00:24:02.300 | variance in your
00:24:04.820 | results because you
00:24:05.900 | need, like especially
00:24:07.820 | in the countdown case,
00:24:08.680 | like you need your
00:24:09.280 | model to explore
00:24:10.020 | multiple paths so
00:24:10.860 | that you can reward
00:24:11.460 | the right one, right?
00:24:12.480 | Like you need it to
00:24:14.180 | kind of, you don't
00:24:16.900 | want it to fail the
00:24:17.600 | same way multiple
00:24:18.460 | times because it
00:24:19.320 | always goes down the
00:24:20.100 | exact same reasoning
00:24:20.760 | path no matter what
00:24:21.660 | because you need
00:24:23.360 | to hopefully like
00:24:24.240 | elicit enough
00:24:26.000 | branches such that
00:24:28.300 | like something ends
00:24:28.940 | up in a success.
00:24:29.660 | We didn't do any
00:24:30.680 | specific experiments on
00:24:32.120 | like playing with the
00:24:33.380 | temperature a lot and
00:24:34.080 | seeing how that
00:24:34.520 | changes things.
00:24:35.160 | But intuitively we
00:24:36.520 | did agree that if you
00:24:37.940 | set temperature to
00:24:38.660 | zero and it's like
00:24:39.440 | very deterministic all
00:24:40.480 | the time, that isn't
00:24:41.940 | going to work.
00:24:42.400 | But yeah, it would
00:24:43.980 | be interested in
00:24:45.020 | in general, like
00:24:45.740 | temperature with RL
00:24:47.140 | seems like something
00:24:48.000 | that we could all
00:24:48.500 | investigate a little
00:24:49.140 | bit more.
00:24:49.500 | That makes sense.
00:24:52.560 | Yeah, thanks.
00:24:53.280 | Yikes.
00:24:58.300 | Do you want to go
00:24:59.500 | ahead with your
00:25:00.840 | chain of thought?
00:25:01.920 | A slew of
00:25:03.000 | questions?
00:25:03.680 | Yeah.
00:25:04.140 | Let me finish
00:25:05.720 | clicking this button
00:25:08.460 | maybe.
00:25:08.880 | Okay, that should do
00:25:10.320 | that.
00:25:10.700 | And then that.
00:25:12.520 | And then this one.
00:25:13.340 | Okay.
00:25:13.600 | Yeah, so let's
00:25:17.460 | for the technique
00:25:21.580 | that you guys did
00:25:22.720 | to find the thing
00:25:24.980 | that you're training
00:25:25.940 | for, I didn't quite
00:25:26.840 | like catch it all
00:25:28.440 | the way, but it
00:25:28.980 | sounds like what
00:25:30.700 | we were doing, the
00:25:31.820 | basic principle is
00:25:34.240 | you could just
00:25:37.640 | continue thinking
00:25:39.300 | forever and ever.
00:25:40.520 | and it turns out
00:25:42.100 | that the more
00:25:43.760 | length we train
00:25:44.760 | for, the better
00:25:45.400 | performance we tend
00:25:46.540 | to get, as so
00:25:48.540 | far kind of been
00:25:49.820 | the verdict.
00:25:50.500 | And so I'm
00:25:53.960 | curious for this
00:25:55.240 | technique, I guess
00:25:57.080 | I should like go
00:25:57.760 | back with a paper a
00:25:59.040 | little bit.
00:25:59.260 | Oh, yeah.
00:25:59.660 | Can we, how do we
00:26:01.000 | do better if we fail?
00:26:01.800 | So for the self
00:26:02.640 | reflection thing,
00:26:04.000 | um, the sort
00:26:04.860 | of like basic
00:26:06.320 | principle here, let's
00:26:07.760 | see, generate
00:26:08.120 | output, success, do
00:26:09.340 | nothing, fail,
00:26:10.000 | generate self
00:26:10.920 | reflection, retry
00:26:12.620 | tasks, same
00:26:13.660 | conversation.
00:26:14.460 | Okay.
00:26:15.440 | So you do a
00:26:16.160 | think block and
00:26:18.020 | then a task
00:26:19.520 | attempt.
00:26:20.040 | And then if the
00:26:21.000 | task is insufficient,
00:26:22.020 | then you do
00:26:22.560 | another think block
00:26:23.860 | and another task
00:26:25.200 | attempt.
00:26:25.540 | And that's all in
00:26:26.440 | the same window.
00:26:27.740 | Yeah.
00:26:28.140 | and then you
00:26:28.600 | reward until you
00:26:30.380 | get it or do you
00:26:31.560 | do it?
00:26:31.820 | So, so yeah, it's
00:26:34.140 | just one, one extra
00:26:35.220 | step.
00:26:35.560 | So it's two task
00:26:36.660 | retries total.
00:26:37.620 | Um, and there's
00:26:39.080 | actually not, we
00:26:39.740 | weren't specifically
00:26:40.540 | using, uh, reasoning
00:26:43.480 | models.
00:26:43.980 | Uh, so, so there
00:26:45.800 | isn't an explicit
00:26:46.620 | think block that this
00:26:47.740 | is just like for, uh,
00:26:50.880 | I mean, yeah, this
00:26:51.540 | was a lot of these
00:26:52.440 | models were released
00:26:53.140 | like early 2024,
00:26:54.220 | like reasoning
00:26:54.920 | models weren't as
00:26:55.640 | much of a thing at
00:26:56.540 | that point in time.
00:26:57.200 | So, um, yeah.
00:27:00.080 | Do you have any
00:27:01.220 | intuition on how, or
00:27:02.760 | if this kind of
00:27:03.940 | technique might apply
00:27:05.060 | to reasoning models
00:27:05.900 | and kind of the
00:27:06.460 | reasoning RL paradigm?
00:27:07.620 | Yeah, that's a good
00:27:10.260 | question.
00:27:10.640 | I mean, my, my intuition
00:27:14.000 | is that these sort
00:27:16.020 | like this similar
00:27:17.860 | approaches are already
00:27:18.620 | being used to trade
00:27:19.680 | good reasoning.
00:27:20.440 | Right.
00:27:20.840 | And so I don't know
00:27:21.460 | how much this would
00:27:22.060 | necessarily build upon
00:27:22.980 | a reasoning model
00:27:23.600 | that has already, uh,
00:27:26.740 | been hyper-optimized
00:27:28.220 | to, for example,
00:27:29.560 | probably like self-reflect
00:27:31.180 | really well.
00:27:31.740 | Right.
00:27:31.960 | That's probably a side,
00:27:32.920 | uh, quest or an
00:27:34.840 | adjacent RL target to
00:27:37.940 | what we were shooting
00:27:39.480 | Right.
00:27:39.780 | Like it is like the
00:27:40.960 | reasoning rewards that
00:27:42.800 | generally speaking,
00:27:43.700 | these large lots use.
00:27:44.720 | So my sense is it will
00:27:46.940 | be less effective on
00:27:48.580 | reasoning models because
00:27:49.360 | has this is almost an
00:27:50.640 | approach to like
00:27:51.200 | incentivizing good
00:27:51.920 | reasoning, pre-reasoning
00:27:53.040 | models.
00:27:53.380 | Let's yeah.
00:27:57.360 | Well, so yeah, that's,
00:27:58.540 | I think that's the
00:27:59.220 | interesting part to me,
00:28:00.800 | the largely because the
00:28:03.600 | reasoning models, or
00:28:04.680 | at least from what I've
00:28:05.440 | seen, if you just train
00:28:07.120 | them to think longer,
00:28:08.120 | like that's really all
00:28:09.220 | we're doing is, is the,
00:28:10.540 | the more performance we
00:28:11.580 | get is just like more
00:28:12.780 | think tokens or less
00:28:13.880 | think tokens essentially.
00:28:15.760 | Um, and then there's
00:28:16.820 | very, or like there
00:28:18.040 | seems to be some
00:28:19.000 | contention on, um, does
00:28:23.380 | the content of those
00:28:24.880 | think tokens actually
00:28:26.320 | matter?
00:28:26.860 | There's like a couple
00:28:27.920 | of, um, papers that
00:28:30.760 | sort of indicate like,
00:28:32.400 | no, it actually like
00:28:33.540 | these reasoning traces can
00:28:34.820 | be completely incoherent.
00:28:36.340 | Um, but you still get
00:28:37.880 | the performance increase.
00:28:39.100 | Um, and then some papers
00:28:41.480 | that are like, oh, if
00:28:42.740 | the, if the reasoning
00:28:44.000 | traces change in this
00:28:45.420 | particular way, then you
00:28:47.340 | see a more significant,
00:28:48.600 | um, performance increase
00:28:50.280 | than you otherwise would.
00:28:51.620 | Um, and I'm sort of like
00:28:53.360 | trying to dial in the
00:28:54.400 | bottom of that.
00:28:55.240 | So yeah, I guess like it
00:28:57.100 | doesn't quite apply here
00:28:58.960 | since we don't have a
00:29:00.140 | reasoning model to work
00:29:01.600 | with, but I'm, I would
00:29:02.840 | be interested to see the
00:29:06.140 | I would, I would be
00:29:08.640 | interested to sit and to
00:29:09.460 | like basically just run
00:29:10.620 | this as identical,
00:29:13.100 | identical pipeline, except
00:29:15.080 | with a reasoning model
00:29:16.620 | because you have, you
00:29:17.540 | would have like reflect
00:29:19.060 | task, output, fail, like
00:29:22.220 | then the second self
00:29:24.640 | reflection there might
00:29:25.860 | have like some weird
00:29:27.000 | stuff in the reasoning
00:29:28.000 | block is like where I'm,
00:29:29.060 | where am I, where I'm
00:29:30.360 | headed, I think.
00:29:31.000 | Um, but yeah, no, I
00:29:34.600 | think that, that, that
00:29:35.560 | does it for me basically.
00:29:36.500 | Um, other than, well, and
00:29:39.300 | then I guess it also
00:29:40.100 | applies to reasoning, but
00:29:41.100 | there's, there's been some
00:29:42.000 | interesting stuff.
00:29:42.880 | Um, namely the paper
00:29:44.720 | like absolute zero.
00:29:45.860 | And then I think Sakana
00:29:46.960 | has another, um, like,
00:29:48.960 | Hey, we secretly actually
00:29:49.940 | don't even need teacher
00:29:50.800 | models kind of thing.
00:29:52.120 | Um, but I think those all
00:29:53.840 | technically apply to
00:29:54.940 | reasoning models.
00:29:55.920 | So that's, um, um, did
00:29:59.460 | you see any like, uh,
00:30:01.080 | unexpected behavior in the
00:30:04.220 | self reflection block or did
00:30:06.200 | it like when you're reading
00:30:07.140 | it, does it all like make
00:30:08.240 | sense?
00:30:08.580 | Like, Oh, okay.
00:30:09.220 | It looks like the model
00:30:10.020 | is self-reflecting on this
00:30:11.240 | and then it generates a new
00:30:12.160 | output that is correct kind
00:30:13.720 | of thing.
00:30:14.020 | Yeah.
00:30:16.420 | Um, yeah.
00:30:17.420 | To your first point, like
00:30:19.000 | before I answer this
00:30:20.060 | question, like to your
00:30:20.900 | first point, I think this
00:30:24.340 | falls very neatly into like
00:30:25.920 | this greater research area
00:30:27.620 | of meta prompting.
00:30:28.360 | And I think that's really
00:30:29.020 | cool.
00:30:29.280 | Right.
00:30:29.560 | Because we're telling the
00:30:30.660 | model a very specific way
00:30:33.160 | to use its tokens where
00:30:34.280 | we're saying like create a
00:30:35.120 | self-reflection.
00:30:35.700 | but I think a lot of where
00:30:37.180 | I would love to head with
00:30:37.980 | this is what if you just
00:30:40.640 | give the models more tokens
00:30:42.460 | in general, right.
00:30:43.200 | And give it lots of
00:30:43.980 | different ways that it can
00:30:44.680 | prompt, like, what is it
00:30:45.480 | about self-reflection
00:30:46.260 | specifically, or is it, if
00:30:47.520 | there are nothing specific
00:30:48.500 | about self-reflection, it's
00:30:49.500 | just, yeah, more thinking
00:30:50.340 | spaces.
00:30:50.720 | What is what gets the job
00:30:52.140 | done period, right.
00:30:52.880 | You just give it more
00:30:53.700 | tokens.
00:30:54.100 | Um, so yeah, curious to, I
00:30:57.540 | think like over time,
00:30:58.500 | hopefully like have some
00:30:59.480 | bandwidth to pull that
00:31:00.200 | apart, um, and, and
00:31:01.340 | think about, uh, how
00:31:03.340 | that relates.
00:31:04.900 | I think your question
00:31:08.840 | was, um, sorry, I'm
00:31:11.200 | blanking on your question.
00:31:12.180 | Uh, no, you're good.
00:31:13.980 | The, um, it was, uh, uh,
00:31:16.220 | uh, I don't remember it
00:31:18.220 | now either.
00:31:18.660 | I just had a different,
00:31:19.380 | a different thought.
00:31:20.520 | The, um, did you try, so
00:31:23.300 | task, generate, output,
00:31:24.440 | succeed, do nothing.
00:31:25.500 | And then other side, we
00:31:27.480 | have the chain.
00:31:28.260 | Um, and then for the, for
00:31:32.480 | the benchmarking that you
00:31:33.560 | did after, did you, did you
00:31:36.320 | have the model that you
00:31:37.260 | were benchmarking against
00:31:38.440 | try twice or did you, did
00:31:41.920 | you have the, the, let's
00:31:44.220 | see, uh, performance on
00:31:46.220 | both first.
00:31:46.860 | Okay.
00:31:47.300 | So you did have it just
00:31:48.400 | like fire two attempts and
00:31:50.100 | like without the training and
00:31:52.000 | as opposed to the other
00:31:52.920 | model who got trained to
00:31:56.480 | have two, my ASML.
00:31:58.060 | Oh, wait, go to 900.
00:32:00.820 | Uh, we'll see.
00:32:02.080 | Uh, what, huh?
00:32:04.700 | Um, okay.
00:32:09.160 | Oh, okay.
00:32:09.760 | Got it.
00:32:10.060 | Uh, did the, the, so for the
00:32:12.860 | model, you're benchmarking
00:32:13.780 | against a, does it have both of
00:32:16.120 | its attempts in the con?
00:32:17.320 | Like is the, is the
00:32:18.160 | experiment identical?
00:32:19.300 | Okay.
00:32:21.580 | So, so the, the model that
00:32:22.980 | doesn't have training has two
00:32:25.160 | attempts in the window.
00:32:26.100 | The model that does have
00:32:26.980 | training also has two attempts
00:32:28.520 | in the window.
00:32:29.140 | Model number two.
00:32:31.080 | Was trained, um, to have
00:32:35.140 | two attempts.
00:32:36.000 | And so is being rewarded for
00:32:42.080 | getting it right on the
00:32:43.020 | second attempt.
00:32:43.840 | Does it, did you, did you
00:32:46.360 | benchmark or like see any
00:32:47.840 | meaningful difference in
00:32:48.980 | attempt number one?
00:32:50.320 | Like it's, it's like, it
00:32:51.680 | makes sense to me that like
00:32:52.740 | if you're training for the
00:32:53.620 | shape of two attempts and
00:32:55.520 | succeed, then like it should
00:32:58.760 | be better at that.
00:33:00.060 | Um, yeah, so, so just to
00:33:02.720 | like unravel this table a
00:33:03.880 | little bit more.
00:33:04.480 | So the, the, cause it's,
00:33:06.720 | it's, it's, it's a, it's a
00:33:08.300 | lot of numbers.
00:33:08.880 | Um, okay.
00:33:10.380 | So if we compare vanilla
00:33:11.700 | first try to trained first
00:33:13.120 | try, that's seeing how much
00:33:15.880 | better the model got at pass
00:33:17.680 | at one through this training
00:33:18.660 | process.
00:33:19.060 | Right.
00:33:19.480 | So like first line when to
00:33:21.220 | 1.5 billion goes from 32.6%
00:33:24.540 | on a first task try after
00:33:27.500 | training, it's at 48.6%, right?
00:33:29.640 | So ignore the reflection, second
00:33:30.820 | try columns, just look at like
00:33:32.100 | first and third column.
00:33:33.040 | That's how much it got better
00:33:34.400 | at the task itself, which is
00:33:35.920 | actually a really interesting
00:33:36.820 | result because we never
00:33:37.860 | directly incentivize the model
00:33:39.180 | to get better at the task.
00:33:40.260 | We just, uh, we explicitly only
00:33:47.120 | rewarded the self-reflection
00:33:48.380 | tokens, right?
00:33:49.160 | Like, so we were never directly
00:33:50.520 | rewarding the model's answers
00:33:53.020 | and being like, you know, get
00:33:53.860 | better answers.
00:33:54.480 | I think this, there is some
00:33:56.960 | stuff that falls out of like
00:33:59.440 | the spurious rewards paper that
00:34:00.700 | helps kind of explain this, right?
00:34:02.040 | A little bit where it's like,
00:34:03.080 | Hey, like to some extent
00:34:04.700 | exposing the model to data is
00:34:07.120 | what matters.
00:34:07.600 | the reward actually matters
00:34:08.580 | way less than we think it does
00:34:09.600 | specifically for client models.
00:34:10.800 | Um, and then another way of,
00:34:15.720 | or another sort of thing about
00:34:17.640 | to think about what this table,
00:34:18.920 | right, is another column we
00:34:21.220 | could have had here.
00:34:21.940 | We could have had two more
00:34:23.080 | columns, which is basically the
00:34:24.040 | diff, um, between the vanilla
00:34:26.220 | first try and the reflection
00:34:27.220 | second try before training and
00:34:28.680 | after training, right?
00:34:29.500 | So if you look at, again, the
00:34:31.080 | first row, you can see that we
00:34:32.540 | go, um, so 32.6% pass at one
00:34:37.500 | on the vanilla model.
00:34:38.960 | And then when you give it that
00:34:39.840 | second try, it goes up to 34.8%,
00:34:42.040 | right?
00:34:42.320 | So that's 2.2% better, right?
00:34:45.200 | And if we go to the two columns
00:34:48.140 | on the right, we go from 48.6%
00:34:50.540 | to 52.9%, which is more than 2.2.
00:34:53.120 | It's, uh, 4.3%, right?
00:34:56.180 | So the model has gotten better at
00:34:57.920 | utilizing that self, like the
00:34:59.620 | self-reflections are better.
00:35:00.640 | And so that second try is right
00:35:05.300 | now, more of the time, right?
00:35:07.000 | You have like, you, we went from
00:35:09.040 | reflection and a second try gives
00:35:12.160 | us 2.2% improvement to that, that
00:35:14.480 | prompt by the end of training now
00:35:15.760 | gives us 4.3% improvement on the
00:35:17.780 | metrics.
00:35:18.080 | So that's kind of how you could, you
00:35:19.780 | could see that, uh, the self-reflections
00:35:22.300 | are quote unquote better, right?
00:35:23.400 | Uh, objectively.
00:35:24.200 | I'm curious, did you evaluate what is
00:35:27.040 | good self-reflection?
00:35:28.440 | How did you induce that?
00:35:29.680 | Yeah.
00:35:31.600 | So I think for us, we really looked
00:35:34.640 | at it qualitatively, um, cause it is
00:35:37.880 | interesting to like, sort of see
00:35:39.320 | what self-reflections lead to that
00:35:43.300 | improvement, right?
00:35:44.060 | And we didn't encourage any specific
00:35:46.520 | format or anything.
00:35:47.300 | We just, uh, rewarded it when a
00:35:50.500 | self-reflection led to success and
00:35:52.080 | saw what happened.
00:35:52.700 | And actually, I think this was one of
00:35:54.400 | the, like earlier questions, uh, the
00:35:56.540 | self-reflections of selves in
00:35:58.340 | kind of clean language mixes a lot,
00:36:00.080 | like a lot, a lot.
00:36:00.980 | And sometimes it was just like pure
00:36:02.340 | gibberish for some models sometimes.
00:36:03.900 | Um, so there is like definitely
00:36:05.740 | evidence that like more thinking space
00:36:08.040 | of kind of what the models need as
00:36:09.300 | opposed to these like very
00:36:10.140 | parsable human, uh, legible self-
00:36:13.340 | reflections.
00:36:14.000 | Um, but qualitatively they do sort of
00:36:17.080 | seem like quote unquote better self-
00:36:18.880 | reflections or like more effective,
00:36:20.680 | more efficient over time, um, in many
00:36:23.200 | cases.
00:36:23.600 | So that was kind of cool to see, but
00:36:25.040 | because we didn't have any format
00:36:26.200 | constraints, um, yeah.
00:36:28.540 | Sometimes it was gibberish,
00:36:29.840 | particularly again, when models and
00:36:31.800 | the language stuff is very
00:36:33.200 | interesting.
00:36:33.740 | I think Frankie has their hand up.
00:36:43.400 | I think it'd be helpful to go back to tasks.
00:37:01.200 | um, the success field.
00:37:02.540 | Um, what, what's actually put there?
00:37:05.460 | Yeah.
00:37:06.800 | So, um, I think it'd be helpful to go back to tasks.
00:37:10.260 | So for the function calling dataset, uh, we cheated a little bit in that we like, you theoretically don't need a ground truth dataset, right? You could do something where it's a simple as like, did this function call, uh, create a request that when it hits an API, it gets you back like a two-one.
00:37:19.200 | Uh, like those sorts of things, but in our case, we did actually check to see if the correct answer is the answer from the ground truth dataset because we were like using a, uh, SFT dataset.
00:37:32.540 | But generally speaking, for function calling, if you have any sort of binary reward checker, like any way of saying, like, I think this was a good function call versus a bad function call, you should be able to like, um, do this with countdown.
00:37:56.780 | With countdown, this was a little bit more of like a true verifier, um, because, you know, like many math questions, it's very easy to check if a particular equation that the model has generated evaluates to the right number, but it's like hard to generate all the answers, right?
00:38:14.240 | Like there's many possible answers.
00:38:15.560 | So what we did here is quite literally ran, like, so we checked to make sure like the numbers that were allowed were the numbers in the equation that the model wrote.
00:38:24.400 | Um, and then we just ran eval on it and like, saw if it hit the target number.
00:38:28.060 | So it was just like a very basic, like evaluate the function and see if there's success.
00:38:32.500 | Yeah.
00:38:33.040 | Go back to the background.
00:38:34.120 | Sorry.
00:38:34.400 | I just wanted to follow along just couple of questions there.
00:38:36.520 | So on that first block to the right of fail, generate self-reflection.
00:38:40.480 | What are you adding directly there when you get a failure?
00:38:43.480 | So we have a prompt that is in the paper.
00:38:47.960 | There's prompt templates at the bottom, um, just prompt.
00:38:51.340 | And then it generates a self-reflection and then you, uh, prompt the question.
00:38:55.300 | You like put the, the original question back in to retry.
00:38:58.540 | Got you.
00:38:58.900 | Okay.
00:38:59.380 | So then, um, since you follow the success path of the second retry, that is, you got the correct, your, uh, your verifier knows that you got the correct answer.
00:39:12.360 | And then you're going to say reward the self-reflection tokens.
00:39:15.300 | So which tokens specifically are you rewarding?
00:39:18.240 | Just the fact that on that first fail, you generated some new tokens from that.
00:39:23.340 | That's, um, that's what you're rewarding for that particular path.
00:39:26.940 | Yeah, exactly.
00:39:28.800 | So after the failure, we prompt for self-reflection and the prompt is something like try again, like you got the wrong answer, please reflect on what you did wrong.
00:39:39.340 | So you can get the right answer next time or something like the prompt is something like that.
00:39:42.360 | And then whatever the model answers directly after that, that's exactly what we reward.
00:39:48.080 | Okay.
00:39:48.360 | And, uh, and you did not do anything with the failed path.
00:39:51.460 | That is not at all.
00:39:53.300 | Yeah.
00:39:53.720 | Why not?
00:39:54.860 | Because you feel like that's not useful to say negative reward or something.
00:40:00.560 | Yeah, pretty much, pretty much.
00:40:03.140 | I mean, I think we found pretty early on that this very simplistic reward formulation worked quite well.
00:40:08.000 | And so we didn't do a lot of work on like exploring alternate reward formulations.
00:40:12.380 | Um, cause I think this one also feels very, I think it's a team.
00:40:17.360 | And we just like really like these like simple intuitive approaches, right?
00:40:20.720 | Like fail retry success is a good thing, right?
00:40:24.080 | Um, fail retry fail.
00:40:25.640 | Isn't necessarily a bad thing because maybe the question is impossible for everyone.
00:40:30.140 | Right.
00:40:30.480 | Like, um, yes.
00:40:32.760 | So I just, I'm curious to think about the logic here, meaning that you're rewarding something that comes as a result of the follow-up prompt that says you failed, right?
00:40:44.240 | Please try this problem again.
00:40:46.220 | And you're rewarding that particular path.
00:40:48.620 | So I'm trying to understand what is it?
00:40:51.380 | Sorry.
00:40:51.580 | Maybe I don't understand.
00:40:52.880 | Like you did this, but then on your, your table, you said, Oh, we did this training.
00:40:59.300 | And that first train pass kind of like got 32 to 48%, right?
00:41:04.520 | So I'm trying to understand what is actually, what do you think is going on there, right?
00:41:08.460 | Because it's a very, you're rewarding something that doesn't quite.
00:41:11.800 | Yeah.
00:41:12.020 | And I want them to understand.
00:41:13.740 | Yeah.
00:41:14.240 | Yeah, that's a really good question.
00:41:16.380 | I think that, uh, we don't super know.
00:41:21.680 | And I think like some of the new papers that have been coming out recently also speak to like in general, when we do RL, we don't super know what's going on.
00:41:29.440 | Um, and in particular, I think this first paper starts to unlock partially what might be going on in that interestingly, like where word functions don't have a lot to do with, uh,
00:41:43.840 | whether or not these models get better, right?
00:41:45.760 | Like the exact formulation of reward has, even if it has very little correlation with the right answer and like, just the fact that we're rewarding in the space of the right answer, just because we're wearing other tokens, like it sort of doesn't matter.
00:41:58.840 | It's just kind of just exposure to the data and any sort of reward sometimes is leading to pass at one improvement as well.
00:42:04.480 | So can I conjecture that, uh, really when you're rewarding that second attempt, you're just rewarding a better response, right?
00:42:12.820 | Because you actually put in the same question to it again.
00:42:15.280 | So you're just rewarding the fact that
00:42:17.620 | They also want to survive just in order to achieve the goals we set.
00:42:22.000 | So here's an example from Apollo research of a super intelligent, not a super intelligent, so, um, very unsuper intelligence, but something that's moderately intelligent.
00:42:34.960 | Sorry, maybe I, maybe I've misunderstood that you're, uh, the main point that you're bringing up. Can you repeat that again?
00:43:01.960 | That sounded like a TTS thing. Like, I think you guys can continue with whatever conversation was going on before.
00:43:06.760 | Because it's like unmuted. Yeah.
00:43:08.920 | Okay, sorry. Yeah. So going back to my question, it feels like you're just rewarding for a better answer in some sense, because
00:43:17.080 | I guess having the fact that, you know, the same question, right, was posed on the retry and it gave a response and you like that response better than the first one.
00:43:27.220 | Uh, you're kind of like rewarding that, uh, the fact that it got closer to the answer because you only be working on the success path.
00:43:33.480 | Right. So, yeah.
00:43:36.660 | Yeah. And I think, and I think, and I think generally speaking, so, so yes, although I think that a thing that is still kind of interesting is that we are not rewarding the answer token directly, but I do think in practice what happens is often self-reflections have the right answer somewhere in them.
00:43:54.340 | Um, and so the answer is leaking into the self-reflection. And so then when we reward the self-reflection token, at times we are rewarding the answer. Um, because a lot of, a lot of the self-reflections
00:44:07.220 | answer the question. Yeah.
00:44:10.980 | Have you noticed like, sorry, sorry to interrupt again. Uh, have you noticed, have you noticed that the response got actually maybe longer or, or have you had just a metrics that to, to figure it out? Like after you did your training, what's the quality of the responses that that 48% column?
00:44:26.820 | Uh, have you like done any like simple metrics on those?
00:44:30.540 | Honestly, I haven't, but I should, there is an error analysis, um, section of the paper that mostly discusses like how errors have changed pre and post-training.
00:44:40.820 | Like what the sorts of mistakes models make are, um, that I found pretty interesting. But, but no, like, I think like speed to response, like how many tokens does it take to get the right answer? Hopefully we would see, it would be lower, uh, over time. So yeah, that'd be cool to look at.
00:44:55.220 | Well, yeah, I, I, if you saved that information, like if you, if you still have the traces of the runs, I would be quite interested to see. Cause I, what I, what I would be, would be inclined to check is just the raw number of tokens count. Cause my suspicion would be the, like,
00:45:10.660 | if that number is larger than you are going to see the perf increase. If it's lower than you don't would be my, my guess.
00:45:16.100 | Yeah. It makes sense.
00:45:19.700 | I mean, but, uh, an alternative hypothesis here is that the self-reflection induces the model to early on in the, in the answer process to use.
00:45:33.780 | Like sort of that, the language from the self-reflections in its initial response. Right. And so they then, and then induces
00:45:42.260 | a better reasoning process about the answer.
00:45:45.940 | So that, I mean, that would be the, like what you would hope would happen. So like maybe that is an, I think a credible alternative hypothesis here.
00:45:56.500 | for the initial prompt, did you ask for a chain of thought prior to, or just say like one shot this for
00:46:01.940 | me kind of thing. Yeah. These were all one shot because a lot of these models were early enough that
00:46:07.300 | like, yeah, chain of thought prompting or like models specifically optimized for train of thought prompting
00:46:14.020 | wasn't as much of a thing yet. Like, because our, um, the function calling data set was June, 2024.
00:46:19.620 | And so we were using Quintu models, which are not necessarily super optimized for reasoning because
00:46:25.300 | the issue is that a lot of these models were trained on this data set, right? Like when you have a really
00:46:30.020 | high quality data set and it's open source and it's public, like model companies just swallow it right
00:46:35.140 | pretty quickly. And then all of your results are skewed because it's like, okay, we're training on data
00:46:38.340 | that it's already seen. And so like, how much is this training recipe actually valid versus just like
00:46:41.860 | reinforcing something that already exists that you SFT on. So we wanted to keep the data really pure.
00:46:47.300 | And so, yeah, there, there are slightly older models.
00:46:49.060 | So, Shelley, I had another question, um, directly from the paper. So, um, there's a, in section 4.3,
00:47:00.340 | and you're talking about the, um, you know, sort of decision to, to, um, emphasize, uh, um, you know,
00:47:10.740 | sort of failure only path. And, and you say in the last sentence, it, um, it is other function, otherwise
00:47:19.060 | functionally equivalent to learn from a real world scenario where we believe we receive both successes
00:47:24.500 | and failed responses. And I, I, that, that seems unintuitive to me because it seems like if you use the
00:47:31.540 | successes that you're going to be like maybe overtraining on the successful response and therefore,
00:47:38.340 | um, you might have more catastrophic forgetting. So I wanted to, I wanted to hear what you have to say
00:47:46.500 | about that. Um, is section 4.3, the part that discusses the failure data set?
00:47:52.580 | Yeah, yeah, exactly. Okay, cool. Yeah. So for context for everyone here, because I didn't really
00:48:00.900 | talk about this. One of the things we did to make GRPO training a lot more tractable was we pre-generated
00:48:08.180 | a data set of failures, right? So like this whole top half, you can do offline and then you just do
00:48:18.420 | GRPO on the, on the second half. Right. Um, and the reason that we did that is because we were seeing,
00:48:25.700 | if you run full-scale GRPO with this entire pathway, it was very, very, it was a very low number of
00:48:31.380 | trajectories that specifically hit failure retry success initially. And so it was really
00:48:37.860 | incredibly resource intensive for, for what it is. Right. And so instead of what we did was we just
00:48:42.180 | like offline stored, um, task, task prompt output, like pairs in the case of failure. Um, and, and what
00:48:53.700 | we gauged is basically like, yeah, this is functionally equivalent in our opinion to, uh, or the way we
00:48:59.300 | were training this was, was very similar to if you didn't do this offline thing, but you're right, RJ.
00:49:05.140 | And that like, there is actually differences because you could, the model could drift over
00:49:09.540 | time. And like, we're anchoring on this offline dataset that was generated from like model at step
00:49:15.300 | zero. Right. We're not adapting to models over time, potentially having new failures. Right. And so there
00:49:22.580 | isn't a lot of preventative stuff in place here to prevent catastrophic forgetting with respect to other
00:49:33.700 | data points in the training dataset. Right. I think our sense was that that wasn't as big of an, of a problem,
00:49:41.060 | especially since we saw low catastrophic forgetting in general. Um, and then of course, like when you evaluate,
00:49:46.900 | you see that like, no matter what, you are strictly better at the task than you were before. But I
00:49:50.420 | think it's definitely possible that for certain tasks, you could see this thing happen where like,
00:49:53.780 | as you train things that you were succeeding at before have somehow you started failing at. And this,
00:49:59.140 | this offline dataset, you're correct. When you capture that, I actually, um, that's a good point.
00:50:04.420 | I was actually thinking the opposite that if it seems like this is an important feature of your
00:50:09.460 | methodology and not just a functional, I, it seems like a functional feature of the methodology
00:50:15.540 | because you're really only you're, you're basically saying things that I used to get wrong. I get like,
00:50:21.860 | I, and now I got right by re-prompting. Right. And so that, that, and you're basically identifying
00:50:28.180 | that it's very specific subset as the important thing to train on. Whereas if you were to use the
00:50:34.340 | the successes as well, then you wouldn't have honed in on that specific subset.
00:50:39.780 | Yeah. That makes a lot of sense. Like we could be rewarding first try success as well, pretty
00:50:44.980 | continuously. Um, yeah, I think our, our approach, like we were really keen on this idea that we can
00:50:53.060 | do this like meta learning, like don't specialize for a specific task, like just incentivize self-reflection.
00:50:58.820 | And so if we were rewarding initial successes, there's, we're rewarding the task. We're not
00:51:02.980 | rewarding self-reflection ability. Um, but I agree that there are like a lot of ways to extend this,
00:51:09.060 | to like both reward the task and self-reflection capability, and hopefully see both things get
00:51:14.180 | better and potentially like you get better at the task faster. Cool. Um, Ted, go ahead.
00:51:20.660 | Hey, thanks again, Shelley, for, for joining and, and coming and discussing this. I hope this is a
00:51:26.580 | quick question. Um, can you say how you formed your batches when you were doing GRPO? Did you like mix
00:51:35.060 | the original and the new success or, or, and did you just randomly permute them or shuffle them or do
00:51:44.100 | anything special? It was pretty random, honestly. Um, I think we stuck to like between eight and 14
00:51:51.380 | generations per failure. So, um, tasks to generate output to fail, what would happen is for each task.
00:51:59.780 | I want to say we generated it. So the number of times we generated pathways for that first attempt
00:52:05.620 | to vary depending on model, um, capability, right? Smaller models, you give them less tries because
00:52:12.340 | actually more of them are failures. So we gathered the failure data, data, data set by just, um,
00:52:16.420 | generating a bunch of times and saving the ones that were failures. And then, yeah, with the actual
00:52:20.820 | GRPO training, um, yeah, I would say nothing special, like between 18, eight and 14 generations,
00:52:26.820 | not a lot of mixing, um, pretty, pretty standard GRPO training, especially since like, again, this
00:52:32.340 | was like February, March. And I feel like we, as a community, we're still like figuring out GRPO,
00:52:35.860 | um, or at least I personally was. Um, and so like, there's also this thing in the paper where it's like,
00:52:41.140 | oh, and we kind of period on less than 10 billion parameters. And like, there was an infrastructure to
00:52:44.980 | train on more than like, there was no multi-node implementation of GRPO publicly available until
00:52:52.500 | after I ran these experiments. So yeah, it's, uh, it's a process for sure. I'm sure that there are
00:52:57.940 | many papers that have come out since that would optimize specifically the GRPO approach. Yeah.
00:53:03.620 | Cool. Thanks.
00:53:05.220 | Okay. Yeah, of course we have, we have one more question.
00:53:07.860 | Vishvesh?
00:53:09.860 | Uh, hi, Shelly. Yeah. Thank you for the presentation. Uh, I was just thinking about, uh,
00:53:14.980 | uh, uh, I was thinking about your motivation that you want to learn self-reflection process rather
00:53:22.500 | than the specific task. So, looking at this particular experiment in this setup, this would be a very good,
00:53:34.980 | uh, uh, like, uh, like, do you think that this is a good ablation would be to mix and rejected pairs
00:53:45.060 | for the particular setup that now, now the first field like success that I can make a chosen repair
00:53:50.740 | and then can do prep, a direct prep optimization to compare my, it is focusing on self-reflection and
00:53:59.460 | not on the task, like some, some form of ablation where I can also have post-training, uh, to compare
00:54:05.700 | whether just, uh, just, uh, rewarding my self-reflection token is, uh, is the best way forward.
00:54:12.740 | Yeah. I, okay. So I didn't quite catch all of that because I think maybe your connection or my
00:54:18.020 | connection wasn't amazing for a bit there, but what I caught was basically ablation studies around
00:54:23.140 | comparing, uh, rewarding self-reflection specifically to rewarding both self-reflection
00:54:29.380 | and the task or just rewarding the task itself and seeing how performance changes. Um, yeah,
00:54:34.500 | super agree that that would be an interesting ablation. I think like we pretty intentionally
00:54:37.940 | in this paper stepped away from directly comparing to things like, uh, like other reward functions and
00:54:43.300 | instead approach it as let's compare to larger models, right? Let's use our baselines to be like,
00:54:47.860 | how much can this training recipe bring us up to, um, bigger models. But I super agree that like,
00:54:53.860 | head-to-head this approach versus standard GP, GRPO where you reward the answer or other, other like,
00:55:02.660 | combine it with both or whatever. Like, yeah, that would be very interesting and, and, um,
00:55:06.260 | hopefully something the writer team can get to, uh, in the next few months or anyone else.
00:55:10.900 | Cool. Awesome. Just one more, one more point. Uh, do you plan to make the code open source anytime
00:55:18.500 | after like maybe you've submitted somewhere and then you plan to do it? If you. Yes. The,
00:55:22.980 | the goal is to definitely make the code open source. Um, yeah, we are waiting, um, on a few things before
00:55:29.540 | we do that. Um, but we did actually document, in my opinion, like relatively well, hopefully, uh, within
00:55:37.780 | the paper, how we did this is actually a pretty straightforward modification. And so, um, to like
00:55:41.940 | open source libraries. And so, um, definitely encourage people to just try it out and implement
00:55:47.220 | it. And of course, email me if they have questions and I'm happy to help, but, um, yeah, super happy to,
00:55:52.420 | to answer. Um, but hopefully all the pieces should be there, uh, where if you would like to implement on
00:55:59.780 | your own, you can, but yeah, eventually, hopefully we will release the code as well.
00:56:10.660 | Thank you so much. Awesome.
00:56:13.060 | Sam, do you want to close this out?
00:56:17.620 | Uh, I don't have much to say. Just thanks. Thanks a lot, Shelly. I really appreciate it. And
00:56:24.180 | thanks everybody for joining and asking such great questions.
00:56:26.980 | And also vote on hugging face apparently. Yes. And also vote on hugging face.
00:56:31.220 | I didn't know. I didn't know they had voting. That's, uh,
00:56:34.740 | Yeah. They have paper of the day. It's run. Yeah. It's paper of the day and then paper of the week
00:56:38.820 | and then paper of the month. So shameless self-promotion.
00:56:41.780 | Yeah. It's still time. Still got time. All right. You got two or nine votes now.
00:56:47.140 | Okay. Well, um, I'll drop the links in the YouTube. Thanks everyone. Thanks everybody.
00:56:54.660 | Thank you, everyone.