back to indexPaper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
data:image/s3,"s3://crabby-images/0ff91/0ff91a435136a228cbb530aa6964a2c29eb3a688" alt=""
00:00:00.000 |
perfect wonderful uh do you also are you also handy inside 00:00:06.720 |
yes yeah wonderful so we can have like three streams or two streams only i 00:00:13.520 |
okay let's start guys um so the goal of today's paper reading 00:00:24.960 |
is to go through the deep seek r1 paper and what we will be seeing is 00:00:32.320 |
first of all the biggest difficulty people have 00:00:34.560 |
in order to understand this paper is the reinforcement learning part 00:00:39.360 |
and what why it is difficult because we are you they use 00:00:42.880 |
another algorithm called the grpo and the goal of this 00:00:46.080 |
initial part of this paper reading is actually to give you the background 00:00:49.440 |
knowledge that is is needed to understand the paper 00:00:53.840 |
so i will be using the slides that i used for making my video on 00:01:00.560 |
so to understand what is the connection between language models and 00:01:04.160 |
reinforcement learning so let's go to that slides 00:01:08.480 |
so let's review very fast very very fast language models 00:01:12.320 |
so as you know language models are generative models 00:01:18.560 |
only have one simple objective which is to tell us what is the next 00:01:22.400 |
likely token given an input prompt so if we are given for example the input 00:01:26.640 |
prompt shanghai is a city in and we feed it to the language model 00:01:29.920 |
the language model will generate a probability distribution 00:01:38.000 |
what it thinks is the next likely token that is coherent with the prompt that we 00:01:43.680 |
have given the language model and usually we sample the for example 00:01:48.960 |
if we feed this prompt to the language model it will give us that maybe the 00:01:51.840 |
next likely token is the word china or beijing or cat or pizza 00:01:55.920 |
and then we choose what we believe is the most likely based on its probability 00:02:03.120 |
then we take this word we put it back in the prompt and we ask again the language 00:02:08.480 |
model what is the next next word and we do this this job 00:02:12.720 |
iteratively to generate a full text for example to generate the 00:02:20.160 |
response to where is shanghai we first ask the language model where is shanghai 00:02:26.240 |
shanghai then we put it back in the input of the language model and it will 00:02:29.600 |
tell us the next likely token is etc etc i'm also making a 00:02:33.360 |
simplified assumption here that says that each token is a word and 00:02:38.080 |
each word is a token this is not the case in language models but for our 00:02:41.920 |
explanation we will think like in this way okay 00:02:46.560 |
now we know what is language model what is reinforcement learning 00:02:49.840 |
we go very very simple through what is reinforcement learning and then we find 00:02:53.520 |
the connection between language models and reinforcement learning 00:02:56.720 |
now reinforcement learning is an area of artificial intelligence i don't remember 00:03:00.560 |
if it belongs to machine learning in particular but it's an area of 00:03:03.280 |
artificial intelligence that is tasked with optimizing the behavior of 00:03:08.720 |
an agent and the behavior of an agent is called a policy 00:03:12.560 |
which is the decision making of an agent in such a way that the agent 00:03:16.960 |
performs actions that maximize the reward it gets from performing these 00:03:21.920 |
actions in an environment for example for example i have a cat so 00:03:31.200 |
and the cat is in my house and the cat will be considered of a 00:03:35.760 |
reinforcement learning agent the cat can make some decisions on how it 00:03:40.720 |
wants to move in the house and this will be the policy of the cat 00:03:44.960 |
so the policy tell the cat if the cat should move 00:03:48.160 |
up down left or right in the house now this is 00:03:51.840 |
my house you cannot see the border so i will draw them because i don't know why 00:03:59.600 |
should think of this environment as being a grid environment made up of 00:04:09.120 |
like this what is the goal of the cat the goal of the cat is to arrive to the 00:04:20.000 |
the house it can make some choices some perform some actions we say 00:04:24.320 |
technically and the actions that the cat can can 00:04:28.240 |
choose at each position in the house is a move up 00:04:31.200 |
down left or right what we want we want to make sure that the cat 00:04:39.840 |
lead him with very likely to the meat while avoiding the things that the cat 00:04:52.400 |
so we designed first of all a reward system for this reinforcement learning 00:04:56.640 |
agent because we want to train this reinforcement learning agent to choose 00:04:59.840 |
which actions to perform in each position in the house 00:05:02.560 |
based on some reward so one reward that we could do 00:05:06.880 |
one reward model could be this one for example if the 00:05:10.080 |
cat moves to an empty cell it receives a reward of zero 00:05:13.200 |
if the cat moves to the broom it receives a reward of minus one if it 00:05:17.680 |
moves to the bathtub it receives a reward of minus 10 00:05:21.440 |
however if after performing a series of actions the cat arrives to the meat 00:05:25.920 |
then the cat receives a big reward of plus 100 00:05:30.320 |
the decision making of this cat is governed by 00:05:33.600 |
a model that we will call the policy of this cat and the goal of the policy 00:05:37.920 |
is to choose an action given the current state 00:05:44.880 |
stochastic it means that this policy gives us a distribution over all the 00:05:49.600 |
possible action that the cat can take so if if we have imagine we have a very 00:05:54.480 |
optimized policy if the cat is here the good policy should 00:05:58.800 |
tell us that with very high probability score we should 00:06:02.240 |
move down because that's one way to maximize the reward 00:06:05.280 |
and with very low probability we should move left because that will take us 00:06:13.600 |
the another thing for example that this policy should 00:06:17.840 |
do is if the cat is here for example it should not move right so the 00:06:22.960 |
probability associated to the action move right should be low 00:06:26.240 |
and maybe the probability associated with the action move down should be a 00:06:34.320 |
now what is the connection between language models and reinforcement 00:06:40.080 |
reinforcement learning is to train this policy so to train 00:06:43.680 |
this decision making of this cat of this agent in order to choose the proper 00:06:48.320 |
actions at every possible state in the environment 00:06:52.080 |
such that it maximizes the reward that the agent can get 00:06:57.360 |
so a good policy for the cat would be a policy that always leads the cat to the 00:07:01.440 |
meat no matter where the cat is now let's go let's connect the 00:07:06.320 |
reinforcement learning with language models so 00:07:08.640 |
language model is also kind of a policy because the language model every time 00:07:13.040 |
you feed a prompt to the language model the language model has to choose an 00:07:16.400 |
action to perform which is what token should come after this 00:07:20.080 |
prompt so in this case we talk about the state 00:07:25.600 |
and action the state is the state in which the 00:07:28.320 |
reinforcement agent is in the case of the cat is the position of the cat inside 00:07:32.560 |
of the house in case of the language model the state 00:07:35.120 |
is the prompt itself that you feed and the action is the 00:07:38.080 |
distribution over all the next token that the language model can choose from 00:07:42.960 |
in the case of language models so we also want to train the language model to 00:07:46.240 |
perform its action actions or to choose the next 00:07:50.480 |
tokens in a particular way according to some reward that we that we 00:07:57.360 |
in the case specifically case of reinforcement learning from human 00:08:04.560 |
generate text using particular rules for example when we do language model 00:08:10.160 |
alignment we are first of all how language models 00:08:12.960 |
are trained usually we have a pre-training part where 00:08:16.000 |
we feed a lot of information to the language model so we throw a lot of data 00:08:19.840 |
like the entire wikipedia the entire web the entire i don't know stack overflow 00:08:23.840 |
and the leet code everything and the language model learns how to 00:08:27.920 |
kind of the structure of the language it learns a little bit of chinese a little 00:08:31.920 |
bit of english a little bit of japanese because we throw every data 00:08:35.120 |
possible that we have at the language model then we do a little bit of 00:08:38.960 |
fine-tuning so we train the language model to generate high quality data so 00:08:42.160 |
we increase the likelihood of generating high quality outputs instead of just 00:08:45.920 |
throwing whatever is on the internet but then we do also an alignment part 00:08:50.480 |
in which we want the language model to follow instructions so we want the 00:08:53.440 |
language model to adhere to some standards for example 00:08:59.040 |
is the conversation is the instruction fine-tuning 00:09:02.320 |
which means that we train the language model to follow a particular format so 00:09:10.160 |
be helpful always never use a curse word etc etc etc 00:09:14.640 |
and this job is done through the reinforcement learning from human 00:09:17.040 |
feedback which includes many kind of algorithms like ppo dpo etc 00:09:20.960 |
and grpo is one of them what we do usually in language models 00:09:25.920 |
to train the language model to follow instructions is we 00:09:30.400 |
generate a data set of instructions in which we we first 00:09:37.840 |
have some list of questions and then we ask the language model to generate some 00:09:41.600 |
answers and then we ask some professional annotators to 00:09:45.360 |
choose which answer they would like the language model to 00:09:48.480 |
generate more and which one they don't like the language model to generate 00:09:53.200 |
and the the goal of reinforcement learning from human feedback is to make 00:09:57.200 |
sure that the language model will generate more answers like the 00:10:01.920 |
ones that are chosen by the annotators and the less likely to 00:10:05.200 |
generate answers that are not chosen by the annotators 00:10:15.680 |
of the reinforcement learning from human feedback 00:10:20.320 |
okay now that we have understood a little bit the connection between 00:10:24.400 |
language models and the reinforcement learning framework let's 00:10:28.400 |
move on to the paper i just want to do a little 00:10:31.520 |
review of what we have seen so far so now we know what are language models 00:10:35.360 |
they are models that generate the probability over what is the 00:10:39.680 |
next likely what is the next token the probability distribution 00:10:45.680 |
of what is the what should be the next token based on the input 00:10:49.040 |
we know what is reinforcement learning which is a framework for training the 00:10:52.240 |
policy of an agent in order to choose actions 00:10:55.760 |
that maximize its reward what is the connection between 00:10:59.440 |
reinforcement learning and language model is that the language model itself 00:11:02.480 |
is a policy because it makes decisions it takes 00:11:05.600 |
actions in choosing what is the next token and we want the language model to 00:11:09.600 |
choose tokens so the next token in such a way that 00:11:13.120 |
it follows some standards which are according to a data set of 00:11:21.040 |
this data set of preferences is converted into a reward model but we 00:11:25.280 |
will not be covering the reward model for now at least 00:11:28.320 |
so now let's go to the deepseq paper now in the deepseq 00:11:31.760 |
in the deepseq r1 paper what they do is they start with 00:11:36.240 |
a pre-trained model they they use the deepseq v3 00:11:51.280 |
what does it mean to perform better at reasoning it we want the language model 00:12:04.240 |
that are easier for the language model to manage and the way they do it 00:12:08.160 |
is also through reinforcement learning let's go 00:12:12.240 |
to the paper so let's go here let's go here let's go here okay 00:12:18.640 |
first of all they say that in this section we explore the 00:12:22.400 |
so let me use the in this section we explore the potential of 00:12:25.840 |
language models to develop reasoning capabilities without any supervised 00:12:29.920 |
data focusing on their self-evolution through a pure reinforcement learning 00:12:36.800 |
train language models we also try to build some 00:12:39.680 |
very high quality data set of what the language model 00:12:43.200 |
should be generating because as we said before we have like 00:12:46.720 |
multiple stages of training one is the pre-training in which we just throw 00:12:50.400 |
random data from the web to the language model but then we want the language 00:12:53.520 |
model to generate high quality data so we have this kind of the supervised 00:12:57.840 |
fine-tuning part and they skip this part here they just 00:13:01.920 |
take the base model and then they want the base model to 00:13:05.840 |
develop the reasoning capability just by using 00:13:09.680 |
reinforcement learning which means that we want to incentivize the language 00:13:14.080 |
model through a reward system to develop by itself 00:13:18.400 |
what is the sequence of token that should lead to so 00:13:22.320 |
for the language model to acquire as much reward as possible 00:13:25.840 |
it's like i take my cat and i want the cat to solve math problems 00:13:32.560 |
and i what i can do i can just play with how many biscuits i can give to the cat 00:13:40.960 |
that the cat is incentivized to solve math problems 00:13:46.560 |
then by the because the cat wants to get the biscuit the cat will develop 00:13:51.120 |
whatever skill it needs to develop in order to get 00:13:54.240 |
maximize the number of biscuits it gets now of course the cat will never 00:13:59.680 |
lacks the capability of learning certain things but that's not the problem in 00:14:03.600 |
language models because big language models have a lot of 00:14:11.840 |
now the algorithm that they use in DeepSeq R1 is called the GRPO 00:14:21.520 |
reinforcement learning from feedback historically we have always 00:14:25.440 |
used the PPO algorithm more recently the DPO algorithm there are 00:14:33.360 |
what they use here is called the GRPO algorithm and it's 00:14:37.040 |
very similar to the PPO but slightly different 00:14:40.160 |
and we will see how let's see what does the GRPO algorithm does 00:14:47.280 |
well as we saw before when we do the when we do reinforcement learning on a 00:14:54.720 |
language model we have a data set of preferences so we 00:14:58.160 |
have some questions and then we ask the language model to generate 00:15:01.120 |
multiple answers and then we ask annotators to choose which answer they 00:15:11.120 |
that gives the signal to the language model to understand 00:15:16.000 |
which if the answer that the language model is generating 00:15:20.400 |
is good or bad in this case with the GRPO we have the 00:15:26.240 |
following objective now if you have never seen clip or 00:15:31.120 |
if you are sorry if you have never seen PPO this 00:15:34.320 |
sounds quite scary but let's try to break it down 00:15:38.000 |
step by step what are we doing here is we want 00:15:41.920 |
to optimize a policy so the policy is the language model itself and the policy 00:15:47.040 |
is always denoted with the letter pi like the greek letter pi so this 00:15:51.840 |
pi of theta is the policy so it is the language model that we are trying to 00:15:55.280 |
optimize we want this policy to be trained to 00:15:59.120 |
maximize the following objective what is this objective and this 00:16:04.160 |
objective is saying that if i have a list of questions that 00:16:13.840 |
we sample some output from our policy using these questions then 00:16:21.680 |
based on some reward that this output that this output of the language 00:16:28.880 |
model get from our reward system we should train the language model to 00:16:34.080 |
give more weight to those actions that result 00:16:38.080 |
in good reward and to give less weight so are less 00:16:42.480 |
likely the language model should be less likely to take those actions that 00:16:51.120 |
is as follows so we take basically what we are doing is 00:16:56.400 |
okay here you see all the policy and the new policy 00:17:00.240 |
for now let's ignore that that's because we do something called 00:17:03.760 |
offline learning so we will not be covering that at least now 00:17:08.160 |
for now what we want to do ignore the the the denominator here so the pi old 00:17:15.600 |
just concentrate on the um on the term pi what we are saying is that we 00:17:21.840 |
generate the log probabilities of the output so what is the log 00:17:26.080 |
probability of the output let's do it step by step actually okay 00:17:36.560 |
for example imagine we ask the language model where is shanghai so 00:17:42.160 |
and the language model generates because as we saw we sample a few questions 00:17:53.120 |
from a database of questions and then language model and then we generate 00:17:57.760 |
multiple outputs using this this question using our 00:18:00.720 |
language model so these outputs are called o and there are g of them this is 00:18:05.680 |
the group in grpo maybe the language model the first time 00:18:10.320 |
will generate let's say uh shanghai is in china so 00:18:15.360 |
let me just write shanghai is in china another output could be for 00:18:29.200 |
and another output could be shanghai is beautiful 00:18:36.640 |
imagine we have some magic reward system that assigns or imagine that our reward 00:18:45.120 |
system is a human being this human being will very likely to 00:18:50.000 |
give a very high reward to this answer zero reward to this answer and maybe 00:18:56.000 |
not completely zero but nearly zero score to this answer 00:19:01.040 |
why because this at least talks about shanghai this doesn't 00:19:04.400 |
talk about anything related to shanghai and this actually 00:19:07.520 |
answers the questions now when we have a language model we have 00:19:13.440 |
a question and the generated answer we have what is known as a trajectory 00:19:19.600 |
the trajectory is a list of actions that the language model has taken 00:19:24.000 |
why because the language model was given this question 00:19:27.280 |
as input and the language model took an action chose 00:19:31.600 |
an action to generate the first token which is shanghai 00:19:34.640 |
then this shanghai was put back into the language model and then the language 00:19:38.080 |
model choose another action which is the token 00:19:40.480 |
is and then this is was put back into the language model and then the language 00:19:44.480 |
model chose another action which is in etc etc this is a trajectory at each 00:19:52.240 |
model chose an action based on the probability 00:19:55.600 |
distribution that it generated actually it's not the 00:19:58.240 |
language model that chooses it's our sampling strategy that chooses the 00:20:01.760 |
particular token so usually we can use the greedy strategy or the top of the 00:20:05.120 |
strategy or whatever so at each step we have chosen some 00:20:10.960 |
action based on the distribution of the language that the language model 00:20:14.800 |
generated for each state so we have a log probability a 00:20:20.560 |
probability associated with the word shanghai 00:20:24.560 |
on conditioned on the question where is shanghai 00:20:27.680 |
we have a probability associated with the word is 00:20:33.680 |
question mark shanghai then we have a probability associated with the word 00:20:37.840 |
in conditioned on the input where is shanghai 00:20:41.200 |
question mark shanghai is blah blah blah etc etc so we have a list of 00:20:45.280 |
probabilities what we are doing here is that we want 00:20:50.400 |
the um this is the product of all the log probabilities 00:20:55.600 |
that are generated at each step of the particular output 00:20:59.280 |
o i here which is maybe the first answer here 00:21:03.440 |
then we ask the language model again the same question and the language model 00:21:06.560 |
will come up with another output and this output will also have 00:21:10.480 |
associated with it a list of log probabilities and these are the log 00:21:14.160 |
probability that you see here each log probability 00:21:17.520 |
furthermore is weighted by an advantage term the advantage term 00:21:24.960 |
is basically telling me how better is choosing a particular token 00:21:31.040 |
given a particular input over all the tokens that are available 00:21:34.800 |
for example imagine we have the input where is shanghai 00:21:39.360 |
is it better to choose the word shanghai or is it better to as the 00:21:47.520 |
i believe it's better to choose the word shanghai so the advantage of choosing 00:21:51.520 |
the word shanghai would result in a better long-term 00:21:58.480 |
model because it will result in a good answer so it will result in a 00:22:05.840 |
just have modeled the reward model as a human being who tells that the 00:22:09.600 |
language model okay this is a good answer this is a bad answer 00:22:12.960 |
but actually later we will see that the reward is actually also a language model 00:22:19.440 |
in the case of the deep seek r1 they use actually a rule-based reward 00:22:23.120 |
model so let's go back we have a list of questions 00:22:30.240 |
we generate multiple answers with each of these questions 00:22:34.320 |
using our language model and then we for each of these 00:22:38.400 |
answers we have the log probabilities associated with this answer 00:22:42.720 |
which is just the product of all the probabilities of choosing that 00:22:49.600 |
we weight each of this log probability by an advantage term 00:22:54.320 |
which basically tells us how good is choosing this particular token over 00:22:59.680 |
all the other that are available for this particular input 00:23:03.120 |
and we train our language model to maximize this objective 00:23:07.120 |
let's see what does it mean to maximize this objective and now let's also see 00:23:10.640 |
what is this old here the language model that we will be 00:23:15.360 |
training is our deep seek base right so at the 00:23:19.760 |
beginning suppose that this pi old is the base 00:23:23.200 |
version of deep seek what we do is basically we want to refine 00:23:32.480 |
output from it and then through the reward model we want to tell it okay 00:23:44.240 |
less of this this is one of the advantage of using reinforcement learning 00:23:48.000 |
because if you do supervised fine tuning you're 00:23:50.560 |
just telling the language model to i want this so generate this if you are 00:23:54.560 |
doing reinforcement learning you have the ability to tell the language model i 00:24:04.320 |
at each iteration the language model should be optimized in the following way 00:24:12.080 |
if the language model at the current iteration is giving 00:24:19.120 |
a response that resulted in a good reward and the advantage term will be 00:24:30.160 |
objective is telling the language model do more of this 00:24:36.800 |
giving less probability to an action that also resulted in 00:24:41.360 |
a low than a bad reward and the advantage term will be negative in 00:24:46.560 |
that case then by optimizing this objective here 00:24:50.400 |
by maximizing the objective here the language model 00:24:53.040 |
will learn to do less of that i know that i didn't explain very well 00:24:58.640 |
the pi theta and the pi old theta i believe i can do that later when 00:25:08.960 |
offline learning in the case of in the case of reinforcement learning 00:25:15.440 |
moreover in the grpo you find this kehl divergence term 00:25:19.760 |
now for people who already know what is the kehl divergence that would be super 00:25:22.960 |
easy but for people who don't know basically the kehl divergence is a way 00:25:26.480 |
of measuring how to distributions are different so how far 00:25:30.240 |
they are what we want is we want the language 00:25:33.360 |
model to be fine tuned to generate more of things that lead to a better 00:25:41.600 |
low reward but at the same time we don't want the 00:25:44.240 |
language model to change too much its behavior 00:25:49.920 |
that tells the language model to be more polite 00:25:53.760 |
what the language model could do and imagine that by saying 00:25:56.880 |
being polite for example in my case means that i always say thank you right 00:26:00.880 |
so what could happen is if we don't enforce the kehl divergence is that the 00:26:04.160 |
language model could just cheat and always generates thank you thank you 00:26:07.440 |
thank you thank you thank you thank you at every response 00:26:09.760 |
like a list of thank yous because that results obviously in a high reward 00:26:13.920 |
but the language model would stop doing its main job which is to generate 00:26:17.360 |
something factual and useful so we want the language model to change 00:26:21.280 |
a little bit but to change so to be more polite but not just be 00:26:27.840 |
just generate a bunch of thank yous so change 00:26:30.880 |
change but change a little bit so that's why we add the scale divergence 00:26:35.360 |
otherwise what the language model will do it will 00:26:37.440 |
do what is known as a reward hacking which is it will try to find 00:26:45.440 |
maximize its reward without actually being useful 00:26:50.240 |
this is also for example if you want a parallel like example it's like 00:26:54.240 |
you have a text code and it's very complex people will always find a way to 00:26:58.880 |
cheat on it and if you have a text code that is very simple made up of few rules 00:27:08.880 |
and so so you cannot do like in that case you cannot cheat 00:27:14.480 |
okay what do we miss here so first of all i didn't explain the 00:27:23.520 |
i didn't explain the offline policy learning and i didn't 00:27:30.800 |
explain the clip part why we are clipping here 00:27:36.400 |
the clipping part basically we are saying if the language model is 00:27:44.400 |
if the language model is trying to change its log probabilities by being 00:27:48.160 |
too confident about its change then we don't want we don't want to 00:27:53.680 |
let the model be overly confident basically what means that if the 00:27:58.160 |
this is the language model at the current iteration and this 00:28:01.520 |
you can think of it at the previous iteration in the training process 00:28:06.080 |
if the language model at the current iteration is very confident that by 00:28:10.160 |
saying the word shanghai will result in a better 00:28:13.600 |
reward even if it's a good choice we don't want the language model to be 00:28:18.880 |
overly confident so we clip this ratio between the log probabilities up to 00:28:24.960 |
one plus epsilon or or in the lower case in one minus epsilon 00:28:30.800 |
because if the language model will choose the next 00:28:34.320 |
word as a shanghai given this question then okay 00:28:37.440 |
we are lucky and it's it's good but imagine the language model is overly 00:28:41.040 |
confident then the next word is i don't know coffee then we don't want 00:28:45.920 |
the language model to make too big step we want the model to learn as 00:28:49.760 |
slow as possible by choosing the accordingly this epsilon 00:28:54.960 |
term which is something that we can choose and this beta term 00:28:59.280 |
okay i believe i have covered a little bit of this 00:29:03.760 |
so last last review guys so we are trying to optimize the language model 00:29:07.840 |
iteratively by telling it to make more of something that we want 00:29:11.840 |
and to do less of what we don't want how does the model know what we want and 00:29:15.920 |
what we don't want it's the reward that we give it 00:29:18.720 |
according to our reward model now let's talk about the reward model 00:29:22.960 |
the reward model historically in ppo let's go to the other here 00:29:28.720 |
was a model which was of the same structure as the 00:29:33.120 |
language model that we are trying to optimize 00:29:38.960 |
gives a reward to each answer how can we assign a numeric reward to 00:29:45.680 |
a particular answer so if you remember when we talk 00:29:52.160 |
about the bpo i said that usually we start with some 00:29:55.680 |
questions then we generate a few answers then we ask some annotators to choose 00:29:59.360 |
which answers we like and they and to also tell us which answers they 00:30:02.800 |
don't like how do we convert this data set of 00:30:05.840 |
preferences into a number we do that by training a 00:30:10.720 |
language model which has the same architecture as the 00:30:14.160 |
policy that we are trying to train just a different head on top that instead 00:30:18.160 |
of generating the log probabilities of the next token generates a numeric 00:30:21.600 |
reward and we do it with what is known as the bradley terry model which is 00:30:27.040 |
basically this loss here you don't have to understand this law 00:30:30.240 |
doesn't matter it basically means that we want the if we train the language 00:30:34.400 |
model on this loss it will generate a very high reward 00:30:42.000 |
so if we train our language model on this loss it will basically result 00:30:49.520 |
in this head here this linear head on top of the language model 00:30:53.360 |
to generate a very high value for the questions that 00:30:57.200 |
for the answers that were chosen in the data set of preferences and 00:31:00.800 |
a very low or low basically value for the answers that were not 00:31:06.800 |
chosen by the professional annotators this is how it was done with ppo so 00:31:13.120 |
we have an objective which increases the likelihood of the things that give 00:31:18.000 |
high reward we also have a reward model which 00:31:20.960 |
generates a numeric reward as a signal for the language 00:31:24.480 |
model to understand what it should do more and what it should 00:31:41.440 |
generate a number that gives a signal to the model to 00:31:44.800 |
understand what we like and what we don't like 00:31:48.960 |
they used a rule-based system you can see here 00:31:52.800 |
and this rule-based system is you can do it for all the tasks that you can kind 00:32:00.960 |
problems they ask the how can you check if the answer 00:32:05.200 |
generated by the model is good well you just run it and if it 00:32:08.400 |
performs if it first compiles and secondly it 00:32:12.000 |
runs in a predefined time limit then it is a good answer 00:32:16.000 |
it doesn't matter how it came to be if it's if it works it's a good answer 00:32:20.480 |
just like also for example math some for most of math 00:32:24.000 |
math problems we do have the answer that we expect the model to generate so we 00:32:28.160 |
can compare what is the actual generated answer 00:32:30.800 |
and what is the we expect the model to generate 00:32:35.920 |
so they create a rule-based reward system in a way that they 00:32:40.480 |
ask the language model to generate some output for a given problem 00:32:44.160 |
and then they can assign by just following rules so check 00:32:47.600 |
if okay if it's a lead code problem just run it and check if it runs 00:32:51.280 |
okay good reward doesn't run okay zero and if it's a math problem they just 00:32:56.480 |
take the the output of the model compare it with the expected 00:32:59.520 |
result and assign reward based on that so if it's 00:33:02.800 |
the answer matches what we expected good reward otherwise 00:33:06.400 |
zero etc etc they also assign a reward for the 00:33:11.200 |
for the language model if it formats the output in a certain way 00:33:17.120 |
by for example forcing the they give a reward for the language model if it uses 00:33:21.840 |
for example if it follows the format of putting all 00:33:30.160 |
slash think etc so basically just with this they train the 00:33:37.040 |
language model so they train the language model to 00:33:40.240 |
generate answers and they reward these answers to a reward 00:33:53.600 |
and the language models basically automatically by itself 00:33:57.760 |
learns to generate the thought process that is necessary to perform 00:34:03.200 |
the the tasks that it is being asked so the language model learns by 00:34:12.800 |
learning but with this reward system to generate the thought process 00:34:17.920 |
that leads to the generating the right code for the 00:34:21.920 |
lead code problems to generate the right thought process 00:34:25.920 |
that is necessary for solving math problems etc etc etc 00:34:31.440 |
let me check what else we need to know from here 00:34:36.400 |
okay here they show some results i think the results you can check by yourself 00:34:45.040 |
during the training of r1 just with reinforcement learning so as i as i want 00:34:54.800 |
and added this reinforcement learning step on top of it which is a massive 00:35:00.560 |
reinforcement learning step usually the alignment part is not so big 00:35:05.120 |
and the more they find you they they run this reinforcement learning 00:35:14.080 |
models automatically learns to generate longer responses 00:35:19.520 |
because to solve problems you need to generate a longer chain of 00:35:25.680 |
reward system learns that in order to get reward it is 00:35:29.600 |
it should generate longer responses so they didn't tell the language model 00:35:35.040 |
with supervised fine tuning to generate that kind of 00:35:38.240 |
data with that particular format with that kind of thought process 00:35:42.720 |
just with reinforcement learning with the right incentives the language model 00:35:46.320 |
learned to do that how did it do that it learned that at a particular input it 00:35:51.600 |
should generate that particular token which in the long term results in a good 00:35:58.960 |
reinforcement learning is that you not only learn to do something 00:36:03.360 |
that based on the immediate reward that you get 00:36:06.640 |
but also on the long-term reward that you will get because sometimes the reward 00:36:11.280 |
as you can see here the reward model only applies when the entire answer is 00:36:17.680 |
produced so the model will only know the signal 00:36:21.360 |
that the model gets is only for the entire output 00:36:26.240 |
actually okay through the advantage term this signal is propagated back to each 00:36:30.000 |
single token but okay we can skip that in the in the reinforcement learning 00:36:34.160 |
actually the beauty is that sometimes the reward you get is not for 00:36:37.840 |
the single action you take so in the case of my cat for example 00:36:45.280 |
the cat will receive reward only after taking many steps that lead it to the 00:36:52.960 |
know okay all this sequence of actions was a good 00:37:00.880 |
each single action in such a way that the cat when it will 00:37:04.480 |
be here it will very likely to choose i need to go 00:37:11.040 |
right this also happens in the case of language 00:37:14.080 |
models in this case through the advantage term here 00:37:22.080 |
which is actually done for each token in the case of 00:37:25.120 |
okay now we can go a little bit more technical details 00:37:29.520 |
in the case of why they choose grpo over ppo first of all 00:37:33.840 |
well with ppo basically this advantage term here 00:37:37.600 |
to compute requires another function that is called the value function and 00:37:42.320 |
this value function basically to be computed requires the training of 00:37:45.600 |
another model by using the advantage term in grpo 00:37:49.920 |
this advantage term is calculated without the value 00:37:53.680 |
function but by the following formula here which is 00:37:56.240 |
basically just based on the rewards which is already 00:37:59.680 |
given by the reward model which we have which in the case of the 00:38:02.720 |
deepseek r1 is rule-based so they don't need to train this other 00:38:07.760 |
uh model to generate the the value function which is 00:38:15.680 |
okay so now we have seen what is the reinforcement learning we have seen what 00:38:19.520 |
is the connection between language models and reinforcement learning we 00:38:21.760 |
have seen a little bit what is the the grpo objective 00:38:37.120 |
let's explore exactly this loss because actually the the rest of the paper is 00:38:43.920 |
okay we tried just reinforcement learning okay 00:38:47.200 |
so people like deep so let's go deep okay all right first of all the most 00:38:52.240 |
interesting thing about reinforcement learning 00:38:55.760 |
especially in the case of ppo and the first learn from human feedback is that 00:39:00.480 |
is this thing called the gradient policy optimization so let's go back to the 00:39:05.920 |
because the rest of the paper in dpo deepseek is basically they they took 00:39:09.680 |
this um r1 and then said okay instead of just doing reinforcement learning let's 00:39:16.240 |
learning and supervised fine tuning and then reinforcement learning again then 00:39:19.200 |
super and it leads to better outcomes but that is not technically difficult to 00:39:23.360 |
understand i think because most of you already kind of have backgrounds in this 00:39:27.680 |
so if you're here it's because you kind of understand what we are talking about 00:39:31.840 |
so let's do all the things that maybe some people have difficulties with which 00:39:35.200 |
is i believe this and this part okay so let's go 00:39:42.560 |
and imagine i want to train my cat so i want to train my cat 00:39:46.080 |
to um follow to to reach the meat so as we saw before my cat is just a 00:39:52.320 |
an agent with a policy a policy is what it tells the cat what action to take 00:39:56.240 |
given the position of the cat inside of the house 00:39:58.800 |
you need to think of this house as a grid so like like 00:40:01.920 |
the following i didn't uh you cannot see the the grid lines because i don't know 00:40:08.240 |
matter what is the policy the policy as we saw 00:40:12.080 |
before is it tells the cat what action to take given a particular position 00:40:15.280 |
what is our goal in reinforcement learning is to select a policy that 00:40:18.800 |
maximizes the reward that the agent gets when using this policy so 00:40:25.040 |
this is the um the objective that if you have we want to select among 00:40:29.040 |
all the possible policies that we can can have 00:40:32.560 |
the one that maximizes an objective what is this objective is the expected 00:40:38.320 |
reward that we can get when using this policy 00:40:41.040 |
means that if i apply in the side or in the brain of my cat 00:40:44.080 |
this policy which is the decision making stuff 00:40:48.080 |
then my cat if this best policy will tell my cat to move down here 00:40:52.960 |
and move right here and move right right down down down 00:40:57.040 |
and until it arrives to the needle this should be the 00:41:00.800 |
um the policy how do we actually train this policy we do what is 00:41:14.160 |
but before we understand policy we need to understand a few terms 00:41:17.920 |
so uh we want to um we want to first of all learn what is the 00:41:23.520 |
trajectories the trajectory is basically a list of state and 00:41:30.560 |
okay yeah okay the trajectories are a list of states of action so if the cat 00:41:37.920 |
lines otherwise it's too bad for you guys if you cannot see it 00:41:45.840 |
okay this is the the cat is initially here so this let's call it the state 00:41:51.040 |
number zero and the cat can choose some action and 00:41:53.440 |
let's call it the action number zero when the cat takes the action number 00:41:56.960 |
zero it will arrive in a new state so maybe the cat will arrive here and it 00:42:00.560 |
will become the state number one of the cat and then the cat here we can take 00:42:03.680 |
another action according to its policy let's call it action number one and 00:42:07.520 |
which will lead the cat into a new state so we'll call it s 00:42:10.480 |
s2 which in which it will in which it can take 00:42:15.040 |
another action and let's call it the action number two etc etc so the 00:42:18.240 |
trajectory is a list of states and action that the agent can take 00:42:25.040 |
is if we take sample a trajectory according to our policy we want to 00:42:29.840 |
maximize the reward that we get from each of the 00:42:36.800 |
how do we do that let's do it here so basically what we do is the 00:42:48.240 |
so if we can find as you know when we have in deep learning what we are doing 00:42:53.520 |
we are trying to either maximize something or we are going to minimize 00:43:00.960 |
model training we usually always minimize a cost function in this case we 00:43:06.000 |
want to maximize the expected rewards when the agent acts 00:43:10.480 |
according to this policy this policy however is not just any policy it is a 00:43:15.040 |
particular policy made up of some parameters that we will call 00:43:18.160 |
theta to give you a parallel on what is happening here 00:43:26.800 |
and you are like the ceo of the company and this company is made up of 00:43:34.720 |
many actors and many functions and many departments 00:43:38.480 |
so each of these things are let's say they are parameters of 00:43:43.440 |
your company they define your pump because you can tune them and the 00:43:46.800 |
company function will change so how people talk to each other how 00:43:57.040 |
they are all parameters of your company which define the outcome of your 00:44:04.320 |
profit of your company so what you do you learn to tune all of 00:44:08.160 |
these parameters so you learn to for example tell people to behave in a 00:44:11.760 |
particular way or you tell the people to collaborate in a particular way or to 00:44:17.760 |
other projects this is what we do in the gradient policy optimization 00:44:21.120 |
policy gradient optimization we calculate the gradient with respect 00:44:26.480 |
to the parameters of this particular objective function 00:44:35.120 |
yes later we talk about the discount factor in the word sum okay 00:44:38.560 |
so what is the gradient the gradient basically tells us 00:44:44.400 |
how the how the objective will change if we change the 00:44:51.040 |
parameter a little bit the gradient always tell us how it will 00:44:57.040 |
increase so the gradient tells us what is the 00:45:00.080 |
the direction of max the maximum ascent of a particular 00:45:03.680 |
function with respect to the variable in to which you 00:45:06.960 |
calculate it so in this case we are computing the 00:45:09.440 |
gradient with respect to the parameters of the objective function which tells us 00:45:17.280 |
how should i change the parameters to increase 00:45:20.480 |
this objective function which is exactly the expected reward 00:45:28.640 |
because the the the gradient tells us how we should change the parameters to 00:45:34.240 |
increase the objective then we change the parameters 00:45:37.600 |
according to the direction of the gradient and this is what we do here 00:45:44.960 |
the expected reward when acting according to this policy 00:45:48.480 |
we calculate its gradient with respect to the parameters which tells us 00:45:52.320 |
how we should change these parameters to increase this objective so to increase 00:45:56.640 |
the expected reward and then we change the parameters in the same direction of 00:46:01.120 |
the gradient and we do it iteratively this is called 00:46:05.680 |
policy gradient optimization and actually this is a beautiful result 00:46:13.600 |
whatever policies my cat has right now i can just 00:46:18.560 |
sample some trajectories from these policies when i ask my cat to move 00:46:25.840 |
calculate the gradient with respect to the expected reward according to these 00:46:30.240 |
trajectories and then tell the cat hey you should do 00:46:33.920 |
more of this because this led to a better reward 00:46:37.120 |
or you should do less of this because it leads to a bad reward 00:46:41.680 |
this is policy gradient optimization now it has some problems because policy 00:46:47.440 |
gradient optimization basically okay as you can 00:46:50.400 |
let's skip the math because if you want the math i i made a video 00:46:53.760 |
it's on youtube so you can watch it tomorrow but it has some problems 00:46:57.840 |
because as you can see the search space of the 00:47:05.280 |
trajectory that the cat can take to go from here 00:47:08.080 |
to the meat there are a lot because the cat can go like this 00:47:11.840 |
it can go like this it can go like this it can go here then come back then go 00:47:17.200 |
down etc etc so there is many many many many many 00:47:21.120 |
trajectories that the cat can take to go to the meat 00:47:26.400 |
however to compute this objective we should actually check all the 00:47:32.320 |
possible trajectories to get the direction of the gradient 00:47:37.760 |
however this is intractable means that in the case of language model we should 00:47:42.000 |
ask the language model to generate all the possible output ever possible 00:47:46.320 |
given a particular question which is intractable because 00:47:49.520 |
at each token the the language model can choose what 00:47:57.040 |
then the language model can choose 30 000 possibilities for the first 00:48:00.560 |
token then 30 000 for the second 30 000 for the third etc etc 00:48:05.200 |
and to check all of them it's computationally 00:48:08.400 |
impossible so we can always approximate this with 00:48:18.880 |
monte carlo estimation however this results in because we are 00:48:25.680 |
but we are making the decision of optimizing our cat 00:48:28.880 |
using only a few trajectories of course as you can see it's a risky situation 00:48:33.440 |
so it means that we are making a hard decision on how we should change 00:48:37.520 |
a policy without checking all the possible search space 00:48:41.920 |
this basically means that we have high variance 00:48:45.200 |
and there are many ways to reduce this variance 00:48:48.880 |
so when you read the term baseline in the deep seek paper 00:48:53.040 |
this is one of the ways to reduce this variance because we 00:48:56.160 |
are trying to optimize the language model into choosing 00:48:59.920 |
certain patterns into choosing certain chain of thoughts into choosing 00:49:04.000 |
certain sequence of tokens without exploring all the possible generation 00:49:11.280 |
um so so let me see how can we simplify this one 00:49:18.720 |
we blah blah okay so in order to reduce this variance so 00:49:25.760 |
in order to make sure that we optimize the language 00:49:33.040 |
generations but still making sure that we make the 00:49:37.040 |
gradient that we get so the direction that tells us 00:49:39.680 |
what parameter we should change and in which direction in order to increase the 00:49:43.920 |
expected reward we can introduce this advantage term 00:49:47.680 |
here this advantage term basically for each token tells the language model 00:49:52.480 |
how better is choosing this token over all the other tokens that i can choose 00:49:57.520 |
in this position for example in the example in the example that we saw 00:50:01.680 |
before so where is shanghai is should the language 00:50:06.000 |
model choose the word shanghai or it should choose the word coffee or should 00:50:09.680 |
it choose the word pizza well it's very more it's much more advantageous 00:50:14.000 |
to to to choose the word shanghai because it's very likely that the 00:50:18.160 |
language model will then complete it as shanghai is in china or 00:50:21.440 |
shanghai is a city in china or shanghai is located in china etc etc 00:50:25.840 |
so the choosing the language the the word shanghai 00:50:38.160 |
is higher compared to all the other tokens in that 00:50:41.280 |
condition in the case of the cat it means that the cat when is 00:50:46.640 |
here it should very high it's very advantageous to choose go down because 00:50:51.840 |
it will result in going to the meat and over all the other 00:50:57.040 |
all the over all the other actions this doesn't mean that 00:51:00.080 |
choosing up will lead you to die or to get no reward because you can always go 00:51:04.240 |
up and then change direction and go down but it's 00:51:07.200 |
much more advantageous to just go down this is the this is the 00:51:16.480 |
and this is the same advantage term that you see in the 00:51:20.880 |
grpo loss now what is the difference between the 00:51:24.800 |
advantage term that you see in the ppo and the grpo is that 00:51:31.600 |
what is known as the value function in the grpo they just 00:51:36.560 |
they they compute this advantage in a different way 00:51:45.120 |
but without having this value function estimation 00:51:48.240 |
so grpo is a computationally more um advanced i would say efficient in this 00:51:56.800 |
okay let me see what else we need we've skipped also the part of the 00:52:01.360 |
let me see um off policy learning right so offline policy learning 00:52:08.320 |
so what is off policy learning uh imagine we have the cat 00:52:13.760 |
let's go to the cat actually okay to compute this the the the loss 00:52:21.760 |
to to okay the the gradient policy optimization we saw that we need to 00:52:25.840 |
sample some trajectories right and we don't have to sample 00:52:28.480 |
all the possible trajectories right because we are trying to approximate it 00:52:33.280 |
um so when we sample these trajectories we are sampling from 00:52:37.440 |
a policy which is the current brain of the cat 00:52:40.800 |
so we ask the current brain of the cat to choose some actions and 00:52:44.880 |
generate some trajectories means that we ask the language the cat to just 00:52:48.160 |
navigate the house and let's see what it does 00:52:52.720 |
okay so we ask the the cat to just navigate the house and 00:52:59.920 |
see what it does and then after the cat has navigated the house 00:53:04.160 |
we look at what they are the trajectory that the cat has taken and then we give 00:53:08.400 |
reward to the cat based on the trajectory it has taken 00:53:12.160 |
and then we optimize the policy which means that we optimize the 00:53:15.600 |
brain of the cat uh with the whatever it has learned with the with 00:53:22.000 |
the direction of the gradient based on the reward it has received 00:53:25.200 |
but now the brain of the cat is a new brain because it has 00:53:29.120 |
changed compared to the past which means that the next step of iteration of 00:53:34.000 |
optimizing optimizing the brain of the cat or its decision making skills 00:53:38.320 |
we need to sample new trajectories so we need to ask the cat again 00:53:42.160 |
to go through all the house make a few choices 00:53:45.520 |
and then we check these choices and again we tell the cat hey you did 00:53:49.120 |
here you did well here you didn't well so the cat will learn to 00:53:52.880 |
to optimize its policy which will result in a new policy 00:53:59.360 |
and then again we need to sample from this policy but as you can see every 00:54:02.400 |
time we do an optimization step we need to sample again these trajectories 00:54:06.240 |
and in the case of language models this means that first you need to sample some 00:54:10.720 |
responses then you reward these responses based on your reward model 00:54:16.240 |
and then you need to sample new responses because now the policy 00:54:21.360 |
has changed because we updated the language model in order to avoid 00:54:26.560 |
this sampling process which is expensive we introduce 00:54:30.560 |
of policy learning in which let me show you here where is it 00:54:36.880 |
okay in which basically we take the language model 00:54:44.480 |
and we do it once then we sample some of these trajectories which means 00:54:48.720 |
basically we ask it to generate some responses then we sample 00:54:51.760 |
some of these responses and we fine-tune the language model 00:54:55.040 |
based on the reward we got on these responses then we don't sample 00:55:01.680 |
responses we just take another mini batch of the 00:55:04.720 |
trajectories that we sampled initially and again we do another step of 00:55:08.320 |
optimization and we keep doing it for n steps only then we sample new 00:55:14.080 |
trajectories this results in a much more efficient 00:55:22.960 |
let's say do change a little bit the policy and then sample new trajectories 00:55:26.960 |
from this policy no we sample a lot of trajectories 00:55:29.600 |
initially we keep them in some database in the memory or whatever you want 00:55:34.080 |
then we do some continuously optimizing the policy 00:55:37.200 |
using the trajectories that we have sampled initially 00:55:41.040 |
and this basically results in a much efficient training 00:55:47.920 |
and now we can go back to the DPO sorry the DeepSeq R1 paper here 00:55:54.640 |
to understand finally the loss in its entirety 00:55:58.480 |
so what we are doing here is we have a policy 00:56:02.560 |
at the current step of sampling and then we have the policy from which 00:56:11.840 |
as you can see it's written here so we sample 00:56:14.960 |
first a question from our database of questions 00:56:18.160 |
then for each question we generate a list of outputs 00:56:21.680 |
of responses so we prompt the model basically with the question and then we 00:56:26.320 |
ask it to generate multiple responses and you can generate 00:56:29.200 |
multiple responses like we saw here like we ask the language model where is Shanghai 00:56:32.240 |
and the language model will generate one response then 00:56:34.560 |
we ask it again where is Shanghai and maybe this time the language model will say 00:56:37.440 |
something else and then we ask it again etc etc so we 00:56:40.000 |
generate a list of outputs for the same question then we 00:56:44.480 |
compute the following loss which is basically 00:56:47.440 |
the ratio of the log probability the ratio of the probabilities or the 00:56:52.240 |
difference of the log probabilities so the ratio of the probabilities of 00:57:02.400 |
iteration so the current iteration at which we are 00:57:07.120 |
optimizing the language model with respect to 00:57:11.120 |
the language model from which we sampled so the language model 00:57:14.400 |
basically when we initially sample the trajectories we already pre-compute the 00:57:17.920 |
probabilities so we we can compute the probabilities 00:57:22.080 |
while while sampling we can save them so we have always available this 00:57:26.640 |
pi pi of old now this ratio means what means that if at the current 00:57:33.040 |
iteration the language imagine this ratio is more 00:57:35.360 |
than one it means that at the current iteration 00:57:37.920 |
the language model is telling me that i want to choose this 00:57:41.680 |
output more because i am more likely to choose this one now what we want is 00:57:48.240 |
if the advantage of doing this choosing this action is good 00:57:52.960 |
because it leads advantage positive means that it's 00:57:56.160 |
good to choose it's advantageous to choose this 00:57:59.120 |
action because it results in a good reward so 00:58:02.400 |
if the language model is more likely to choose something 00:58:05.840 |
and at the same time this something also results in good advantage 00:58:10.640 |
then this stuff here will be positive and it will be 00:58:14.800 |
big and because we are maximizing this will contribute positively to our 00:58:19.840 |
objective because we are trying to maximize it so we want to do 00:58:26.160 |
on the other hand is trying to do less of something so means that this ratio 00:58:30.960 |
will be less than one and at the same time this 00:58:34.640 |
results in a something that is disadvantageous so it 00:58:42.640 |
then the language model is also um okay let me check what so 00:58:51.120 |
i got a little lost if i have if this ratio is a positive 00:58:55.680 |
and i am i get a good advantage it means that the language model will 00:59:01.680 |
be more likely to do it if something that results in bad 00:59:06.240 |
advantage and the language model is still doing it then it will be a big 00:59:10.880 |
negative reward which will contribute negatively to our 00:59:14.960 |
objective so the language model will be less likely to do it 00:59:18.640 |
at the same time we don't want the language model to make big decisions 00:59:22.560 |
at every step we want to clip limit the decision making of the language 00:59:28.880 |
model at each step so it means that at each step of it of 00:59:32.480 |
optimization even if the language model is very 00:59:35.200 |
confident that something is good then we don't 00:59:40.880 |
limit its confidence by clipping this ratio here between one minus epsilon and 00:59:49.920 |
at the same time we also don't want the language model to change too much so we 00:59:54.320 |
have initial frozen model here which is a pyref 00:59:57.680 |
pyref basically means the original model so in the case of r1 01:00:07.760 |
language model that has never been trained with reinforcement learning 01:00:11.920 |
so we want because we want reinforcement learning framework we want to 01:00:15.920 |
change the language model a little bit to learn 01:00:19.120 |
reasoning but we don't want the language model to forget 01:00:22.320 |
everything or to just not behave like a language model anymore 01:00:26.560 |
otherwise the language model could just do like a reward hacking 01:00:34.400 |
the deep seek v3 base so the current policy so the current 01:00:41.360 |
language model should try to be as close as possible to the 01:00:44.640 |
original model but at the same time it should try to change according to the 01:00:48.640 |
reward it gets the advantages of the reward it gets 01:00:53.440 |
now how is the advantage term computed here basically it is 01:00:58.160 |
each reward normalized so it's basically it's like 01:01:05.360 |
centered on the mean of zero and the standard division of one 01:01:10.880 |
why do we want this because it means first of all 01:01:14.800 |
we don't want the numeric value of the reward to affect the training process 01:01:23.600 |
rewards to affect the the training process not the magnitude of its value 01:01:30.160 |
okay now that we have seen this i believe you should have 01:01:40.320 |
all of the paper because the other part that they do is okay instead of 01:01:47.040 |
just taking the language model and just training with the reinforcement learning 01:01:55.120 |
first of all let's try to introduce some very high quality super 01:01:59.280 |
fine tuning data and then we do another step of reinforcement learning and then 01:02:02.640 |
we do another step of fine tuning and then do another 01:02:06.000 |
step of reinforcement learning and this actually leads to a better 01:02:10.720 |
better model another actually interesting part of the 01:02:14.240 |
the paper is the distillation so the distillation i don't know if 01:02:17.600 |
most people are familiar with what is distillation and how it works so if you 01:02:21.440 |
want i can talk about it a little bit otherwise i think let's go to sleep guys 01:02:33.280 |
okay okay okay okay no problem please give big picture overview i mean 01:02:48.000 |
okay so distillation basically means this imagine you have a 01:03:03.440 |
okay imagine you are trying to teach yourself 01:03:10.720 |
graduate maths that's one thing right imagine you don't have any background on 01:03:18.000 |
university teacher it's two different thing right because if 01:03:21.520 |
you try to learn it by yourself you need to come up with all the strategies to 01:03:24.400 |
learn math but if you learn it from a professor then the professor can also 01:03:27.520 |
give you hints on how to learn it faster so in the case of models we have usually 01:03:37.760 |
and then we have a smaller model what we want 01:03:41.520 |
is and then we have a data set let's call it a data set so data set 01:03:49.360 |
what we do is we prompt the big brother so the 01:03:52.800 |
big model let's call it the big model actually i don't want to confuse people 01:04:10.800 |
what how do we train language models first of all 01:04:13.200 |
when we train language models we do that in a self-supervised way 01:04:17.520 |
what does it mean it means that there is no kind 01:04:21.040 |
it means that the the language model is trained 01:04:25.200 |
without annotating the data it means that we sample a lot of text 01:04:29.680 |
and we force the language model to learn to predict the next token given the 01:04:33.760 |
the context that comes before it it means that imagine you want to train 01:04:37.360 |
the language model on the following text for example 01:04:42.720 |
the following sentence for distilled model we 01:04:47.120 |
report representative results blah blah imagine we want to train the language 01:04:51.600 |
model only on this green text here what we will do is we will give it the 01:04:56.720 |
word for and we ask it to learn to predict the word distilled when it sees 01:05:00.880 |
for then we give it the word for distilled and we force it to learn to 01:05:05.520 |
predict the word models when it sees for distilled the beauty of the 01:05:09.440 |
transformer is that all this process can be done in parallel so we don't have to 01:05:20.240 |
on reasoning just by itself it will have much more difficulties however if you 01:05:25.360 |
have a big model that has already been trained on a specific task 01:05:29.360 |
then you can use the big model to help the small model learn faster 01:05:33.680 |
how when we train a language model on raw data so in this case for example we 01:05:39.680 |
said to the language model when you see four you should choose distilled 01:05:44.480 |
when you see four distilled you should models 01:05:48.080 |
this basically is called the next token prediction task 01:05:52.320 |
and the way we do it is we force the distribution that 01:05:56.000 |
the language model outputs and this distribution is the goal of the language 01:06:01.360 |
generates a distribution over all the possible next words 01:06:04.880 |
so it will assign a probability to the first word 01:06:08.320 |
next word and to the second next word and the third next word and the fourth 01:06:12.400 |
next word and we ask it okay when you see four 01:06:16.400 |
distilled you should exactly this particular word 01:06:19.840 |
which is the word models so you should choose the word 01:06:25.200 |
models and all the other words should not be chosen 01:06:28.880 |
so zero zero zero zero zero this one should be chosen with 100 percent 01:06:35.120 |
score this is how we force the language model 01:06:39.520 |
by doing it on many many many many texts the language model 01:06:42.720 |
learns to generate a distribution that very likely will generate this word 01:06:47.840 |
models when it is four distilled and less likely to generate the others 01:06:53.600 |
if we do the same job with the small model we will see that the small model 01:06:56.880 |
will have a lot of difficulties learning at the 01:07:00.800 |
same pace as the big model why because the big model has more parameters so it 01:07:04.320 |
has more flexibility in learning complex tasks 01:07:07.760 |
while the small model does not have this flexibility so it will must be much slow 01:07:16.000 |
the knowledge of the big model into the small model 01:07:19.760 |
we do it as follows we take a data set of prompts so a list of prompts 01:07:26.240 |
we feed it to the big model and then we ask the big model to generate 01:07:30.080 |
an answer not only the answer we also ask it to generate the log probabilities 01:07:38.720 |
before we use the sentence where is shanghai shanghai is in china right 01:07:43.200 |
so imagine we are dealing with the big model where is shanghai 01:07:47.280 |
the model will generate the shanghai is in china but 01:07:50.720 |
with the because we have the word shanghai we also have the log probability 01:07:55.280 |
of not only the word shanghai but also all the other words that 01:08:02.640 |
we forced the small model to learn not only to generate shanghai but we force 01:08:07.520 |
it to learn the same distribution that the big model generated for the 01:08:12.800 |
same position so let me do it again so imagine now we 01:08:17.040 |
do a concrete example imagine we have a sentence where is shanghai 01:08:30.000 |
the big model will generate an answer what does it mean to generate an answer 01:08:32.960 |
it means that it will generate first of all a distribution over what is the next 01:08:39.440 |
probabilities where the word shanghai will have a very high score so maybe 0.6 01:08:45.280 |
and maybe the word pizza will be 0.1 and the word the cat will be 0.05 etc 01:08:52.400 |
0.05 etc etc then we choose the word shanghai 01:08:56.480 |
and we ask the language model again where is shanghai 01:08:59.600 |
question mark shanghai then it will choose the word is but it will not just 01:09:03.760 |
choose the word is it will actually give us a distribution and we choose the word 01:09:07.520 |
is it will give us a distribution that looks like this for example it will say 01:09:11.360 |
okay the word is is very likely so it's maybe 01:09:15.440 |
70 probability the word i don't know road is unlikely because it 01:09:22.400 |
doesn't make sense and the word i don't know hello 01:09:30.160 |
likely so for each position we can generate the 01:09:33.600 |
log probabilities from the big model and we force the small model not to 01:09:38.240 |
learn to just generate shanghai but to learn this entire distribution 01:09:42.160 |
why because this gives a much stronger signal to the small model on what it 01:09:48.080 |
should do what it can do and what it must never do 01:09:52.000 |
instead of just telling you should do this it's a bigger signal for the small 01:09:56.240 |
model to learn faster and what they say in the 01:10:01.120 |
in the deep seek r1 paper is that by distilling 01:10:04.880 |
you can create a much stronger model than just by 01:10:12.960 |
a bit more context on the probabilities for each forward pass well 01:10:28.320 |
uh no distillation is used on all the tokens so it's you're telling for 01:10:34.960 |
each possible token you learn the entire distribution for 01:10:53.040 |
so how we assign reward for each action okay in the case 01:10:56.800 |
of um in the case of ppo by the way if you watch my video on 01:11:03.840 |
ppo i i explain all of that so i don't want to kind of uh 01:11:09.120 |
i don't know why nobody ever watched that video it's my one of my masterpieces 01:11:15.520 |
slides come from my video that i done in 2023 01:11:19.040 |
so um the way we generate the rewards is basically we the reward are usually 01:11:26.320 |
only generated for the okay there are two kind of rewards that you can 01:11:30.080 |
generate one is called the outcome based reward 01:11:36.480 |
in the case of ppo we generate the outcome based reward and then through 01:11:41.760 |
the advantage term is it is a kind of distributed on all the 01:11:46.800 |
previous tokens because we sample a response and then we judge this response 01:11:51.600 |
according to our reward model then with the advantage terms that you see in the 01:12:00.880 |
previous tokens so each token carries information on how 01:12:05.280 |
likely it is going to lead to the final reward 01:12:18.080 |
okay there is another part of way i think we forgot one part guys 01:12:24.800 |
um the unsuccessful attempts this is interesting also so uh they 01:12:33.680 |
model basically um okay the reward model that we saw before 01:12:37.360 |
is a rule-based model that is outcome driven means that the model 01:12:41.600 |
have to generate the entire process and the for example in 01:12:44.960 |
the case of read code problems the model has to generate the final code 01:12:48.320 |
to make it runnable then we run the code we see how it performs and then we have 01:12:53.600 |
signal on the reward however there are other ways of 01:12:57.520 |
generating reward and one of them is called the process reward model where 01:13:01.040 |
you divide the problem in sub-problems and then you have a model which is 01:13:08.080 |
called the process reward model that assigns a reward to each single step 01:13:12.400 |
however they in the paper they say that assigning first of all 01:13:16.480 |
dividing a problem into sub-steps it's difficult because every problem is 01:13:20.880 |
different and we don't want to kind of tell the 01:13:23.920 |
model you should follow this pattern it should 01:13:26.640 |
the model that should come try to come up with this pattern and 01:13:29.440 |
actually it does and secondly they say that it's 01:13:32.560 |
difficult to judge each single step so with sometimes we don't even know if 01:13:36.880 |
that step is good or not like when you're trying to solve a math proof 01:13:40.480 |
sometimes you do some intermitted steps that may not lead to the 01:13:46.080 |
to the you may not immediately see that they will lead to the 01:13:52.160 |
final conclusion but they are necessary right 01:13:55.520 |
so it's difficult to give them a reward until you arrive to the end so it's 01:14:00.720 |
difficult to kind of give a reward to each single step and 01:14:04.640 |
there is another technique called the multi-carlot research 01:14:10.640 |
each intermediate node is given each intermediate node is also a step 01:14:16.480 |
and each of these steps each of these nodes actually has a kind of a score 01:14:20.720 |
associated with it which increases the more time that 01:14:24.720 |
particular step leads to a successful solution so imagine for 01:14:29.360 |
example you are doing a lit code and if you a lit code problem so if you 01:14:37.200 |
of course the code will not compile so any style anything that starts with that 01:14:43.520 |
solution so that node will never be explored further 01:14:46.480 |
so multi-carlot tree search basically forces the language model to explore 01:14:50.480 |
more of the solutions that lead to successful 01:14:54.080 |
attempts and less to the one that don't succeed 01:14:58.400 |
however even by doing multi-carlot research they they get 01:15:09.520 |
the beauty of this actually is also in this sentence which is 01:15:12.880 |
rather than explicitly teaching the model on how to first of all divide the 01:15:16.720 |
problem into sub problems and how to solve the problem or tell the 01:15:20.880 |
model what is the format it should follow to solve the problem 01:15:24.080 |
the model just learns to solve to just learns to come up with the right format 01:15:29.520 |
with the right chain of thought to solve the problem of course they 01:15:35.920 |
tell the reward that the language model should 01:15:39.920 |
follow a particular format and the language model should also 01:15:43.200 |
what is it should be accurate etc etc so you can always play with a little bit 01:15:47.680 |
the reward and another thing that they notice is 01:15:49.920 |
actually if you train the language model only on the 01:15:52.560 |
reinforcement learning then it leads to kind of the language model 01:15:57.120 |
with some side effects for example the fact that the language 01:16:01.760 |
model mixes languages because nobody told the language model 01:16:05.440 |
that it has to only speak english to solve a problem that is 01:16:09.040 |
stated in english so imagine i ask you to solve a problem like 01:16:15.920 |
and imagine you are like a mixed kid and you can speak chinese and english or you 01:16:21.040 |
are just a chinese who knows also english etc 01:16:25.520 |
maybe in your head you you speak english and the chinese 01:16:29.040 |
you think in english and in chinese and then you come up with the solution and 01:16:32.400 |
that's totally correct right so unless i tell you 01:16:35.280 |
explicitly that you should never think in chinese 01:16:38.720 |
then you you why you should not take the freedom of doing it right 01:16:42.480 |
so if you never tell the model to not do something 01:16:46.160 |
the model probably will do it and because in the reward 01:16:49.520 |
here they never told the language model to never think in other languages the 01:16:52.880 |
language actually started thinking in other languages 01:16:56.000 |
anything to get the job done that's the beauty of reinforcement learning 01:17:00.640 |
so you give the right incentive and the language model will come up with the 01:17:04.000 |
right way of reaching that goal as long as the 01:17:10.880 |
yes you need a massive cluster of gpus i think i believe so because 01:17:19.600 |
you still have a like you're still running gradient descent on the language 01:17:26.240 |
fine-tuning the language model you are changing its weights so 01:17:30.640 |
okay guys um before we talk about other questions okay uh all this lecture was 01:17:57.280 |
reinforcement learning from human feedback is quite a complicated topic 01:17:59.840 |
and you need to kind of derive everything step by step this is what i 01:18:02.400 |
do in my video on reinforcement learning from human feedback here i kind of 01:18:05.600 |
sometimes skipped some steps and went back etc so i hope i didn't create 01:18:09.600 |
confusion so if there is kind of some parts about 01:18:17.440 |
think we can talk about it otherwise let's call it a day guys 01:18:22.960 |
maybe you could record for youtube more prepared i could but 01:18:26.800 |
i have a daughter guys now i'm super busy and i also have a full-time job and 01:18:51.680 |
yeah i think everyone now should have at least the basic understanding on how to 01:18:55.760 |
read the paper like when you read the paper you should 01:18:58.960 |
have a clear idea of what's happening and if you kind of need more 01:19:02.720 |
background you are uh you can check like my previous works 01:19:09.440 |
i think let's stop the recording and have a good night guys