back to index

Paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning


Whisper Transcript | Transcript Only Page

00:00:00.000 | perfect wonderful uh do you also are you also handy inside
00:00:06.720 | yes yeah wonderful so we can have like three streams or two streams only i
00:00:10.480 | don't know how that works
00:00:13.520 | okay let's start guys um so the goal of today's paper reading
00:00:24.960 | is to go through the deep seek r1 paper and what we will be seeing is
00:00:32.320 | first of all the biggest difficulty people have
00:00:34.560 | in order to understand this paper is the reinforcement learning part
00:00:39.360 | and what why it is difficult because we are you they use
00:00:42.880 | another algorithm called the grpo and the goal of this
00:00:46.080 | initial part of this paper reading is actually to give you the background
00:00:49.440 | knowledge that is is needed to understand the paper
00:00:53.840 | so i will be using the slides that i used for making my video on
00:00:58.400 | reinforcement learning from human feedback
00:01:00.560 | so to understand what is the connection between language models and
00:01:04.160 | reinforcement learning so let's go to that slides
00:01:08.480 | so let's review very fast very very fast language models
00:01:12.320 | so as you know language models are generative models
00:01:15.600 | which only have one simple function which
00:01:18.560 | only have one simple objective which is to tell us what is the next
00:01:22.400 | likely token given an input prompt so if we are given for example the input
00:01:26.640 | prompt shanghai is a city in and we feed it to the language model
00:01:29.920 | the language model will generate a probability distribution
00:01:33.040 | over what it thinks means the
00:01:38.000 | what it thinks is the next likely token that is coherent with the prompt that we
00:01:43.680 | have given the language model and usually we sample the for example
00:01:48.960 | if we feed this prompt to the language model it will give us that maybe the
00:01:51.840 | next likely token is the word china or beijing or cat or pizza
00:01:55.920 | and then we choose what we believe is the most likely based on its probability
00:02:00.160 | or probability score
00:02:03.120 | then we take this word we put it back in the prompt and we ask again the language
00:02:08.480 | model what is the next next word and we do this this job
00:02:12.720 | iteratively to generate a full text for example to generate the
00:02:20.160 | response to where is shanghai we first ask the language model where is shanghai
00:02:23.200 | it will tell us okay the next likely word is
00:02:26.240 | shanghai then we put it back in the input of the language model and it will
00:02:29.600 | tell us the next likely token is etc etc i'm also making a
00:02:33.360 | simplified assumption here that says that each token is a word and
00:02:38.080 | each word is a token this is not the case in language models but for our
00:02:41.920 | explanation we will think like in this way okay
00:02:46.560 | now we know what is language model what is reinforcement learning
00:02:49.840 | we go very very simple through what is reinforcement learning and then we find
00:02:53.520 | the connection between language models and reinforcement learning
00:02:56.720 | now reinforcement learning is an area of artificial intelligence i don't remember
00:03:00.560 | if it belongs to machine learning in particular but it's an area of
00:03:03.280 | artificial intelligence that is tasked with optimizing the behavior of
00:03:08.720 | an agent and the behavior of an agent is called a policy
00:03:12.560 | which is the decision making of an agent in such a way that the agent
00:03:16.960 | performs actions that maximize the reward it gets from performing these
00:03:21.920 | actions in an environment for example for example i have a cat so
00:03:27.440 | this cat likes to eat meat like most cats
00:03:31.200 | and the cat is in my house and the cat will be considered of a
00:03:35.760 | reinforcement learning agent the cat can make some decisions on how it
00:03:40.720 | wants to move in the house and this will be the policy of the cat
00:03:44.960 | so the policy tell the cat if the cat should move
00:03:48.160 | up down left or right in the house now this is
00:03:51.840 | my house you cannot see the border so i will draw them because i don't know why
00:03:55.440 | the the borders are not drawn here and you
00:03:59.600 | should think of this environment as being a grid environment made up of
00:04:03.760 | cells so like the following
00:04:09.120 | like this what is the goal of the cat the goal of the cat is to arrive to the
00:04:15.920 | meat so the the cat can at each position in
00:04:20.000 | the house it can make some choices some perform some actions we say
00:04:24.320 | technically and the actions that the cat can can
00:04:28.240 | choose at each position in the house is a move up
00:04:31.200 | down left or right what we want we want to make sure that the cat
00:04:36.400 | learns to perform the series of actions that
00:04:39.840 | lead him with very likely to the meat while avoiding the things that the cat
00:04:46.320 | is scared about which is the broom and the
00:04:48.800 | bathtub because no cat likes to take shower
00:04:52.400 | so we designed first of all a reward system for this reinforcement learning
00:04:56.640 | agent because we want to train this reinforcement learning agent to choose
00:04:59.840 | which actions to perform in each position in the house
00:05:02.560 | based on some reward so one reward that we could do
00:05:06.880 | one reward model could be this one for example if the
00:05:10.080 | cat moves to an empty cell it receives a reward of zero
00:05:13.200 | if the cat moves to the broom it receives a reward of minus one if it
00:05:17.680 | moves to the bathtub it receives a reward of minus 10
00:05:21.440 | however if after performing a series of actions the cat arrives to the meat
00:05:25.920 | then the cat receives a big reward of plus 100
00:05:30.320 | the decision making of this cat is governed by
00:05:33.600 | a model that we will call the policy of this cat and the goal of the policy
00:05:37.920 | is to choose an action given the current state
00:05:41.600 | and the action that the cat can choose is
00:05:44.880 | stochastic it means that this policy gives us a distribution over all the
00:05:49.600 | possible action that the cat can take so if if we have imagine we have a very
00:05:54.480 | optimized policy if the cat is here the good policy should
00:05:58.800 | tell us that with very high probability score we should
00:06:02.240 | move down because that's one way to maximize the reward
00:06:05.280 | and with very low probability we should move left because that will take us
00:06:10.000 | towards the the bathtub
00:06:13.600 | the another thing for example that this policy should
00:06:17.840 | do is if the cat is here for example it should not move right so the
00:06:22.960 | probability associated to the action move right should be low
00:06:26.240 | and maybe the probability associated with the action move down should be a
00:06:29.920 | little higher this is what the policy is
00:06:34.320 | now what is the connection between language models and reinforcement
00:06:37.680 | learning so first of all what is the goal in
00:06:40.080 | reinforcement learning is to train this policy so to train
00:06:43.680 | this decision making of this cat of this agent in order to choose the proper
00:06:48.320 | actions at every possible state in the environment
00:06:52.080 | such that it maximizes the reward that the agent can get
00:06:57.360 | so a good policy for the cat would be a policy that always leads the cat to the
00:07:01.440 | meat no matter where the cat is now let's go let's connect the
00:07:06.320 | reinforcement learning with language models so
00:07:08.640 | language model is also kind of a policy because the language model every time
00:07:13.040 | you feed a prompt to the language model the language model has to choose an
00:07:16.400 | action to perform which is what token should come after this
00:07:20.080 | prompt so in this case we talk about the state
00:07:25.600 | and action the state is the state in which the
00:07:28.320 | reinforcement agent is in the case of the cat is the position of the cat inside
00:07:32.560 | of the house in case of the language model the state
00:07:35.120 | is the prompt itself that you feed and the action is the
00:07:38.080 | distribution over all the next token that the language model can choose from
00:07:42.960 | in the case of language models so we also want to train the language model to
00:07:46.240 | perform its action actions or to choose the next
00:07:50.480 | tokens in a particular way according to some reward that we that we
00:07:55.040 | can build a reward model that we can build
00:07:57.360 | in the case specifically case of reinforcement learning from human
00:08:00.400 | feedback we want the language model to
00:08:04.560 | generate text using particular rules for example when we do language model
00:08:10.160 | alignment we are first of all how language models
00:08:12.960 | are trained usually we have a pre-training part where
00:08:16.000 | we feed a lot of information to the language model so we throw a lot of data
00:08:19.840 | like the entire wikipedia the entire web the entire i don't know stack overflow
00:08:23.840 | and the leet code everything and the language model learns how to
00:08:27.920 | kind of the structure of the language it learns a little bit of chinese a little
00:08:31.920 | bit of english a little bit of japanese because we throw every data
00:08:35.120 | possible that we have at the language model then we do a little bit of
00:08:38.960 | fine-tuning so we train the language model to generate high quality data so
00:08:42.160 | we increase the likelihood of generating high quality outputs instead of just
00:08:45.920 | throwing whatever is on the internet but then we do also an alignment part
00:08:50.480 | in which we want the language model to follow instructions so we want the
00:08:53.440 | language model to adhere to some standards for example
00:08:56.960 | what makes the language model conversational
00:08:59.040 | is the conversation is the instruction fine-tuning
00:09:02.320 | which means that we train the language model to follow a particular format so
00:09:06.400 | always for example greet the user always
00:09:10.160 | be helpful always never use a curse word etc etc etc
00:09:14.640 | and this job is done through the reinforcement learning from human
00:09:17.040 | feedback which includes many kind of algorithms like ppo dpo etc
00:09:20.960 | and grpo is one of them what we do usually in language models
00:09:25.920 | to train the language model to follow instructions is we
00:09:30.400 | generate a data set of instructions in which we we first
00:09:37.840 | have some list of questions and then we ask the language model to generate some
00:09:41.600 | answers and then we ask some professional annotators to
00:09:45.360 | choose which answer they would like the language model to
00:09:48.480 | generate more and which one they don't like the language model to generate
00:09:53.200 | and the the goal of reinforcement learning from human feedback is to make
00:09:57.200 | sure that the language model will generate more answers like the
00:10:01.920 | ones that are chosen by the annotators and the less likely to
00:10:05.200 | generate answers that are not chosen by the annotators
00:10:10.080 | this is called the reward model of the
00:10:15.680 | of the reinforcement learning from human feedback
00:10:20.320 | okay now that we have understood a little bit the connection between
00:10:24.400 | language models and the reinforcement learning framework let's
00:10:28.400 | move on to the paper i just want to do a little
00:10:31.520 | review of what we have seen so far so now we know what are language models
00:10:35.360 | they are models that generate the probability over what is the
00:10:39.680 | next likely what is the next token the probability distribution
00:10:45.680 | of what is the what should be the next token based on the input
00:10:49.040 | we know what is reinforcement learning which is a framework for training the
00:10:52.240 | policy of an agent in order to choose actions
00:10:55.760 | that maximize its reward what is the connection between
00:10:59.440 | reinforcement learning and language model is that the language model itself
00:11:02.480 | is a policy because it makes decisions it takes
00:11:05.600 | actions in choosing what is the next token and we want the language model to
00:11:09.600 | choose tokens so the next token in such a way that
00:11:13.120 | it follows some standards which are according to a data set of
00:11:17.360 | preferences that we usually build
00:11:21.040 | this data set of preferences is converted into a reward model but we
00:11:25.280 | will not be covering the reward model for now at least
00:11:28.320 | so now let's go to the deepseq paper now in the deepseq
00:11:31.760 | in the deepseq r1 paper what they do is they start with
00:11:36.240 | a pre-trained model they they use the deepseq v3
00:11:39.760 | base model which i believe is a 600 billion
00:11:42.720 | parameter model and then they want this
00:11:46.400 | this model to perform better at reasoning
00:11:51.280 | what does it mean to perform better at reasoning it we want the language model
00:11:55.120 | to find a way to solve complex problems by
00:11:59.040 | breaking them into smaller steps that can be
00:12:04.240 | that are easier for the language model to manage and the way they do it
00:12:08.160 | is also through reinforcement learning let's go
00:12:12.240 | to the paper so let's go here let's go here let's go here okay
00:12:18.640 | first of all they say that in this section we explore the
00:12:22.400 | so let me use the in this section we explore the potential of
00:12:25.840 | language models to develop reasoning capabilities without any supervised
00:12:29.920 | data focusing on their self-evolution through a pure reinforcement learning
00:12:34.080 | process so what is supervised data when we
00:12:36.800 | train language models we also try to build some
00:12:39.680 | very high quality data set of what the language model
00:12:43.200 | should be generating because as we said before we have like
00:12:46.720 | multiple stages of training one is the pre-training in which we just throw
00:12:50.400 | random data from the web to the language model but then we want the language
00:12:53.520 | model to generate high quality data so we have this kind of the supervised
00:12:57.840 | fine-tuning part and they skip this part here they just
00:13:01.920 | take the base model and then they want the base model to
00:13:05.840 | develop the reasoning capability just by using
00:13:09.680 | reinforcement learning which means that we want to incentivize the language
00:13:14.080 | model through a reward system to develop by itself
00:13:18.400 | what is the sequence of token that should lead to so
00:13:22.320 | for the language model to acquire as much reward as possible
00:13:25.840 | it's like i take my cat and i want the cat to solve math problems
00:13:32.560 | and i what i can do i can just play with how many biscuits i can give to the cat
00:13:37.440 | so if i build my reward model in such a way
00:13:40.960 | that the cat is incentivized to solve math problems
00:13:46.560 | then by the because the cat wants to get the biscuit the cat will develop
00:13:51.120 | whatever skill it needs to develop in order to get
00:13:54.240 | maximize the number of biscuits it gets now of course the cat will never
00:13:57.440 | develop it because the underlying kind of
00:13:59.680 | lacks the capability of learning certain things but that's not the problem in
00:14:03.600 | language models because big language models have a lot of
00:14:07.600 | capabilities in developing novel skills
00:14:11.840 | now the algorithm that they use in DeepSeq R1 is called the GRPO
00:14:18.320 | algorithm if you look at my video on
00:14:21.520 | reinforcement learning from feedback historically we have always
00:14:25.440 | used the PPO algorithm more recently the DPO algorithm there are
00:14:29.760 | also other algorithms like the ORPO
00:14:33.360 | what they use here is called the GRPO algorithm and it's
00:14:37.040 | very similar to the PPO but slightly different
00:14:40.160 | and we will see how let's see what does the GRPO algorithm does
00:14:47.280 | well as we saw before when we do the when we do reinforcement learning on a
00:14:54.720 | language model we have a data set of preferences so we
00:14:58.160 | have some questions and then we ask the language model to generate
00:15:01.120 | multiple answers and then we ask annotators to choose which answer they
00:15:04.720 | like and then we train a language a reward
00:15:07.760 | model that
00:15:11.120 | that gives the signal to the language model to understand
00:15:16.000 | which if the answer that the language model is generating
00:15:20.400 | is good or bad in this case with the GRPO we have the
00:15:26.240 | following objective now if you have never seen clip or
00:15:31.120 | if you are sorry if you have never seen PPO this
00:15:34.320 | sounds quite scary but let's try to break it down
00:15:38.000 | step by step what are we doing here is we want
00:15:41.920 | to optimize a policy so the policy is the language model itself and the policy
00:15:47.040 | is always denoted with the letter pi like the greek letter pi so this
00:15:51.840 | pi of theta is the policy so it is the language model that we are trying to
00:15:55.280 | optimize we want this policy to be trained to
00:15:59.120 | maximize the following objective what is this objective and this
00:16:04.160 | objective is saying that if i have a list of questions that
00:16:08.560 | belong to some database of questions and
00:16:13.840 | we sample some output from our policy using these questions then
00:16:21.680 | based on some reward that this output that this output of the language
00:16:28.880 | model get from our reward system we should train the language model to
00:16:34.080 | give more weight to those actions that result
00:16:38.080 | in good reward and to give less weight so are less
00:16:42.480 | likely the language model should be less likely to take those actions that
00:16:46.560 | lead to bad reward and the way we do it
00:16:51.120 | is as follows so we take basically what we are doing is
00:16:56.400 | okay here you see all the policy and the new policy
00:17:00.240 | for now let's ignore that that's because we do something called
00:17:03.760 | offline learning so we will not be covering that at least now
00:17:08.160 | for now what we want to do ignore the the the denominator here so the pi old
00:17:15.600 | just concentrate on the um on the term pi what we are saying is that we
00:17:21.840 | generate the log probabilities of the output so what is the log
00:17:26.080 | probability of the output let's do it step by step actually okay
00:17:32.320 | imagine we have the following question
00:17:36.560 | for example imagine we ask the language model where is shanghai so
00:17:42.160 | and the language model generates because as we saw we sample a few questions
00:17:53.120 | from a database of questions and then language model and then we generate
00:17:57.760 | multiple outputs using this this question using our
00:18:00.720 | language model so these outputs are called o and there are g of them this is
00:18:05.680 | the group in grpo maybe the language model the first time
00:18:10.320 | will generate let's say uh shanghai is in china so
00:18:15.360 | let me just write shanghai is in china another output could be for
00:18:22.240 | example the sky is blue
00:18:29.200 | and another output could be shanghai is beautiful
00:18:36.640 | imagine we have some magic reward system that assigns or imagine that our reward
00:18:45.120 | system is a human being this human being will very likely to
00:18:50.000 | give a very high reward to this answer zero reward to this answer and maybe
00:18:56.000 | not completely zero but nearly zero score to this answer
00:19:01.040 | why because this at least talks about shanghai this doesn't
00:19:04.400 | talk about anything related to shanghai and this actually
00:19:07.520 | answers the questions now when we have a language model we have
00:19:13.440 | a question and the generated answer we have what is known as a trajectory
00:19:19.600 | the trajectory is a list of actions that the language model has taken
00:19:24.000 | why because the language model was given this question
00:19:27.280 | as input and the language model took an action chose
00:19:31.600 | an action to generate the first token which is shanghai
00:19:34.640 | then this shanghai was put back into the language model and then the language
00:19:38.080 | model choose another action which is the token
00:19:40.480 | is and then this is was put back into the language model and then the language
00:19:44.480 | model chose another action which is in etc etc this is a trajectory at each
00:19:49.840 | step of the generation process the language
00:19:52.240 | model chose an action based on the probability
00:19:55.600 | distribution that it generated actually it's not the
00:19:58.240 | language model that chooses it's our sampling strategy that chooses the
00:20:01.760 | particular token so usually we can use the greedy strategy or the top of the
00:20:05.120 | strategy or whatever so at each step we have chosen some
00:20:10.960 | action based on the distribution of the language that the language model
00:20:14.800 | generated for each state so we have a log probability a
00:20:20.560 | probability associated with the word shanghai
00:20:24.560 | on conditioned on the question where is shanghai
00:20:27.680 | we have a probability associated with the word is
00:20:30.640 | conditioned on the input where is shanghai
00:20:33.680 | question mark shanghai then we have a probability associated with the word
00:20:37.840 | in conditioned on the input where is shanghai
00:20:41.200 | question mark shanghai is blah blah blah etc etc so we have a list of
00:20:45.280 | probabilities what we are doing here is that we want
00:20:50.400 | the um this is the product of all the log probabilities
00:20:55.600 | that are generated at each step of the particular output
00:20:59.280 | o i here which is maybe the first answer here
00:21:03.440 | then we ask the language model again the same question and the language model
00:21:06.560 | will come up with another output and this output will also have
00:21:10.480 | associated with it a list of log probabilities and these are the log
00:21:14.160 | probability that you see here each log probability
00:21:17.520 | furthermore is weighted by an advantage term the advantage term
00:21:24.960 | is basically telling me how better is choosing a particular token
00:21:31.040 | given a particular input over all the tokens that are available
00:21:34.800 | for example imagine we have the input where is shanghai
00:21:39.360 | is it better to choose the word shanghai or is it better to as the
00:21:43.200 | first word of the response or the word pizza
00:21:47.520 | i believe it's better to choose the word shanghai so the advantage of choosing
00:21:51.520 | the word shanghai would result in a better long-term
00:21:55.680 | reward for the policy so for the language
00:21:58.480 | model because it will result in a good answer so it will result in a
00:22:02.720 | high reward from our reward model for now we
00:22:05.840 | just have modeled the reward model as a human being who tells that the
00:22:09.600 | language model okay this is a good answer this is a bad answer
00:22:12.960 | but actually later we will see that the reward is actually also a language model
00:22:16.480 | and in the case of grpo they actually used
00:22:19.440 | in the case of the deep seek r1 they use actually a rule-based reward
00:22:23.120 | model so let's go back we have a list of questions
00:22:30.240 | we generate multiple answers with each of these questions
00:22:34.320 | using our language model and then we for each of these
00:22:38.400 | answers we have the log probabilities associated with this answer
00:22:42.720 | which is just the product of all the probabilities of choosing that
00:22:46.160 | particular token given that particular input
00:22:49.600 | we weight each of this log probability by an advantage term
00:22:54.320 | which basically tells us how good is choosing this particular token over
00:22:59.680 | all the other that are available for this particular input
00:23:03.120 | and we train our language model to maximize this objective
00:23:07.120 | let's see what does it mean to maximize this objective and now let's also see
00:23:10.640 | what is this old here the language model that we will be
00:23:15.360 | training is our deep seek base right so at the
00:23:19.760 | beginning suppose that this pi old is the base
00:23:23.200 | version of deep seek what we do is basically we want to refine
00:23:28.560 | it iteratively by generate keep generating
00:23:32.480 | output from it and then through the reward model we want to tell it okay
00:23:36.480 | this was a good this was a good output so do
00:23:40.480 | more of this or this was a bad output so do
00:23:44.240 | less of this this is one of the advantage of using reinforcement learning
00:23:48.000 | because if you do supervised fine tuning you're
00:23:50.560 | just telling the language model to i want this so generate this if you are
00:23:54.560 | doing reinforcement learning you have the ability to tell the language model i
00:23:57.600 | want more of this and i want less of this
00:24:00.800 | what we are doing is the the language model
00:24:04.320 | at each iteration the language model should be optimized in the following way
00:24:12.080 | if the language model at the current iteration is giving
00:24:15.120 | more probability more likelihood to generate
00:24:19.120 | a response that resulted in a good reward and the advantage term will be
00:24:25.040 | high in that case then this policy is this
00:24:30.160 | objective is telling the language model do more of this
00:24:33.440 | however if right now the language model is
00:24:36.800 | giving less probability to an action that also resulted in
00:24:41.360 | a low than a bad reward and the advantage term will be negative in
00:24:46.560 | that case then by optimizing this objective here
00:24:50.400 | by maximizing the objective here the language model
00:24:53.040 | will learn to do less of that i know that i didn't explain very well
00:24:58.640 | the pi theta and the pi old theta i believe i can do that later when
00:25:03.200 | if we have time by talking about offline
00:25:08.960 | offline learning in the case of in the case of reinforcement learning
00:25:15.440 | moreover in the grpo you find this kehl divergence term
00:25:19.760 | now for people who already know what is the kehl divergence that would be super
00:25:22.960 | easy but for people who don't know basically the kehl divergence is a way
00:25:26.480 | of measuring how to distributions are different so how far
00:25:30.240 | they are what we want is we want the language
00:25:33.360 | model to be fine tuned to generate more of things that lead to a better
00:25:38.000 | reward to do less of things that result in
00:25:41.600 | low reward but at the same time we don't want the
00:25:44.240 | language model to change too much its behavior
00:25:46.880 | for example imagine we have a reward model
00:25:49.920 | that tells the language model to be more polite
00:25:53.760 | what the language model could do and imagine that by saying
00:25:56.880 | being polite for example in my case means that i always say thank you right
00:26:00.880 | so what could happen is if we don't enforce the kehl divergence is that the
00:26:04.160 | language model could just cheat and always generates thank you thank you
00:26:07.440 | thank you thank you thank you thank you at every response
00:26:09.760 | like a list of thank yous because that results obviously in a high reward
00:26:13.920 | but the language model would stop doing its main job which is to generate
00:26:17.360 | something factual and useful so we want the language model to change
00:26:21.280 | a little bit but to change so to be more polite but not just be
00:26:27.840 | just generate a bunch of thank yous so change
00:26:30.880 | change but change a little bit so that's why we add the scale divergence
00:26:35.360 | otherwise what the language model will do it will
00:26:37.440 | do what is known as a reward hacking which is it will try to find
00:26:41.200 | a way to just learn what is a trick to get
00:26:45.440 | maximize its reward without actually being useful
00:26:50.240 | this is also for example if you want a parallel like example it's like
00:26:54.240 | you have a text code and it's very complex people will always find a way to
00:26:58.880 | cheat on it and if you have a text code that is very simple made up of few rules
00:27:03.120 | then it's very unlikely that people will be
00:27:05.200 | able to cheat on it so
00:27:08.880 | and so so you cannot do like in that case you cannot cheat
00:27:14.480 | okay what do we miss here so first of all i didn't explain the
00:27:23.520 | i didn't explain the offline policy learning and i didn't
00:27:30.800 | explain the clip part why we are clipping here
00:27:36.400 | the clipping part basically we are saying if the language model is
00:27:39.680 | trying to change its log probabilities
00:27:44.400 | if the language model is trying to change its log probabilities by being
00:27:48.160 | too confident about its change then we don't want we don't want to
00:27:53.680 | let the model be overly confident basically what means that if the
00:27:58.160 | this is the language model at the current iteration and this
00:28:01.520 | you can think of it at the previous iteration in the training process
00:28:06.080 | if the language model at the current iteration is very confident that by
00:28:10.160 | saying the word shanghai will result in a better
00:28:13.600 | reward even if it's a good choice we don't want the language model to be
00:28:18.880 | overly confident so we clip this ratio between the log probabilities up to
00:28:24.960 | one plus epsilon or or in the lower case in one minus epsilon
00:28:30.800 | because if the language model will choose the next
00:28:34.320 | word as a shanghai given this question then okay
00:28:37.440 | we are lucky and it's it's good but imagine the language model is overly
00:28:41.040 | confident then the next word is i don't know coffee then we don't want
00:28:45.920 | the language model to make too big step we want the model to learn as
00:28:49.760 | slow as possible by choosing the accordingly this epsilon
00:28:54.960 | term which is something that we can choose and this beta term
00:28:59.280 | okay i believe i have covered a little bit of this
00:29:03.760 | so last last review guys so we are trying to optimize the language model
00:29:07.840 | iteratively by telling it to make more of something that we want
00:29:11.840 | and to do less of what we don't want how does the model know what we want and
00:29:15.920 | what we don't want it's the reward that we give it
00:29:18.720 | according to our reward model now let's talk about the reward model
00:29:22.960 | the reward model historically in ppo let's go to the other here
00:29:28.720 | was a model which was of the same structure as the
00:29:33.120 | language model that we are trying to optimize
00:29:35.680 | in which we add a linear layer on top that
00:29:38.960 | gives a reward to each answer how can we assign a numeric reward to
00:29:45.680 | a particular answer so if you remember when we talk
00:29:52.160 | about the bpo i said that usually we start with some
00:29:55.680 | questions then we generate a few answers then we ask some annotators to choose
00:29:59.360 | which answers we like and they and to also tell us which answers they
00:30:02.800 | don't like how do we convert this data set of
00:30:05.840 | preferences into a number we do that by training a
00:30:10.720 | language model which has the same architecture as the
00:30:14.160 | policy that we are trying to train just a different head on top that instead
00:30:18.160 | of generating the log probabilities of the next token generates a numeric
00:30:21.600 | reward and we do it with what is known as the bradley terry model which is
00:30:27.040 | basically this loss here you don't have to understand this law
00:30:30.240 | doesn't matter it basically means that we want the if we train the language
00:30:34.400 | model on this loss it will generate a very high reward
00:30:37.840 | for the quest bradley terry model
00:30:42.000 | so if we train our language model on this loss it will basically result
00:30:49.520 | in this head here this linear head on top of the language model
00:30:53.360 | to generate a very high value for the questions that
00:30:57.200 | for the answers that were chosen in the data set of preferences and
00:31:00.800 | a very low or low basically value for the answers that were not
00:31:06.800 | chosen by the professional annotators this is how it was done with ppo so
00:31:13.120 | we have an objective which increases the likelihood of the things that give
00:31:18.000 | high reward we also have a reward model which
00:31:20.960 | generates a numeric reward as a signal for the language
00:31:24.480 | model to understand what it should do more and what it should
00:31:27.520 | do less in deepseek r1 they do something
00:31:30.800 | different they instead of using a model as a
00:31:35.200 | reward they use a rule-based model system so
00:31:38.640 | they don't train a neural network to
00:31:41.440 | generate a number that gives a signal to the model to
00:31:44.800 | understand what we like and what we don't like
00:31:48.960 | they used a rule-based system you can see here
00:31:52.800 | and this rule-based system is you can do it for all the tasks that you can kind
00:31:57.600 | of verify so for example for the lead code
00:32:00.960 | problems they ask the how can you check if the answer
00:32:05.200 | generated by the model is good well you just run it and if it
00:32:08.400 | performs if it first compiles and secondly it
00:32:12.000 | runs in a predefined time limit then it is a good answer
00:32:16.000 | it doesn't matter how it came to be if it's if it works it's a good answer
00:32:20.480 | just like also for example math some for most of math
00:32:24.000 | math problems we do have the answer that we expect the model to generate so we
00:32:28.160 | can compare what is the actual generated answer
00:32:30.800 | and what is the we expect the model to generate
00:32:35.920 | so they create a rule-based reward system in a way that they
00:32:40.480 | ask the language model to generate some output for a given problem
00:32:44.160 | and then they can assign by just following rules so check
00:32:47.600 | if okay if it's a lead code problem just run it and check if it runs
00:32:51.280 | okay good reward doesn't run okay zero and if it's a math problem they just
00:32:56.480 | take the the output of the model compare it with the expected
00:32:59.520 | result and assign reward based on that so if it's
00:33:02.800 | the answer matches what we expected good reward otherwise
00:33:06.400 | zero etc etc they also assign a reward for the
00:33:11.200 | for the language model if it formats the output in a certain way
00:33:17.120 | by for example forcing the they give a reward for the language model if it uses
00:33:21.840 | for example if it follows the format of putting all
00:33:26.320 | the thought process in the tags think and
00:33:30.160 | slash think etc so basically just with this they train the
00:33:37.040 | language model so they train the language model to
00:33:40.240 | generate answers and they reward these answers to a reward
00:33:46.960 | system which is rule-based and they keep
00:33:50.960 | training it
00:33:53.600 | and the language models basically automatically by itself
00:33:57.760 | learns to generate the thought process that is necessary to perform
00:34:03.200 | the the tasks that it is being asked so the language model learns by
00:34:09.520 | itself by just with the reinforcement
00:34:12.800 | learning but with this reward system to generate the thought process
00:34:17.920 | that leads to the generating the right code for the
00:34:21.920 | lead code problems to generate the right thought process
00:34:25.920 | that is necessary for solving math problems etc etc etc
00:34:31.440 | let me check what else we need to know from here
00:34:36.400 | okay here they show some results i think the results you can check by yourself
00:34:41.600 | what is very interesting i think it's this
00:34:45.040 | during the training of r1 just with reinforcement learning so as i as i want
00:34:50.480 | to remind you is they took deep seek v3 base
00:34:54.800 | and added this reinforcement learning step on top of it which is a massive
00:35:00.560 | reinforcement learning step usually the alignment part is not so big
00:35:05.120 | and the more they find you they they run this reinforcement learning
00:35:09.680 | step they saw that the language
00:35:14.080 | models automatically learns to generate longer responses
00:35:19.520 | because to solve problems you need to generate a longer chain of
00:35:23.200 | thought so the language model because of the
00:35:25.680 | reward system learns that in order to get reward it is
00:35:29.600 | it should generate longer responses so they didn't tell the language model
00:35:35.040 | with supervised fine tuning to generate that kind of
00:35:38.240 | data with that particular format with that kind of thought process
00:35:42.720 | just with reinforcement learning with the right incentives the language model
00:35:46.320 | learned to do that how did it do that it learned that at a particular input it
00:35:51.600 | should generate that particular token which in the long term results in a good
00:35:56.320 | reward so the language model the beauty of
00:35:58.960 | reinforcement learning is that you not only learn to do something
00:36:03.360 | that based on the immediate reward that you get
00:36:06.640 | but also on the long-term reward that you will get because sometimes the reward
00:36:11.280 | as you can see here the reward model only applies when the entire answer is
00:36:17.680 | produced so the model will only know the signal
00:36:21.360 | that the model gets is only for the entire output
00:36:26.240 | actually okay through the advantage term this signal is propagated back to each
00:36:30.000 | single token but okay we can skip that in the in the reinforcement learning
00:36:34.160 | actually the beauty is that sometimes the reward you get is not for
00:36:37.840 | the single action you take so in the case of my cat for example
00:36:41.200 | let's say here so let's go back to my cat
00:36:45.280 | the cat will receive reward only after taking many steps that lead it to the
00:36:50.080 | meat so only when the cat is here it will
00:36:52.960 | know okay all this sequence of actions was a good
00:36:56.480 | choice but this signal is propagated back to
00:37:00.880 | each single action in such a way that the cat when it will
00:37:04.480 | be here it will very likely to choose i need to go
00:37:08.080 | down and less likely to choose i need to go
00:37:11.040 | right this also happens in the case of language
00:37:14.080 | models in this case through the advantage term here
00:37:18.160 | where is it through this advantage term here
00:37:22.080 | which is actually done for each token in the case of
00:37:25.120 | okay now we can go a little bit more technical details
00:37:29.520 | in the case of why they choose grpo over ppo first of all
00:37:33.840 | well with ppo basically this advantage term here
00:37:37.600 | to compute requires another function that is called the value function and
00:37:42.320 | this value function basically to be computed requires the training of
00:37:45.600 | another model by using the advantage term in grpo
00:37:49.920 | this advantage term is calculated without the value
00:37:53.680 | function but by the following formula here which is
00:37:56.240 | basically just based on the rewards which is already
00:37:59.680 | given by the reward model which we have which in the case of the
00:38:02.720 | deepseek r1 is rule-based so they don't need to train this other
00:38:07.760 | uh model to generate the the value function which is
00:38:11.280 | adds more complexity to the to the system
00:38:15.680 | okay so now we have seen what is the reinforcement learning we have seen what
00:38:19.520 | is the connection between language models and reinforcement learning we
00:38:21.760 | have seen a little bit what is the the grpo objective
00:38:29.280 | okay i believe let's actually do a poll guys
00:38:33.120 | do you want me to go deeper in the grpo like
00:38:37.120 | let's explore exactly this loss because actually the the rest of the paper is
00:38:43.920 | okay we tried just reinforcement learning okay
00:38:47.200 | so people like deep so let's go deep okay all right first of all the most
00:38:52.240 | interesting thing about reinforcement learning
00:38:55.760 | especially in the case of ppo and the first learn from human feedback is that
00:39:00.480 | is this thing called the gradient policy optimization so let's go back to the
00:39:03.360 | other slide and then we go back to the dpo
00:39:05.920 | because the rest of the paper in dpo deepseek is basically they they took
00:39:09.680 | this um r1 and then said okay instead of just doing reinforcement learning let's
00:39:13.920 | do maybe multiple step of reinforcement
00:39:16.240 | learning and supervised fine tuning and then reinforcement learning again then
00:39:19.200 | super and it leads to better outcomes but that is not technically difficult to
00:39:23.360 | understand i think because most of you already kind of have backgrounds in this
00:39:27.680 | so if you're here it's because you kind of understand what we are talking about
00:39:31.840 | so let's do all the things that maybe some people have difficulties with which
00:39:35.200 | is i believe this and this part okay so let's go
00:39:37.840 | deeper so um okay so let's go back to my cat
00:39:42.560 | and imagine i want to train my cat so i want to train my cat
00:39:46.080 | to um follow to to reach the meat so as we saw before my cat is just a
00:39:52.320 | an agent with a policy a policy is what it tells the cat what action to take
00:39:56.240 | given the position of the cat inside of the house
00:39:58.800 | you need to think of this house as a grid so like like
00:40:01.920 | the following i didn't uh you cannot see the the grid lines because i don't know
00:40:05.840 | my it's not showing but okay it doesn't
00:40:08.240 | matter what is the policy the policy as we saw
00:40:12.080 | before is it tells the cat what action to take given a particular position
00:40:15.280 | what is our goal in reinforcement learning is to select a policy that
00:40:18.800 | maximizes the reward that the agent gets when using this policy so
00:40:25.040 | this is the um the objective that if you have we want to select among
00:40:29.040 | all the possible policies that we can can have
00:40:32.560 | the one that maximizes an objective what is this objective is the expected
00:40:38.320 | reward that we can get when using this policy
00:40:41.040 | means that if i apply in the side or in the brain of my cat
00:40:44.080 | this policy which is the decision making stuff
00:40:48.080 | then my cat if this best policy will tell my cat to move down here
00:40:52.960 | and move right here and move right right down down down
00:40:57.040 | and until it arrives to the needle this should be the
00:41:00.800 | um the policy how do we actually train this policy we do what is
00:41:09.360 | known as a policy gradient optimization
00:41:14.160 | but before we understand policy we need to understand a few terms
00:41:17.920 | so uh we want to um we want to first of all learn what is the
00:41:23.520 | trajectories the trajectory is basically a list of state and
00:41:26.800 | actions so um
00:41:30.560 | okay yeah okay the trajectories are a list of states of action so if the cat
00:41:35.520 | is here it's in the state uh let me draw the
00:41:37.920 | lines otherwise it's too bad for you guys if you cannot see it
00:41:41.920 | here here here here
00:41:45.840 | okay this is the the cat is initially here so this let's call it the state
00:41:51.040 | number zero and the cat can choose some action and
00:41:53.440 | let's call it the action number zero when the cat takes the action number
00:41:56.960 | zero it will arrive in a new state so maybe the cat will arrive here and it
00:42:00.560 | will become the state number one of the cat and then the cat here we can take
00:42:03.680 | another action according to its policy let's call it action number one and
00:42:07.520 | which will lead the cat into a new state so we'll call it s
00:42:10.480 | s2 which in which it will in which it can take
00:42:15.040 | another action and let's call it the action number two etc etc so the
00:42:18.240 | trajectory is a list of states and action that the agent can take
00:42:21.600 | inside of the environment uh what we want
00:42:25.040 | is if we take sample a trajectory according to our policy we want to
00:42:29.840 | maximize the reward that we get from each of the
00:42:32.960 | possible trajectories that we can take
00:42:36.800 | how do we do that let's do it here so basically what we do is the
00:42:43.680 | following we this is our objective
00:42:48.240 | so if we can find as you know when we have in deep learning what we are doing
00:42:53.520 | we are trying to either maximize something or we are going to minimize
00:42:57.280 | something in the case of
00:43:00.960 | model training we usually always minimize a cost function in this case we
00:43:06.000 | want to maximize the expected rewards when the agent acts
00:43:10.480 | according to this policy this policy however is not just any policy it is a
00:43:15.040 | particular policy made up of some parameters that we will call
00:43:18.160 | theta to give you a parallel on what is happening here
00:43:23.120 | is the following imagine you have a company
00:43:26.800 | and you are like the ceo of the company and this company is made up of
00:43:34.720 | many actors and many functions and many departments
00:43:38.480 | so each of these things are let's say they are parameters of
00:43:43.440 | your company they define your pump because you can tune them and the
00:43:46.800 | company function will change so how people talk to each other how
00:43:51.200 | people work how the departments work how the
00:43:54.560 | logistics work how the office works etc
00:43:57.040 | they are all parameters of your company which define the outcome of your
00:44:00.640 | company and imagine you want to maximize the
00:44:04.320 | profit of your company so what you do you learn to tune all of
00:44:08.160 | these parameters so you learn to for example tell people to behave in a
00:44:11.760 | particular way or you tell the people to collaborate in a particular way or to
00:44:15.760 | work on some projects and not work on some
00:44:17.760 | other projects this is what we do in the gradient policy optimization
00:44:21.120 | policy gradient optimization we calculate the gradient with respect
00:44:26.480 | to the parameters of this particular objective function
00:44:31.280 | which what is the gradient
00:44:35.120 | yes later we talk about the discount factor in the word sum okay
00:44:38.560 | so what is the gradient the gradient basically tells us
00:44:44.400 | how the how the objective will change if we change the
00:44:51.040 | parameter a little bit the gradient always tell us how it will
00:44:57.040 | increase so the gradient tells us what is the
00:45:00.080 | the direction of max the maximum ascent of a particular
00:45:03.680 | function with respect to the variable in to which you
00:45:06.960 | calculate it so in this case we are computing the
00:45:09.440 | gradient with respect to the parameters of the objective function which tells us
00:45:17.280 | how should i change the parameters to increase
00:45:20.480 | this objective function which is exactly the expected reward
00:45:24.480 | that we want we can get from this policy so
00:45:28.640 | because the the the gradient tells us how we should change the parameters to
00:45:34.240 | increase the objective then we change the parameters
00:45:37.600 | according to the direction of the gradient and this is what we do here
00:45:41.600 | so we have an objective which is tells us
00:45:44.960 | the expected reward when acting according to this policy
00:45:48.480 | we calculate its gradient with respect to the parameters which tells us
00:45:52.320 | how we should change these parameters to increase this objective so to increase
00:45:56.640 | the expected reward and then we change the parameters in the same direction of
00:46:01.120 | the gradient and we do it iteratively this is called
00:46:05.680 | policy gradient optimization and actually this is a beautiful result
00:46:09.280 | because it means that i can just use my cat
00:46:13.600 | whatever policies my cat has right now i can just
00:46:18.560 | sample some trajectories from these policies when i ask my cat to move
00:46:22.080 | around check what kind of reward i get
00:46:25.840 | calculate the gradient with respect to the expected reward according to these
00:46:30.240 | trajectories and then tell the cat hey you should do
00:46:33.920 | more of this because this led to a better reward
00:46:37.120 | or you should do less of this because it leads to a bad reward
00:46:41.680 | this is policy gradient optimization now it has some problems because policy
00:46:47.440 | gradient optimization basically okay as you can
00:46:50.400 | let's skip the math because if you want the math i i made a video
00:46:53.760 | it's on youtube so you can watch it tomorrow but it has some problems
00:46:57.840 | because as you can see the search space of the
00:47:01.840 | cat is enormous because at the possible
00:47:05.280 | trajectory that the cat can take to go from here
00:47:08.080 | to the meat there are a lot because the cat can go like this
00:47:11.840 | it can go like this it can go like this it can go here then come back then go
00:47:17.200 | down etc etc so there is many many many many many
00:47:21.120 | trajectories that the cat can take to go to the meat
00:47:26.400 | however to compute this objective we should actually check all the
00:47:32.320 | possible trajectories to get the direction of the gradient
00:47:37.760 | however this is intractable means that in the case of language model we should
00:47:42.000 | ask the language model to generate all the possible output ever possible
00:47:46.320 | given a particular question which is intractable because
00:47:49.520 | at each token the the language model can choose what
00:47:54.000 | let's say the vocabulary size is 30 000
00:47:57.040 | then the language model can choose 30 000 possibilities for the first
00:48:00.560 | token then 30 000 for the second 30 000 for the third etc etc
00:48:05.200 | and to check all of them it's computationally
00:48:08.400 | impossible so we can always approximate this with
00:48:12.160 | a sample with a sample this is called the
00:48:18.880 | monte carlo estimation however this results in because we are
00:48:23.840 | not checking all the possible trajectories
00:48:25.680 | but we are making the decision of optimizing our cat
00:48:28.880 | using only a few trajectories of course as you can see it's a risky situation
00:48:33.440 | so it means that we are making a hard decision on how we should change
00:48:37.520 | a policy without checking all the possible search space
00:48:41.920 | this basically means that we have high variance
00:48:45.200 | and there are many ways to reduce this variance
00:48:48.880 | so when you read the term baseline in the deep seek paper
00:48:53.040 | this is one of the ways to reduce this variance because we
00:48:56.160 | are trying to optimize the language model into choosing
00:48:59.920 | certain patterns into choosing certain chain of thoughts into choosing
00:49:04.000 | certain sequence of tokens without exploring all the possible generation
00:49:08.160 | that the language model can have
00:49:11.280 | um so so let me see how can we simplify this one
00:49:18.720 | we blah blah okay so in order to reduce this variance so
00:49:25.760 | in order to make sure that we optimize the language
00:49:29.600 | model even without checking all the possible
00:49:33.040 | generations but still making sure that we make the
00:49:37.040 | gradient that we get so the direction that tells us
00:49:39.680 | what parameter we should change and in which direction in order to increase the
00:49:43.920 | expected reward we can introduce this advantage term
00:49:47.680 | here this advantage term basically for each token tells the language model
00:49:52.480 | how better is choosing this token over all the other tokens that i can choose
00:49:57.520 | in this position for example in the example in the example that we saw
00:50:01.680 | before so where is shanghai is should the language
00:50:06.000 | model choose the word shanghai or it should choose the word coffee or should
00:50:09.680 | it choose the word pizza well it's very more it's much more advantageous
00:50:14.000 | to to to choose the word shanghai because it's very likely that the
00:50:18.160 | language model will then complete it as shanghai is in china or
00:50:21.440 | shanghai is a city in china or shanghai is located in china etc etc
00:50:25.840 | so the choosing the language the the word shanghai
00:50:30.400 | results on the long term in a much better
00:50:34.640 | reward so the advantage of choosing shanghai
00:50:38.160 | is higher compared to all the other tokens in that
00:50:41.280 | condition in the case of the cat it means that the cat when is
00:50:46.640 | here it should very high it's very advantageous to choose go down because
00:50:51.840 | it will result in going to the meat and over all the other
00:50:57.040 | all the over all the other actions this doesn't mean that
00:51:00.080 | choosing up will lead you to die or to get no reward because you can always go
00:51:04.240 | up and then change direction and go down but it's
00:51:07.200 | much more advantageous to just go down this is the this is the
00:51:13.440 | uh this is the meaning of the advantage term
00:51:16.480 | and this is the same advantage term that you see in the
00:51:20.880 | grpo loss now what is the difference between the
00:51:24.800 | advantage term that you see in the ppo and the grpo is that
00:51:28.240 | the advantage term in the ppo requires the
00:51:31.600 | what is known as the value function in the grpo they just
00:51:36.560 | they they compute this advantage in a different way
00:51:40.560 | which still results in in variance reduction
00:51:45.120 | but without having this value function estimation
00:51:48.240 | so grpo is a computationally more um advanced i would say efficient in this
00:51:56.800 | okay let me see what else we need we've skipped also the part of the
00:52:01.360 | let me see um off policy learning right so offline policy learning
00:52:08.320 | so what is off policy learning uh imagine we have the cat
00:52:13.760 | let's go to the cat actually okay to compute this the the the loss
00:52:21.760 | to to okay the the gradient policy optimization we saw that we need to
00:52:25.840 | sample some trajectories right and we don't have to sample
00:52:28.480 | all the possible trajectories right because we are trying to approximate it
00:52:33.280 | um so when we sample these trajectories we are sampling from
00:52:37.440 | a policy which is the current brain of the cat
00:52:40.800 | so we ask the current brain of the cat to choose some actions and
00:52:44.880 | generate some trajectories means that we ask the language the cat to just
00:52:48.160 | navigate the house and let's see what it does
00:52:52.720 | okay so we ask the the cat to just navigate the house and
00:52:59.920 | see what it does and then after the cat has navigated the house
00:53:04.160 | we look at what they are the trajectory that the cat has taken and then we give
00:53:08.400 | reward to the cat based on the trajectory it has taken
00:53:12.160 | and then we optimize the policy which means that we optimize the
00:53:15.600 | brain of the cat uh with the whatever it has learned with the with
00:53:22.000 | the direction of the gradient based on the reward it has received
00:53:25.200 | but now the brain of the cat is a new brain because it has
00:53:29.120 | changed compared to the past which means that the next step of iteration of
00:53:34.000 | optimizing optimizing the brain of the cat or its decision making skills
00:53:38.320 | we need to sample new trajectories so we need to ask the cat again
00:53:42.160 | to go through all the house make a few choices
00:53:45.520 | and then we check these choices and again we tell the cat hey you did
00:53:49.120 | here you did well here you didn't well so the cat will learn to
00:53:52.880 | to optimize its policy which will result in a new policy
00:53:59.360 | and then again we need to sample from this policy but as you can see every
00:54:02.400 | time we do an optimization step we need to sample again these trajectories
00:54:06.240 | and in the case of language models this means that first you need to sample some
00:54:10.720 | responses then you reward these responses based on your reward model
00:54:16.240 | and then you need to sample new responses because now the policy
00:54:21.360 | has changed because we updated the language model in order to avoid
00:54:26.560 | this sampling process which is expensive we introduce
00:54:30.560 | of policy learning in which let me show you here where is it
00:54:36.880 | okay in which basically we take the language model
00:54:41.200 | we ask it to generate a lot of trajectories
00:54:44.480 | and we do it once then we sample some of these trajectories which means
00:54:48.720 | basically we ask it to generate some responses then we sample
00:54:51.760 | some of these responses and we fine-tune the language model
00:54:55.040 | based on the reward we got on these responses then we don't sample
00:54:58.800 | new we don't sample new trajectories or
00:55:01.680 | responses we just take another mini batch of the
00:55:04.720 | trajectories that we sampled initially and again we do another step of
00:55:08.320 | optimization and we keep doing it for n steps only then we sample new
00:55:14.080 | trajectories this results in a much more efficient
00:55:18.320 | training so it's not like we are
00:55:22.960 | let's say do change a little bit the policy and then sample new trajectories
00:55:26.960 | from this policy no we sample a lot of trajectories
00:55:29.600 | initially we keep them in some database in the memory or whatever you want
00:55:34.080 | then we do some continuously optimizing the policy
00:55:37.200 | using the trajectories that we have sampled initially
00:55:41.040 | and this basically results in a much efficient training
00:55:47.920 | and now we can go back to the DPO sorry the DeepSeq R1 paper here
00:55:54.640 | to understand finally the loss in its entirety
00:55:58.480 | so what we are doing here is we have a policy
00:56:02.560 | at the current step of sampling and then we have the policy from which
00:56:07.680 | we sampled the trajectories so
00:56:11.840 | as you can see it's written here so we sample
00:56:14.960 | first a question from our database of questions
00:56:18.160 | then for each question we generate a list of outputs
00:56:21.680 | of responses so we prompt the model basically with the question and then we
00:56:26.320 | ask it to generate multiple responses and you can generate
00:56:29.200 | multiple responses like we saw here like we ask the language model where is Shanghai
00:56:32.240 | and the language model will generate one response then
00:56:34.560 | we ask it again where is Shanghai and maybe this time the language model will say
00:56:37.440 | something else and then we ask it again etc etc so we
00:56:40.000 | generate a list of outputs for the same question then we
00:56:44.480 | compute the following loss which is basically
00:56:47.440 | the ratio of the log probability the ratio of the probabilities or the
00:56:52.240 | difference of the log probabilities so the ratio of the probabilities of
00:56:57.360 | the assigned to the output by the current
00:57:02.400 | iteration so the current iteration at which we are
00:57:07.120 | optimizing the language model with respect to
00:57:11.120 | the language model from which we sampled so the language model
00:57:14.400 | basically when we initially sample the trajectories we already pre-compute the
00:57:17.920 | probabilities so we we can compute the probabilities
00:57:22.080 | while while sampling we can save them so we have always available this
00:57:26.640 | pi pi of old now this ratio means what means that if at the current
00:57:33.040 | iteration the language imagine this ratio is more
00:57:35.360 | than one it means that at the current iteration
00:57:37.920 | the language model is telling me that i want to choose this
00:57:41.680 | output more because i am more likely to choose this one now what we want is
00:57:48.240 | if the advantage of doing this choosing this action is good
00:57:52.960 | because it leads advantage positive means that it's
00:57:56.160 | good to choose it's advantageous to choose this
00:57:59.120 | action because it results in a good reward so
00:58:02.400 | if the language model is more likely to choose something
00:58:05.840 | and at the same time this something also results in good advantage
00:58:10.640 | then this stuff here will be positive and it will be
00:58:14.800 | big and because we are maximizing this will contribute positively to our
00:58:19.840 | objective because we are trying to maximize it so we want to do
00:58:22.720 | more of this however if the language model
00:58:26.160 | on the other hand is trying to do less of something so means that this ratio
00:58:30.960 | will be less than one and at the same time this
00:58:34.640 | results in a something that is disadvantageous so it
00:58:39.200 | means that it's not advantageous to do this
00:58:42.640 | then the language model is also um okay let me check what so
00:58:51.120 | i got a little lost if i have if this ratio is a positive
00:58:55.680 | and i am i get a good advantage it means that the language model will
00:59:01.680 | be more likely to do it if something that results in bad
00:59:06.240 | advantage and the language model is still doing it then it will be a big
00:59:10.880 | negative reward which will contribute negatively to our
00:59:14.960 | objective so the language model will be less likely to do it
00:59:18.640 | at the same time we don't want the language model to make big decisions
00:59:22.560 | at every step we want to clip limit the decision making of the language
00:59:28.880 | model at each step so it means that at each step of it of
00:59:32.480 | optimization even if the language model is very
00:59:35.200 | confident that something is good then we don't
00:59:38.800 | care how confident language is we want to
00:59:40.880 | limit its confidence by clipping this ratio here between one minus epsilon and
00:59:46.880 | one plus epsilon
00:59:49.920 | at the same time we also don't want the language model to change too much so we
00:59:54.320 | have initial frozen model here which is a pyref
00:59:57.680 | pyref basically means the original model so in the case of r1
01:00:02.320 | it is the deep seek v3 base so which is the
01:00:07.760 | language model that has never been trained with reinforcement learning
01:00:11.920 | so we want because we want reinforcement learning framework we want to
01:00:15.920 | change the language model a little bit to learn
01:00:19.120 | reasoning but we don't want the language model to forget
01:00:22.320 | everything or to just not behave like a language model anymore
01:00:26.560 | otherwise the language model could just do like a reward hacking
01:00:29.680 | so we want this is basically pyref is the
01:00:34.400 | the deep seek v3 base so the current policy so the current
01:00:41.360 | language model should try to be as close as possible to the
01:00:44.640 | original model but at the same time it should try to change according to the
01:00:48.640 | reward it gets the advantages of the reward it gets
01:00:53.440 | now how is the advantage term computed here basically it is
01:00:58.160 | each reward normalized so it's basically it's like
01:01:03.200 | each reward coming out from a distribution
01:01:05.360 | centered on the mean of zero and the standard division of one
01:01:10.880 | why do we want this because it means first of all
01:01:14.800 | we don't want the numeric value of the reward to affect the training process
01:01:19.760 | but how better it is compared to the other
01:01:23.600 | rewards to affect the the training process not the magnitude of its value
01:01:30.160 | okay now that we have seen this i believe you should have
01:01:37.120 | most of the knowledge i guess to understand
01:01:40.320 | all of the paper because the other part that they do is okay instead of
01:01:47.040 | just taking the language model and just training with the reinforcement learning
01:01:50.880 | let's try to introduce some first of all
01:01:55.120 | first of all let's try to introduce some very high quality super
01:01:59.280 | fine tuning data and then we do another step of reinforcement learning and then
01:02:02.640 | we do another step of fine tuning and then do another
01:02:06.000 | step of reinforcement learning and this actually leads to a better
01:02:10.720 | better model another actually interesting part of the
01:02:14.240 | the paper is the distillation so the distillation i don't know if
01:02:17.600 | most people are familiar with what is distillation and how it works so if you
01:02:21.440 | want i can talk about it a little bit otherwise i think let's go to sleep guys
01:02:25.840 | let's see
01:02:28.240 | so yes means let's talk about it
01:02:33.280 | okay okay okay okay no problem please give big picture overview i mean
01:02:39.760 | bro you can just read the paper yourself
01:02:42.720 | okay yes okay more on distillation
01:02:48.000 | okay so distillation basically means this imagine you have a
01:02:53.680 | imagine you are trying to
01:02:56.720 | imagine you are trying to teach yourself
01:03:03.440 | okay imagine you are trying to teach yourself
01:03:10.720 | graduate maths that's one thing right imagine you don't have any background on
01:03:15.600 | math and imagine you learn it from a
01:03:18.000 | university teacher it's two different thing right because if
01:03:21.520 | you try to learn it by yourself you need to come up with all the strategies to
01:03:24.400 | learn math but if you learn it from a professor then the professor can also
01:03:27.520 | give you hints on how to learn it faster so in the case of models we have usually
01:03:33.360 | a big model so let's call it big brother
01:03:37.760 | and then we have a smaller model what we want
01:03:41.520 | is and then we have a data set let's call it a data set so data set
01:03:49.360 | what we do is we prompt the big brother so the
01:03:52.800 | big model let's call it the big model actually i don't want to confuse people
01:03:58.080 | so this is the big model
01:04:01.920 | and this is the small model so
01:04:06.480 | so now when we train language models so
01:04:10.800 | what how do we train language models first of all
01:04:13.200 | when we train language models we do that in a self-supervised way
01:04:17.520 | what does it mean it means that there is no kind
01:04:21.040 | it means that the the language model is trained
01:04:25.200 | without annotating the data it means that we sample a lot of text
01:04:29.680 | and we force the language model to learn to predict the next token given the
01:04:33.760 | the context that comes before it it means that imagine you want to train
01:04:37.360 | the language model on the following text for example
01:04:42.720 | the following sentence for distilled model we
01:04:47.120 | report representative results blah blah imagine we want to train the language
01:04:51.600 | model only on this green text here what we will do is we will give it the
01:04:56.720 | word for and we ask it to learn to predict the word distilled when it sees
01:05:00.880 | for then we give it the word for distilled and we force it to learn to
01:05:05.520 | predict the word models when it sees for distilled the beauty of the
01:05:09.440 | transformer is that all this process can be done in parallel so we don't have to
01:05:12.640 | do it one word at the time
01:05:15.920 | if you do the job of training a small model
01:05:20.240 | on reasoning just by itself it will have much more difficulties however if you
01:05:25.360 | have a big model that has already been trained on a specific task
01:05:29.360 | then you can use the big model to help the small model learn faster
01:05:33.680 | how when we train a language model on raw data so in this case for example we
01:05:39.680 | said to the language model when you see four you should choose distilled
01:05:44.480 | when you see four distilled you should models
01:05:48.080 | this basically is called the next token prediction task
01:05:52.320 | and the way we do it is we force the distribution that
01:05:56.000 | the language model outputs and this distribution is the goal of the language
01:05:59.520 | model so as we saw before the language model
01:06:01.360 | generates a distribution over all the possible next words
01:06:04.880 | so it will assign a probability to the first word
01:06:08.320 | next word and to the second next word and the third next word and the fourth
01:06:12.400 | next word and we ask it okay when you see four
01:06:16.400 | distilled you should exactly this particular word
01:06:19.840 | which is the word models so you should choose the word
01:06:25.200 | models and all the other words should not be chosen
01:06:28.880 | so zero zero zero zero zero this one should be chosen with 100 percent
01:06:35.120 | score this is how we force the language model
01:06:39.520 | by doing it on many many many many texts the language model
01:06:42.720 | learns to generate a distribution that very likely will generate this word
01:06:47.840 | models when it is four distilled and less likely to generate the others
01:06:53.600 | if we do the same job with the small model we will see that the small model
01:06:56.880 | will have a lot of difficulties learning at the
01:07:00.800 | same pace as the big model why because the big model has more parameters so it
01:07:04.320 | has more flexibility in learning complex tasks
01:07:07.760 | while the small model does not have this flexibility so it will must be much slow
01:07:11.600 | learner so how do we how can we distill the
01:07:16.000 | the knowledge of the big model into the small model
01:07:19.760 | we do it as follows we take a data set of prompts so a list of prompts
01:07:26.240 | we feed it to the big model and then we ask the big model to generate
01:07:30.080 | an answer not only the answer we also ask it to generate the log probabilities
01:07:35.440 | at each step of the generation process so
01:07:38.720 | before we use the sentence where is shanghai shanghai is in china right
01:07:43.200 | so imagine we are dealing with the big model where is shanghai
01:07:47.280 | the model will generate the shanghai is in china but
01:07:50.720 | with the because we have the word shanghai we also have the log probability
01:07:55.280 | of not only the word shanghai but also all the other words that
01:07:59.360 | could we could have chosen in that position
01:08:02.640 | we forced the small model to learn not only to generate shanghai but we force
01:08:07.520 | it to learn the same distribution that the big model generated for the
01:08:12.800 | same position so let me do it again so imagine now we
01:08:17.040 | do a concrete example imagine we have a sentence where is shanghai
01:08:24.160 | shanghai we put it give it to the big model
01:08:30.000 | the big model will generate an answer what does it mean to generate an answer
01:08:32.960 | it means that it will generate first of all a distribution over what is the next
01:08:36.400 | likely next token which will be a list of
01:08:39.440 | probabilities where the word shanghai will have a very high score so maybe 0.6
01:08:45.280 | and maybe the word pizza will be 0.1 and the word the cat will be 0.05 etc
01:08:52.400 | 0.05 etc etc then we choose the word shanghai
01:08:56.480 | and we ask the language model again where is shanghai
01:08:59.600 | question mark shanghai then it will choose the word is but it will not just
01:09:03.760 | choose the word is it will actually give us a distribution and we choose the word
01:09:07.520 | is it will give us a distribution that looks like this for example it will say
01:09:11.360 | okay the word is is very likely so it's maybe
01:09:15.440 | 70 probability the word i don't know road is unlikely because it
01:09:22.400 | doesn't make sense and the word i don't know hello
01:09:26.720 | also doesn't make sense so it is even less
01:09:30.160 | likely so for each position we can generate the
01:09:33.600 | log probabilities from the big model and we force the small model not to
01:09:38.240 | learn to just generate shanghai but to learn this entire distribution
01:09:42.160 | why because this gives a much stronger signal to the small model on what it
01:09:48.080 | should do what it can do and what it must never do
01:09:52.000 | instead of just telling you should do this it's a bigger signal for the small
01:09:56.240 | model to learn faster and what they say in the
01:10:01.120 | in the deep seek r1 paper is that by distilling
01:10:04.880 | you can create a much stronger model than just by
01:10:08.640 | um training them from scratch
01:10:12.960 | a bit more context on the probabilities for each forward pass well
01:10:28.320 | uh no distillation is used on all the tokens so it's you're telling for
01:10:34.960 | each possible token you learn the entire distribution for
01:10:38.400 | each position not only the last
01:10:53.040 | so how we assign reward for each action okay in the case
01:10:56.800 | of um in the case of ppo by the way if you watch my video on
01:11:03.840 | ppo i i explain all of that so i don't want to kind of uh
01:11:09.120 | i don't know why nobody ever watched that video it's my one of my masterpieces
01:11:13.200 | and nobody ever cared about it but all these
01:11:15.520 | slides come from my video that i done in 2023
01:11:19.040 | so um the way we generate the rewards is basically we the reward are usually
01:11:26.320 | only generated for the okay there are two kind of rewards that you can
01:11:30.080 | generate one is called the outcome based reward
01:11:33.680 | and one is called the process based reward
01:11:36.480 | in the case of ppo we generate the outcome based reward and then through
01:11:41.760 | the advantage term is it is a kind of distributed on all the
01:11:46.800 | previous tokens because we sample a response and then we judge this response
01:11:51.600 | according to our reward model then with the advantage terms that you see in the
01:11:56.160 | loss this reward is distributed on all the
01:12:00.880 | previous tokens so each token carries information on how
01:12:05.280 | likely it is going to lead to the final reward
01:12:18.080 | okay there is another part of way i think we forgot one part guys
01:12:24.800 | um the unsuccessful attempts this is interesting also so uh they
01:12:31.280 | tried what is known as the process reward
01:12:33.680 | model basically um okay the reward model that we saw before
01:12:37.360 | is a rule-based model that is outcome driven means that the model
01:12:41.600 | have to generate the entire process and the for example in
01:12:44.960 | the case of read code problems the model has to generate the final code
01:12:48.320 | to make it runnable then we run the code we see how it performs and then we have
01:12:53.600 | signal on the reward however there are other ways of
01:12:57.520 | generating reward and one of them is called the process reward model where
01:13:01.040 | you divide the problem in sub-problems and then you have a model which is
01:13:08.080 | called the process reward model that assigns a reward to each single step
01:13:12.400 | however they in the paper they say that assigning first of all
01:13:16.480 | dividing a problem into sub-steps it's difficult because every problem is
01:13:20.880 | different and we don't want to kind of tell the
01:13:23.920 | model you should follow this pattern it should
01:13:26.640 | the model that should come try to come up with this pattern and
01:13:29.440 | actually it does and secondly they say that it's
01:13:32.560 | difficult to judge each single step so with sometimes we don't even know if
01:13:36.880 | that step is good or not like when you're trying to solve a math proof
01:13:40.480 | sometimes you do some intermitted steps that may not lead to the
01:13:46.080 | to the you may not immediately see that they will lead to the
01:13:52.160 | final conclusion but they are necessary right
01:13:55.520 | so it's difficult to give them a reward until you arrive to the end so it's
01:14:00.720 | difficult to kind of give a reward to each single step and
01:14:04.640 | there is another technique called the multi-carlot research
01:14:07.440 | which basically it's a tree search in which
01:14:10.640 | each intermediate node is given each intermediate node is also a step
01:14:16.480 | and each of these steps each of these nodes actually has a kind of a score
01:14:20.720 | associated with it which increases the more time that
01:14:24.720 | particular step leads to a successful solution so imagine for
01:14:29.360 | example you are doing a lit code and if you a lit code problem so if you
01:14:33.360 | start your code for example with a typo
01:14:37.200 | of course the code will not compile so any style anything that starts with that
01:14:41.280 | type typo will never lead to a successful
01:14:43.520 | solution so that node will never be explored further
01:14:46.480 | so multi-carlot tree search basically forces the language model to explore
01:14:50.480 | more of the solutions that lead to successful
01:14:54.080 | attempts and less to the one that don't succeed
01:14:58.400 | however even by doing multi-carlot research they they get
01:15:01.760 | like sub results that are not as good as the
01:15:06.400 | reinforcement learning driven one and the
01:15:09.520 | the beauty of this actually is also in this sentence which is
01:15:12.880 | rather than explicitly teaching the model on how to first of all divide the
01:15:16.720 | problem into sub problems and how to solve the problem or tell the
01:15:20.880 | model what is the format it should follow to solve the problem
01:15:24.080 | the model just learns to solve to just learns to come up with the right format
01:15:29.520 | with the right chain of thought to solve the problem of course they
01:15:33.280 | apply they play with the reward because they
01:15:35.920 | tell the reward that the language model should
01:15:39.920 | follow a particular format and the language model should also
01:15:43.200 | what is it should be accurate etc etc so you can always play with a little bit
01:15:47.680 | the reward and another thing that they notice is
01:15:49.920 | actually if you train the language model only on the
01:15:52.560 | reinforcement learning then it leads to kind of the language model
01:15:57.120 | with some side effects for example the fact that the language
01:16:01.760 | model mixes languages because nobody told the language model
01:16:05.440 | that it has to only speak english to solve a problem that is
01:16:09.040 | stated in english so imagine i ask you to solve a problem like
01:16:12.160 | a math problem and i say it in english
01:16:15.920 | and imagine you are like a mixed kid and you can speak chinese and english or you
01:16:21.040 | are just a chinese who knows also english etc
01:16:25.520 | maybe in your head you you speak english and the chinese
01:16:29.040 | you think in english and in chinese and then you come up with the solution and
01:16:32.400 | that's totally correct right so unless i tell you
01:16:35.280 | explicitly that you should never think in chinese
01:16:38.720 | then you you why you should not take the freedom of doing it right
01:16:42.480 | so if you never tell the model to not do something
01:16:46.160 | the model probably will do it and because in the reward
01:16:49.520 | here they never told the language model to never think in other languages the
01:16:52.880 | language actually started thinking in other languages
01:16:56.000 | anything to get the job done that's the beauty of reinforcement learning
01:17:00.640 | so you give the right incentive and the language model will come up with the
01:17:04.000 | right way of reaching that goal as long as the
01:17:07.840 | signal is strong enough
01:17:10.880 | yes you need a massive cluster of gpus i think i believe so because
01:17:19.600 | you still have a like you're still running gradient descent on the language
01:17:23.120 | model itself so every time you are kind of
01:17:26.240 | fine-tuning the language model you are changing its weights so
01:17:30.640 | okay guys um before we talk about other questions okay uh all this lecture was
01:17:45.200 | totally uh kind of um improvised so i didn't
01:17:48.320 | prepare it um so i hope i didn't
01:17:53.440 | make big mistakes because the problem is
01:17:57.280 | reinforcement learning from human feedback is quite a complicated topic
01:17:59.840 | and you need to kind of derive everything step by step this is what i
01:18:02.400 | do in my video on reinforcement learning from human feedback here i kind of
01:18:05.600 | sometimes skipped some steps and went back etc so i hope i didn't create
01:18:09.600 | confusion so if there is kind of some parts about
01:18:14.480 | the rl that you didn't understand then i
01:18:17.440 | think we can talk about it otherwise let's call it a day guys
01:18:22.960 | maybe you could record for youtube more prepared i could but
01:18:26.800 | i have a daughter guys now i'm super busy and i also have a full-time job and
01:18:30.400 | i also have a wife so have some mercy on me
01:18:45.040 | thank you guys
01:18:51.680 | yeah i think everyone now should have at least the basic understanding on how to
01:18:55.760 | read the paper like when you read the paper you should
01:18:58.960 | have a clear idea of what's happening and if you kind of need more
01:19:02.720 | background you are uh you can check like my previous works
01:19:09.440 | i think let's stop the recording and have a good night guys
01:19:15.520 | and now if you want you can also
01:19:20.800 | how to stop the recording