Paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

perfect wonderful uh do you also are you also handy inside yes yeah wonderful so we can have like three streams or two streams only i don't know how that works okay let's start guys um so the goal of today's paper reading is to go through the deep seek r1 paper and what we will be seeing is first of all the biggest difficulty people have in order to understand this paper is the reinforcement learning part and what why it is difficult because we are you they use another algorithm called the grpo and the goal of this initial part of this paper reading is actually to give you the background knowledge that is is needed to understand the paper so i will be using the slides that i used for making my video on reinforcement learning from human feedback so to understand what is the connection between language models and reinforcement learning so let's go to that slides so let's review very fast very very fast language models so as you know language models are generative models which only have one simple function which only have one simple objective which is to tell us what is the next likely token given an input prompt so if we are given for example the input prompt shanghai is a city in and we feed it to the language model the language model will generate a probability distribution over what it thinks means the what it thinks is the next likely token that is coherent with the prompt that we have given the language model and usually we sample the for example if we feed this prompt to the language model it will give us that maybe the next likely token is the word china or beijing or cat or pizza and then we choose what we believe is the most likely based on its probability or probability score then we take this word we put it back in the prompt and we ask again the language model what is the next next word and we do this this job iteratively to generate a full text for example to generate the response to where is shanghai we first ask the language model where is shanghai it will tell us okay the next likely word is shanghai then we put it back in the input of the language model and it will tell us the next likely token is etc etc i'm also making a simplified assumption here that says that each token is a word and each word is a token this is not the case in language models but for our explanation we will think like in this way okay now we know what is language model what is reinforcement learning we go very very simple through what is reinforcement learning and then we find the connection between language models and reinforcement learning now reinforcement learning is an area of artificial intelligence i don't remember if it belongs to machine learning in particular but it's an area of artificial intelligence that is tasked with optimizing the behavior of an agent and the behavior of an agent is called a policy which is the decision making of an agent in such a way that the agent performs actions that maximize the reward it gets from performing these actions in an environment for example for example i have a cat so this cat likes to eat meat like most cats and the cat is in my house and the cat will be considered of a reinforcement learning agent the cat can make some decisions on how it wants to move in the house and this will be the policy of the cat so the policy tell the cat if the cat should move up down left or right in the house now this is my house you cannot see the border so i will draw them because i don't know why the the borders are not drawn here and you should think of this environment as being a grid environment made up of cells so like the following like this what is the goal of the cat the goal of the cat is to arrive to the meat so the the cat can at each position in the house it can make some choices some perform some actions we say technically and the actions that the cat can can choose at each position in the house is a move up down left or right what we want we want to make sure that the cat learns to perform the series of actions that lead him with very likely to the meat while avoiding the things that the cat is scared about which is the broom and the bathtub because no cat likes to take shower so we designed first of all a reward system for this reinforcement learning agent because we want to train this reinforcement learning agent to choose which actions to perform in each position in the house based on some reward so one reward that we could do one reward model could be this one for example if the cat moves to an empty cell it receives a reward of zero if the cat moves to the broom it receives a reward of minus one if it moves to the bathtub it receives a reward of minus 10 however if after performing a series of actions the cat arrives to the meat then the cat receives a big reward of plus 100 the decision making of this cat is governed by a model that we will call the policy of this cat and the goal of the policy is to choose an action given the current state and the action that the cat can choose is stochastic it means that this policy gives us a distribution over all the possible action that the cat can take so if if we have imagine we have a very optimized policy if the cat is here the good policy should tell us that with very high probability score we should move down because that's one way to maximize the reward and with very low probability we should move left because that will take us towards the the bathtub the another thing for example that this policy should do is if the cat is here for example it should not move right so the probability associated to the action move right should be low and maybe the probability associated with the action move down should be a little higher this is what the policy is now what is the connection between language models and reinforcement learning so first of all what is the goal in reinforcement learning is to train this policy so to train this decision making of this cat of this agent in order to choose the proper actions at every possible state in the environment such that it maximizes the reward that the agent can get so a good policy for the cat would be a policy that always leads the cat to the meat no matter where the cat is now let's go let's connect the reinforcement learning with language models so language model is also kind of a policy because the language model every time you feed a prompt to the language model the language model has to choose an action to perform which is what token should come after this prompt so in this case we talk about the state and action the state is the state in which the reinforcement agent is in the case of the cat is the position of the cat inside of the house in case of the language model the state is the prompt itself that you feed and the action is the distribution over all the next token that the language model can choose from in the case of language models so we also want to train the language model to perform its action actions or to choose the next tokens in a particular way according to some reward that we that we can build a reward model that we can build in the case specifically case of reinforcement learning from human feedback we want the language model to generate text using particular rules for example when we do language model alignment we are first of all how language models are trained usually we have a pre-training part where we feed a lot of information to the language model so we throw a lot of data like the entire wikipedia the entire web the entire i don't know stack overflow and the leet code everything and the language model learns how to kind of the structure of the language it learns a little bit of chinese a little bit of english a little bit of japanese because we throw every data possible that we have at the language model then we do a little bit of fine-tuning so we train the language model to generate high quality data so we increase the likelihood of generating high quality outputs instead of just throwing whatever is on the internet but then we do also an alignment part in which we want the language model to follow instructions so we want the language model to adhere to some standards for example what makes the language model conversational is the conversation is the instruction fine-tuning which means that we train the language model to follow a particular format so always for example greet the user always be helpful always never use a curse word etc etc etc and this job is done through the reinforcement learning from human feedback which includes many kind of algorithms like ppo dpo etc and grpo is one of them what we do usually in language models to train the language model to follow instructions is we generate a data set of instructions in which we we first have some list of questions and then we ask the language model to generate some answers and then we ask some professional annotators to choose which answer they would like the language model to generate more and which one they don't like the language model to generate and the the goal of reinforcement learning from human feedback is to make sure that the language model will generate more answers like the ones that are chosen by the annotators and the less likely to generate answers that are not chosen by the annotators this is called the reward model of the of the reinforcement learning from human feedback okay now that we have understood a little bit the connection between language models and the reinforcement learning framework let's move on to the paper i just want to do a little review of what we have seen so far so now we know what are language models they are models that generate the probability over what is the next likely what is the next token the probability distribution of what is the what should be the next token based on the input we know what is reinforcement learning which is a framework for training the policy of an agent in order to choose actions that maximize its reward what is the connection between reinforcement learning and language model is that the language model itself is a policy because it makes decisions it takes actions in choosing what is the next token and we want the language model to choose tokens so the next token in such a way that it follows some standards which are according to a data set of preferences that we usually build this data set of preferences is converted into a reward model but we will not be covering the reward model for now at least so now let's go to the deepseq paper now in the deepseq in the deepseq r1 paper what they do is they start with a pre-trained model they they use the deepseq v3 base model which i believe is a 600 billion parameter model and then they want this this model to perform better at reasoning what does it mean to perform better at reasoning it we want the language model to find a way to solve complex problems by breaking them into smaller steps that can be that are easier for the language model to manage and the way they do it is also through reinforcement learning let's go to the paper so let's go here let's go here let's go here okay first of all they say that in this section we explore the so let me use the in this section we explore the potential of language models to develop reasoning capabilities without any supervised data focusing on their self-evolution through a pure reinforcement learning process so what is supervised data when we train language models we also try to build some very high quality data set of what the language model should be generating because as we said before we have like multiple stages of training one is the pre-training in which we just throw random data from the web to the language model but then we want the language model to generate high quality data so we have this kind of the supervised fine-tuning part and they skip this part here they just take the base model and then they want the base model to develop the reasoning capability just by using reinforcement learning which means that we want to incentivize the language model through a reward system to develop by itself what is the sequence of token that should lead to so for the language model to acquire as much reward as possible it's like i take my cat and i want the cat to solve math problems and i what i can do i can just play with how many biscuits i can give to the cat so if i build my reward model in such a way that the cat is incentivized to solve math problems then by the because the cat wants to get the biscuit the cat will develop whatever skill it needs to develop in order to get maximize the number of biscuits it gets now of course the cat will never develop it because the underlying kind of lacks the capability of learning certain things but that's not the problem in language models because big language models have a lot of capabilities in developing novel skills now the algorithm that they use in DeepSeq R1 is called the GRPO algorithm if you look at my video on reinforcement learning from feedback historically we have always used the PPO algorithm more recently the DPO algorithm there are also other algorithms like the ORPO what they use here is called the GRPO algorithm and it's very similar to the PPO but slightly different and we will see how let's see what does the GRPO algorithm does well as we saw before when we do the when we do reinforcement learning on a language model we have a data set of preferences so we have some questions and then we ask the language model to generate multiple answers and then we ask annotators to choose which answer they like and then we train a language a reward model that that gives the signal to the language model to understand which if the answer that the language model is generating is good or bad in this case with the GRPO we have the following objective now if you have never seen clip or if you are sorry if you have never seen PPO this sounds quite scary but let's try to break it down step by step what are we doing here is we want to optimize a policy so the policy is the language model itself and the policy is always denoted with the letter pi like the greek letter pi so this pi of theta is the policy so it is the language model that we are trying to optimize we want this policy to be trained to maximize the following objective what is this objective and this objective is saying that if i have a list of questions that belong to some database of questions and we sample some output from our policy using these questions then based on some reward that this output that this output of the language model get from our reward system we should train the language model to give more weight to those actions that result in good reward and to give less weight so are less likely the language model should be less likely to take those actions that lead to bad reward and the way we do it is as follows so we take basically what we are doing is okay here you see all the policy and the new policy for now let's ignore that that's because we do something called offline learning so we will not be covering that at least now for now what we want to do ignore the the the denominator here so the pi old just concentrate on the um on the term pi what we are saying is that we generate the log probabilities of the output so what is the log probability of the output let's do it step by step actually okay imagine we have the following question for example imagine we ask the language model where is shanghai so and the language model generates because as we saw we sample a few questions from a database of questions and then language model and then we generate multiple outputs using this this question using our language model so these outputs are called o and there are g of them this is the group in grpo maybe the language model the first time will generate let's say uh shanghai is in china so let me just write shanghai is in china another output could be for example the sky is blue and another output could be shanghai is beautiful imagine we have some magic reward system that assigns or imagine that our reward system is a human being this human being will very likely to give a very high reward to this answer zero reward to this answer and maybe not completely zero but nearly zero score to this answer why because this at least talks about shanghai this doesn't talk about anything related to shanghai and this actually answers the questions now when we have a language model we have a question and the generated answer we have what is known as a trajectory the trajectory is a list of actions that the language model has taken why because the language model was given this question as input and the language model took an action chose an action to generate the first token which is shanghai then this shanghai was put back into the language model and then the language model choose another action which is the token is and then this is was put back into the language model and then the language model chose another action which is in etc etc this is a trajectory at each step of the generation process the language model chose an action based on the probability distribution that it generated actually it's not the language model that chooses it's our sampling strategy that chooses the particular token so usually we can use the greedy strategy or the top of the strategy or whatever so at each step we have chosen some action based on the distribution of the language that the language model generated for each state so we have a log probability a probability associated with the word shanghai on conditioned on the question where is shanghai we have a probability associated with the word is conditioned on the input where is shanghai question mark shanghai then we have a probability associated with the word in conditioned on the input where is shanghai question mark shanghai is blah blah blah etc etc so we have a list of probabilities what we are doing here is that we want the um this is the product of all the log probabilities that are generated at each step of the particular output o i here which is maybe the first answer here then we ask the language model again the same question and the language model will come up with another output and this output will also have associated with it a list of log probabilities and these are the log probability that you see here each log probability furthermore is weighted by an advantage term the advantage term is basically telling me how better is choosing a particular token given a particular input over all the tokens that are available for example imagine we have the input where is shanghai is it better to choose the word shanghai or is it better to as the first word of the response or the word pizza i believe it's better to choose the word shanghai so the advantage of choosing the word shanghai would result in a better long-term reward for the policy so for the language model because it will result in a good answer so it will result in a high reward from our reward model for now we just have modeled the reward model as a human being who tells that the language model okay this is a good answer this is a bad answer but actually later we will see that the reward is actually also a language model and in the case of grpo they actually used in the case of the deep seek r1 they use actually a rule-based reward model so let's go back we have a list of questions we generate multiple answers with each of these questions using our language model and then we for each of these answers we have the log probabilities associated with this answer which is just the product of all the probabilities of choosing that particular token given that particular input we weight each of this log probability by an advantage term which basically tells us how good is choosing this particular token over all the other that are available for this particular input and we train our language model to maximize this objective let's see what does it mean to maximize this objective and now let's also see what is this old here the language model that we will be training is our deep seek base right so at the beginning suppose that this pi old is the base version of deep seek what we do is basically we want to refine it iteratively by generate keep generating output from it and then through the reward model we want to tell it okay this was a good this was a good output so do more of this or this was a bad output so do less of this this is one of the advantage of using reinforcement learning because if you do supervised fine tuning you're just telling the language model to i want this so generate this if you are doing reinforcement learning you have the ability to tell the language model i want more of this and i want less of this what we are doing is the the language model at each iteration the language model should be optimized in the following way if the language model at the current iteration is giving more probability more likelihood to generate a response that resulted in a good reward and the advantage term will be high in that case then this policy is this objective is telling the language model do more of this however if right now the language model is giving less probability to an action that also resulted in a low than a bad reward and the advantage term will be negative in that case then by optimizing this objective here by maximizing the objective here the language model will learn to do less of that i know that i didn't explain very well the pi theta and the pi old theta i believe i can do that later when if we have time by talking about offline offline learning in the case of in the case of reinforcement learning moreover in the grpo you find this kehl divergence term now for people who already know what is the kehl divergence that would be super easy but for people who don't know basically the kehl divergence is a way of measuring how to distributions are different so how far they are what we want is we want the language model to be fine tuned to generate more of things that lead to a better reward to do less of things that result in low reward but at the same time we don't want the language model to change too much its behavior for example imagine we have a reward model that tells the language model to be more polite what the language model could do and imagine that by saying being polite for example in my case means that i always say thank you right so what could happen is if we don't enforce the kehl divergence is that the language model could just cheat and always generates thank you thank you thank you thank you thank you thank you at every response like a list of thank yous because that results obviously in a high reward but the language model would stop doing its main job which is to generate something factual and useful so we want the language model to change a little bit but to change so to be more polite but not just be just generate a bunch of thank yous so change change but change a little bit so that's why we add the scale divergence otherwise what the language model will do it will do what is known as a reward hacking which is it will try to find a way to just learn what is a trick to get maximize its reward without actually being useful this is also for example if you want a parallel like example it's like you have a text code and it's very complex people will always find a way to cheat on it and if you have a text code that is very simple made up of few rules then it's very unlikely that people will be able to cheat on it so and so so you cannot do like in that case you cannot cheat okay what do we miss here so first of all i didn't explain the i didn't explain the offline policy learning and i didn't explain the clip part why we are clipping here the clipping part basically we are saying if the language model is trying to change its log probabilities if the language model is trying to change its log probabilities by being too confident about its change then we don't want we don't want to let the model be overly confident basically what means that if the this is the language model at the current iteration and this you can think of it at the previous iteration in the training process if the language model at the current iteration is very confident that by saying the word shanghai will result in a better reward even if it's a good choice we don't want the language model to be overly confident so we clip this ratio between the log probabilities up to one plus epsilon or or in the lower case in one minus epsilon because if the language model will choose the next word as a shanghai given this question then okay we are lucky and it's it's good but imagine the language model is overly confident then the next word is i don't know coffee then we don't want the language model to make too big step we want the model to learn as slow as possible by choosing the accordingly this epsilon term which is something that we can choose and this beta term okay i believe i have covered a little bit of this so last last review guys so we are trying to optimize the language model iteratively by telling it to make more of something that we want and to do less of what we don't want how does the model know what we want and what we don't want it's the reward that we give it according to our reward model now let's talk about the reward model the reward model historically in ppo let's go to the other here was a model which was of the same structure as the language model that we are trying to optimize in which we add a linear layer on top that gives a reward to each answer how can we assign a numeric reward to a particular answer so if you remember when we talk about the bpo i said that usually we start with some questions then we generate a few answers then we ask some annotators to choose which answers we like and they and to also tell us which answers they don't like how do we convert this data set of preferences into a number we do that by training a language model which has the same architecture as the policy that we are trying to train just a different head on top that instead of generating the log probabilities of the next token generates a numeric reward and we do it with what is known as the bradley terry model which is basically this loss here you don't have to understand this law doesn't matter it basically means that we want the if we train the language model on this loss it will generate a very high reward for the quest bradley terry model so if we train our language model on this loss it will basically result in this head here this linear head on top of the language model to generate a very high value for the questions that for the answers that were chosen in the data set of preferences and a very low or low basically value for the answers that were not chosen by the professional annotators this is how it was done with ppo so we have an objective which increases the likelihood of the things that give high reward we also have a reward model which generates a numeric reward as a signal for the language model to understand what it should do more and what it should do less in deepseek r1 they do something different they instead of using a model as a reward they use a rule-based model system so they don't train a neural network to generate a number that gives a signal to the model to understand what we like and what we don't like they used a rule-based system you can see here and this rule-based system is you can do it for all the tasks that you can kind of verify so for example for the lead code problems they ask the how can you check if the answer generated by the model is good well you just run it and if it performs if it first compiles and secondly it runs in a predefined time limit then it is a good answer it doesn't matter how it came to be if it's if it works it's a good answer just like also for example math some for most of math math problems we do have the answer that we expect the model to generate so we can compare what is the actual generated answer and what is the we expect the model to generate so they create a rule-based reward system in a way that they ask the language model to generate some output for a given problem and then they can assign by just following rules so check if okay if it's a lead code problem just run it and check if it runs okay good reward doesn't run okay zero and if it's a math problem they just take the the output of the model compare it with the expected result and assign reward based on that so if it's the answer matches what we expected good reward otherwise zero etc etc they also assign a reward for the for the language model if it formats the output in a certain way by for example forcing the they give a reward for the language model if it uses for example if it follows the format of putting all the thought process in the tags think and slash think etc so basically just with this they train the language model so they train the language model to generate answers and they reward these answers to a reward system which is rule-based and they keep training it and the language models basically automatically by itself learns to generate the thought process that is necessary to perform the the tasks that it is being asked so the language model learns by itself by just with the reinforcement learning but with this reward system to generate the thought process that leads to the generating the right code for the lead code problems to generate the right thought process that is necessary for solving math problems etc etc etc let me check what else we need to know from here okay here they show some results i think the results you can check by yourself what is very interesting i think it's this during the training of r1 just with reinforcement learning so as i as i want to remind you is they took deep seek v3 base and added this reinforcement learning step on top of it which is a massive reinforcement learning step usually the alignment part is not so big and the more they find you they they run this reinforcement learning step they saw that the language models automatically learns to generate longer responses because to solve problems you need to generate a longer chain of thought so the language model because of the reward system learns that in order to get reward it is it should generate longer responses so they didn't tell the language model with supervised fine tuning to generate that kind of data with that particular format with that kind of thought process just with reinforcement learning with the right incentives the language model learned to do that how did it do that it learned that at a particular input it should generate that particular token which in the long term results in a good reward so the language model the beauty of reinforcement learning is that you not only learn to do something that based on the immediate reward that you get but also on the long-term reward that you will get because sometimes the reward as you can see here the reward model only applies when the entire answer is produced so the model will only know the signal that the model gets is only for the entire output actually okay through the advantage term this signal is propagated back to each single token but okay we can skip that in the in the reinforcement learning actually the beauty is that sometimes the reward you get is not for the single action you take so in the case of my cat for example let's say here so let's go back to my cat the cat will receive reward only after taking many steps that lead it to the meat so only when the cat is here it will know okay all this sequence of actions was a good choice but this signal is propagated back to each single action in such a way that the cat when it will be here it will very likely to choose i need to go down and less likely to choose i need to go right this also happens in the case of language models in this case through the advantage term here where is it through this advantage term here which is actually done for each token in the case of okay now we can go a little bit more technical details in the case of why they choose grpo over ppo first of all well with ppo basically this advantage term here to compute requires another function that is called the value function and this value function basically to be computed requires the training of another model by using the advantage term in grpo this advantage term is calculated without the value function but by the following formula here which is basically just based on the rewards which is already given by the reward model which we have which in the case of the deepseek r1 is rule-based so they don't need to train this other uh model to generate the the value function which is adds more complexity to the to the system okay so now we have seen what is the reinforcement learning we have seen what is the connection between language models and reinforcement learning we have seen a little bit what is the the grpo objective okay i believe let's actually do a poll guys do you want me to go deeper in the grpo like let's explore exactly this loss because actually the the rest of the paper is okay we tried just reinforcement learning okay so people like deep so let's go deep okay all right first of all the most interesting thing about reinforcement learning especially in the case of ppo and the first learn from human feedback is that is this thing called the gradient policy optimization so let's go back to the other slide and then we go back to the dpo because the rest of the paper in dpo deepseek is basically they they took this um r1 and then said okay instead of just doing reinforcement learning let's do maybe multiple step of reinforcement learning and supervised fine tuning and then reinforcement learning again then super and it leads to better outcomes but that is not technically difficult to understand i think because most of you already kind of have backgrounds in this so if you're here it's because you kind of understand what we are talking about so let's do all the things that maybe some people have difficulties with which is i believe this and this part okay so let's go deeper so um okay so let's go back to my cat and imagine i want to train my cat so i want to train my cat to um follow to to reach the meat so as we saw before my cat is just a an agent with a policy a policy is what it tells the cat what action to take given the position of the cat inside of the house you need to think of this house as a grid so like like the following i didn't uh you cannot see the the grid lines because i don't know my it's not showing but okay it doesn't matter what is the policy the policy as we saw before is it tells the cat what action to take given a particular position what is our goal in reinforcement learning is to select a policy that maximizes the reward that the agent gets when using this policy so this is the um the objective that if you have we want to select among all the possible policies that we can can have the one that maximizes an objective what is this objective is the expected reward that we can get when using this policy means that if i apply in the side or in the brain of my cat this policy which is the decision making stuff then my cat if this best policy will tell my cat to move down here and move right here and move right right down down down and until it arrives to the needle this should be the um the policy how do we actually train this policy we do what is known as a policy gradient optimization but before we understand policy we need to understand a few terms so uh we want to um we want to first of all learn what is the trajectories the trajectory is basically a list of state and actions so um okay yeah okay the trajectories are a list of states of action so if the cat is here it's in the state uh let me draw the lines otherwise it's too bad for you guys if you cannot see it here here here here okay this is the the cat is initially here so this let's call it the state number zero and the cat can choose some action and let's call it the action number zero when the cat takes the action number zero it will arrive in a new state so maybe the cat will arrive here and it will become the state number one of the cat and then the cat here we can take another action according to its policy let's call it action number one and which will lead the cat into a new state so we'll call it s s2 which in which it will in which it can take another action and let's call it the action number two etc etc so the trajectory is a list of states and action that the agent can take inside of the environment uh what we want is if we take sample a trajectory according to our policy we want to maximize the reward that we get from each of the possible trajectories that we can take how do we do that let's do it here so basically what we do is the following we this is our objective so if we can find as you know when we have in deep learning what we are doing we are trying to either maximize something or we are going to minimize something in the case of model training we usually always minimize a cost function in this case we want to maximize the expected rewards when the agent acts according to this policy this policy however is not just any policy it is a particular policy made up of some parameters that we will call theta to give you a parallel on what is happening here is the following imagine you have a company and you are like the ceo of the company and this company is made up of many actors and many functions and many departments so each of these things are let's say they are parameters of your company they define your pump because you can tune them and the company function will change so how people talk to each other how people work how the departments work how the logistics work how the office works etc they are all parameters of your company which define the outcome of your company and imagine you want to maximize the profit of your company so what you do you learn to tune all of these parameters so you learn to for example tell people to behave in a particular way or you tell the people to collaborate in a particular way or to work on some projects and not work on some other projects this is what we do in the gradient policy optimization policy gradient optimization we calculate the gradient with respect to the parameters of this particular objective function which what is the gradient yes later we talk about the discount factor in the word sum okay so what is the gradient the gradient basically tells us how the how the objective will change if we change the parameter a little bit the gradient always tell us how it will increase so the gradient tells us what is the the direction of max the maximum ascent of a particular function with respect to the variable in to which you calculate it so in this case we are computing the gradient with respect to the parameters of the objective function which tells us how should i change the parameters to increase this objective function which is exactly the expected reward that we want we can get from this policy so because the the the gradient tells us how we should change the parameters to increase the objective then we change the parameters according to the direction of the gradient and this is what we do here so we have an objective which is tells us the expected reward when acting according to this policy we calculate its gradient with respect to the parameters which tells us how we should change these parameters to increase this objective so to increase the expected reward and then we change the parameters in the same direction of the gradient and we do it iteratively this is called policy gradient optimization and actually this is a beautiful result because it means that i can just use my cat whatever policies my cat has right now i can just sample some trajectories from these policies when i ask my cat to move around check what kind of reward i get calculate the gradient with respect to the expected reward according to these trajectories and then tell the cat hey you should do more of this because this led to a better reward or you should do less of this because it leads to a bad reward this is policy gradient optimization now it has some problems because policy gradient optimization basically okay as you can let's skip the math because if you want the math i i made a video it's on youtube so you can watch it tomorrow but it has some problems because as you can see the search space of the cat is enormous because at the possible trajectory that the cat can take to go from here to the meat there are a lot because the cat can go like this it can go like this it can go like this it can go here then come back then go down etc etc so there is many many many many many trajectories that the cat can take to go to the meat however to compute this objective we should actually check all the possible trajectories to get the direction of the gradient however this is intractable means that in the case of language model we should ask the language model to generate all the possible output ever possible given a particular question which is intractable because at each token the the language model can choose what let's say the vocabulary size is 30 000 then the language model can choose 30 000 possibilities for the first token then 30 000 for the second 30 000 for the third etc etc and to check all of them it's computationally impossible so we can always approximate this with a sample with a sample this is called the monte carlo estimation however this results in because we are not checking all the possible trajectories but we are making the decision of optimizing our cat using only a few trajectories of course as you can see it's a risky situation so it means that we are making a hard decision on how we should change a policy without checking all the possible search space this basically means that we have high variance and there are many ways to reduce this variance so when you read the term baseline in the deep seek paper this is one of the ways to reduce this variance because we are trying to optimize the language model into choosing certain patterns into choosing certain chain of thoughts into choosing certain sequence of tokens without exploring all the possible generation that the language model can have um so so let me see how can we simplify this one we blah blah okay so in order to reduce this variance so in order to make sure that we optimize the language model even without checking all the possible generations but still making sure that we make the gradient that we get so the direction that tells us what parameter we should change and in which direction in order to increase the expected reward we can introduce this advantage term here this advantage term basically for each token tells the language model how better is choosing this token over all the other tokens that i can choose in this position for example in the example in the example that we saw before so where is shanghai is should the language model choose the word shanghai or it should choose the word coffee or should it choose the word pizza well it's very more it's much more advantageous to to to choose the word shanghai because it's very likely that the language model will then complete it as shanghai is in china or shanghai is a city in china or shanghai is located in china etc etc so the choosing the language the the word shanghai results on the long term in a much better reward so the advantage of choosing shanghai is higher compared to all the other tokens in that condition in the case of the cat it means that the cat when is here it should very high it's very advantageous to choose go down because it will result in going to the meat and over all the other all the over all the other actions this doesn't mean that choosing up will lead you to die or to get no reward because you can always go up and then change direction and go down but it's much more advantageous to just go down this is the this is the uh this is the meaning of the advantage term and this is the same advantage term that you see in the grpo loss now what is the difference between the advantage term that you see in the ppo and the grpo is that the advantage term in the ppo requires the what is known as the value function in the grpo they just they they compute this advantage in a different way which still results in in variance reduction but without having this value function estimation so grpo is a computationally more um advanced i would say efficient in this case okay let me see what else we need we've skipped also the part of the let me see um off policy learning right so offline policy learning so what is off policy learning uh imagine we have the cat let's go to the cat actually okay to compute this the the the loss to to okay the the gradient policy optimization we saw that we need to sample some trajectories right and we don't have to sample all the possible trajectories right because we are trying to approximate it um so when we sample these trajectories we are sampling from a policy which is the current brain of the cat so we ask the current brain of the cat to choose some actions and generate some trajectories means that we ask the language the cat to just navigate the house and let's see what it does okay so we ask the the cat to just navigate the house and see what it does and then after the cat has navigated the house we look at what they are the trajectory that the cat has taken and then we give reward to the cat based on the trajectory it has taken and then we optimize the policy which means that we optimize the brain of the cat uh with the whatever it has learned with the with the direction of the gradient based on the reward it has received but now the brain of the cat is a new brain because it has changed compared to the past which means that the next step of iteration of optimizing optimizing the brain of the cat or its decision making skills we need to sample new trajectories so we need to ask the cat again to go through all the house make a few choices and then we check these choices and again we tell the cat hey you did here you did well here you didn't well so the cat will learn to to optimize its policy which will result in a new policy and then again we need to sample from this policy but as you can see every time we do an optimization step we need to sample again these trajectories and in the case of language models this means that first you need to sample some responses then you reward these responses based on your reward model and then you need to sample new responses because now the policy has changed because we updated the language model in order to avoid this sampling process which is expensive we introduce of policy learning in which let me show you here where is it okay in which basically we take the language model we ask it to generate a lot of trajectories and we do it once then we sample some of these trajectories which means basically we ask it to generate some responses then we sample some of these responses and we fine-tune the language model based on the reward we got on these responses then we don't sample new we don't sample new trajectories or responses we just take another mini batch of the trajectories that we sampled initially and again we do another step of optimization and we keep doing it for n steps only then we sample new trajectories this results in a much more efficient training so it's not like we are let's say do change a little bit the policy and then sample new trajectories from this policy no we sample a lot of trajectories initially we keep them in some database in the memory or whatever you want then we do some continuously optimizing the policy using the trajectories that we have sampled initially and this basically results in a much efficient training and now we can go back to the DPO sorry the DeepSeq R1 paper here to understand finally the loss in its entirety so what we are doing here is we have a policy at the current step of sampling and then we have the policy from which we sampled the trajectories so as you can see it's written here so we sample first a question from our database of questions then for each question we generate a list of outputs of responses so we prompt the model basically with the question and then we ask it to generate multiple responses and you can generate multiple responses like we saw here like we ask the language model where is Shanghai and the language model will generate one response then we ask it again where is Shanghai and maybe this time the language model will say something else and then we ask it again etc etc so we generate a list of outputs for the same question then we compute the following loss which is basically the ratio of the log probability the ratio of the probabilities or the difference of the log probabilities so the ratio of the probabilities of the assigned to the output by the current iteration so the current iteration at which we are optimizing the language model with respect to the language model from which we sampled so the language model basically when we initially sample the trajectories we already pre-compute the probabilities so we we can compute the probabilities while while sampling we can save them so we have always available this pi pi of old now this ratio means what means that if at the current iteration the language imagine this ratio is more than one it means that at the current iteration the language model is telling me that i want to choose this output more because i am more likely to choose this one now what we want is if the advantage of doing this choosing this action is good because it leads advantage positive means that it's good to choose it's advantageous to choose this action because it results in a good reward so if the language model is more likely to choose something and at the same time this something also results in good advantage then this stuff here will be positive and it will be big and because we are maximizing this will contribute positively to our objective because we are trying to maximize it so we want to do more of this however if the language model on the other hand is trying to do less of something so means that this ratio will be less than one and at the same time this results in a something that is disadvantageous so it means that it's not advantageous to do this then the language model is also um okay let me check what so i got a little lost if i have if this ratio is a positive and i am i get a good advantage it means that the language model will be more likely to do it if something that results in bad advantage and the language model is still doing it then it will be a big negative reward which will contribute negatively to our objective so the language model will be less likely to do it at the same time we don't want the language model to make big decisions at every step we want to clip limit the decision making of the language model at each step so it means that at each step of it of optimization even if the language model is very confident that something is good then we don't care how confident language is we want to limit its confidence by clipping this ratio here between one minus epsilon and one plus epsilon at the same time we also don't want the language model to change too much so we have initial frozen model here which is a pyref pyref basically means the original model so in the case of r1 it is the deep seek v3 base so which is the language model that has never been trained with reinforcement learning so we want because we want reinforcement learning framework we want to change the language model a little bit to learn reasoning but we don't want the language model to forget everything or to just not behave like a language model anymore otherwise the language model could just do like a reward hacking so we want this is basically pyref is the the deep seek v3 base so the current policy so the current language model should try to be as close as possible to the original model but at the same time it should try to change according to the reward it gets the advantages of the reward it gets now how is the advantage term computed here basically it is each reward normalized so it's basically it's like each reward coming out from a distribution centered on the mean of zero and the standard division of one why do we want this because it means first of all we don't want the numeric value of the reward to affect the training process but how better it is compared to the other rewards to affect the the training process not the magnitude of its value okay now that we have seen this i believe you should have most of the knowledge i guess to understand all of the paper because the other part that they do is okay instead of just taking the language model and just training with the reinforcement learning let's try to introduce some first of all first of all let's try to introduce some very high quality super fine tuning data and then we do another step of reinforcement learning and then we do another step of fine tuning and then do another step of reinforcement learning and this actually leads to a better better model another actually interesting part of the the paper is the distillation so the distillation i don't know if most people are familiar with what is distillation and how it works so if you want i can talk about it a little bit otherwise i think let's go to sleep guys let's see so yes means let's talk about it okay okay okay okay no problem please give big picture overview i mean bro you can just read the paper yourself okay yes okay more on distillation okay so distillation basically means this imagine you have a imagine you are trying to imagine you are trying to teach yourself okay imagine you are trying to teach yourself graduate maths that's one thing right imagine you don't have any background on math and imagine you learn it from a university teacher it's two different thing right because if you try to learn it by yourself you need to come up with all the strategies to learn math but if you learn it from a professor then the professor can also give you hints on how to learn it faster so in the case of models we have usually a big model so let's call it big brother and then we have a smaller model what we want is and then we have a data set let's call it a data set so data set what we do is we prompt the big brother so the big model let's call it the big model actually i don't want to confuse people so this is the big model and this is the small model so so now when we train language models so what how do we train language models first of all when we train language models we do that in a self-supervised way what does it mean it means that there is no kind it means that the the language model is trained without annotating the data it means that we sample a lot of text and we force the language model to learn to predict the next token given the the context that comes before it it means that imagine you want to train the language model on the following text for example the following sentence for distilled model we report representative results blah blah imagine we want to train the language model only on this green text here what we will do is we will give it the word for and we ask it to learn to predict the word distilled when it sees for then we give it the word for distilled and we force it to learn to predict the word models when it sees for distilled the beauty of the transformer is that all this process can be done in parallel so we don't have to do it one word at the time if you do the job of training a small model on reasoning just by itself it will have much more difficulties however if you have a big model that has already been trained on a specific task then you can use the big model to help the small model learn faster how when we train a language model on raw data so in this case for example we said to the language model when you see four you should choose distilled when you see four distilled you should models this basically is called the next token prediction task and the way we do it is we force the distribution that the language model outputs and this distribution is the goal of the language model so as we saw before the language model generates a distribution over all the possible next words so it will assign a probability to the first word next word and to the second next word and the third next word and the fourth next word and we ask it okay when you see four distilled you should exactly this particular word which is the word models so you should choose the word models and all the other words should not be chosen so zero zero zero zero zero this one should be chosen with 100 percent score this is how we force the language model by doing it on many many many many texts the language model learns to generate a distribution that very likely will generate this word models when it is four distilled and less likely to generate the others if we do the same job with the small model we will see that the small model will have a lot of difficulties learning at the same pace as the big model why because the big model has more parameters so it has more flexibility in learning complex tasks while the small model does not have this flexibility so it will must be much slow learner so how do we how can we distill the the knowledge of the big model into the small model we do it as follows we take a data set of prompts so a list of prompts we feed it to the big model and then we ask the big model to generate an answer not only the answer we also ask it to generate the log probabilities at each step of the generation process so before we use the sentence where is shanghai shanghai is in china right so imagine we are dealing with the big model where is shanghai the model will generate the shanghai is in china but with the because we have the word shanghai we also have the log probability of not only the word shanghai but also all the other words that could we could have chosen in that position we forced the small model to learn not only to generate shanghai but we force it to learn the same distribution that the big model generated for the same position so let me do it again so imagine now we do a concrete example imagine we have a sentence where is shanghai shanghai we put it give it to the big model the big model will generate an answer what does it mean to generate an answer it means that it will generate first of all a distribution over what is the next likely next token which will be a list of probabilities where the word shanghai will have a very high score so maybe 0.6 and maybe the word pizza will be 0.1 and the word the cat will be 0.05 etc 0.05 etc etc then we choose the word shanghai and we ask the language model again where is shanghai question mark shanghai then it will choose the word is but it will not just choose the word is it will actually give us a distribution and we choose the word is it will give us a distribution that looks like this for example it will say okay the word is is very likely so it's maybe 70 probability the word i don't know road is unlikely because it doesn't make sense and the word i don't know hello also doesn't make sense so it is even less likely so for each position we can generate the log probabilities from the big model and we force the small model not to learn to just generate shanghai but to learn this entire distribution why because this gives a much stronger signal to the small model on what it should do what it can do and what it must never do instead of just telling you should do this it's a bigger signal for the small model to learn faster and what they say in the in the deep seek r1 paper is that by distilling you can create a much stronger model than just by um training them from scratch a bit more context on the probabilities for each forward pass well uh no distillation is used on all the tokens so it's you're telling for each possible token you learn the entire distribution for each position not only the last so how we assign reward for each action okay in the case of um in the case of ppo by the way if you watch my video on ppo i i explain all of that so i don't want to kind of uh i don't know why nobody ever watched that video it's my one of my masterpieces and nobody ever cared about it but all these slides come from my video that i done in 2023 so um the way we generate the rewards is basically we the reward are usually only generated for the okay there are two kind of rewards that you can generate one is called the outcome based reward and one is called the process based reward in the case of ppo we generate the outcome based reward and then through the advantage term is it is a kind of distributed on all the previous tokens because we sample a response and then we judge this response according to our reward model then with the advantage terms that you see in the loss this reward is distributed on all the previous tokens so each token carries information on how likely it is going to lead to the final reward okay there is another part of way i think we forgot one part guys um the unsuccessful attempts this is interesting also so uh they tried what is known as the process reward model basically um okay the reward model that we saw before is a rule-based model that is outcome driven means that the model have to generate the entire process and the for example in the case of read code problems the model has to generate the final code to make it runnable then we run the code we see how it performs and then we have signal on the reward however there are other ways of generating reward and one of them is called the process reward model where you divide the problem in sub-problems and then you have a model which is called the process reward model that assigns a reward to each single step however they in the paper they say that assigning first of all dividing a problem into sub-steps it's difficult because every problem is different and we don't want to kind of tell the model you should follow this pattern it should the model that should come try to come up with this pattern and actually it does and secondly they say that it's difficult to judge each single step so with sometimes we don't even know if that step is good or not like when you're trying to solve a math proof sometimes you do some intermitted steps that may not lead to the to the you may not immediately see that they will lead to the final conclusion but they are necessary right so it's difficult to give them a reward until you arrive to the end so it's difficult to kind of give a reward to each single step and there is another technique called the multi-carlot research which basically it's a tree search in which each intermediate node is given each intermediate node is also a step and each of these steps each of these nodes actually has a kind of a score associated with it which increases the more time that particular step leads to a successful solution so imagine for example you are doing a lit code and if you a lit code problem so if you start your code for example with a typo of course the code will not compile so any style anything that starts with that type typo will never lead to a successful solution so that node will never be explored further so multi-carlot tree search basically forces the language model to explore more of the solutions that lead to successful attempts and less to the one that don't succeed however even by doing multi-carlot research they they get like sub results that are not as good as the reinforcement learning driven one and the the beauty of this actually is also in this sentence which is rather than explicitly teaching the model on how to first of all divide the problem into sub problems and how to solve the problem or tell the model what is the format it should follow to solve the problem the model just learns to solve to just learns to come up with the right format with the right chain of thought to solve the problem of course they apply they play with the reward because they tell the reward that the language model should follow a particular format and the language model should also what is it should be accurate etc etc so you can always play with a little bit the reward and another thing that they notice is actually if you train the language model only on the reinforcement learning then it leads to kind of the language model with some side effects for example the fact that the language model mixes languages because nobody told the language model that it has to only speak english to solve a problem that is stated in english so imagine i ask you to solve a problem like a math problem and i say it in english and imagine you are like a mixed kid and you can speak chinese and english or you are just a chinese who knows also english etc maybe in your head you you speak english and the chinese you think in english and in chinese and then you come up with the solution and that's totally correct right so unless i tell you explicitly that you should never think in chinese then you you why you should not take the freedom of doing it right so if you never tell the model to not do something the model probably will do it and because in the reward here they never told the language model to never think in other languages the language actually started thinking in other languages anything to get the job done that's the beauty of reinforcement learning so you give the right incentive and the language model will come up with the right way of reaching that goal as long as the signal is strong enough yes you need a massive cluster of gpus i think i believe so because you still have a like you're still running gradient descent on the language model itself so every time you are kind of fine-tuning the language model you are changing its weights so okay guys um before we talk about other questions okay uh all this lecture was totally uh kind of um improvised so i didn't prepare it um so i hope i didn't make big mistakes because the problem is reinforcement learning from human feedback is quite a complicated topic and you need to kind of derive everything step by step this is what i do in my video on reinforcement learning from human feedback here i kind of sometimes skipped some steps and went back etc so i hope i didn't create confusion so if there is kind of some parts about the rl that you didn't understand then i think we can talk about it otherwise let's call it a day guys maybe you could record for youtube more prepared i could but i have a daughter guys now i'm super busy and i also have a full-time job and i also have a wife so have some mercy on me any thank you guys yeah i think everyone now should have at least the basic understanding on how to read the paper like when you read the paper you should have a clear idea of what's happening and if you kind of need more background you are uh you can check like my previous works i think let's stop the recording and have a good night guys and now if you want you can also how to stop the recording you you you you you you you you you

Paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Transcript