Paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

00:00:00.000 | perfect wonderful uh do you also are you also handy inside

00:00:06.720 | yes yeah wonderful so we can have like three streams or two streams only i

00:00:10.480 | don't know how that works

00:00:13.520 | okay let's start guys um so the goal of today's paper reading

00:00:24.960 | is to go through the deep seek r1 paper and what we will be seeing is

00:00:32.320 | first of all the biggest difficulty people have

00:00:34.560 | in order to understand this paper is the reinforcement learning part

00:00:39.360 | and what why it is difficult because we are you they use

00:00:42.880 | another algorithm called the grpo and the goal of this

00:00:46.080 | initial part of this paper reading is actually to give you the background

00:00:49.440 | knowledge that is is needed to understand the paper

00:00:53.840 | so i will be using the slides that i used for making my video on

00:00:58.400 | reinforcement learning from human feedback

00:01:00.560 | so to understand what is the connection between language models and

00:01:04.160 | reinforcement learning so let's go to that slides

00:01:08.480 | so let's review very fast very very fast language models

00:01:12.320 | so as you know language models are generative models

00:01:15.600 | which only have one simple function which

00:01:18.560 | only have one simple objective which is to tell us what is the next

00:01:22.400 | likely token given an input prompt so if we are given for example the input

00:01:26.640 | prompt shanghai is a city in and we feed it to the language model

00:01:29.920 | the language model will generate a probability distribution

00:01:33.040 | over what it thinks means the

00:01:38.000 | what it thinks is the next likely token that is coherent with the prompt that we

00:01:43.680 | have given the language model and usually we sample the for example

00:01:48.960 | if we feed this prompt to the language model it will give us that maybe the

00:01:51.840 | next likely token is the word china or beijing or cat or pizza

00:01:55.920 | and then we choose what we believe is the most likely based on its probability

00:02:00.160 | or probability score

00:02:03.120 | then we take this word we put it back in the prompt and we ask again the language

00:02:08.480 | model what is the next next word and we do this this job

00:02:12.720 | iteratively to generate a full text for example to generate the

00:02:20.160 | response to where is shanghai we first ask the language model where is shanghai

00:02:23.200 | it will tell us okay the next likely word is

00:02:26.240 | shanghai then we put it back in the input of the language model and it will

00:02:29.600 | tell us the next likely token is etc etc i'm also making a

00:02:33.360 | simplified assumption here that says that each token is a word and

00:02:38.080 | each word is a token this is not the case in language models but for our

00:02:41.920 | explanation we will think like in this way okay

00:02:46.560 | now we know what is language model what is reinforcement learning

00:02:49.840 | we go very very simple through what is reinforcement learning and then we find

00:02:53.520 | the connection between language models and reinforcement learning

00:02:56.720 | now reinforcement learning is an area of artificial intelligence i don't remember

00:03:00.560 | if it belongs to machine learning in particular but it's an area of

00:03:03.280 | artificial intelligence that is tasked with optimizing the behavior of

00:03:08.720 | an agent and the behavior of an agent is called a policy

00:03:12.560 | which is the decision making of an agent in such a way that the agent

00:03:16.960 | performs actions that maximize the reward it gets from performing these

00:03:21.920 | actions in an environment for example for example i have a cat so

00:03:27.440 | this cat likes to eat meat like most cats

00:03:31.200 | and the cat is in my house and the cat will be considered of a

00:03:35.760 | reinforcement learning agent the cat can make some decisions on how it

00:03:40.720 | wants to move in the house and this will be the policy of the cat

00:03:44.960 | so the policy tell the cat if the cat should move

00:03:48.160 | up down left or right in the house now this is

00:03:51.840 | my house you cannot see the border so i will draw them because i don't know why

00:03:55.440 | the the borders are not drawn here and you

00:03:59.600 | should think of this environment as being a grid environment made up of

00:04:03.760 | cells so like the following

00:04:09.120 | like this what is the goal of the cat the goal of the cat is to arrive to the

00:04:15.920 | meat so the the cat can at each position in

00:04:20.000 | the house it can make some choices some perform some actions we say

00:04:24.320 | technically and the actions that the cat can can

00:04:28.240 | choose at each position in the house is a move up

00:04:31.200 | down left or right what we want we want to make sure that the cat

00:04:36.400 | learns to perform the series of actions that

00:04:39.840 | lead him with very likely to the meat while avoiding the things that the cat

00:04:46.320 | is scared about which is the broom and the

00:04:48.800 | bathtub because no cat likes to take shower

00:04:52.400 | so we designed first of all a reward system for this reinforcement learning

00:04:56.640 | agent because we want to train this reinforcement learning agent to choose

00:04:59.840 | which actions to perform in each position in the house

00:05:02.560 | based on some reward so one reward that we could do

00:05:06.880 | one reward model could be this one for example if the

00:05:10.080 | cat moves to an empty cell it receives a reward of zero

00:05:13.200 | if the cat moves to the broom it receives a reward of minus one if it

00:05:17.680 | moves to the bathtub it receives a reward of minus 10

00:05:21.440 | however if after performing a series of actions the cat arrives to the meat

00:05:25.920 | then the cat receives a big reward of plus 100

00:05:30.320 | the decision making of this cat is governed by

00:05:33.600 | a model that we will call the policy of this cat and the goal of the policy

00:05:37.920 | is to choose an action given the current state

00:05:41.600 | and the action that the cat can choose is

00:05:44.880 | stochastic it means that this policy gives us a distribution over all the

00:05:49.600 | possible action that the cat can take so if if we have imagine we have a very

00:05:54.480 | optimized policy if the cat is here the good policy should

00:05:58.800 | tell us that with very high probability score we should

00:06:02.240 | move down because that's one way to maximize the reward

00:06:05.280 | and with very low probability we should move left because that will take us

00:06:10.000 | towards the the bathtub

00:06:13.600 | the another thing for example that this policy should

00:06:17.840 | do is if the cat is here for example it should not move right so the

00:06:22.960 | probability associated to the action move right should be low

00:06:26.240 | and maybe the probability associated with the action move down should be a

00:06:29.920 | little higher this is what the policy is

00:06:34.320 | now what is the connection between language models and reinforcement

00:06:37.680 | learning so first of all what is the goal in

00:06:40.080 | reinforcement learning is to train this policy so to train

00:06:43.680 | this decision making of this cat of this agent in order to choose the proper

00:06:48.320 | actions at every possible state in the environment

00:06:52.080 | such that it maximizes the reward that the agent can get

00:06:57.360 | so a good policy for the cat would be a policy that always leads the cat to the

00:07:01.440 | meat no matter where the cat is now let's go let's connect the

00:07:06.320 | reinforcement learning with language models so

00:07:08.640 | language model is also kind of a policy because the language model every time

00:07:13.040 | you feed a prompt to the language model the language model has to choose an

00:07:16.400 | action to perform which is what token should come after this

00:07:20.080 | prompt so in this case we talk about the state

00:07:25.600 | and action the state is the state in which the

00:07:28.320 | reinforcement agent is in the case of the cat is the position of the cat inside

00:07:32.560 | of the house in case of the language model the state

00:07:35.120 | is the prompt itself that you feed and the action is the

00:07:38.080 | distribution over all the next token that the language model can choose from

00:07:42.960 | in the case of language models so we also want to train the language model to

00:07:46.240 | perform its action actions or to choose the next

00:07:50.480 | tokens in a particular way according to some reward that we that we

00:07:55.040 | can build a reward model that we can build

00:07:57.360 | in the case specifically case of reinforcement learning from human

00:08:00.400 | feedback we want the language model to

00:08:04.560 | generate text using particular rules for example when we do language model

00:08:10.160 | alignment we are first of all how language models

00:08:12.960 | are trained usually we have a pre-training part where

00:08:16.000 | we feed a lot of information to the language model so we throw a lot of data

00:08:19.840 | like the entire wikipedia the entire web the entire i don't know stack overflow

00:08:23.840 | and the leet code everything and the language model learns how to

00:08:27.920 | kind of the structure of the language it learns a little bit of chinese a little

00:08:31.920 | bit of english a little bit of japanese because we throw every data

00:08:35.120 | possible that we have at the language model then we do a little bit of

00:08:38.960 | fine-tuning so we train the language model to generate high quality data so

00:08:42.160 | we increase the likelihood of generating high quality outputs instead of just

00:08:45.920 | throwing whatever is on the internet but then we do also an alignment part

00:08:50.480 | in which we want the language model to follow instructions so we want the

00:08:53.440 | language model to adhere to some standards for example

00:08:56.960 | what makes the language model conversational

00:08:59.040 | is the conversation is the instruction fine-tuning

00:09:02.320 | which means that we train the language model to follow a particular format so

00:09:06.400 | always for example greet the user always

00:09:10.160 | be helpful always never use a curse word etc etc etc

00:09:14.640 | and this job is done through the reinforcement learning from human

00:09:17.040 | feedback which includes many kind of algorithms like ppo dpo etc

00:09:20.960 | and grpo is one of them what we do usually in language models

00:09:25.920 | to train the language model to follow instructions is we

00:09:30.400 | generate a data set of instructions in which we we first

00:09:37.840 | have some list of questions and then we ask the language model to generate some

00:09:41.600 | answers and then we ask some professional annotators to

00:09:45.360 | choose which answer they would like the language model to

00:09:48.480 | generate more and which one they don't like the language model to generate

00:09:53.200 | and the the goal of reinforcement learning from human feedback is to make

00:09:57.200 | sure that the language model will generate more answers like the

00:10:01.920 | ones that are chosen by the annotators and the less likely to

00:10:05.200 | generate answers that are not chosen by the annotators

00:10:10.080 | this is called the reward model of the

00:10:15.680 | of the reinforcement learning from human feedback

00:10:20.320 | okay now that we have understood a little bit the connection between

00:10:24.400 | language models and the reinforcement learning framework let's

00:10:28.400 | move on to the paper i just want to do a little

00:10:31.520 | review of what we have seen so far so now we know what are language models

00:10:35.360 | they are models that generate the probability over what is the

00:10:39.680 | next likely what is the next token the probability distribution

00:10:45.680 | of what is the what should be the next token based on the input

00:10:49.040 | we know what is reinforcement learning which is a framework for training the

00:10:52.240 | policy of an agent in order to choose actions

00:10:55.760 | that maximize its reward what is the connection between

00:10:59.440 | reinforcement learning and language model is that the language model itself

00:11:02.480 | is a policy because it makes decisions it takes

00:11:05.600 | actions in choosing what is the next token and we want the language model to

00:11:09.600 | choose tokens so the next token in such a way that

00:11:13.120 | it follows some standards which are according to a data set of

00:11:17.360 | preferences that we usually build

00:11:21.040 | this data set of preferences is converted into a reward model but we

00:11:25.280 | will not be covering the reward model for now at least

00:11:28.320 | so now let's go to the deepseq paper now in the deepseq

00:11:31.760 | in the deepseq r1 paper what they do is they start with

00:11:36.240 | a pre-trained model they they use the deepseq v3

00:11:39.760 | base model which i believe is a 600 billion

00:11:42.720 | parameter model and then they want this

00:11:46.400 | this model to perform better at reasoning

00:11:51.280 | what does it mean to perform better at reasoning it we want the language model

00:11:55.120 | to find a way to solve complex problems by

00:11:59.040 | breaking them into smaller steps that can be

00:12:04.240 | that are easier for the language model to manage and the way they do it

00:12:08.160 | is also through reinforcement learning let's go

00:12:12.240 | to the paper so let's go here let's go here let's go here okay

00:12:18.640 | first of all they say that in this section we explore the

00:12:22.400 | so let me use the in this section we explore the potential of

00:12:25.840 | language models to develop reasoning capabilities without any supervised

00:12:29.920 | data focusing on their self-evolution through a pure reinforcement learning

00:12:34.080 | process so what is supervised data when we

00:12:36.800 | train language models we also try to build some

00:12:39.680 | very high quality data set of what the language model

00:12:43.200 | should be generating because as we said before we have like

00:12:46.720 | multiple stages of training one is the pre-training in which we just throw

00:12:50.400 | random data from the web to the language model but then we want the language

00:12:53.520 | model to generate high quality data so we have this kind of the supervised

00:12:57.840 | fine-tuning part and they skip this part here they just

00:13:01.920 | take the base model and then they want the base model to

00:13:05.840 | develop the reasoning capability just by using

00:13:09.680 | reinforcement learning which means that we want to incentivize the language

00:13:14.080 | model through a reward system to develop by itself

00:13:18.400 | what is the sequence of token that should lead to so

00:13:22.320 | for the language model to acquire as much reward as possible

00:13:25.840 | it's like i take my cat and i want the cat to solve math problems

00:13:32.560 | and i what i can do i can just play with how many biscuits i can give to the cat

00:13:37.440 | so if i build my reward model in such a way

00:13:40.960 | that the cat is incentivized to solve math problems

00:13:46.560 | then by the because the cat wants to get the biscuit the cat will develop

00:13:51.120 | whatever skill it needs to develop in order to get

00:13:54.240 | maximize the number of biscuits it gets now of course the cat will never

00:13:57.440 | develop it because the underlying kind of

00:13:59.680 | lacks the capability of learning certain things but that's not the problem in

00:14:03.600 | language models because big language models have a lot of

00:14:07.600 | capabilities in developing novel skills

00:14:11.840 | now the algorithm that they use in DeepSeq R1 is called the GRPO

00:14:18.320 | algorithm if you look at my video on

00:14:21.520 | reinforcement learning from feedback historically we have always

00:14:25.440 | used the PPO algorithm more recently the DPO algorithm there are

00:14:29.760 | also other algorithms like the ORPO

00:14:33.360 | what they use here is called the GRPO algorithm and it's

00:14:37.040 | very similar to the PPO but slightly different

00:14:40.160 | and we will see how let's see what does the GRPO algorithm does

00:14:47.280 | well as we saw before when we do the when we do reinforcement learning on a

00:14:54.720 | language model we have a data set of preferences so we

00:14:58.160 | have some questions and then we ask the language model to generate

00:15:01.120 | multiple answers and then we ask annotators to choose which answer they

00:15:04.720 | like and then we train a language a reward

00:15:07.760 | model that

00:15:11.120 | that gives the signal to the language model to understand

00:15:16.000 | which if the answer that the language model is generating

00:15:20.400 | is good or bad in this case with the GRPO we have the

00:15:26.240 | following objective now if you have never seen clip or

00:15:31.120 | if you are sorry if you have never seen PPO this

00:15:34.320 | sounds quite scary but let's try to break it down

00:15:38.000 | step by step what are we doing here is we want

00:15:41.920 | to optimize a policy so the policy is the language model itself and the policy

00:15:47.040 | is always denoted with the letter pi like the greek letter pi so this

00:15:51.840 | pi of theta is the policy so it is the language model that we are trying to

00:15:55.280 | optimize we want this policy to be trained to

00:15:59.120 | maximize the following objective what is this objective and this

00:16:04.160 | objective is saying that if i have a list of questions that

00:16:08.560 | belong to some database of questions and

00:16:13.840 | we sample some output from our policy using these questions then

00:16:21.680 | based on some reward that this output that this output of the language

00:16:28.880 | model get from our reward system we should train the language model to

00:16:34.080 | give more weight to those actions that result

00:16:38.080 | in good reward and to give less weight so are less

00:16:42.480 | likely the language model should be less likely to take those actions that

00:16:46.560 | lead to bad reward and the way we do it

00:16:51.120 | is as follows so we take basically what we are doing is

00:16:56.400 | okay here you see all the policy and the new policy

00:17:00.240 | for now let's ignore that that's because we do something called

00:17:03.760 | offline learning so we will not be covering that at least now

00:17:08.160 | for now what we want to do ignore the the the denominator here so the pi old

00:17:15.600 | just concentrate on the um on the term pi what we are saying is that we

00:17:21.840 | generate the log probabilities of the output so what is the log

00:17:26.080 | probability of the output let's do it step by step actually okay

00:17:32.320 | imagine we have the following question

00:17:36.560 | for example imagine we ask the language model where is shanghai so

00:17:42.160 | and the language model generates because as we saw we sample a few questions

00:17:53.120 | from a database of questions and then language model and then we generate

00:17:57.760 | multiple outputs using this this question using our

00:18:00.720 | language model so these outputs are called o and there are g of them this is

00:18:05.680 | the group in grpo maybe the language model the first time

00:18:10.320 | will generate let's say uh shanghai is in china so

00:18:15.360 | let me just write shanghai is in china another output could be for

00:18:22.240 | example the sky is blue

00:18:29.200 | and another output could be shanghai is beautiful

00:18:36.640 | imagine we have some magic reward system that assigns or imagine that our reward

00:18:45.120 | system is a human being this human being will very likely to

00:18:50.000 | give a very high reward to this answer zero reward to this answer and maybe

00:18:56.000 | not completely zero but nearly zero score to this answer

00:19:01.040 | why because this at least talks about shanghai this doesn't

00:19:04.400 | talk about anything related to shanghai and this actually

00:19:07.520 | answers the questions now when we have a language model we have

00:19:13.440 | a question and the generated answer we have what is known as a trajectory

00:19:19.600 | the trajectory is a list of actions that the language model has taken

00:19:24.000 | why because the language model was given this question

00:19:27.280 | as input and the language model took an action chose

00:19:31.600 | an action to generate the first token which is shanghai

00:19:34.640 | then this shanghai was put back into the language model and then the language

00:19:38.080 | model choose another action which is the token

00:19:40.480 | is and then this is was put back into the language model and then the language

00:19:44.480 | model chose another action which is in etc etc this is a trajectory at each

00:19:49.840 | step of the generation process the language

00:19:52.240 | model chose an action based on the probability

00:19:55.600 | distribution that it generated actually it's not the

00:19:58.240 | language model that chooses it's our sampling strategy that chooses the

00:20:01.760 | particular token so usually we can use the greedy strategy or the top of the

00:20:05.120 | strategy or whatever so at each step we have chosen some

00:20:10.960 | action based on the distribution of the language that the language model

00:20:14.800 | generated for each state so we have a log probability a

00:20:20.560 | probability associated with the word shanghai

00:20:24.560 | on conditioned on the question where is shanghai

00:20:27.680 | we have a probability associated with the word is

00:20:30.640 | conditioned on the input where is shanghai

00:20:33.680 | question mark shanghai then we have a probability associated with the word

00:20:37.840 | in conditioned on the input where is shanghai

00:20:41.200 | question mark shanghai is blah blah blah etc etc so we have a list of

00:20:45.280 | probabilities what we are doing here is that we want

00:20:50.400 | the um this is the product of all the log probabilities

00:20:55.600 | that are generated at each step of the particular output

00:20:59.280 | o i here which is maybe the first answer here

00:21:03.440 | then we ask the language model again the same question and the language model

00:21:06.560 | will come up with another output and this output will also have

00:21:10.480 | associated with it a list of log probabilities and these are the log

00:21:14.160 | probability that you see here each log probability

00:21:17.520 | furthermore is weighted by an advantage term the advantage term

00:21:24.960 | is basically telling me how better is choosing a particular token

00:21:31.040 | given a particular input over all the tokens that are available

00:21:34.800 | for example imagine we have the input where is shanghai

00:21:39.360 | is it better to choose the word shanghai or is it better to as the

00:21:43.200 | first word of the response or the word pizza

00:21:47.520 | i believe it's better to choose the word shanghai so the advantage of choosing

00:21:51.520 | the word shanghai would result in a better long-term

00:21:55.680 | reward for the policy so for the language

00:21:58.480 | model because it will result in a good answer so it will result in a

00:22:02.720 | high reward from our reward model for now we

00:22:05.840 | just have modeled the reward model as a human being who tells that the

00:22:09.600 | language model okay this is a good answer this is a bad answer

00:22:12.960 | but actually later we will see that the reward is actually also a language model

00:22:16.480 | and in the case of grpo they actually used

00:22:19.440 | in the case of the deep seek r1 they use actually a rule-based reward

00:22:23.120 | model so let's go back we have a list of questions

00:22:30.240 | we generate multiple answers with each of these questions

00:22:34.320 | using our language model and then we for each of these

00:22:38.400 | answers we have the log probabilities associated with this answer

00:22:42.720 | which is just the product of all the probabilities of choosing that

00:22:46.160 | particular token given that particular input

00:22:49.600 | we weight each of this log probability by an advantage term

00:22:54.320 | which basically tells us how good is choosing this particular token over

00:22:59.680 | all the other that are available for this particular input

00:23:03.120 | and we train our language model to maximize this objective

00:23:07.120 | let's see what does it mean to maximize this objective and now let's also see

00:23:10.640 | what is this old here the language model that we will be

00:23:15.360 | training is our deep seek base right so at the

00:23:19.760 | beginning suppose that this pi old is the base

00:23:23.200 | version of deep seek what we do is basically we want to refine

00:23:28.560 | it iteratively by generate keep generating

00:23:32.480 | output from it and then through the reward model we want to tell it okay

00:23:36.480 | this was a good this was a good output so do

00:23:40.480 | more of this or this was a bad output so do

00:23:44.240 | less of this this is one of the advantage of using reinforcement learning

00:23:48.000 | because if you do supervised fine tuning you're

00:23:50.560 | just telling the language model to i want this so generate this if you are

00:23:54.560 | doing reinforcement learning you have the ability to tell the language model i

00:23:57.600 | want more of this and i want less of this

00:24:00.800 | what we are doing is the the language model

00:24:04.320 | at each iteration the language model should be optimized in the following way

00:24:12.080 | if the language model at the current iteration is giving

00:24:15.120 | more probability more likelihood to generate

00:24:19.120 | a response that resulted in a good reward and the advantage term will be

00:24:25.040 | high in that case then this policy is this

00:24:30.160 | objective is telling the language model do more of this

00:24:33.440 | however if right now the language model is

00:24:36.800 | giving less probability to an action that also resulted in

00:24:41.360 | a low than a bad reward and the advantage term will be negative in

00:24:46.560 | that case then by optimizing this objective here

00:24:50.400 | by maximizing the objective here the language model

00:24:53.040 | will learn to do less of that i know that i didn't explain very well

00:24:58.640 | the pi theta and the pi old theta i believe i can do that later when

00:25:03.200 | if we have time by talking about offline

00:25:08.960 | offline learning in the case of in the case of reinforcement learning

00:25:15.440 | moreover in the grpo you find this kehl divergence term

00:25:19.760 | now for people who already know what is the kehl divergence that would be super

00:25:22.960 | easy but for people who don't know basically the kehl divergence is a way

00:25:26.480 | of measuring how to distributions are different so how far

00:25:30.240 | they are what we want is we want the language

00:25:33.360 | model to be fine tuned to generate more of things that lead to a better

00:25:38.000 | reward to do less of things that result in

00:25:41.600 | low reward but at the same time we don't want the

00:25:44.240 | language model to change too much its behavior

00:25:46.880 | for example imagine we have a reward model

00:25:49.920 | that tells the language model to be more polite

00:25:53.760 | what the language model could do and imagine that by saying

00:25:56.880 | being polite for example in my case means that i always say thank you right

00:26:00.880 | so what could happen is if we don't enforce the kehl divergence is that the

00:26:04.160 | language model could just cheat and always generates thank you thank you

00:26:07.440 | thank you thank you thank you thank you at every response

00:26:09.760 | like a list of thank yous because that results obviously in a high reward

00:26:13.920 | but the language model would stop doing its main job which is to generate

00:26:17.360 | something factual and useful so we want the language model to change

00:26:21.280 | a little bit but to change so to be more polite but not just be

00:26:27.840 | just generate a bunch of thank yous so change

00:26:30.880 | change but change a little bit so that's why we add the scale divergence

00:26:35.360 | otherwise what the language model will do it will

00:26:37.440 | do what is known as a reward hacking which is it will try to find

00:26:41.200 | a way to just learn what is a trick to get

00:26:45.440 | maximize its reward without actually being useful

00:26:50.240 | this is also for example if you want a parallel like example it's like

00:26:54.240 | you have a text code and it's very complex people will always find a way to

00:26:58.880 | cheat on it and if you have a text code that is very simple made up of few rules

00:27:03.120 | then it's very unlikely that people will be

00:27:05.200 | able to cheat on it so

00:27:08.880 | and so so you cannot do like in that case you cannot cheat

00:27:14.480 | okay what do we miss here so first of all i didn't explain the

00:27:23.520 | i didn't explain the offline policy learning and i didn't

00:27:30.800 | explain the clip part why we are clipping here

00:27:36.400 | the clipping part basically we are saying if the language model is

00:27:39.680 | trying to change its log probabilities

00:27:44.400 | if the language model is trying to change its log probabilities by being

00:27:48.160 | too confident about its change then we don't want we don't want to

00:27:53.680 | let the model be overly confident basically what means that if the

00:27:58.160 | this is the language model at the current iteration and this

00:28:01.520 | you can think of it at the previous iteration in the training process

00:28:06.080 | if the language model at the current iteration is very confident that by

00:28:10.160 | saying the word shanghai will result in a better

00:28:13.600 | reward even if it's a good choice we don't want the language model to be

00:28:18.880 | overly confident so we clip this ratio between the log probabilities up to

00:28:24.960 | one plus epsilon or or in the lower case in one minus epsilon

00:28:30.800 | because if the language model will choose the next

00:28:34.320 | word as a shanghai given this question then okay

00:28:37.440 | we are lucky and it's it's good but imagine the language model is overly

00:28:41.040 | confident then the next word is i don't know coffee then we don't want

00:28:45.920 | the language model to make too big step we want the model to learn as

00:28:49.760 | slow as possible by choosing the accordingly this epsilon

00:28:54.960 | term which is something that we can choose and this beta term

00:28:59.280 | okay i believe i have covered a little bit of this

00:29:03.760 | so last last review guys so we are trying to optimize the language model

00:29:07.840 | iteratively by telling it to make more of something that we want

00:29:11.840 | and to do less of what we don't want how does the model know what we want and

00:29:15.920 | what we don't want it's the reward that we give it

00:29:18.720 | according to our reward model now let's talk about the reward model

00:29:22.960 | the reward model historically in ppo let's go to the other here

00:29:28.720 | was a model which was of the same structure as the

00:29:33.120 | language model that we are trying to optimize

00:29:35.680 | in which we add a linear layer on top that

00:29:38.960 | gives a reward to each answer how can we assign a numeric reward to

00:29:45.680 | a particular answer so if you remember when we talk

00:29:52.160 | about the bpo i said that usually we start with some

00:29:55.680 | questions then we generate a few answers then we ask some annotators to choose

00:29:59.360 | which answers we like and they and to also tell us which answers they

00:30:02.800 | don't like how do we convert this data set of

00:30:05.840 | preferences into a number we do that by training a

00:30:10.720 | language model which has the same architecture as the

00:30:14.160 | policy that we are trying to train just a different head on top that instead

00:30:18.160 | of generating the log probabilities of the next token generates a numeric

00:30:21.600 | reward and we do it with what is known as the bradley terry model which is

00:30:27.040 | basically this loss here you don't have to understand this law

00:30:30.240 | doesn't matter it basically means that we want the if we train the language

00:30:34.400 | model on this loss it will generate a very high reward

00:30:37.840 | for the quest bradley terry model

00:30:42.000 | so if we train our language model on this loss it will basically result

00:30:49.520 | in this head here this linear head on top of the language model

00:30:53.360 | to generate a very high value for the questions that

00:30:57.200 | for the answers that were chosen in the data set of preferences and

00:31:00.800 | a very low or low basically value for the answers that were not

00:31:06.800 | chosen by the professional annotators this is how it was done with ppo so

00:31:13.120 | we have an objective which increases the likelihood of the things that give

00:31:18.000 | high reward we also have a reward model which

00:31:20.960 | generates a numeric reward as a signal for the language

00:31:24.480 | model to understand what it should do more and what it should

00:31:27.520 | do less in deepseek r1 they do something

00:31:30.800 | different they instead of using a model as a

00:31:35.200 | reward they use a rule-based model system so

00:31:38.640 | they don't train a neural network to

00:31:41.440 | generate a number that gives a signal to the model to

00:31:44.800 | understand what we like and what we don't like

00:31:48.960 | they used a rule-based system you can see here

00:31:52.800 | and this rule-based system is you can do it for all the tasks that you can kind

00:31:57.600 | of verify so for example for the lead code

00:32:00.960 | problems they ask the how can you check if the answer

00:32:05.200 | generated by the model is good well you just run it and if it

00:32:08.400 | performs if it first compiles and secondly it

00:32:12.000 | runs in a predefined time limit then it is a good answer

00:32:16.000 | it doesn't matter how it came to be if it's if it works it's a good answer

00:32:20.480 | just like also for example math some for most of math

00:32:24.000 | math problems we do have the answer that we expect the model to generate so we

00:32:28.160 | can compare what is the actual generated answer

00:32:30.800 | and what is the we expect the model to generate

00:32:35.920 | so they create a rule-based reward system in a way that they

00:32:40.480 | ask the language model to generate some output for a given problem

00:32:44.160 | and then they can assign by just following rules so check

00:32:47.600 | if okay if it's a lead code problem just run it and check if it runs

00:32:51.280 | okay good reward doesn't run okay zero and if it's a math problem they just

00:32:56.480 | take the the output of the model compare it with the expected

00:32:59.520 | result and assign reward based on that so if it's

00:33:02.800 | the answer matches what we expected good reward otherwise

00:33:06.400 | zero etc etc they also assign a reward for the

00:33:11.200 | for the language model if it formats the output in a certain way

00:33:17.120 | by for example forcing the they give a reward for the language model if it uses

00:33:21.840 | for example if it follows the format of putting all

00:33:26.320 | the thought process in the tags think and

00:33:30.160 | slash think etc so basically just with this they train the

00:33:37.040 | language model so they train the language model to

00:33:40.240 | generate answers and they reward these answers to a reward

00:33:46.960 | system which is rule-based and they keep

00:33:50.960 | training it

00:33:53.600 | and the language models basically automatically by itself

00:33:57.760 | learns to generate the thought process that is necessary to perform

00:34:03.200 | the the tasks that it is being asked so the language model learns by

00:34:09.520 | itself by just with the reinforcement

00:34:12.800 | learning but with this reward system to generate the thought process

00:34:17.920 | that leads to the generating the right code for the

00:34:21.920 | lead code problems to generate the right thought process

00:34:25.920 | that is necessary for solving math problems etc etc etc

00:34:31.440 | let me check what else we need to know from here

00:34:36.400 | okay here they show some results i think the results you can check by yourself

00:34:41.600 | what is very interesting i think it's this

00:34:45.040 | during the training of r1 just with reinforcement learning so as i as i want

00:34:50.480 | to remind you is they took deep seek v3 base

00:34:54.800 | and added this reinforcement learning step on top of it which is a massive

00:35:00.560 | reinforcement learning step usually the alignment part is not so big

00:35:05.120 | and the more they find you they they run this reinforcement learning

00:35:09.680 | step they saw that the language

00:35:14.080 | models automatically learns to generate longer responses

00:35:19.520 | because to solve problems you need to generate a longer chain of

00:35:23.200 | thought so the language model because of the

00:35:25.680 | reward system learns that in order to get reward it is

00:35:29.600 | it should generate longer responses so they didn't tell the language model

00:35:35.040 | with supervised fine tuning to generate that kind of

00:35:38.240 | data with that particular format with that kind of thought process

00:35:42.720 | just with reinforcement learning with the right incentives the language model

00:35:46.320 | learned to do that how did it do that it learned that at a particular input it

00:35:51.600 | should generate that particular token which in the long term results in a good

00:35:56.320 | reward so the language model the beauty of

00:35:58.960 | reinforcement learning is that you not only learn to do something

00:36:03.360 | that based on the immediate reward that you get

00:36:06.640 | but also on the long-term reward that you will get because sometimes the reward

00:36:11.280 | as you can see here the reward model only applies when the entire answer is

00:36:17.680 | produced so the model will only know the signal

00:36:21.360 | that the model gets is only for the entire output

00:36:26.240 | actually okay through the advantage term this signal is propagated back to each

00:36:30.000 | single token but okay we can skip that in the in the reinforcement learning

00:36:34.160 | actually the beauty is that sometimes the reward you get is not for

00:36:37.840 | the single action you take so in the case of my cat for example

00:36:41.200 | let's say here so let's go back to my cat

00:36:45.280 | the cat will receive reward only after taking many steps that lead it to the

00:36:50.080 | meat so only when the cat is here it will

00:36:52.960 | know okay all this sequence of actions was a good

00:36:56.480 | choice but this signal is propagated back to

00:37:00.880 | each single action in such a way that the cat when it will

00:37:04.480 | be here it will very likely to choose i need to go

00:37:08.080 | down and less likely to choose i need to go

00:37:11.040 | right this also happens in the case of language

00:37:14.080 | models in this case through the advantage term here

00:37:18.160 | where is it through this advantage term here

00:37:22.080 | which is actually done for each token in the case of

00:37:25.120 | okay now we can go a little bit more technical details

00:37:29.520 | in the case of why they choose grpo over ppo first of all

00:37:33.840 | well with ppo basically this advantage term here

00:37:37.600 | to compute requires another function that is called the value function and

00:37:42.320 | this value function basically to be computed requires the training of

00:37:45.600 | another model by using the advantage term in grpo

00:37:49.920 | this advantage term is calculated without the value

00:37:53.680 | function but by the following formula here which is

00:37:56.240 | basically just based on the rewards which is already

00:37:59.680 | given by the reward model which we have which in the case of the

00:38:02.720 | deepseek r1 is rule-based so they don't need to train this other

00:38:07.760 | uh model to generate the the value function which is

00:38:11.280 | adds more complexity to the to the system

00:38:15.680 | okay so now we have seen what is the reinforcement learning we have seen what

00:38:19.520 | is the connection between language models and reinforcement learning we

00:38:21.760 | have seen a little bit what is the the grpo objective

00:38:29.280 | okay i believe let's actually do a poll guys

00:38:33.120 | do you want me to go deeper in the grpo like

00:38:37.120 | let's explore exactly this loss because actually the the rest of the paper is

00:38:43.920 | okay we tried just reinforcement learning okay

00:38:47.200 | so people like deep so let's go deep okay all right first of all the most

00:38:52.240 | interesting thing about reinforcement learning

00:38:55.760 | especially in the case of ppo and the first learn from human feedback is that

00:39:00.480 | is this thing called the gradient policy optimization so let's go back to the

00:39:03.360 | other slide and then we go back to the dpo

00:39:05.920 | because the rest of the paper in dpo deepseek is basically they they took

00:39:09.680 | this um r1 and then said okay instead of just doing reinforcement learning let's

00:39:13.920 | do maybe multiple step of reinforcement

00:39:16.240 | learning and supervised fine tuning and then reinforcement learning again then

00:39:19.200 | super and it leads to better outcomes but that is not technically difficult to

00:39:23.360 | understand i think because most of you already kind of have backgrounds in this

00:39:27.680 | so if you're here it's because you kind of understand what we are talking about

00:39:31.840 | so let's do all the things that maybe some people have difficulties with which

00:39:35.200 | is i believe this and this part okay so let's go

00:39:37.840 | deeper so um okay so let's go back to my cat

00:39:42.560 | and imagine i want to train my cat so i want to train my cat

00:39:46.080 | to um follow to to reach the meat so as we saw before my cat is just a

00:39:52.320 | an agent with a policy a policy is what it tells the cat what action to take

00:39:56.240 | given the position of the cat inside of the house

00:39:58.800 | you need to think of this house as a grid so like like

00:40:01.920 | the following i didn't uh you cannot see the the grid lines because i don't know

00:40:05.840 | my it's not showing but okay it doesn't

00:40:08.240 | matter what is the policy the policy as we saw

00:40:12.080 | before is it tells the cat what action to take given a particular position

00:40:15.280 | what is our goal in reinforcement learning is to select a policy that

00:40:18.800 | maximizes the reward that the agent gets when using this policy so

00:40:25.040 | this is the um the objective that if you have we want to select among

00:40:29.040 | all the possible policies that we can can have

00:40:32.560 | the one that maximizes an objective what is this objective is the expected

00:40:38.320 | reward that we can get when using this policy

00:40:41.040 | means that if i apply in the side or in the brain of my cat

00:40:44.080 | this policy which is the decision making stuff

00:40:48.080 | then my cat if this best policy will tell my cat to move down here

00:40:52.960 | and move right here and move right right down down down

00:40:57.040 | and until it arrives to the needle this should be the

00:41:00.800 | um the policy how do we actually train this policy we do what is

00:41:09.360 | known as a policy gradient optimization

00:41:14.160 | but before we understand policy we need to understand a few terms

00:41:17.920 | so uh we want to um we want to first of all learn what is the

00:41:23.520 | trajectories the trajectory is basically a list of state and

00:41:26.800 | actions so um

00:41:30.560 | okay yeah okay the trajectories are a list of states of action so if the cat

00:41:35.520 | is here it's in the state uh let me draw the

00:41:37.920 | lines otherwise it's too bad for you guys if you cannot see it

00:41:41.920 | here here here here

00:41:45.840 | okay this is the the cat is initially here so this let's call it the state

00:41:51.040 | number zero and the cat can choose some action and

00:41:53.440 | let's call it the action number zero when the cat takes the action number

00:41:56.960 | zero it will arrive in a new state so maybe the cat will arrive here and it

00:42:00.560 | will become the state number one of the cat and then the cat here we can take

00:42:03.680 | another action according to its policy let's call it action number one and

00:42:07.520 | which will lead the cat into a new state so we'll call it s

00:42:10.480 | s2 which in which it will in which it can take

00:42:15.040 | another action and let's call it the action number two etc etc so the

00:42:18.240 | trajectory is a list of states and action that the agent can take

00:42:21.600 | inside of the environment uh what we want

00:42:25.040 | is if we take sample a trajectory according to our policy we want to

00:42:29.840 | maximize the reward that we get from each of the

00:42:32.960 | possible trajectories that we can take

00:42:36.800 | how do we do that let's do it here so basically what we do is the

00:42:43.680 | following we this is our objective

00:42:48.240 | so if we can find as you know when we have in deep learning what we are doing

00:42:53.520 | we are trying to either maximize something or we are going to minimize

00:42:57.280 | something in the case of

00:43:00.960 | model training we usually always minimize a cost function in this case we

00:43:06.000 | want to maximize the expected rewards when the agent acts

00:43:10.480 | according to this policy this policy however is not just any policy it is a

00:43:15.040 | particular policy made up of some parameters that we will call

00:43:18.160 | theta to give you a parallel on what is happening here

00:43:23.120 | is the following imagine you have a company

00:43:26.800 | and you are like the ceo of the company and this company is made up of

00:43:34.720 | many actors and many functions and many departments

00:43:38.480 | so each of these things are let's say they are parameters of

00:43:43.440 | your company they define your pump because you can tune them and the

00:43:46.800 | company function will change so how people talk to each other how

00:43:51.200 | people work how the departments work how the

00:43:54.560 | logistics work how the office works etc

00:43:57.040 | they are all parameters of your company which define the outcome of your

00:44:00.640 | company and imagine you want to maximize the

00:44:04.320 | profit of your company so what you do you learn to tune all of

00:44:08.160 | these parameters so you learn to for example tell people to behave in a

00:44:11.760 | particular way or you tell the people to collaborate in a particular way or to

00:44:15.760 | work on some projects and not work on some

00:44:17.760 | other projects this is what we do in the gradient policy optimization

00:44:21.120 | policy gradient optimization we calculate the gradient with respect

00:44:26.480 | to the parameters of this particular objective function

00:44:31.280 | which what is the gradient

00:44:35.120 | yes later we talk about the discount factor in the word sum okay

00:44:38.560 | so what is the gradient the gradient basically tells us

00:44:44.400 | how the how the objective will change if we change the

00:44:51.040 | parameter a little bit the gradient always tell us how it will

00:44:57.040 | increase so the gradient tells us what is the

00:45:00.080 | the direction of max the maximum ascent of a particular

00:45:03.680 | function with respect to the variable in to which you

00:45:06.960 | calculate it so in this case we are computing the

00:45:09.440 | gradient with respect to the parameters of the objective function which tells us

00:45:17.280 | how should i change the parameters to increase

00:45:20.480 | this objective function which is exactly the expected reward

00:45:24.480 | that we want we can get from this policy so

00:45:28.640 | because the the the gradient tells us how we should change the parameters to

00:45:34.240 | increase the objective then we change the parameters

00:45:37.600 | according to the direction of the gradient and this is what we do here

00:45:41.600 | so we have an objective which is tells us

00:45:44.960 | the expected reward when acting according to this policy

00:45:48.480 | we calculate its gradient with respect to the parameters which tells us

00:45:52.320 | how we should change these parameters to increase this objective so to increase

00:45:56.640 | the expected reward and then we change the parameters in the same direction of

00:46:01.120 | the gradient and we do it iteratively this is called

00:46:05.680 | policy gradient optimization and actually this is a beautiful result

00:46:09.280 | because it means that i can just use my cat

00:46:13.600 | whatever policies my cat has right now i can just

00:46:18.560 | sample some trajectories from these policies when i ask my cat to move

00:46:22.080 | around check what kind of reward i get

00:46:25.840 | calculate the gradient with respect to the expected reward according to these

00:46:30.240 | trajectories and then tell the cat hey you should do

00:46:33.920 | more of this because this led to a better reward

00:46:37.120 | or you should do less of this because it leads to a bad reward

00:46:41.680 | this is policy gradient optimization now it has some problems because policy

00:46:47.440 | gradient optimization basically okay as you can

00:46:50.400 | let's skip the math because if you want the math i i made a video

00:46:53.760 | it's on youtube so you can watch it tomorrow but it has some problems

00:46:57.840 | because as you can see the search space of the

00:47:01.840 | cat is enormous because at the possible

00:47:05.280 | trajectory that the cat can take to go from here

00:47:08.080 | to the meat there are a lot because the cat can go like this

00:47:11.840 | it can go like this it can go like this it can go here then come back then go

00:47:17.200 | down etc etc so there is many many many many many

00:47:21.120 | trajectories that the cat can take to go to the meat

00:47:26.400 | however to compute this objective we should actually check all the

00:47:32.320 | possible trajectories to get the direction of the gradient

00:47:37.760 | however this is intractable means that in the case of language model we should

00:47:42.000 | ask the language model to generate all the possible output ever possible

00:47:46.320 | given a particular question which is intractable because

00:47:49.520 | at each token the the language model can choose what

00:47:54.000 | let's say the vocabulary size is 30 000

00:47:57.040 | then the language model can choose 30 000 possibilities for the first

00:48:00.560 | token then 30 000 for the second 30 000 for the third etc etc

00:48:05.200 | and to check all of them it's computationally

00:48:08.400 | impossible so we can always approximate this with

00:48:12.160 | a sample with a sample this is called the

00:48:18.880 | monte carlo estimation however this results in because we are

00:48:23.840 | not checking all the possible trajectories

00:48:25.680 | but we are making the decision of optimizing our cat

00:48:28.880 | using only a few trajectories of course as you can see it's a risky situation

00:48:33.440 | so it means that we are making a hard decision on how we should change

00:48:37.520 | a policy without checking all the possible search space

00:48:41.920 | this basically means that we have high variance

00:48:45.200 | and there are many ways to reduce this variance

00:48:48.880 | so when you read the term baseline in the deep seek paper

00:48:53.040 | this is one of the ways to reduce this variance because we

00:48:56.160 | are trying to optimize the language model into choosing

00:48:59.920 | certain patterns into choosing certain chain of thoughts into choosing

00:49:04.000 | certain sequence of tokens without exploring all the possible generation

00:49:08.160 | that the language model can have

00:49:11.280 | um so so let me see how can we simplify this one

00:49:18.720 | we blah blah okay so in order to reduce this variance so

00:49:25.760 | in order to make sure that we optimize the language

00:49:29.600 | model even without checking all the possible

00:49:33.040 | generations but still making sure that we make the

00:49:37.040 | gradient that we get so the direction that tells us

00:49:39.680 | what parameter we should change and in which direction in order to increase the

00:49:43.920 | expected reward we can introduce this advantage term

00:49:47.680 | here this advantage term basically for each token tells the language model

00:49:52.480 | how better is choosing this token over all the other tokens that i can choose

00:49:57.520 | in this position for example in the example in the example that we saw

00:50:01.680 | before so where is shanghai is should the language

00:50:06.000 | model choose the word shanghai or it should choose the word coffee or should

00:50:09.680 | it choose the word pizza well it's very more it's much more advantageous

00:50:14.000 | to to to choose the word shanghai because it's very likely that the

00:50:18.160 | language model will then complete it as shanghai is in china or

00:50:21.440 | shanghai is a city in china or shanghai is located in china etc etc

00:50:25.840 | so the choosing the language the the word shanghai

00:50:30.400 | results on the long term in a much better

00:50:34.640 | reward so the advantage of choosing shanghai

00:50:38.160 | is higher compared to all the other tokens in that

00:50:41.280 | condition in the case of the cat it means that the cat when is

00:50:46.640 | here it should very high it's very advantageous to choose go down because

00:50:51.840 | it will result in going to the meat and over all the other

00:50:57.040 | all the over all the other actions this doesn't mean that

00:51:00.080 | choosing up will lead you to die or to get no reward because you can always go

00:51:04.240 | up and then change direction and go down but it's

00:51:07.200 | much more advantageous to just go down this is the this is the

00:51:13.440 | uh this is the meaning of the advantage term

00:51:16.480 | and this is the same advantage term that you see in the

00:51:20.880 | grpo loss now what is the difference between the

00:51:24.800 | advantage term that you see in the ppo and the grpo is that

00:51:28.240 | the advantage term in the ppo requires the

00:51:31.600 | what is known as the value function in the grpo they just

00:51:36.560 | they they compute this advantage in a different way

00:51:40.560 | which still results in in variance reduction

00:51:45.120 | but without having this value function estimation

00:51:48.240 | so grpo is a computationally more um advanced i would say efficient in this

00:51:54.480 | case

00:51:56.800 | okay let me see what else we need we've skipped also the part of the

00:52:01.360 | let me see um off policy learning right so offline policy learning

00:52:08.320 | so what is off policy learning uh imagine we have the cat

00:52:13.760 | let's go to the cat actually okay to compute this the the the loss

00:52:21.760 | to to okay the the gradient policy optimization we saw that we need to

00:52:25.840 | sample some trajectories right and we don't have to sample

00:52:28.480 | all the possible trajectories right because we are trying to approximate it

00:52:33.280 | um so when we sample these trajectories we are sampling from

00:52:37.440 | a policy which is the current brain of the cat

00:52:40.800 | so we ask the current brain of the cat to choose some actions and

00:52:44.880 | generate some trajectories means that we ask the language the cat to just

00:52:48.160 | navigate the house and let's see what it does

00:52:52.720 | okay so we ask the the cat to just navigate the house and

00:52:59.920 | see what it does and then after the cat has navigated the house

00:53:04.160 | we look at what they are the trajectory that the cat has taken and then we give

00:53:08.400 | reward to the cat based on the trajectory it has taken

00:53:12.160 | and then we optimize the policy which means that we optimize the

00:53:15.600 | brain of the cat uh with the whatever it has learned with the with

00:53:22.000 | the direction of the gradient based on the reward it has received

00:53:25.200 | but now the brain of the cat is a new brain because it has

00:53:29.120 | changed compared to the past which means that the next step of iteration of

00:53:34.000 | optimizing optimizing the brain of the cat or its decision making skills

00:53:38.320 | we need to sample new trajectories so we need to ask the cat again

00:53:42.160 | to go through all the house make a few choices

00:53:45.520 | and then we check these choices and again we tell the cat hey you did

00:53:49.120 | here you did well here you didn't well so the cat will learn to

00:53:52.880 | to optimize its policy which will result in a new policy

00:53:59.360 | and then again we need to sample from this policy but as you can see every

00:54:02.400 | time we do an optimization step we need to sample again these trajectories

00:54:06.240 | and in the case of language models this means that first you need to sample some

00:54:10.720 | responses then you reward these responses based on your reward model

00:54:16.240 | and then you need to sample new responses because now the policy

00:54:21.360 | has changed because we updated the language model in order to avoid

00:54:26.560 | this sampling process which is expensive we introduce

00:54:30.560 | of policy learning in which let me show you here where is it

00:54:36.880 | okay in which basically we take the language model

00:54:41.200 | we ask it to generate a lot of trajectories

00:54:44.480 | and we do it once then we sample some of these trajectories which means

00:54:48.720 | basically we ask it to generate some responses then we sample

00:54:51.760 | some of these responses and we fine-tune the language model

00:54:55.040 | based on the reward we got on these responses then we don't sample

00:54:58.800 | new we don't sample new trajectories or

00:55:01.680 | responses we just take another mini batch of the

00:55:04.720 | trajectories that we sampled initially and again we do another step of

00:55:08.320 | optimization and we keep doing it for n steps only then we sample new

00:55:14.080 | trajectories this results in a much more efficient

00:55:18.320 | training so it's not like we are

00:55:22.960 | let's say do change a little bit the policy and then sample new trajectories

00:55:26.960 | from this policy no we sample a lot of trajectories

00:55:29.600 | initially we keep them in some database in the memory or whatever you want

00:55:34.080 | then we do some continuously optimizing the policy

00:55:37.200 | using the trajectories that we have sampled initially

00:55:41.040 | and this basically results in a much efficient training

00:55:47.920 | and now we can go back to the DPO sorry the DeepSeq R1 paper here

00:55:54.640 | to understand finally the loss in its entirety

00:55:58.480 | so what we are doing here is we have a policy

00:56:02.560 | at the current step of sampling and then we have the policy from which

00:56:07.680 | we sampled the trajectories so

00:56:11.840 | as you can see it's written here so we sample

00:56:14.960 | first a question from our database of questions

00:56:18.160 | then for each question we generate a list of outputs

00:56:21.680 | of responses so we prompt the model basically with the question and then we

00:56:26.320 | ask it to generate multiple responses and you can generate

00:56:29.200 | multiple responses like we saw here like we ask the language model where is Shanghai

00:56:32.240 | and the language model will generate one response then

00:56:34.560 | we ask it again where is Shanghai and maybe this time the language model will say

00:56:37.440 | something else and then we ask it again etc etc so we

00:56:40.000 | generate a list of outputs for the same question then we

00:56:44.480 | compute the following loss which is basically

00:56:47.440 | the ratio of the log probability the ratio of the probabilities or the

00:56:52.240 | difference of the log probabilities so the ratio of the probabilities of

00:56:57.360 | the assigned to the output by the current

00:57:02.400 | iteration so the current iteration at which we are

00:57:07.120 | optimizing the language model with respect to

00:57:11.120 | the language model from which we sampled so the language model

00:57:14.400 | basically when we initially sample the trajectories we already pre-compute the

00:57:17.920 | probabilities so we we can compute the probabilities

00:57:22.080 | while while sampling we can save them so we have always available this

00:57:26.640 | pi pi of old now this ratio means what means that if at the current

00:57:33.040 | iteration the language imagine this ratio is more

00:57:35.360 | than one it means that at the current iteration

00:57:37.920 | the language model is telling me that i want to choose this

00:57:41.680 | output more because i am more likely to choose this one now what we want is

00:57:48.240 | if the advantage of doing this choosing this action is good

00:57:52.960 | because it leads advantage positive means that it's

00:57:56.160 | good to choose it's advantageous to choose this

00:57:59.120 | action because it results in a good reward so

00:58:02.400 | if the language model is more likely to choose something

00:58:05.840 | and at the same time this something also results in good advantage

00:58:10.640 | then this stuff here will be positive and it will be

00:58:14.800 | big and because we are maximizing this will contribute positively to our

00:58:19.840 | objective because we are trying to maximize it so we want to do

00:58:22.720 | more of this however if the language model

00:58:26.160 | on the other hand is trying to do less of something so means that this ratio

00:58:30.960 | will be less than one and at the same time this

00:58:34.640 | results in a something that is disadvantageous so it

00:58:39.200 | means that it's not advantageous to do this

00:58:42.640 | then the language model is also um okay let me check what so

00:58:51.120 | i got a little lost if i have if this ratio is a positive

00:58:55.680 | and i am i get a good advantage it means that the language model will

00:59:01.680 | be more likely to do it if something that results in bad

00:59:06.240 | advantage and the language model is still doing it then it will be a big

00:59:10.880 | negative reward which will contribute negatively to our

00:59:14.960 | objective so the language model will be less likely to do it

00:59:18.640 | at the same time we don't want the language model to make big decisions

00:59:22.560 | at every step we want to clip limit the decision making of the language

00:59:28.880 | model at each step so it means that at each step of it of

00:59:32.480 | optimization even if the language model is very

00:59:35.200 | confident that something is good then we don't

00:59:38.800 | care how confident language is we want to

00:59:40.880 | limit its confidence by clipping this ratio here between one minus epsilon and

00:59:46.880 | one plus epsilon

00:59:49.920 | at the same time we also don't want the language model to change too much so we

00:59:54.320 | have initial frozen model here which is a pyref

00:59:57.680 | pyref basically means the original model so in the case of r1

01:00:02.320 | it is the deep seek v3 base so which is the

01:00:07.760 | language model that has never been trained with reinforcement learning

01:00:11.920 | so we want because we want reinforcement learning framework we want to

01:00:15.920 | change the language model a little bit to learn

01:00:19.120 | reasoning but we don't want the language model to forget

01:00:22.320 | everything or to just not behave like a language model anymore

01:00:26.560 | otherwise the language model could just do like a reward hacking

01:00:29.680 | so we want this is basically pyref is the

01:00:34.400 | the deep seek v3 base so the current policy so the current

01:00:41.360 | language model should try to be as close as possible to the

01:00:44.640 | original model but at the same time it should try to change according to the

01:00:48.640 | reward it gets the advantages of the reward it gets

01:00:53.440 | now how is the advantage term computed here basically it is

01:00:58.160 | each reward normalized so it's basically it's like

01:01:03.200 | each reward coming out from a distribution

01:01:05.360 | centered on the mean of zero and the standard division of one

01:01:10.880 | why do we want this because it means first of all

01:01:14.800 | we don't want the numeric value of the reward to affect the training process

01:01:19.760 | but how better it is compared to the other

01:01:23.600 | rewards to affect the the training process not the magnitude of its value

01:01:30.160 | okay now that we have seen this i believe you should have

01:01:37.120 | most of the knowledge i guess to understand

01:01:40.320 | all of the paper because the other part that they do is okay instead of

01:01:47.040 | just taking the language model and just training with the reinforcement learning

01:01:50.880 | let's try to introduce some first of all

01:01:55.120 | first of all let's try to introduce some very high quality super

01:01:59.280 | fine tuning data and then we do another step of reinforcement learning and then

01:02:02.640 | we do another step of fine tuning and then do another

01:02:06.000 | step of reinforcement learning and this actually leads to a better

01:02:10.720 | better model another actually interesting part of the

01:02:14.240 | the paper is the distillation so the distillation i don't know if

01:02:17.600 | most people are familiar with what is distillation and how it works so if you

01:02:21.440 | want i can talk about it a little bit otherwise i think let's go to sleep guys

01:02:25.840 | let's see

01:02:28.240 | so yes means let's talk about it

01:02:33.280 | okay okay okay okay no problem please give big picture overview i mean

01:02:39.760 | bro you can just read the paper yourself

01:02:42.720 | okay yes okay more on distillation

01:02:48.000 | okay so distillation basically means this imagine you have a

01:02:53.680 | imagine you are trying to

01:02:56.720 | imagine you are trying to teach yourself

01:03:03.440 | okay imagine you are trying to teach yourself

01:03:10.720 | graduate maths that's one thing right imagine you don't have any background on

01:03:15.600 | math and imagine you learn it from a

01:03:18.000 | university teacher it's two different thing right because if

01:03:21.520 | you try to learn it by yourself you need to come up with all the strategies to

01:03:24.400 | learn math but if you learn it from a professor then the professor can also

01:03:27.520 | give you hints on how to learn it faster so in the case of models we have usually

01:03:33.360 | a big model so let's call it big brother

01:03:37.760 | and then we have a smaller model what we want

01:03:41.520 | is and then we have a data set let's call it a data set so data set

01:03:49.360 | what we do is we prompt the big brother so the

01:03:52.800 | big model let's call it the big model actually i don't want to confuse people

01:03:58.080 | so this is the big model

01:04:01.920 | and this is the small model so

01:04:06.480 | so now when we train language models so

01:04:10.800 | what how do we train language models first of all

01:04:13.200 | when we train language models we do that in a self-supervised way

01:04:17.520 | what does it mean it means that there is no kind

01:04:21.040 | it means that the the language model is trained

01:04:25.200 | without annotating the data it means that we sample a lot of text

01:04:29.680 | and we force the language model to learn to predict the next token given the

01:04:33.760 | the context that comes before it it means that imagine you want to train

01:04:37.360 | the language model on the following text for example

01:04:42.720 | the following sentence for distilled model we

01:04:47.120 | report representative results blah blah imagine we want to train the language

01:04:51.600 | model only on this green text here what we will do is we will give it the

01:04:56.720 | word for and we ask it to learn to predict the word distilled when it sees

01:05:00.880 | for then we give it the word for distilled and we force it to learn to

01:05:05.520 | predict the word models when it sees for distilled the beauty of the

01:05:09.440 | transformer is that all this process can be done in parallel so we don't have to

01:05:12.640 | do it one word at the time

01:05:15.920 | if you do the job of training a small model

01:05:20.240 | on reasoning just by itself it will have much more difficulties however if you

01:05:25.360 | have a big model that has already been trained on a specific task

01:05:29.360 | then you can use the big model to help the small model learn faster

01:05:33.680 | how when we train a language model on raw data so in this case for example we

01:05:39.680 | said to the language model when you see four you should choose distilled

01:05:44.480 | when you see four distilled you should models

01:05:48.080 | this basically is called the next token prediction task

01:05:52.320 | and the way we do it is we force the distribution that

01:05:56.000 | the language model outputs and this distribution is the goal of the language

01:05:59.520 | model so as we saw before the language model

01:06:01.360 | generates a distribution over all the possible next words

01:06:04.880 | so it will assign a probability to the first word

01:06:08.320 | next word and to the second next word and the third next word and the fourth

01:06:12.400 | next word and we ask it okay when you see four

01:06:16.400 | distilled you should exactly this particular word

01:06:19.840 | which is the word models so you should choose the word

01:06:25.200 | models and all the other words should not be chosen

01:06:28.880 | so zero zero zero zero zero this one should be chosen with 100 percent

01:06:35.120 | score this is how we force the language model

01:06:39.520 | by doing it on many many many many texts the language model

01:06:42.720 | learns to generate a distribution that very likely will generate this word

01:06:47.840 | models when it is four distilled and less likely to generate the others

01:06:53.600 | if we do the same job with the small model we will see that the small model

01:06:56.880 | will have a lot of difficulties learning at the

01:07:00.800 | same pace as the big model why because the big model has more parameters so it

01:07:04.320 | has more flexibility in learning complex tasks

01:07:07.760 | while the small model does not have this flexibility so it will must be much slow

01:07:11.600 | learner so how do we how can we distill the

01:07:16.000 | the knowledge of the big model into the small model

01:07:19.760 | we do it as follows we take a data set of prompts so a list of prompts

01:07:26.240 | we feed it to the big model and then we ask the big model to generate

01:07:30.080 | an answer not only the answer we also ask it to generate the log probabilities

01:07:35.440 | at each step of the generation process so

01:07:38.720 | before we use the sentence where is shanghai shanghai is in china right

01:07:43.200 | so imagine we are dealing with the big model where is shanghai

01:07:47.280 | the model will generate the shanghai is in china but

01:07:50.720 | with the because we have the word shanghai we also have the log probability

01:07:55.280 | of not only the word shanghai but also all the other words that

01:07:59.360 | could we could have chosen in that position

01:08:02.640 | we forced the small model to learn not only to generate shanghai but we force

01:08:07.520 | it to learn the same distribution that the big model generated for the

01:08:12.800 | same position so let me do it again so imagine now we

01:08:17.040 | do a concrete example imagine we have a sentence where is shanghai

01:08:24.160 | shanghai we put it give it to the big model

01:08:30.000 | the big model will generate an answer what does it mean to generate an answer

01:08:32.960 | it means that it will generate first of all a distribution over what is the next

01:08:36.400 | likely next token which will be a list of

01:08:39.440 | probabilities where the word shanghai will have a very high score so maybe 0.6

01:08:45.280 | and maybe the word pizza will be 0.1 and the word the cat will be 0.05 etc

01:08:52.400 | 0.05 etc etc then we choose the word shanghai

01:08:56.480 | and we ask the language model again where is shanghai

01:08:59.600 | question mark shanghai then it will choose the word is but it will not just

01:09:03.760 | choose the word is it will actually give us a distribution and we choose the word

01:09:07.520 | is it will give us a distribution that looks like this for example it will say

01:09:11.360 | okay the word is is very likely so it's maybe

01:09:15.440 | 70 probability the word i don't know road is unlikely because it

01:09:22.400 | doesn't make sense and the word i don't know hello

01:09:26.720 | also doesn't make sense so it is even less

01:09:30.160 | likely so for each position we can generate the

01:09:33.600 | log probabilities from the big model and we force the small model not to

01:09:38.240 | learn to just generate shanghai but to learn this entire distribution

01:09:42.160 | why because this gives a much stronger signal to the small model on what it

01:09:48.080 | should do what it can do and what it must never do

01:09:52.000 | instead of just telling you should do this it's a bigger signal for the small

01:09:56.240 | model to learn faster and what they say in the

01:10:01.120 | in the deep seek r1 paper is that by distilling

01:10:04.880 | you can create a much stronger model than just by

01:10:08.640 | um training them from scratch

01:10:12.960 | a bit more context on the probabilities for each forward pass well

01:10:28.320 | uh no distillation is used on all the tokens so it's you're telling for

01:10:34.960 | each possible token you learn the entire distribution for

01:10:38.400 | each position not only the last

01:10:53.040 | so how we assign reward for each action okay in the case

01:10:56.800 | of um in the case of ppo by the way if you watch my video on

01:11:03.840 | ppo i i explain all of that so i don't want to kind of uh

01:11:09.120 | i don't know why nobody ever watched that video it's my one of my masterpieces

01:11:13.200 | and nobody ever cared about it but all these

01:11:15.520 | slides come from my video that i done in 2023

01:11:19.040 | so um the way we generate the rewards is basically we the reward are usually

01:11:26.320 | only generated for the okay there are two kind of rewards that you can

01:11:30.080 | generate one is called the outcome based reward

01:11:33.680 | and one is called the process based reward

01:11:36.480 | in the case of ppo we generate the outcome based reward and then through

01:11:41.760 | the advantage term is it is a kind of distributed on all the

01:11:46.800 | previous tokens because we sample a response and then we judge this response

01:11:51.600 | according to our reward model then with the advantage terms that you see in the

01:11:56.160 | loss this reward is distributed on all the

01:12:00.880 | previous tokens so each token carries information on how

01:12:05.280 | likely it is going to lead to the final reward

01:12:18.080 | okay there is another part of way i think we forgot one part guys

01:12:24.800 | um the unsuccessful attempts this is interesting also so uh they

01:12:31.280 | tried what is known as the process reward

01:12:33.680 | model basically um okay the reward model that we saw before

01:12:37.360 | is a rule-based model that is outcome driven means that the model

01:12:41.600 | have to generate the entire process and the for example in

01:12:44.960 | the case of read code problems the model has to generate the final code

01:12:48.320 | to make it runnable then we run the code we see how it performs and then we have

01:12:53.600 | signal on the reward however there are other ways of

01:12:57.520 | generating reward and one of them is called the process reward model where

01:13:01.040 | you divide the problem in sub-problems and then you have a model which is

01:13:08.080 | called the process reward model that assigns a reward to each single step

01:13:12.400 | however they in the paper they say that assigning first of all

01:13:16.480 | dividing a problem into sub-steps it's difficult because every problem is

01:13:20.880 | different and we don't want to kind of tell the

01:13:23.920 | model you should follow this pattern it should

01:13:26.640 | the model that should come try to come up with this pattern and

01:13:29.440 | actually it does and secondly they say that it's

01:13:32.560 | difficult to judge each single step so with sometimes we don't even know if

01:13:36.880 | that step is good or not like when you're trying to solve a math proof

01:13:40.480 | sometimes you do some intermitted steps that may not lead to the

01:13:46.080 | to the you may not immediately see that they will lead to the

01:13:52.160 | final conclusion but they are necessary right

01:13:55.520 | so it's difficult to give them a reward until you arrive to the end so it's

01:14:00.720 | difficult to kind of give a reward to each single step and

01:14:04.640 | there is another technique called the multi-carlot research

01:14:07.440 | which basically it's a tree search in which

01:14:10.640 | each intermediate node is given each intermediate node is also a step

01:14:16.480 | and each of these steps each of these nodes actually has a kind of a score

01:14:20.720 | associated with it which increases the more time that

01:14:24.720 | particular step leads to a successful solution so imagine for

01:14:29.360 | example you are doing a lit code and if you a lit code problem so if you

01:14:33.360 | start your code for example with a typo

01:14:37.200 | of course the code will not compile so any style anything that starts with that

01:14:41.280 | type typo will never lead to a successful

01:14:43.520 | solution so that node will never be explored further

01:14:46.480 | so multi-carlot tree search basically forces the language model to explore

01:14:50.480 | more of the solutions that lead to successful

01:14:54.080 | attempts and less to the one that don't succeed

01:14:58.400 | however even by doing multi-carlot research they they get

01:15:01.760 | like sub results that are not as good as the

01:15:06.400 | reinforcement learning driven one and the

01:15:09.520 | the beauty of this actually is also in this sentence which is

01:15:12.880 | rather than explicitly teaching the model on how to first of all divide the

01:15:16.720 | problem into sub problems and how to solve the problem or tell the

01:15:20.880 | model what is the format it should follow to solve the problem

01:15:24.080 | the model just learns to solve to just learns to come up with the right format

01:15:29.520 | with the right chain of thought to solve the problem of course they

01:15:33.280 | apply they play with the reward because they

01:15:35.920 | tell the reward that the language model should

01:15:39.920 | follow a particular format and the language model should also

01:15:43.200 | what is it should be accurate etc etc so you can always play with a little bit

01:15:47.680 | the reward and another thing that they notice is

01:15:49.920 | actually if you train the language model only on the

01:15:52.560 | reinforcement learning then it leads to kind of the language model

01:15:57.120 | with some side effects for example the fact that the language

01:16:01.760 | model mixes languages because nobody told the language model

01:16:05.440 | that it has to only speak english to solve a problem that is

01:16:09.040 | stated in english so imagine i ask you to solve a problem like

01:16:12.160 | a math problem and i say it in english

01:16:15.920 | and imagine you are like a mixed kid and you can speak chinese and english or you

01:16:21.040 | are just a chinese who knows also english etc

01:16:25.520 | maybe in your head you you speak english and the chinese

01:16:29.040 | you think in english and in chinese and then you come up with the solution and

01:16:32.400 | that's totally correct right so unless i tell you

01:16:35.280 | explicitly that you should never think in chinese

01:16:38.720 | then you you why you should not take the freedom of doing it right

01:16:42.480 | so if you never tell the model to not do something

01:16:46.160 | the model probably will do it and because in the reward

01:16:49.520 | here they never told the language model to never think in other languages the

01:16:52.880 | language actually started thinking in other languages

01:16:56.000 | anything to get the job done that's the beauty of reinforcement learning

01:17:00.640 | so you give the right incentive and the language model will come up with the

01:17:04.000 | right way of reaching that goal as long as the

01:17:07.840 | signal is strong enough

01:17:10.880 | yes you need a massive cluster of gpus i think i believe so because

01:17:19.600 | you still have a like you're still running gradient descent on the language

01:17:23.120 | model itself so every time you are kind of

01:17:26.240 | fine-tuning the language model you are changing its weights so

01:17:30.640 | okay guys um before we talk about other questions okay uh all this lecture was

01:17:45.200 | totally uh kind of um improvised so i didn't

01:17:48.320 | prepare it um so i hope i didn't

01:17:53.440 | make big mistakes because the problem is

01:17:57.280 | reinforcement learning from human feedback is quite a complicated topic

01:17:59.840 | and you need to kind of derive everything step by step this is what i

01:18:02.400 | do in my video on reinforcement learning from human feedback here i kind of

01:18:05.600 | sometimes skipped some steps and went back etc so i hope i didn't create

01:18:09.600 | confusion so if there is kind of some parts about

01:18:14.480 | the rl that you didn't understand then i

01:18:17.440 | think we can talk about it otherwise let's call it a day guys

01:18:22.960 | maybe you could record for youtube more prepared i could but

01:18:26.800 | i have a daughter guys now i'm super busy and i also have a full-time job and

01:18:30.400 | i also have a wife so have some mercy on me

01:18:35.040 | any

01:18:45.040 | thank you guys

01:18:51.680 | yeah i think everyone now should have at least the basic understanding on how to

01:18:55.760 | read the paper like when you read the paper you should

01:18:58.960 | have a clear idea of what's happening and if you kind of need more

01:19:02.720 | background you are uh you can check like my previous works

01:19:09.440 | i think let's stop the recording and have a good night guys

01:19:15.520 | and now if you want you can also

01:19:20.800 | how to stop the recording

01:19:24.160 | you

01:19:26.240 | you

01:19:28.400 | you

01:19:30.480 | you

01:19:31.480 | you

01:19:32.480 | you

01:19:33.480 | you

01:19:34.480 | you

01:19:35.480 | you