Back to Index

Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.


Chapters

0:0 Introduction
3:52 Intro to Language Models
5:53 AI Alignment
6:48 Intro to RL
9:44 RL for Language Models
11:1 Reward model
20:39 Trajectories (RL)
29:33 Trajectories (Language Models)
31:29 Policy Gradient Optimization
41:36 REINFORCE algorithm
44:8 REINFORCE algorithm (Language Models)
45:15 Calculating the log probabilities
49:15 Calculating the rewards
50:42 Problems with Gradient Policy Optimization: variance
56:0 Rewards to go
59:19 Baseline
62:49 Value function estimation
64:30 Advantage function
70:54 Generalized Advantage Estimation
79:50 Advantage function (Language Models)
81:59 Problems with Gradient Policy Optimization: sampling
84:8 Importance Sampling
87:56 Off-Policy Learning
93:2 Proximal Policy Optimization (loss)
100:59 Reward hacking (KL divergence)
103:56 Code walkthrough
133:26 Conclusion

Transcript

Hello guys, welcome back to my channel. Today we are going to talk about reinforcement learning from human feedback and PPO So reinforcement learning from human feedback is a technique that is used to align the behavior of a language model to what we want The language model to output for example We don't want the language model to use curse words or we don't want the language model to behave in an impolite way to the user So we need to do some kind of alignment and reinforcement learning from human feedback is one of the most famous technique Even if there are now new techniques like dpo, which I will talk about in another video Now reinforcement learning from human feedback is also how they created chat gpt So how they align chat gpt to the behavior they wanted In the topics of today are first I will introduce a little bit the language models how they are used and how they work Then we will talk about the topic of ai alignment why it's important And later we will do a deep dive into reinforcement learning from human feedback in particular I will introduce first of all, what is reinforcement learning then I will describe all the setup of the reinforcement learning So the reward model what are trajectories in particular, we will see the policy gradient optimization and we will derive the algorithm We will see also the problems with it.

So how to reduce the variance the advantage estimation important sampling of policy learning, etc, etc The goal for today's video is actually to derive the loss of the ppo So I don't want to just throw the formula at you. I want to actually derive step by step all the Algorithm of ppo and also show you all the history that led to it So what were the problems that ppo was trying to solve mathematical from a mathematical point of view?

and In the final part of the video, we will go through the code of an actual implementation of reinforcement learning from human feedback with ppo and I will actually not code from uh by line by line I will actually explain the code line by line and in particular I will show the Implementation as done by the HuggingFace team So I will not show you how to use the HuggingFace library to use reinforcement learning from human feedback But we will go inside the code of the HuggingFace library and see how it was implemented by the HuggingFace team This way we can combine the theory that we have learned with practice Now the code written by the HuggingFace team is kind of obscure and complex to understand So I deleted some parts and I also commented with my own comments some other parts that were not easy to understand this way I hope to make it easier for everyone to follow the code Now there are some prerequisites before watching this video First of all, I hope that you have some notions of probability and statistics.

Not much. At least, you know, what is an expectation? um, we we need to know so of course some Knowledge from deep learning for example gradient descent. What is the loss function? And the fact that in gradient descent we calculate some kind of gradient etc We need to have some basic knowledge of reinforcement learning even if I will review most of it So at least you know, what is an agent, the state, the environment and the reward One important aspect of this video is that we will be using the transformer model a lot So I recommend you watch my previous video on the transformer If you have not if you're not familiar with the concept of self-attention or the causal mask, which will be key to understanding this video So the goal of this video is actually to combine theory with practice So I will make sure that I will always kind of give an intuition to formulas that are complex And don't worry if you don't understand everything at the beginning Why?

Because I will be giving a lot of theory at the beginning because later I will be showing the code I cannot show the code without giving the theoretical knowledge So don't be scared if you don't understand everything because when we will look at the code I will go back to the theory line by line so that we can combine You know the practical and the theoretical aspect of this knowledge.

So let's start our journey Okay What is a language model? First of all a language model is a probabilistic model that assigns probabilities to sequence of words in particular A language model allows us to compute the probability of the next token given the input sequence In particular, for example, if we have a prompt that says shanghai is a city in What is the probability that the next word is china?

Or what is the probability that the next word is beijing or cat or pizza? This is the kind of probability that the language model is modeling Now in my tractation of language model I always make a simplification which is that each word is a token and each token is a word This is not always the case because it depends on the tokenizer that we are using and actually in most cases.

It's not like this But for simplicity, we will always consider for the rest of the video that each word is a token and each token is a word Now you may be wondering how can we use the language models to generate text? well, we do it iteratively which means that if we have a Prompt for example a question like where is shanghai then we ask the language model What is the next token and for example greedily we select the token with the most probability So we select for example the word shanghai then we take this word shanghai.

Let me use the laser We put it back into the input and we ask again the language model What is the next token and the language model will tell us what are the probability of the next token and we select the one that is more probable suppose it's the word is We take it and we put it back in the input and again we ask the language model What is the next token suppose the next token is in We take it we put it back in the input and we ask again the language model What is the next token etc until we reach a number of tokens that we have generated?

Or we believe that the answer is complete So in this case we can stop for example Because we can see that the answer is shanghai is in china is the answer generated by the language model So this is an iterative process of generating text with the language model and all language models actually work like this Now with what is the topic of ai alignment A language model is usually pre-trained on a vast amount of data which means that it has been pre-trained on billions of web pages or the entire of wikipedia or thousands of books This gives the language model a lot of knowledge from which it can retrieve And it can learn to complete a prompt in a reasonable way However, this does not teach the language model to behave in a particular way.

For example Just by pre-training we do not teach the language model to not use offensive language or to not use racist expressions or to not use curse words To do this and to create for example a chat assistant that is friendly to the user We need to do some kind of alignment So the topic of ai alignment is to align the model's behavior with some desired behavior Let's talk about reinforcement learning.

So reinforcement learning is an area of artificial intelligence that is concerned with training an Intelligent agent to take actions in an environment in order to maximize some reward that it receives from the environment Let me give you a concrete example So imagine we have a cat that lives in a very simple world Suppose it's a room made up of many grids and this cat can move from one cell to another now in this case our agent is the cat and this agent has a state and Which describes for example the position of this agent In this case the state of the cat can be described by two variables One is the x coordinate and one is the y coordinate of the position of this cat Based on the state the cat can choose to do some actions Which could be for example to move down, move left, move right or move up Based on the state the cat can take some actions and every time the cat takes some action It will receive some reward from the environment.

It will for sure move to a new position And at the same time will receive some reward from the environment And the reward is according to this reward model So if the cat moves to an empty cell it will receive a reward of zero If it moves to the broom, for example It will receive a reward of -1 because my cat is scared of the broom if somehow after a series of states and actions the cat arrives to the Bathtub it will receive a reward of -10 because my cat is super scared of water However, if the cat somehow manages to arrive to the meat it will receive a big reward of +100 How should the cat move?

Well, there is a policy that tells what is the probability of the next action given the current state So the policy describes for each position. So for each state of the cat With what probability the cat should move up or down or left or right? And then the agent can choose to either choose a randomly an action Or it can choose to select the action with the most probability for example, which is a greedy strategy etc etc now the goal of reinforcement learning is to Learn a probability so to optimize a policy Such that we maximize the expected return when the agent acts according to this policy which means that we should have a policy that with very high probability takes us to the meat because that's one way to Maximize the expected return in this case Now you may be wondering okay the cat I can see it as a reinforcement learning agent and the reinforcement learning setup Makes sense for the cat and the meat and all these rewards But what is the connection between reinforcement learning and language models?

Let's try to clarify this so You can think of the language model as a policy itself So as we saw before the policy is something that given the state Tells you what is the probability of the action that you should take in that state In the case of the language model.

We know that the language model tells you given a prompt What is the probability of the next token? So we can think of the prompt as the state and the next token as the action that the language model can choose to perform Which will lead to a new state because every time we sample a next token We put it back into the prompt then we can ask the language model again.

What is the next next token etc So as you can see we can think of the language model as the reinforcement learning agent itself and also as the policy itself in which the state is the prompt and the action is the Next token that the language model will choose according to some strategy which could be the greedy one Which could be the top p or the top k or etc, etc The only thing that we are missing here is the reward model How can we reward the language model for good responses and how can we kind of?

Penalize the language model for bad responses. This is Done through a reward model that we have to build. Let's see how Okay, imagine we want to create a reward model for our language model, which will become our Reinforcement learning agent. Now to reward the model for generating a particular answer for questions We could create a dataset like this of questions and answers generated by the model For example, imagine we ask the model where is shanghai the model language model could say.

Okay. Shanghai is a city in china. We should Give some reward to this answer. So how good this answer is? Now in my case, I would give it a high reward because I believe that the answer is short and to the point But some other people may think that this answer is too short.

So they maybe want They prefer an answer that is a little longer or in this case, for example What is two plus two suppose that our language model only says the word four Now some in my case, I believe this answer is too short so it could be a little more elaborate, but some other people may think that this answer is Is good enough now what kind of reward should we give to this answer or this answer as you can see It's not easy to come up with a number that can be accepted by everyone So us humans are not very good at finding a common ground for agreement But unfortunately, we are very good at comparing so we will exploit this fact to create our data set for training our reward model So what if instead of generating one answer we could generate multiple answers using the same language model This can be done for example by using a high temperature and then we can ask a group of people So expert labelers experts in this field to choose which answer they prefer and having this data set of Preferences we can then create a model that will generate a numeric reward for each question and answer So first we create a data set of questions Then we ask the language model to generate multiple answers for the same question For example by using a high temperature and then we ask people to choose which answer they prefer Now our goal is to create a neural network, which will act as a reward model so a model that given a question and an answer will generate a numeric value for in such a way that the Answer that has been chosen should have a high reward and the answer that has not been chosen Which is something that we don't like should have a low reward.

Let's see how it is done What we do in practice is that we take a pre-trained language model For example, we can take the pre-trained llama and we feed the language model the question and answer So the input tokens here you can see are the questions and the answer concatenated together We give it to the language model as input the language model.

It's a transformer model So it will generate some output embeddings. These are called hidden states So as you know, the input are the tokens which are converted into embeddings Then the positional encoding then we feed it to the transformer layer The transformer layer will actually output some embeddings which are called hidden states And usually for text generation, we take the last hidden state We send it to some linear layer which will project it into the vocabulary Then we use the softmax and then we select the next token But instead of selecting because here we are we do not want to generate a text We just want to generate a numeric reward We can substitute the linear layer that is projecting the last hidden state into the vocabulary But instead we replace it with another linear layer with only one output feature So that it will take an input embedding as input and generate only one value as output Which will be the reward assigned to the answer for the particular given question Of course this is the architecture of the model.

We also need to train it So we also need to tell this model that it has to generate a high reward for answers that are chosen And low reward for answers that are not chosen Let's see what is the loss function that we will use to train this model The loss function that we will be using is this one here So you can see it's minus the log of the sigmoid of the reward assigned to the good answer Minus the reward assigned to the bad answer now Let's analyze this loss function here.

So Pen, okay So there are two possibilities either this difference here So he is a negative or it is positive which means that either The response assigned to the so how do we train it? First of all basically because our data set is made up of questions and possible answers I suppose there are only two possible answers.

One is a good one. And one is the bad one We take each question answers We feed the question to the model along with the answer concatenated to it and we general model will generate some reward We do it for the good question and for the sorry for the good answer and also for the bad answer And it will generate two rewards suppose.

This is the reward for the good one. So let's write good one And this is the reward associated with the bad one Now either the model assigned a high reward to the good one and a low reward to the bad one So this difference will be positive and this is good So in this case the loss will be like this So if the reward given to the good answer is higher than the reward given to the bad answer the This difference will be positive.

So let's see the sigmoid function. How does it behave when the input is positive? So when the input is positive the sigmoid gives an output value that is between 0.5 And one so this stuff here will be between 0.5 And one when the log receives any because here you can think of as having a parenthesis When the log sees an input that is between 0.5 and 1 will generate a number negative number That is more or less between 0 and minus 1 more or less so With the minus sign here, it will become a positive number between 0 and 1 So the loss in this case will be small because it will be a number between more or less between 0 and 1 I maybe it's two or three but okay depends on the graph of the log.

I don't remember. What is the exact value for the 0.5 here However, let's see if the model Gave a high score to the bad response and the low score to the good response. So let's start again Okay And Okay, here is the bad Response and here is the good response Now what happens if this value here is smaller than this value here So this difference will be negative when the sigmoid receives as input something that is negative It will return an output that is between 0 and 0.5 The log when it sees an input that is between 0 and 0.5 So more or less here it will return a number negative number that is between minus infinity and more or less one so It will return because there is a minus sign here.

It will become a very big number in the negative range So the loss in this case will be big so big loss Here it was a small loss small Loss Okay Now as you can see when the reward model is real is giving a high reward to the good answer and a bad A low score to the bad answer.

The loss is small. However, when the reward model gives a high reward to the bad answer and a low score to the good answer the loss is very big What does that what does it mean this for the model that it will force the model to always give High rewards to the winning response and low reward to the losing response So it because that's the only way for the model to minimize the loss because the goal of the model always during training is to minimize The loss so the model will be forced to give High reward to the chosen answer and the low reward to the not chosen answer or the bad answer In hugging face you we can this reward model is implemented in the reward trainer class So if you want to train your own reward model, you need to use this reward trainer class and it will take as input A auto model for sequence classification, which is exactly this architecture here So it's a transformer model with instead of having the linear layer that projects into the vocabulary It has a linear layer with only one output feature that gives the reward And if you look at the code on how this is implemented in the hugging face library You will see that they first generate the reward for the chosen answer So for the good answer, then they generate the reward for the bad answer.

So for the rejected response here, it's called And then they calculated the loss exactly using the formula that we saw So the log sigmoid of the rewards given to the chosen one minus the rewards given to the rejected one Let's talk about trajectories now Now as I said previously in reinforcement learning the goal is to select a policy or to optimize a policy That maximizes the expected return of the agent when the agent acts according to this policy More formally we can write it as follows that we want to select a policy pi That gives us the maximum expected reward when the agent acts according to this policy pi Now, what is the expected return?

The expected return of the policy is the Expected return over all possible trajectories that the agent can have when using this policy So it's the expected return over all possible trajectories as you know The expectation can also be written as an integral. So it is the probability of the Particular trajectory using this policy multiplied by the return over that particular trajectory Now, what is a trajectory first of all and later we will see what is the probability of a trajectory So the trajectory is a series of states and actions Which means that a trajectory you can think of in the case of the cat as a path that the cat can take Suppose that each of the trajectory have a maximum length.

So we don't want the agent to perform more than 10 steps to To arrive to its goal. Now the cat can go to the meat for example using this path here or it can choose this path here Or it can use this path here or this one here or for example It can go forward and then go backward and then stop because it has already used the 10 steps Or it can go like this etc etc.

So there are many many many paths. What we want is we want to we want to find a policy that Maximizes the expected return so the return that we get along each of these paths Now, we will also model the the The the next state of the cat has been stochastic.

So first of all, let's introduce what is the these states and actions So let me give you an example Suppose that our cat is starting from some state s0 which is the initial state The policy tells us what is the next action that we should take given the state So the cat will ask the policy What is the next action that it should take?

And because the policy is a stochastic this policy will tell us what is the probability of the next action. So Just like in the case of the language model we given a prompt we select what is the probability of the next token So imagine that the policy tells us that the cat should move down So action down for example with very high probability or it should move right with lower probability It should move left with even lower probability or it should move up with an even lower probability Suppose that we select to move down it will result in a new state That may not be exactly this one.

Why? Because we model the Cat as being drunk, which means that the cat Wants to move down but may not always move down and we will see later why this is helpful But another case could be for example Imagine we have a robot and the robot wants to move down but the wheels of the robot are broken So the robot will not actually move down.

It will remain in the same state So we always model the next state not as being deterministically determined But as being stochastic given the current state and the action that we choose to perform So imagine that we choose to perform the action down The cat may arrive to a new state s1 which will be according to some probability distribution Then we can ask again the policy.

What is the next action I should do? Policy might say okay You should move right with very high probability and you should move down with a lower probability or you should move left with an even lower Probability etc etc. So as you can see, we are creating a trajectory which is a series of states and actions Which define how our cat will move in a particular trajectory Okay Let's see Now, what is the probability of a trajectory?

The probability of a trajectory as you can see here The fact that we chose a particular action depends only on the state we were in And the fact that we arrived to this state here depended on the state we were in and the action that we have chosen and then The fact that we have chosen this action here depended only on this state We were in because the policy only takes as input the state and gives us what is the probability of the action that we should take So we can because they are independent from each other these events We can multiply them together to get the probability of the trajectory So the probability of the trajectory is the probability of starting from a particular starting point.

So from this state zero here Then for each step that we have take so for each action state of this particular trajectory We have the probability of choosing This the action given the state and then to arriving a new state Given that we were at this state at time step t and we chose action 80 at time step t And we multiply all these probabilities together because they are independent from each other Another thing that we will consider is that is when How do we calculate the reward of a trajectory?

A very simple way to calculate the reward of a trajectory is to just sum all the rewards that we get along this trajectory For example, imagine the cat to arrive to the meat follows this trajectory. You could say that the reward is zero here So it's zero zero zero zero zero zero zero and then suddenly it becomes plus 100 when we reach the meat If the cat for example follows this path here We could say okay, it will receive minus one because the cat is scared of the broom then zero zero zero zero zero one hundred Actually, this is not how we will calculate the reward of a trajectory We will actually calculate the reward as a discounted which means that we prefer immediate rewards instead of future rewards To give you an intuition in why this happens.

First. Let me talk about money So if I give you ten thousand dollars today You prefer receiving it today instead of receiving it in one year Why because you could put the ten thousand dollars in the bank. It will generate some interest So at the end of the year, you will have more than ten thousand dollars And in the case of reinforcement learning, this is helpful also for another case For example, imagine the cat can only take 10 steps to arrive to the meat Or 20 steps.

So one way for the cat to arrive to the meat is to just go directly to the meat like this And this is one trajectory But another way for the cat is to go like this For example, go here then go here then go here then go here and then go here So in this case, we prefer the cat to go directly to the meat instead of Taking this longer route.

Why? Because we modeled the next state as being stochastic And if we take a longer route the probability of ending up in one of these obstacles is higher the longer the route is So we prefer having shorter routes in this case And this is also convenient from a mathematical point of view to have this discounted rewards Because this series which is infinite in some cases, okay, we will not work with infinite series but it's helpful because this series can converge if this Element of the series is becoming smaller and smaller and smaller So let me give you a practical example of how to calculate the reward in a discounted case so imagine the cat starts from here and it goes to the follows this path so to calculate the reward of this trajectory We will do like this.

So it is the reward at time step zero, which is Arriving to the broom multiplied by gamma to the power of one. So it will be gamma multiplied by minus one then All these rewards are 0 0 0 so they will not be summed up And finally we arrive here at where the reward is plus 100 at time step 1 2 3 4 5 6 7 8 So it will be gamma to the power of 8 multiplied by 100 So gamma is usually chosen.

Not usually. It's always something that is between 0 and 1 So it's a number smaller than 1. So it means that we are decaying this reward by gamma to the power of 8 so it will be Smaller the longer we take to reach it. This is the intuition behind discounted rewards Now you may be wondering The trajectories make sense in the case of the cat So I can see that the cat will follow some path to arrive to the meat and it can take many paths to arrive to the Meat to so so we know what is the trajectory in the case of the cat But what are the trajectories in case of language model?

Well, as I saw before, we want to we have a policy which is the language model itself So because the policy tells us given the state what is the next action and in the case of language model We can see that the language model itself is a policy and we want to optimize this policy such that it selects The next token in such a way as to maximize a cumulative reward According to the reward model that we have built before using the data set of preferences that I saw before Also in the case of the language model the trajectory is a series of states and actions What are the states in the case of the language model?

Are they prompts? What are the actions? Are the next tokens? So imagine we have a question like this too for the language model. So where is shanghai? Of course, we will ask the language model. What is the next token which will this will become the initial prompt? So the initial state of the language model we will ask the language model What is the next token and that will become our action the token that we choose But then we feed it back to the language model.

So it will become the new state of the language model And then we ask the language model again. What is the next token? It will be for example, the word is and this will become again the input of the language model So the next state and then we ask the language model again.

What is the next token? For example, we choose the token in and then the concatenation of all these tokens will become the new state of the language model So we ask the language model again. What is the next token, etc, etc until we generate an answer So as you can see also in the case of the language model We have trajectories which are the series of prompts and the tokens that we have chosen Now imagine that we have a policy because we our goal is to optimize our language model Which is a policy such that we maximize a cumulative reward according to some reward model that we have built in the past now Our more formally our goal is this so we want to maximize this function here Which is the expected return over all possible trajectories that our language model can generate And we also saw that before the trajectory is a series of prompts and next tokens Now when we Use stochastic gradient descent.

So for example when we try to optimize the neural network We use stochastic gradient descent, which means that we have some kind of loss function We calculate the gradient of the loss function with respect to the parameters of the model And we change the parameters of the model such that we move against the direction of this gradient So we take little steps against the direction of the gradient to optimize the parameters of the model to minimize this loss function In our case, we do not want to minimize a loss function.

We want to maximize a function which is here And this is can also be thought of as an objective function that we want to maximize So instead of using a gradient descent, we will use a gradient ascent The only difference between the two is that instead of having a minus sign here.

We have a plus sign Now, this algorithm is called the policy gradient optimization And the point is we need to calculate the gradient of this Function here of our objective function. So what is the gradient with respect to the parameters of our model? So our language model what is the gradient of the Expected return over all possible trajectories with respect to the parameters of the model We need to find an expression of this gradient so that we can calculate it And use it to optimize the parameters of the model using gradient ascent Using also a learning rate alpha you can see here Now, let's see how to derive the expression of the gradient of this objective function that we have Now the gradient of the objective function is the gradient of this expectation so it's the expectation over all possible trajectory of multiplied by the The return over the particular trajectory As we know the expectation is also an integral So it can be written as the gradient of the integral of the probability of following a particular trajectory Multiplied by the return over this trajectory as you know from high school the gradient of a sum is equal to the sum of the gradients or the You may recall it as the derivative.

So the derivative of a sum of a function is equal to the sum of the derivatives So we can bring this gradient sign inside and it can it can be written like this Now we will use a trick called the log derivative trick to expand this expression So p of tau given theta into this expression here.

Let me show you how it works Let's use the pen Okay You may recall also from calculus that the gradient with respect to theta of the log function of the log of a function in this case of p of tau given theta Is equal to so the gradient of the derivative of the log function is one over the function p of tau given theta multiplied by the gradient with respect to theta of the function that is inside the log so p of tau given theta We can take this term to the left side multiply it here and this expression here Will become equal to the this expression multiplied by this expression and this is exactly what you see here So we can replace this expression that we see in the equation above.

So this expression with this expression we can see here in the equation below now We can this integral we can write it back as an expectation over all possible trajectories of this quantity here now Because the probability is only this term here So we can write it back as a expectation Now we need to expand this term here.

So what is the gradient of the log? So this this expression here So what is the gradient of the log of probability of a particular trajectory given the parameters of the model? Let's expand it we saw before that the probability of a trajectory is just the product of all the Probabilities of the state actions that are in this trajectory.

So the probability of starting from a particular state Multiplied by the probability of taking a particular action given the state we are in multiplied by the probability of ending up in a new state given that we started from The state at time step t and we took action at time step t And we do it for all the state actions that we have in this trajectory If we apply a log to this expression here, the product here will become a sum And let's do it actually.

Okay, so we circle the log of p of tau given pi Pi theta actually because we model our Policy pi as parameterized by parameter theta here. I forgot the theta but doesn't matter It's equal to the log Of all this expression. So it's the log of a series of products so it can be written as the log of p 0 of s 0 plus the summation The log of p of s t plus 1 Given that we are in st plus at not plus and at And we are in we took action at plus the log Of This the action that we took according to our policy at given that we were in st Okay.

Now we are also taking the gradient of this expression and as you can see here there is no term that depends on Theta so it can be deleted Also in this case, we do not have any Expression that in this expression here. We do not have anything that depends on theta.

So this can be deleted Because the derivative of something that does not have the variable being Derived is a constant so it can be deleted because it will be zero So the only term surviving in the summation is only these terms here because it's the only one that contains the theta As you can see here.

So in the final expression is this one here. So this summation now, let me delete So we have derived An expression that allow us to calculate the gradient of the objective function because why we need the gradient of the objective function? Because we want to run gradient ascent Now one thing that we can see here.

We still have this expectation over all possible trajectories now To calculate over all possible trajectories in the case of the cat It means that we need to calculate this gradient over all the possible paths that the cat can take of Length, for example, 10 steps. So if we want to model trajectories of only length 10 It means that we need to calculate all the possible paths that the cat can take of length 10 And it could be a huge number in the case of language model It's even bigger because usually imagine we want to generate trajectories of size 100.

It means that what are the possible All the possible texts that we can generate of size 100 tokens using our language model And for each of them we need to calculate the reward and the log action probabilities, which I will show later how to calculate Now as you can see the problem is this expectation is over a lot of terms So it's intractable computationally to calculate them to calculate this expression because we would generate Need to generate a lot a lot a lot of text for the language model.

So one way to To calculate this expectation is to approximate it with the sample mean so we can always approximate An expectation with the sample mean so instead of calculating it over all the possible trajectories We can calculate it over some trajectories. So in the case of the cat it means that we Take the cat and we ask it to move using the policy for some number of steps and each And we will generate one trajectory We do it many times and it will generate some trajectories in the case of the language model We have some prompt we ask the language model to generate some text Then we do it many times using different temperatures and different sampling strategies For example by sampling randomly instead of using the greedy strategy.

We can use the top p so it will generate many texts Each text will represent a trajectory. We do not have to do it over all the possible text that the language model can generate But only some so it means that we will generate some trajectories So we can calculate this expression here only on some trajectory that our language model will generate And this will give us an approximation of this gradient here Once we have this gradient here, we can evaluate it over the trajectories that we have sampled And then run gradient ascent on it So practically it works like this in the case of the cat We have some kind of neural network that defines the policy which is taking the state of the cat Which is the position of the cat tells us what is the probability of the next action that the cat should take We can use this policy, which is not optimized to generate some trajectories So for example, we start from here.

We ask the policy Where should I go and we for example, we use the greedy strategy and we move down then Or we use the top p for example Also in this case, we can use top p to sample randomly the action given the probabilities generated by the network So imagine the cat goes down and then we ask again the policy.

Where should I go? Policy may say okay move right move down move right move right etc. So we will generate one trajectory We do it many times by sampling always randomly according to the probabilities generated by the policy For each state actions, we will generate many trajectories in this case Then we can evaluate because we also know the rewards that we accumulate over each state actions.

We calculate the reward We also know the log probabilities of the each action because for each state we have The log what is what was the probability of taking that action and we choose it And we need to calculate also the gradient of this log probabilities This is done by automatically by pytorch when you run lost dot backwards.

So pytorch actually will calculate the gradient for you We do it for all the other possible trajectories. This will give us the approximated Gradient of over the trajectories that we have collected We run gradient ascent and we optimize the parameters of the model using a step towards the gradient Now then we need to go We do we need to do it again.

So we need to collect more trajectories We evaluate them. We evaluate the gradient of the log probabilities. We run a gradient ascent So we take one little step towards the direction of the gradient And then we do it again. We go again collect some trajectories. We evaluate this expression here to Calculate the gradient of the policy with respect to the parameters And we run again gradient ascent so a little step towards the direction of the gradient This is known as the reinforcement learning algorithm in literature And we can use it also to optimize our language model.

So in the case of the language model We we have to also generate some trajectories So one way to generate the trajectories would be to for example use the database of Questions and answers that we have built before for the reward model Which means that we have some questions So we ask the language model to generate some answer for each question Using for example the top piece strategy.

So it will generate according to the temperature many different answers for the same given question This will be a series of trajectories because the language model generation process is an iterative process made up of states So prompts and actions and which are the next tokens And this will result in a list of trajectories for which we have The log probabilities because the language model generates a list of probabilities over the next token And we can also calculate the gradient of this state Log probabilities using PyTorch because when we run loss.backward it will calculate the gradient But how do we do it in practice?

Let's see Now we want to calculate this term here So the log probabilities of the action given the state for language models Which means what is the probability of the next token given a particular prompt? Imagine that our language model has generated the following response So we asked the language model where is Shanghai and the language model said Shanghai is in China Our language model is a transformer model.

So it is a transformer layer And it will generate a given an input sequence of embeddings. It will generate An output sequence of embeddings which are called hidden states one for each input token As you know the language model when we use it for text generation It has a linear layer that allow us to calculate the logits for each position So usually we calculate the logits only of the last token because we want to understand what is the next token But actually we can calculate the logits for each position So for example, we can also calculate the logits for this position and the logits for this position will indicate What is the most likely next token?

Given this input. So where is Shanghai? question mark Shanghai is So this is because of the causal mask that we apply during the self-attention mechanism So each hidden state actually encapsulates information about the current token. So in this case of the token is And also all the previous tokens.

This is a property of the transformer model that is used during training So during training as you know, we do not calculate The output of the language model step by step We just give it the input sentence the output sentence, which is the shifted version of the input sentence we calculate the For we do the forward pass and then we calculate the log using only one forward pass We can use the same mechanism to calculate the log probabilities for each States and actions in this trajectory, which as I showed you is a series of prompts and next tokens Now we can calculate the logits for this position for this position for this position and for this position then we usually we apply the softmax to understand what is the Probability of the next token, but in this case, we want the log probabilities So we can apply the log softmax for each position.

This will give us What is the log probability of the next token given only the previous tokens? Compared to the current one So for this position it will give us the log probability of the next token given that the input is only where is shanghai? question mark shanghai Of course, we do not want all the log probabilities We only want the log probability of the token that actually has been chosen in this trajectory What is the actual token that has been chosen for this particular?

Position. Well, we know it. It's the word is so we only selected the log probability corresponding to the word is This will return us the log probability for the entire trajectory because now we have the log probability of selecting The word shanghai given the state where is shanghai? We have the log probability of selecting the word is given the input Where is shanghai question mark shanghai?

We have the log probability of selecting the word in given the input where is shanghai question mark shanghai is etc, etc So now we have the log probabilities of each Of each position of each state action in this trajectory When we have this stuff here, we can always ask PyTorch to run The backward step to calculate the gradients and then we multiply each gradient by the reward that we receive From our reward model we can then calculate this expression and then we can run Gradient ascent to optimize our policy based on this approximated gradient Let's see how to calculate the reward now for the trajectory So calculating the reward is a similar process as you saw before we have a reward model That is a transformer model with a linear layer on top that it has only one output feature So imagine our sentence is the same.

So Where is shanghai shanghai is in china. This is the trajectory that has been generated by our language model Now we give it to the reward model. The reward model will generate some hidden states because it's a transformer model And we apply the linear layer to all the positions that are corresponding to the action that are in this trajectory So first action is the selection of this word.

The second action is this one the third and the fourth So we can generate the reward for each time step We can just sum these rewards to generate the total reward of the trajectory or we can sum the discounted reward Which means that we will calculate something like this.

For example We will calculate Let's write it. So it will be the reward at time step zero plus gamma multiplied by the reward at time step one plus gamma multiplied by the reward at time gamma to the power of two multiplied at By the reward at time step two plus gamma to the power of three multiplied by the reward at time step three, etc Etc.

So now we also know how to calculate the reward for each trajectory So now we know how to evaluate This expression you can see here. So now we know also how to run gradient ascent to optimize our language model The algorithm that I have described before is called the gradient policy optimization and it works fine for very small problems But it exhibits problems.

It is not perfect for bigger problems. So for example language modeling And the problem is very simple. The problem is that we are approximating. So let's write here something so We as you saw before our objective function, which is j of theta, which is an expectation Over all possible trajectories that are sampled according to our policy And expectation each one with its reward along the trajectory So we are approximating the expectation with a sample mean so we do not Calculate this expression over all possible trajectories.

We calculate it only on some trajectories now this is Fair, it means that the result that we will get will be an approximation that on average will converge to the true expectation So it means that on the long term it will converge to the true expectation, but it exhibits high variance So to give you an intuition into what this means, let's talk about something more simple For example, imagine I ask you to calculate the average age of the American population.

Now the American population is made up of 330 million people To calculate the average age means that you need to go to every person ask what is their birthday calculate the Age and then sum all these ages that you collect divide by the number of people And this will give you the true average age of the American population But of course as you can see, this is not easy to compute because you would need to interview 330 million people Another idea would be say okay.

I don't go to every American person I only go to some Americans and I calculate their average age which could give me a good indication of what is the average age of the American population But the result of this approximation depends on how many people you interview because if you only interview one person It may not be representative of the whole population.

Even if you interview 10 people, it may not be representative of the whole population So the more people you interview the better and this is actually a result that is statistically proven by the central limit theorem So let's talk about the variance Of this estimator. So we want to calculate the average age of the American population Suppose that the average age of the American population is 40 years or 45 years or whatever If we approximate it using a sample mean which means that we do not ask every American but some Americans what is their average age We need to sample randomly some people and ask what their age.

Suppose that we only interview one person because we are We do not have time Suppose that we are unlucky and this person happens to be a kindergarten student and this person will probably say The age is a six. So we will get a result that is very far from the true mean of the population On the other hand, we may ask again some random people and these people happen to be for example All people from retirement homes.

So we will get some number that is very high which is for example 80 years Which is also not representative of the true population So the smaller the sample the more unlucky we are in getting these values that are very far from the true mean So one way is to increase the sample size So if we ask 1000 people what is their average age very probably we'll get something that is closer to this 40 years old because we cannot be so unlucky to get six or That all of them happen to be in the kindergarten or in the retirement age in the retirement home This happens also when we approximate an estimation with a sample mean here The quality of this approximation depends on how many trajectories we choose and as you saw before Choosing too many trajectories from language models is not easy because it means that you need to run Inference on the language model many times to calculate these trajectories Now So the problem is we cannot easily Increase the number of trajectories, but we need to find a way to reduce this value so we do not because this is the This tells us what is the direction of the gradient that we will use to run a gradient ascent We want to find the true direction of the gradient so imagine the true direction of the gradient is this one if we have high variance it means that sometimes the This approximation may tell us that the gradient is actually pointing in this direction or it's pointing in this direction Or it's pointing in this direction But if we increase the reduce the variance It will probably tell us something that is more closer to the true direction of the gradient So we will move our weights in a way that is moving To maximize the objective function because we are moving according to the true direction of the gradient So this is why we want to reduce the variance of this estimator Now, let's see what are the techniques that we can use to reduce the variance of this estimator without increasing the sample size The first thing that we should notice is that okay First of all, we had this expectation that we approximate using the sample mean you can see here Now each of these log probabilities.

So this log probabilities here are multiplied by the reward over the entire trajectory Now the first thing that we should notice is that each action cannot alter the reward that it That we received in previous steps. So imagine We have a series of states and actions. So for example, we started from state zero And then we take action one Which led us to action zero and then this led us to state one In which we we took action one which led us to state two in which we took action two, etc, etc, etc For each state action we receive a reward because when we take an action it will For example in the cat it will move to a new cell or remain in the same cell and it will receive some reward And also for this one we will have some reward.

So reward one and for this one we will have reward two Now when we take this action here, for example action number two, it cannot alter the reward that we already received in the past So when we multiply by this term reward of tau We do not consider all the rewards that came before the action that we are considering in this summation So instead of calculating the reward For the trajectory starting from zero We can calculate the reward starting from the time step of the action that we are considering for the log probabilities of the action This term here is known as the rewards to go which means what is the total reward if I start from this state And take this action and then act according to the policy for the rest of the trajectories Why do we want to do this?

Because as you can see This expression here is an approximation of the true expectation here The less terms we have the better because we will have less noise Why? Because first of all As we know each action cannot alter the rewards that we received in the past Which means that on average all these past terms will cancel out with each other But so we if we do not consider them we avoid adding some noise in this approximation that will send our gradient in Directions that are further from the true gradient So if we can remove some terms from this expression It is better because we have less chance of introducing noise that sends our gradient in two directions that are far from The one that is the true gradient that would be given by this expectation So the first thing we do is we instead of calculating the reward over all the trajectory.

We only calculate the reward For each state action of the reward starting from that state action onwards Until we reach the end of the trajectory So this T big T here you can see here capital T Indicates from the time of the current state action that we are considering here until the end of the trajectory Now this is one way to reduce the variance of the estimator Another way is to introduce a baseline.

So You can introduce it has been proven in the research of reinforcement learning that introducing a constant here Reduces the variance and it doesn't have to be a constant but it can also be something that depends on the state So it could be also a function of the state For which we are calculating the reward of the trajectory.

So for each log probability we multiply by a term here That indicates the rewards to go so the reward from this state action until the end of the trajectory minus a baseline that does not have to be Constant, but it can also be a function of the state And the function that we will choose is called the value function So this baseline we will there are many baselines, but with the one we will choose is the value function the value function tells us Of S according to some policy pi tells us what is the expected reward if you start from S And then act according to the policy for the rest of the trajectory.

This is the value function Let me show you some examples So The value function of this particular cell Of this cell here. We expect it to be high why because It's very probable that the cat will take the action move down And go directly to the meat in the case of language model this is a prompt because it's a series of tokens that we will feed to the language model to generate the Probabilities of the next token and it's very good to be in this state Why because it's very probable that the next token will be generated in such a way that it will actually answer the question Of where is shanghai?

So if the model has already generated these two tokens, for example Shanghai is it's very probable that the next token will be the word in and the next next token will be the word china Which answers our question which will result in a good response by the language model Which in turn will give us a good reward according to our reward model on the other hand If we are here, for example with the cat This is a state that can lead us to move to the bathtub So we expect the value of this state to be lower than that of this state because it's less probable that from here We end up on the bathtub.

Maybe we get closer to the bathtub, but we do not end up directly on the bathtub But from here we can end up there so it will reduce the value of this state So what is a bad value for a language model? For example in the case for this prompt here So we started with a prompt and the language model somehow generated these two words chocolate muffins for the question Where is shanghai?

Now if we ask the language model to generate the next tokens for given this prompt It will probably move far from the actual response of where is shanghai It will not tell us that shanghai is in china So the value that we can get starting from this state is not so high because we will probably end up generating a Bad response which will give us a low reward according to our reward model So this is the meaning of a value function The value function tells us if I start from this state and then act according to the policy What is the expected return I can get?

Now, how do we estimate this value function? Well, just like we did for the reward model we can generate a neural network To which we add a linear layer on top that can estimate this value function and usually what is done in Practically is we use the same language model that we are trying to optimize we add another linear layer on top So apart from the one that projects into the vocabulary We add another one that can also estimate the value so that the parameters of the transformer layer are shared For the language modeling and the estimation of the value.

The only two differences are the linear layers One is used for projecting the tokens into the vocabulary and one is used to estimate the value of the state Which is the prompt basically So suppose our language model has generated this response for our Prompt, so where is Shanghai and the language model has said Shanghai is in China We send it to the policy model.

So the language model that we're trying to optimize this is called the policy It will generate some hidden states one corresponding to each token and then instead of using the linear layer of the Vocabulary, so that will project each hidden state into the vocabulary We use another linear layer that with only one output feature that will be used to estimate the value of each state So we can estimate the value of this state of this state of this state and also of the entire sequence By using the values generated by this linear layer for each hidden state that we want Okay, now we have seen that before to reduce the variance first of all, we transformed the The reward of the entire trajectory in rewards to go So something that starts not from t zero, but t equal to the action state that we are considering here And we also saw that we can introduce a baseline that depends on the state And this will not change the approximation.

So this approximator is still unbiased Which means that it will on average converge to the true gradient, but will have lower variance Which means that in the case of for example, we are calculating the average age of the american population which means that we are reducing the chance of Getting very low very low values for the age or very high values for the age But we will get something that is more closer to the true average age of the american population Now this function here this rewards to go is in reinforcement literature.

It's also called the Q function So the Q function tells us if I start from this state and take this action. What is the future? Expected reward if I act according to the policy for the rest of the trajectory So the Q function tells us the expected reward if I start from this state and take this action So we get some immediate reward and then act according to the policy for the rest of the trajectory and So we can simplify the expression that we have seen before as Q of state And action at time step t here.

I forgot the t minus the value of the state at time step t The difference between the two is known as advantage function Now, I know that I am introducing a lot of terms and terminology bear with me because it will make sense later now just Don't you don't have to remember all the terms.

I will repeat multiple times these concepts So what we were trying to do we are trying to reduce the variance of this estimator and we saw that we can instead of calculating the reward for all the trajectories only for the rewards for the Starting from the time step in which we are considering the action values Then we saw that we can introduce this baseline called the value function that will reduce further the variance of this estimator The difference between these two is called advantage function in the literature of reinforcement learning And the advantage function if you look at the expression here tells us.

Okay. First of all, let's analyze these two terms pen Now the Q function tells us what is the expected return if I start from state s at time step t Take action a so here. I forgot the t's t Action t and t and also here t and t.

Okay so the Q function tells us if I start from state t take action a And then act according to the policy What is the expected return the value function on the other hand tells us if I start from state s And I act according to the policy. What is the expected return?

now in this case For example, let's use the pen in this case here in this state If I choose the action go down It is better than going left because by going down I will move towards the mid So it is better to use the action go down The advantage term that is the difference between these two terms tells us How better is this particular action compared to the average action that we can take in the state s Which means that the advantage function for the state for the action go down in this state here So in this state here will be higher than the advantage function of another action so the advantage function tells us how Better than the average is this action that we are considering compared to the other actions that we have in this state And if we want to give an interpretation to this whole expression It tells our model that for each log probability.

So for each action in a particular state We want to multiply it by its advantage Because this is the gradient it will indicate a direction in which we need to optimize our parameters By using gradient ascent basically what we are doing is we are forcing our policy to push up So to increase the likelihood or the log probabilities Of the actions that have high advantage, which means that they result in a better than average Returns and push down the log probabilities of those actions in each state That result in lower than average returns Which means that for example, let's talk about language modeling if someone asks Where is Shanghai?

So where is Shanghai What is better And the question mark what's a good action to take? What's the good next token to select? Well, we know that the starting with the chocolate is going to be less Going to produce a reward that is worse than average because very probably it will lead to a bad answer however starting the the answer with the word Shanghai will probably result in a In the correct answer because the next token will be in Shanghai is in China So it will actually result in a good answer which will be rewarded well by our reward model So our model will be more likely to select the word Shanghai when it will see this prompt So this is how to interpret this advantage term Basically, what we are trying to do is we are trying to push up the log probabilities of those actions for a given state That result in better than average reward according to our reward model and push down the probabilities of those actions Given the state that result in low than average reward for according to our reward model Let's see how to estimate this advantage term now So first of all, let me write again the expression of the advantage term So let's use the pen.

So as we saw before the advantage term at time step t so the Starting from state s and taking action t is equal to the Q function at time step t Action a at time step t minus Minus the value at time step t What is the Q function the Q function tells us If we start from state s and take action a and then act according to the policy What is the expected return if we start from state a state s and take action a and then we act according to the policy?

For the rest of the trajectory while the value function tells us what is the expected return? If we start from state s and then act according to the policy Which means that imagine we have a trajectory a trajectory is what it's a list of state ended actions. So we have a state 0 action 0 And this will Have some reward associated maybe reward 0 this will lead us to State 1 in which we will take maybe action 1 this will have some reward associated with it, which is reward 1 This will take us to another state for example state 2 Action in which we will take action 2 and this will have some reward associated with it, which is reward 2 and then state 3 In which we will take action 3 it will have some reward associated which is reward 3 etc Etc, etc, etc for the rest of the trajectory Let's try to understand how can we estimate this advantage term?

We saw also before that for the estimating the value function We can build a neural network, which is a linear head on top of our policy network Which is the language model that we are trying to optimize So instead of using the linear network that linear layer that projects the hidden state into the vocabulary We can use another special linear layer with only one output feature that can estimate the value function of that particular state later, we will see also how to Which loss function we need to use to train this value head So now let's concentrate on estimating this advantage term Now imagine we have a trajectory this advantage term can be estimated like follows.

So as we know the advantage term tells us The Q function, so this is the Q function at given state S and action A at time step T Can be calculated as follows. So if we start from state S, we will receive some reward and then we can calculate because for each trajectory we can for each trajectory we can calculate the The Q function so if we start from state S at time step T and take action T in this state And then act according to the policy We can either sum all of these terms that we have for the trajectory or we can just say okay If I start from state 0 and take action 0 I will have some immediate reward, which is this one Plus I approximate the rest of the rewards with the value function because I will end up in some state S1 And I just approximate all this rest of the summation as the V of S1 or We can let me delete some stuff now because Okay Now or we can say okay the advantage term Which means that if I start from state S at time step T and take action T can also be approximated as follows So I have some immediate reward Plus the reward that I get in the next state plus the rest of the trajectory I approximate it with the value function at time step T plus 2 so S2 And this is exactly what we are doing here And we are also discounting it with the gamma parameter that we saw here So we want to discount future rewards And this minus V is just because of the formula of the advantage term has this minus value function We can also do it with three terms or four terms or five terms or whatever we want And then we can cut the rest just with the value function Now, why do we want to do this?

Let me delete some stuff Okay, if we stop too early So we calculate or for example Just the first approximation because we are approximating most of the trajectory with the value function It will exhibit high bias, which means that the value of the estimation of this advantage Will not be very correct because we are approximating most of the trajectory with the value function, which is itself an approximation Or to improve this approximation we can introduce more rewards from the actual trajectory that we got And only approximate a little bit Of the trajectory with the value function or we can approximate all of the trajectory with the rewards that we get and Use no approximation with the value head But if we use more terms, it will result in a higher variance If we use less terms, it will result in a high bias because we are approximating more So in order to solve this bias variance problem, we can use a generalized advantage estimation, which basically takes the Weighted sum of all these terms.

So of this one, this one, this one each multiplied by a decay parameter lambda We can see here So basically this results in a recursive formula in which we can calculate the advantage At each time step t given the future advantage at time step t plus one Let's try to use this formula.

For example, imagine we have a trajectory which is a series of states and actions So we have a state zero with action zero which will result in a reward zero Then we have this will result in another state s1 in which we take action one and it will have some reward one This will result in a new state s2 in which we take action two Which will lead us to state three in which we take action three, etc, etc This one will have reward three and this one reward two and this one will have reward three Let's try to calculate the advantage.

For example, the advantage at time step three because it's the last term in our trajectory Is equal to delta at time step t plus Gamma multiplied by lambda at time step four, but we do not have any time step four So we this term does not exist. So delta three Is equal to the return that we have at time step t plus Gamma multiplied by the value function at time step four But we do not have this term because there is no state four Minus the value of the state S3 this tells us The advantage estimation at time step three then we can use it to calculate the advantage estimation at time step two which is A2 is equal to delta two plus lambda Gamma lambda Oops A3 But what is delta two?

Delta two is equal to the reward that we have at time step two plus Gamma multiplied by the value of the state three Minus the value of the state two, etc, etc So we can recursively calculate the advantage estimation of each term Why do we need to calculate the advantage estimation because the advantage is in the formula of our gradient that we need to calculate the That we need to run a gradient ascent I know that I have introduced a lot of concepts.

I have introduced the value function I have introduced the Q function and the advantage function I also know that it may not be very clear to you Why we are calculating all this stuff because we have not seen the code and how it will be used So please bear with me now.

I know that there is a lot of stuff that you need to remember But when we will see the code, I will go back to all these slides for now. I just made this I just made all these formulas because later when we go back it they will make more sense to you And also if you want to in the future review this video, you don't have to kind of watch the code to understand the formulas because once you understand the This video once you can just review the parts that you're interested and they will be more clarified to you Okay, now let's see what is the advantage term for language model so just like the example I made before I said, okay we have this expression for our gradient In which we are multiplying each log probability by the advantage function also here.

I forgot the t And here I forgot the t later. I will fix the slides Now as I saw before as we saw before if we have our question is where is shanghai and our language model selects the the word shanghai Very probably this will be a new state that will be fed to the language model for generating the next next next next tokens This the first choice of shanghai will lead to a good answer because very probably the next tokens will be selected in such a way that it will result in the for example, the Phrase shanghai is in china, which is a good response because it matches What is what are the chosen answer in our data set of the reward model.

So our reward model will give a good reward to this kind of Answer so we can say that this is a good state to be in because it will lead to future states that will be rewarded Well by the reward model However, if our language model happens to choose the word chocolate as the next token after this question This new state will lead to new tokens being selected that are not very close to the answer that we are trying to find This will result in a bad response.

So it will result in a low reward from our reward model So in the case of language models, we are trying to push up The log probabilities of the word shanghai when it sees the state Where is shanghai and push down the log probability of the word chocolate When the state is where is shanghai because the advantage for choosing shanghai Is higher than the advantage for choosing the word chocolate given this prompt.

This is the how do we Interpret the advantage estimation for language models Another problem that we have a policy gradient optimization is because of the sampling that we are doing So as you know in the policy gradient optimization, the algorithm is like this So we have a language model.

We sample some trajectories from this language model. We calculate the Rewards associated with these trajectories. We calculate the advantages associated with these trajectories We calculate the log probabilities associated with these trajectories Then we can use all this information to calculate this big expression here, which is the direction of the gradient so which is the gradient of the Expected reward with respect to the parameters of the model and then we can run gradient ascent to optimize the parameters of the model according to the direction of the gradient and This is a process that is also used in gradient descent So using gradient descent we have a loss function We calculate the gradient of the loss function with respect to the parameter of the model And then we optimize the parameters of the model according to the direction of the gradient We do this process many many many times.

Why? Because we do little steps With respect to the direction of the gradient according to a learning rate alpha Now the problem is that we are sampling trajectories from the language model For each step that you are making in this gradient ascent So for each step of this optimization process, we need to sample many trajectories.

We need to calculate many advantages We need to calculate many rewards. We need to calculate many log probabilities So this can be very very inefficient because we will when doing gradient ascent, we are taking only small steps so for each of those small steps, we need to do a lot of calculation which makes the Which is makes the computation nearly impossible because we cannot run all these forward steps on many different Models to calculate the values the advantages and the rewards etc.

We need to find a better way so as you remember this This formula for the gradient that we have found is an approximation of an expectation And in probability we have this thing called important sampling So when evaluating the expectation with respect to one distribution we can calculate the expectation with respect to another distribution different from the The previous one as long as we modify the we multiply the function Inside the expectation by an additional term here.

So let's try to understand. What does it mean? Imagine we are trying to calculate this expectation and I want to remind you that in the case of the language model optimization Or the gradient policy optimization. We are calculating the gradient of e over all the possible trajectory according to policy Parameterized by theta of what of the reward of each trajectory Here so in this case the x is we can consider x to be the trajectory sampled from the policy theta Pi theta and this could be the reward of each theta.

Now, as you know, the expectation can be written as a integral of the probability of each item in the expectation multiplied by the function f of x which is the inside here in the parentheses of the expectation We can multiply by By the this constant here, which is basically the number one So we can always multiply by the number one in a multiplication without changing the result of this multiplication So we are multiplying up and down in this fraction by the same quantity, which is the number one so we can do it then we can rearrange the terms such that the We divide the p Basically the p of x by this q of x where q of x this term here Is the distribution it's another distribution is the probability density function of another distribution and Then we can return back this integral to the expectation form So now we we can write the expectation as a sample from the distribution q And calculate the the with respect to a function that is the f of x multiplied by this additional term So this means that in order to calculate the initial expectation here instead of sampling from the distribution For which we want to calculate the expectation We can sample from another distribution as long as each item is multiplied by this additional factor here And we can do the same for our expression of the gradient policy optimization in which we were sampling from some policy Here, which is the policy that we are trying to optimize But we can modify it by using important sampling to sample from another policy, which could be a different Neural network, but we will see that actually it's the same But okay, suppose that it's a different neural network Because sampling trajectory means that we generate some text given some questions.

So it's actually we are sampling from our neural network And each of the items so each of this advantage term instead of being multiplied only by the probability according to the To the network that we're trying to optimize. We also divide it by this q of x. So this the log probabilities of the Distribution from which we are sampling We will call the distribution from which we are sampling pi offline and the distribution that we are trying to optimize pi online Let me give you an example a graphical example on how it works So for now, just remember that with important sampling we can calculate this expectation By sampling from another network while optimizing another one a different one It works like this.

This is called off policy learning in reinforcement learning literature. So imagine we have a language model and we will call it Parameterized by some parameters called theta offline and we will call it the offline policy We will sample some trajectories. What does it mean? We give some questions according to our reward model data set, for example So we ask it where is shanghai and we ask the language model to generate many answers giving using a high temperature For example, then we calculate the rewards for these trajectories that are generated We calculate the advantages for all the state action pairs.

We calculate the log probabilities for this state action pairs and then We optimize another another model called online policy So we take all these trajectories that we have sampled from the offline policy and we save it in some database or in some memory And we keep it there. Then we take some mini batch of trajectories from this database or from this memory And then we run we calculate this expression here because we can calculate it So we can calculate the log probabilities according to the online model So for this the trajectories that we have sampled from this memory We can also calculate again the advantage term according to the online policy, which is another neural network We can also calculate and later i will show in the code how it's done We can also calculate the advantage term according to the online policy.

We can also calculate the rewards according to the online policy, etc And then we run gradient ascent Based on this expression only optimizing this online policy here And we do it for a few epochs, which means for a few mini batches that we sample from this big memory of trajectories And after a while, we just set the online policy The parameters of the offline policy equal to the parameters of the online policy and restart the loop So we start again by sampling some trajectories, which we keep them in the memory then For a few epochs, we sample some trajectories from here.

We calculate the log probabilities with respect to the online policy We calculate this expression here, which is needed to optimize With the gradient ascent and then after a while we set the offline policy equal to the online policy now They look like two different network neural network But actually it's the same neural network in which we first sample from the neural network We keep the memory of the trajectories that we sample and then we optimize this neural network by taking these trajectories After a while, we do this process again.

I know that this is not easy to visualize So later we will see this in the code, but the important thing is that now we have found a way to Run gradient ascent multiple times without having to sample each time from the policy that we are optimizing from the network that we are trying To optimize.

We can sample once, keep these trajectories in memory Optimize the network for some steps and then after we have optimized for some steps, we can sample new trajectories We do not have to do it for every step of gradient ascent So this makes the computation of this policy gradient algorithm tractable because otherwise it was too slow to run it And this is how we do it in the code, so I also created some pseudocode in how to do this offline policy.

So imagine we have a model that we want to train Okay, let's use This one here, okay For now, just ignore the frozen model. We're not using it So we have a neural network that we want to train with gradient ascent So we have a policy that we want to optimize with gradient ascent We sample some trajectories from this policy and we keep them in memory.

For each trajectory We calculate the log probabilities, the rewards, the advantages, the KL divergence, etc, etc Later, we will see why we need the KL divergence for now. Just ignore it This part Then we sample some mini-batch from these trajectories that we have seen. We run the PPO algorithm that we calculated the loss, basically the expression that we saw before We calculate the gradient using loss.backward and we run optimizer step, but we do not need to sample again We just take another sample from the trajectories that we have already saved We do again another step of gradient ascent and then etc, etc until we reach a specified number of steps And then after we have optimized the model for some number of steps We can sample new trajectories and then run again this loop of optimization for many steps So not for every step of gradient ascent, we have to sample new trajectories We sample once, we do many steps of gradient ascent and then we sample again We do many steps of gradient ascent and then we sample again.

This makes the training much faster Okay, I promise this is the last group of formulas that we are going to see. So this is finally the PPO loss Let's try to understand it So based on what we have seen before, the first thing that we should see is that This term here is exactly the one that we saw before So we have the log probabilities according to the policy that we are trying to optimize Divided by the log probability of the policy that we sample from, so the offline policy.

Yeah, I don't know why it's so ugly So we have the log probability of the, this is called the Online policy, so the policy that we are trying to optimize. So let's call it online This is the log probabilities according to the policy that we sample from, so we sample some trajectories from this policy This is the offline policy And then we have this advantage term which is multiplied by each of the action state pairs We are calculating the minimum value of this expression and this other expression here.

So this clipped log probabilities Why? Well, first of all, what is the clip function? The clip function says that if this expression we can see here is bigger than 1 plus epsilon, then it will be clipped to 1 plus epsilon If this expression is smaller than 1 minus epsilon, then it will be clipped to 1 minus epsilon Why do we want this?

Well, it means that First of all, let's try to interpret this This term here The difference, the ratio of the two log probabilities so we have some log, we have some policy that we sample from and then we have a policy that we are optimizing This means that if the log probability in the policy that we are optimizing Is much higher for a specific action compared to the one that we sampled from Which means that we are trying to increase the likelihood of selecting that action in the future We don't want this This increase to be too far.

So we want to clip it to maximum at this value On the other hand, if we are trying to decrease the likelihood of an action compared to what it was before We don't want it to decrease by too much, but at most by this quantity here This means that in our optimization step We are moving the action probabilities So the probabilities of selecting a particular token given a particular prompt we are changing them continuously But we don't want them to change too much.

We want to make little steps Why? Because we are If we move them too much, maybe the model will Run into maybe the model will kind of Not explore enough Of the other options so the model may actually optimize for that particular action too much So it may always avoid that action or it will always use that action in this case We want to do it little by little So we want to the model to make little steps In increasing a particular action or a little step in decreasing the log probability of that particular action Why are we talking about actions?

Because we are talking about language models and so we want to Increase or decrease the probability of selecting a particular token given a prompt But we don't want this probability to change too much. This is why we have the minimum here. So we want to make the most Pessimistic update we can we don't want to be too optimistic.

We don't want the model to make the most optimistic steps So if the model is very sure that it can always select this token, we don't want the model to be sure We want the model to make a little step towards what the model thinks is better choice The other head that we introduced before was the head for calculating the value function So as you remember, we also Introduced this value function and we say that this value function which is a function of the state indicates what is the Expected reward that we can receive from start by starting from that particular state And the example that I gave you was for example, imagine we are our question is where is Shanghai?

So where is Shanghai? If the model has selected for example the word Shanghai as the next token We expect the value of this state So because this will become a new input for the language model to be high Why? because it will probably result in a good answer that will be rewarded well by our model But of course, we also need to train our neural network to approximate this value function Well, so what we do is we use this other term for the PPO loss is for training the value function estimator and basically it means the The value function estimator based on a particular state So this is the output of the model and we compare it with what is the value actual value of this state based on the trajectories that we have sampled because we have trajectories Each trajectory is made up of state actions.

Each state action has some reward So we actually can calculate the value of this state According to the trajectory that we have sampled. So we want to optimize the value function estimator According to the trajectories that we have actually sampled from our policy The last term in the policy the PPO loss.

So we have first of all the policy optimization term, which is this This is the one I described here Then we have the the loss because for the value function estimator and then we have another term here the entropy loss This is to introduce some kind of this is to force our model to explore more options so imagine our model if we don't have this term here the model may just optimize the actions in such a way In such a way to select the actions that resulted in a very good advantage To select them more often and the actions that resulted in lower than average advantage to select them less often So this will kind of make the model very rigid in selecting tokens The model will always choose the tokens that resulted in good advantage and never select the tokens that resulted in bad advantage But this will make also the model not explore other options Which means that for example, imagine we sample some trajectories And for the question, where is shanghai the model always selects the word shanghai because it results in a good answer But we want the model to be also kind of explore other options.

Maybe there is another word For example, where is shanghai? Maybe the next word is can be the word it because it will result in it is in china So we also want the model to give the model the possibility to explore or more of these options And this is why we introduce this entropy term because we want the model for each state actions to also explore other options So we want to force the model to explore other options.

So because we are maximizing this This objective here. So we are maximizing the this objective function here. We also want to Minimize this loss here. We will see later how to do it and we want to maximize the entropy So that the model can also explore more options why we use the entropy because the entropy tells us How much kind of disorder there is uncertainty there is in the prediction So we want the model to be more uncertain why because it will help the model to explore more next tokens for a given prompt The last thing that we need to consider is that if we kind of optimize the policy Using the ppo loss that we have described before the model may learn Some tokens or some sequence of tokens that always result in a good reward And the model may always choose these tokens to always get good rewards But these tokens may not make sense for us humans.

So for example, imagine our model our data set Forces our data set for the reward model forces the model to be polite The model may just use the word Thank you. Thank you. Thank you continuously because we know that it is very polite and it results in a good reward but this is not a good answer for a Question because if I ask you where is shanghai and you are just keep if the model just keeps telling me Thank you.

Thank you. Thank you Then for sure the reward model will give a good reward to this answer because it's a polite answer But it does not make sense to humans So we want the model to actually generate output that makes sense that are very similar to the data It has seen during the training.

That's why we want to constrain the model Not only to get good rewards, but at the same time to generate answers that are very similar to the one It would generate by just looking at the untrained model. So at the unaligned model This is why we make another copy of the model that we want to optimize and we freeze its weights So this is the frozen model.

We generate the rewards for each step in the trajectory But we penalize by how much the log probabilities at each step change from the frozen model So for each hidden state we can generate the reward by using the linear layer that we saw before with only one output feature But at the same time for each hidden state, we will also calculate the log probabilities using the other linear layer for generating the logits So we'll send it also this one to the linear layer to generate the logits This will calculate the logits And then the log probabilities So the prob We do the same for the frozen model and then we penalize the reward So this reward here for this time step, we say the reward is equal to the reward at the time step zero minus the KL divergence between the log probabilities of the frozen model so the log probabilities of the frozen model and The log probabilities of the policy that we are optimizing We want to penalize the model for generating answers that are too different from the frozen model.

So we want the Reward to be maximized but at the same time we don't want the model to cheat In just getting reward by generating any kind of output But we want the model to actually get rewards for good answer that are very similar to the one that it would generate If it was not optimized Okay, I know that you are tired of looking at all this explanation and all this theory.

So let's jump into the code now Okay. So the the goal that we are the code that we are going to see is a code that I took from the HuggingFace Website which basically allow us to train a reinforcement learning Setup in which we want to train a language model to generate positive reviews So if we have a language model that is generating text But we want to force the language model to generate positive reviews of a particular For example a restaurant or a movie or something like this so We want the language model to be still similar to to generate something that is Comprehensible to humans, but at the same time we want to like we force the language model to be positive to generate positive stuff So say stuff like for example, I really like this movie or I really like this restaurant We will be using the imdb data set.

So as you can see from the website of HuggingFace the imdb data set It's a data set made up of text of reviews and for each review It indicates what is if the review is positive or negative And we will use this imdb data set To understand what is the score that we want to give to a review So if the review will be positive according to this data set It will be given a high reward and if the text generated will be similar to a negative review Then it will be given a low reward so the first thing that we do is we create the model that we want to optimize which is a which is this language model here, so it's I think it's gpt2 already fine-tuned on the imdb data set And then we create a reference model why because we need a frozen model of with frozen weights that we need to keep the weights frozen to compare how different is the response of the model that we are trying to optimize from the frozen model because we don't want the Output to be much different.

We just want it to be a little positive, but we don't want the model to just Output garbage just to get high reward. We want to actually Get actual text that makes sense. This is why we keep also a frozen model And then we load this PPO trainer. The PPO trainer in HuggingFace is the class that is used to train To run reinforcement learning from human feedback using the PPO algorithm So, let's see.

First of all, what is the reward model? The reward model is basically just a sentiment analysis or using this model here It will give us for each text that we feed to this reward model A number that indicates how positive it is according to this imdb data set you can see here so it will tell us if the The text that we are receiving is a positive review or a negative review For example, if we give this text here, it will probably tell us that it's a bad review So low reward and if this we give it give this text here for this movie was really good.

It will give us a positive Reward and we will use this number here as the reward. So the score corresponding to the positive class Okay, the first step in PPO is to Is to generate the trajectories. So we have some model the Policy that is the offline policy and we need to sample some trajectories from it.

What do I mean by Sampling some trajectories means that we give it some text and it will generate some responses some output text And what we will be using as a kind of questions or prompt for generating the text we will be using Just some initial sampled Text from this imdb data set you can see here.

So for example, uh, this data set is composed of many Reviews some are positive. Some are negative. We just randomly take the initial part of a Review and we use it as a prompt to generate the rest of the review And then we ask the reward model to judge this review that was generated if it's positive or negative It's positive then it will achieve a high reward if it's negative, it will achieve low reward So we take some Okay, we generate some Lengths, so random select how many tokens we need to take from each review.

We select it randomly We get these prompts from our data set and we ask the ppo model to generate some Answers for these questions for these prompts. So generate the rest of the text up to a maximum length That is also sampled randomly These are our trajectories for now These are just the combination of prompt and the generated text.

We did not calculate the log probabilities We did not calculate the advantages. We did not calculate the rewards etc Okay. So now for now, we only have the query and the response generated by our offline policy What is the offline policy is the model that we are trying to train.

So this variable here model and Now that we have some responses We can ask our reward model to judge these responses and we use basically just do a sentiment classification in which we give the response that was given by the Policy and we ask the sentiment pipe So the sentiment analysis pipes which will act as our reward model to judge this text So how positive is this review that was generated and we will take the Score associated with the positive class that will be generated as you can see here.

So as a reward we take the We assign the reward to the full response. So for each response, we will get one number And this number is actually the Logits, so the score corresponding to the positive class according to this sentiment analysis pipeline Now that we have some trajectories, which are some questions So some prompts along with the text that was generated along with the reward for each of this text that was generated We can run the PPO training setup.

So let's now go inside the code of the library So the first thing we do is we call this function here step in which we give out the prompt that we gave to the language Model the responses that were generated and the rewards associated with each response And then we run this step function here Now the step function here.

Okay. First it checks if the tensors that you pass it are correct So the data types and the shapes of the tensors, etc, etc Then it converts the scores into a tensor because the scores are at least one score for each response So it converts it into a tensor.

I commented the code that I don't find Useful for my explanation. So there are many functions in a hugging phase, but we will not be using all of them I will just concentrate on explaining the vanilla PPO like it was described in my slides Okay The first thing that we need to do is to calculate all the log probabilities of the actions that we that we need to calculate the gradient so we do it here in this function here, so given the answers the text generated by our model And the queries that were used so here they are called queries and responses But they are actually the prompts and they generated the text The hugging phase they calculate the log probabilities for each step.

How do they calculate it? Well, they calculate the call this function batched forward pass in which they pass the model from which the Answers were generated. So the text was generated the prompt that were used to generate this text And they divide each of these Questions and responses into mini batches and then they run it through the model the model as we saw in the slides So let's go back here, I think So Here we know that we can calculate the log probabilities corresponding to each position based on the Text and the question that was asked so we can create a concatenation Of the question and the text that was generated.

We pass it to the model. The model will generate some logits one for each position Of the token. We only take the log probability of the next token because we already know which next token was generated So we know that for this particular prompt made up of these four tokens.

The next token is shanghai So we only take the log probability corresponding to the word shanghai and this is what is done in this line here so we ask the language model to generate the logits corresponding to all the Positions then we calculate the log probabilities from this logits.

How? Here We calculated the log softmax so exactly like in my slides So we calculate the log softmax here as you can see. So for each logits we calculate the log log softmax which is the Log probabilities for each position But we are only interested in the position corresponding to the next token and this is done here with the gather function You can see here.

So From all the log probabilities It only selects the one corresponding to the next token because we already know which token was generated So now we have the log probabilities and we can We can save them because we don't have the log we don't want the log probabilities for all the tokens We also need to keep track of where the log probabilities start So the one that we want to consider and where they end why because as you can see from my slide Our trajectory here.

The question was where is shanghai the model generated four tokens Where is shanghai is in china? So we are all interested in this trajectory. We only have four steps So we are all interested in the log probabilities of four tokens And this is exactly what we do here. So we consider which is the starting point from which we consider the log probabilities and which is the ending token for which we consider the log probabilities because this The model will generate the log probabilities for all the positions But we only want some of them and here is what we do.

So we create a mask in which we say that the model Only consider we will be considering only these four probabilities or four five probabilities according to which token were actually generated by the model So now we have the log probabilities of each action. So let's go back To the step function Okay, so we calculated the log probabilities according to our offline policy Why do we do it here inside the step method and not outside?

Well, because the hugging face is a library that is user friendly So they don't want to give to the user the burden of calculating the log probabilities of each action They do it inside the library So they only ask the user to generate the responses for each prompt and then they take care of calculating the rest of the information Now we also need to calculate the log probability with respect to the reference model So the frozen model why because we also need to calculate the KL divergence that will be used to penalize the reward for each position because we want to penalize the model for generating Log probabilities that are much different from the frozen model Otherwise the model will just do what is known as a reward hacking which is just generate Random tokens that actually give a good reward, but they do not make sense for the user So we also need to generate using the same method So this batched forward pass using the frozen model to generate the log probabilities Which will be used to calculate the KL divergence to penalize the reward The next step we do is we actually compute these rewards.

So how do we compute the rewards? Well using the log probabilities of the model that we are trying to optimize and the frozen model because we need to calculate the KL divergence We have this mask which indicate which log probabilities we need to take into consideration because we have the log probabilities of all the response But only some of them are interesting for us because they belong to the trajectory And let's see how to compute the rewards So the rewards are computed as follows.

So we calculate the KL penalty, which is the difference in log probabilities So if you go to here, you can see that the KL divergence is just a difference in log probabilities as you can see here and we penalize as you can see here The reward is basically just the KL divergence penalization, which is the KL divergence multiplied by some factor Which is the penalty factor and then we sum the score so We saw before that the score is what is just the score associated to each response By our reward model.

Our reward model is just a sentiment classification pipeline that will generate one reward one single number for each response so indicating how Positive is the response that was generated or how negative it is Because we only have one generated response We and this response this reward is associated with the last token.

So let me show you in the slides Here we were computing the reward for each step But actually the sentiment classification model will compute the reward only for the last token for the full answer for the full generated text So we basically we create but we need of course to calculate the reward of the trajectory.

We need the reward for each state actions so we compute the KL penalty for each position because we know the log probabilities of the frozen model and of the Model that we are trying to optimize. So we have the KL penalty for each position, but we have the reward only for the last one So this is exactly what we are doing here.

We calculate the log probabilities For the KL penalty for each position, but the score is only added to the last token, so here in this position here And then when we compute the advantage because we compute the advantage starting from the last to the first we will kind of Take this reward and put it in the previous steps and we will see this later so now we have found a way to calculate the The rewards associated with each position in which each position is given some score by the sentiment classification But this is only given to the last token while the KL penalty is given to each position So, let's go back Okay, so we have computed the rewards now we can compute the advantages Let's see how we compute the advantages to compute the advantages.

We need the values What are the values? Well, the value is the estimation of the value value function As we saw before the value is computed by using the same model So the policy network with an additional head Which is a linear layer that gives us the value estimation for that particular State so let me show you in the slides Here we saw before that of the policy network So the model that we are trying to optimize also has an additional linear layer that gives us a value estimation for each step of the trajectory And this is actually already when we calculated the log probabilities the this function also returns the value head the value estimation for each Step of the trajectory then we can use the values estimated plus the rewards that we calculated Plus the mask because we need to know which value we have and which value we don't have to compute the advantage using the same formula that we saw before so we start from the The formula of the which is a this one here.

So let's go back to the formula Okay here We calculate the delta t to compute the advantage estimation at time step t so Here we are computing the first delta t which is the reward at time step t plus gamma as you can see here Multiplied by the value at time step t plus one and this is here.

So it's zero if we do not have any future Values, otherwise, it's the value at time step t plus one Minus the value at time step t exactly according to this formula here. You can see here and then we use this delta value to compute the Ge estimation, which is the delta plus gamma multiplied by lambda multiplied by the Ge at the next time step which is exactly what we do here.

So delta at time step t plus gamma multiplied by lambda multiplied by the Advantage estimation at time step t plus one And we do it from the last From the last item in the trajectory to the first item in the trajectory That's why we do this for loop in reverse And then we reverse it back because we computed the advantage Reversed and then we reverse the computed advantages to have them from zero to time step t to Capital t instead of capital t to zero Then we compute the q values that will be used to Optimize the value function.

So as you can see here To optimize the value head. So the value estimation we need to have the estimation of the value function But according to the trajectory that we have sampled, but what is the estimation of the value function according to the trajectory? It is actually the q function because For the value function tells us.

Okay. Let me use some kind of let me write here Otherwise, it's not easy to understand. So The value function here is tells us what is the value Of a particular state. So what is the expected return that we can get? By starting from a particular state and we can We can approximate it also actually with the q function from the sample trajectories.

Why? Because the value function is At time step t is the expected return Over all possible actions that we can take starting from the State s And taking action a so the value function here can be actually calculated from the q function But it's an estimated An expectation over all the possible actions that we can take which means that The q function tells us what is the expected return if we start from state s and take action a the value function tells us What is the expected return that we can get if we only start from?

State s and react according to the policy Which is also the which basically can also be Calculated as the expected return over the q function, but Expected expectation over all the possible actions that we can take which kind of can be thought of as what is the average return that we can get by starting from the State s and taking some actions over all the possible actions that we can take But we do not have all the possible actions So we can approximate this expectation with a sample mean according to the one that we have in our trajectory So we have some actions state actions in our trajectory So we can actually approximate it this using the q S a that we have in our trajectory And how do we compute this q?

S a As you remember the formula for the advantage is advantage of s a at particular time step is equal to the q Of s a minus v of s so we can get q S a is equal to advantage S a plus the value s And this is exactly what we are doing Here, so we are saying to get the q function We are calculating the advantages plus values and this term here will be used to Calculate the loss for the value head.

We will see later. So remember these returns we are doing here Okay. So now we have computed the advantages and the values Now we still are in the first phase. So we have sampled some trajectories from our model that we're trying to optimize We computed the rewards using the rewards.

We We also computed the log probabilities for each time step We also computed the advantages for each time step and we also computed the q values for each time step Which are used for the value head Now let's go to the second phase of the ppo algorithm, which is the phase two Which means that we take some mini-batch from these trajectories We optimize the model based on the estimated gradient We do it with many steps and then again, we sample new trajectories We sample some mini-batches.

We optimize the model according to the loss We do it many times and then again, we sample new trajectories So let's go back to our step function So We are here Okay, so we computed the the advantages Now we can use the sampled trajectories to optimize the model. So what do we do?

We sample some mini-batches. This is the mini-batch that we are sampling So we sample a mini-batch as you can see here And then what we need to do First of all, we go as we saw in the formula of the ppl also we need to have the log probabilities according to the model that We sampled from which is this pi old and also according to the model that we are trying to optimize using this Sampled mini-batches, which is exactly what we did here with offline policy So we sample from some policy and we need to have the trajectories from this policy and also the log probabilities from this policy Which is the offline policy and then we use this sample trajectory So we take a mini-batch and then we run a gradient ascent on an online policy But we also need to have the log probabilities according to this online policy The one we are that we are trying to optimize and this is exactly what we do here So we run again this method that we ran before so the batch the forward pass to calculate the log probabilities the logits and the value Head prediction according to the mini-batch that you are considering and then we train the model according to this mini-batch Let's see how it's done The first thing that we need to do is to calculate the loss of ppu according to the formula that we saw on the slides So let's go in the loss In the loss we have to calculate three losses.

The first is the loss for the value head, which is this loss here So they are actually doing it the hugging face is actually calculating also the clipped loss But let's not consider the clipped loss for now. It's just some new It's just an optimization, but it doesn't have to be in the vanilla ppu.

We don't have to do it so we are taking the Values that were predicted by the model and the returns that we calculated as the sum of the advantages plus the values that we saw before So this is how we this is the loss for the value head According to this formula here.

So as you can see, so this is basically the Estimated q functions according to our trajectories And this is the loss of the value head then we have the loss of the ppu Which is just the advantage term multiplied by the ratio of the log probabilities. What is the ratio of the probabilities?

It's the here The log probabilities, okay, let's go to the formula first. Okay, as you can see here we have the ratio of the two probabilities But we have the log probabilities. So what we can do is we can calculate Let's use the here Okay, we have the log probabilities so The log of a minus the log of b And then we are doing the exponential of this This is equivalent to doing the exponential of the log of a divided by b Which is equal to doing a divided by b Of the two probabilities.

So because we do not have the log with the probabilities, but we have the log probabilities We are calculating like this. So we first do the log probabilities of the Online model minus the log probabilities of the offline model and then we apply the exponential which will result in a divided by b Which is exactly what we want here.

So let's check In the code We are calculating the difference in the log probabilities and applying the exponential which will result in this ratio here being calculated Then we need to This ratio is multiplied by the advantage term as you can see here. So we need to multiply it by this term advantage And then we need to also calculate the other part of this Expression, which is this clipped advantage as you can see here So again the ratio but clipped between the value one minus epsilon and one plus epsilon so We are doing it here.

So the advantage multiplied by the ratio clipped between One minus epsilon and one plus epsilon Why do we have this minus sign here because The goal in ppo is we want to maximize This term here, but we are doing you is using pytorch and the optimizer of pytorch, pytorch always run gradient descent Which means that it's the opposite of gradient ascent.

So if we want to um So it's basically we are instead of maximizing this we can minimize the negative Loss that we can see here and this is exactly why we have this minus sign So because pytorch always minimizes we can multiply this by minus so it's like we are maximizing this term and The entropy is calculated here as you can see Then they have also others Other terms that we do not use because they they do some optimizations, but they are not present in the vanilla loss of the ppo So the loss of the ppo is calculated as the the loss of the policy Plus the value head multiplied by its coefficient that you can see here.

So loss policy They calculate also the entropy, but they do not use it. I don't know why to be honest So they calculated the entropy here. So they calculated the entropy using the logits as you can see And they do it not using the formula that I show in the slides, which is the actual formula of the entropy But they're using an optimized version called the log sum exp and I am putting here some information for those who want to have The derivation of how it's done.

But basically wikipedia says that the convex conjugate of the log sum eps is the negative entropy and Yeah, so we have also this entropy term here and we return our loss So let's go back to the optimization step So we go here So now we are optimizing over a mini batch Which means that the first thing that we do is we calculate the loss and then we run a back propagation on this loss and then we optimize and we do it for Oops, we do it for Many mini batches, so let me go back again So we train on a one mini batch then we do it again for many mini batches you can see here And after a while we return We return here and we do again the procedure again.

So we again generate the new trajectories Then the hugging face library will calculate and we calculate of course also the rewards the hugging face library will calculate the log probabilities According to these trajectories the advantage estimation according to these trajectories the value estimation according to the trajectories Then we'll iteratively Sample from these trajectories some mini batches and then we run a gradient ascent According to the ppo loss on these mini batches many times and then again, we restart the loop and this is how we Run the ppo algorithm for reinforcement learning from human feedback Let's go back to the slides and thank you guys for watching this video, I know it has been very very demanding It has been one of my most difficult video also for me to describe all these parts without Getting lost myself I know that I gave a lot of knowledge because Actually ppo and the reinforcement learning are quite big topic.

So there are entire university courses on this stuff. So it's not easy to give A complete understanding in just a few hours. This is also one of the reason I decided not to code it from scratch because It would make the video like 10 hours of video but at least I hope that now you have a deep understanding into how each step of the Reinforcement learning from human feedback is done.

I will share with you the code commented by me with all the parts that are unnecessary Removed or anyway, I will comment telling explicitly which parts are not necessary for the ppo algorithm It took me more than one month of research to prepare this video and I had to record it multiple times because I made some some mistakes and then I realized that I forgot something in the slides Then I had to fix them etc, etc So the best way to help me guys is to share this video with others if you found it useful I know that it's very difficult So I suggest watching it multiple times because the first time you watch this video you will have some understanding but not very deep The second time you will realize that you will have a better understanding And maybe you will need to review some concepts from reinforcement learning or from the transformer to better understand it fully So I recommend watching it multiple times and please leave in the comments if some part was not clear I will always try to help you and yeah, have a nice day