back to index

Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.


Chapters

0:0 Introduction
3:52 Intro to Language Models
5:53 AI Alignment
6:48 Intro to RL
9:44 RL for Language Models
11:1 Reward model
20:39 Trajectories (RL)
29:33 Trajectories (Language Models)
31:29 Policy Gradient Optimization
41:36 REINFORCE algorithm
44:8 REINFORCE algorithm (Language Models)
45:15 Calculating the log probabilities
49:15 Calculating the rewards
50:42 Problems with Gradient Policy Optimization: variance
56:0 Rewards to go
59:19 Baseline
62:49 Value function estimation
64:30 Advantage function
70:54 Generalized Advantage Estimation
79:50 Advantage function (Language Models)
81:59 Problems with Gradient Policy Optimization: sampling
84:8 Importance Sampling
87:56 Off-Policy Learning
93:2 Proximal Policy Optimization (loss)
100:59 Reward hacking (KL divergence)
103:56 Code walkthrough
133:26 Conclusion

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hello guys, welcome back to my channel. Today we are going to talk about reinforcement learning from human feedback and PPO
00:00:04.800 | So reinforcement learning from human feedback is a technique that is used to align the behavior of a language model to what we want
00:00:11.440 | The language model to output for example
00:00:13.440 | We don't want the language model to use curse words or we don't want the language model to behave in an impolite way to the user
00:00:19.440 | So we need to do some kind of alignment and reinforcement learning from human feedback is one of the most famous technique
00:00:24.960 | Even if there are now new techniques like dpo, which I will talk about in another video
00:00:30.400 | Now reinforcement learning from human feedback is also how they created chat gpt
00:00:34.560 | So how they align chat gpt to the behavior they wanted
00:00:38.320 | In the topics of today are first I will introduce a little bit the language models how they are used and how they work
00:00:45.280 | Then we will talk about the topic of ai alignment why it's important
00:00:49.780 | And later we will do a deep dive into reinforcement learning from human feedback in particular
00:00:54.960 | I will introduce first of all, what is reinforcement learning then I will describe all the setup of the reinforcement learning
00:01:01.200 | So the reward model what are trajectories in particular, we will see the policy gradient optimization and we will derive the algorithm
00:01:08.260 | We will see also the problems with it. So how to reduce the variance the advantage estimation important sampling of policy learning, etc, etc
00:01:15.280 | The goal for today's video is actually to derive the loss of the ppo
00:01:20.400 | So I don't want to just throw the formula at you. I want to actually derive step by step all the
00:01:25.680 | Algorithm of ppo and also show you all the history that led to it
00:01:30.640 | So what were the problems that ppo was trying to solve mathematical from a mathematical point of view?
00:01:37.280 | In the final part of the video, we will go through the code of an actual implementation of reinforcement learning from human feedback with ppo
00:01:45.760 | I will actually not code from uh by line by line
00:01:49.200 | I will actually explain the code line by line and in particular I will show the
00:01:52.800 | Implementation as done by the HuggingFace team
00:01:55.680 | So I will not show you how to use the HuggingFace library to use reinforcement learning from human feedback
00:02:01.680 | But we will go inside the code of the HuggingFace library and see how it was implemented by the HuggingFace team
00:02:07.840 | This way we can combine the theory that we have learned with practice
00:02:12.400 | Now the code written by the HuggingFace team is kind of obscure and complex to understand
00:02:17.360 | So I deleted some parts and I also commented with my own comments some other parts that were not easy to understand this way
00:02:24.080 | I hope to make it easier for everyone to follow the code
00:02:27.760 | Now there are some prerequisites before watching this video
00:02:30.880 | First of all, I hope that you have some notions of probability and statistics. Not much. At least, you know, what is an expectation?
00:02:37.060 | um, we we need to know so of course some
00:02:41.440 | Knowledge from deep learning for example gradient descent. What is the loss function?
00:02:44.800 | And the fact that in gradient descent we calculate some kind of gradient etc
00:02:49.200 | We need to have some basic knowledge of reinforcement learning even if I will review most of it
00:02:54.880 | So at least you know, what is an agent, the state, the environment and the reward
00:02:58.640 | One important aspect of this video is that we will be using the transformer model a lot
00:03:04.160 | So I recommend you watch my previous video on the transformer
00:03:06.960 | If you have not if you're not familiar with the concept of self-attention or the causal mask, which will be key to understanding this video
00:03:14.000 | So the goal of this video is actually to combine theory with practice
00:03:18.640 | So I will make sure that I will always kind of give an intuition to formulas that are complex
00:03:24.240 | And don't worry if you don't understand everything at the beginning
00:03:28.240 | Why? Because I will be giving a lot of theory at the beginning because later I will be showing the code
00:03:34.240 | I cannot show the code without giving the theoretical knowledge
00:03:37.600 | So don't be scared if you don't understand everything because when we will look at the code
00:03:41.760 | I will go back to the theory line by line so that we can combine
00:03:45.680 | You know the practical and the theoretical aspect of this knowledge. So let's start our journey
00:03:53.360 | What is a language model?
00:03:54.560 | First of all a language model is a probabilistic model that assigns probabilities to sequence of words in particular
00:04:01.120 | A language model allows us to compute the probability of the next token given the input sequence
00:04:06.960 | In particular, for example, if we have a prompt that says shanghai is a city in
00:04:12.320 | What is the probability that the next word is china?
00:04:15.680 | Or what is the probability that the next word is beijing or cat or pizza?
00:04:19.360 | This is the kind of probability that the language model is modeling
00:04:23.040 | Now in my tractation of language model
00:04:26.400 | I always make a simplification which is that each word is a token and each token is a word
00:04:31.520 | This is not always the case because it depends on the tokenizer that we are using and actually in most cases. It's not
00:04:37.520 | like this
00:04:39.360 | But for simplicity, we will always consider for the rest of the video that each word is a token and each token is a word
00:04:44.960 | Now you may be wondering how can we use the language models to generate text?
00:04:50.480 | well, we do it iteratively which means that if we have a
00:04:53.600 | Prompt for example a question like where is shanghai then we ask the language model
00:04:58.640 | What is the next token and for example greedily we select the token with the most probability
00:05:03.540 | So we select for example the word shanghai then we take this word shanghai. Let me use the laser
00:05:09.200 | We put it back into the input and we ask again the language model
00:05:12.640 | What is the next token and the language model will tell us what are the probability of the next token and we select the one
00:05:18.000 | that is more probable suppose it's the word is
00:05:20.240 | We take it and we put it back in the input and again we ask the language model
00:05:24.480 | What is the next token suppose the next token is in
00:05:27.360 | We take it we put it back in the input and we ask again the language model
00:05:31.440 | What is the next token etc until we reach a number of tokens that we have generated?
00:05:36.100 | Or we believe that the answer is complete
00:05:39.040 | So in this case we can stop for example
00:05:41.360 | Because we can see that the answer is shanghai is in china is the answer generated by the language model
00:05:46.560 | So this is an iterative process of generating text with the language model and all language models actually work like this
00:05:52.800 | Now with what is the topic of ai alignment
00:05:57.280 | A language model is usually pre-trained on a vast amount of data
00:06:02.320 | which means that it has been pre-trained on billions of web pages or the entire of wikipedia or
00:06:08.000 | thousands of books
00:06:10.160 | This gives the language model a lot of knowledge from which it can retrieve
00:06:15.920 | And it can learn to complete a prompt in a reasonable way
00:06:19.360 | However, this does not teach the language model to behave in a particular way. For example
00:06:24.640 | Just by pre-training we do not teach the language model to not use offensive language or to not use
00:06:30.560 | racist expressions or to not use curse words
00:06:33.520 | To do this and to create for example a chat assistant that is friendly to the user
00:06:40.400 | We need to do some kind of alignment
00:06:42.720 | So the topic of ai alignment is to align the model's behavior with some desired behavior
00:06:48.340 | Let's talk about reinforcement learning. So reinforcement learning is an area of artificial intelligence that is concerned with training an
00:06:55.920 | Intelligent agent to take actions in an environment in order to maximize some reward that it receives from the environment
00:07:04.080 | Let me give you a concrete example
00:07:06.480 | So imagine we have a cat that lives in a very simple world
00:07:10.080 | Suppose it's a room made up of many grids and this cat can move from one cell to another
00:07:16.720 | now in this case our agent is the cat and this agent has a state and
00:07:22.720 | Which describes for example the position of this agent
00:07:26.880 | In this case the state of the cat can be described by two variables
00:07:31.680 | One is the x coordinate and one is the y coordinate of the position of this cat
00:07:37.360 | Based on the state the cat can choose to do some actions
00:07:40.640 | Which could be for example to move down, move left, move right or move up
00:07:45.680 | Based on the state the cat can take some actions and every time the cat takes some action
00:07:52.000 | It will receive some reward from the environment. It will for sure move to a new position
00:07:57.140 | And at the same time will receive some reward from the environment
00:08:01.140 | And the reward is according to this reward model
00:08:04.800 | So if the cat moves to an empty cell it will receive a reward of zero
00:08:08.800 | If it moves to the broom, for example
00:08:10.800 | It will receive a reward of -1 because my cat is scared of the broom
00:08:14.800 | if somehow after a series of states and actions the cat arrives to the
00:08:19.040 | Bathtub it will receive a reward of -10 because my cat is super scared of water
00:08:25.760 | However, if the cat somehow manages to arrive to the meat it will receive a big reward of +100
00:08:33.120 | How should the cat move?
00:08:34.800 | Well, there is a policy that tells what is the probability of the next action given the current state
00:08:42.160 | So the policy describes for each position. So for each state of the cat
00:08:46.880 | With what probability the cat should move up or down or left or right?
00:08:52.160 | And then the agent can choose to either choose a randomly an action
00:08:56.320 | Or it can choose to select the action with the most probability for example, which is a greedy strategy etc etc
00:09:03.040 | now the goal of reinforcement learning is to
00:09:05.680 | Learn a probability so to optimize a policy
00:09:09.060 | Such that we maximize the expected return when the agent acts according to this policy
00:09:16.640 | which means that we should have a policy that with very high probability takes us to the meat because that's one way to
00:09:23.600 | Maximize the expected return in this case
00:09:28.080 | Now you may be wondering okay the cat I can see it as a reinforcement learning agent and the reinforcement learning setup
00:09:34.560 | Makes sense for the cat and the meat and all these rewards
00:09:37.840 | But what is the connection between reinforcement learning and language models? Let's try to clarify this
00:09:45.680 | You can think of the language model as a policy itself
00:09:49.520 | So as we saw before the policy is something that given the state
00:09:53.440 | Tells you what is the probability of the action that you should take in that state
00:09:58.480 | In the case of the language model. We know that the language model tells you given a prompt
00:10:02.880 | What is the probability of the next token?
00:10:05.680 | So we can think of the prompt as the state and the next token as the action that the language model can choose to perform
00:10:12.640 | Which will lead to a new state because every time we sample a next token
00:10:17.040 | We put it back into the prompt then we can ask the language model again. What is the next next token etc
00:10:22.800 | So as you can see we can think of the language model as the reinforcement learning agent itself and also as the policy itself
00:10:29.040 | in which the state is the prompt and the action is the
00:10:32.880 | Next token that the language model will choose according to some strategy which could be the greedy one
00:10:38.080 | Which could be the top p or the top k or etc, etc
00:10:41.120 | The only thing that we are missing here is the reward model
00:10:44.800 | How can we reward the language model for good responses and how can we kind of?
00:10:50.860 | Penalize the language model for bad responses. This is
00:10:55.340 | Done through a reward model that we have to build. Let's see how
00:10:59.980 | Okay, imagine we want to create a reward model for our language model, which will become our
00:11:05.820 | Reinforcement learning agent. Now to reward the model for generating a particular answer for questions
00:11:13.580 | We could create a dataset like this of questions and answers generated by the model
00:11:19.660 | For example, imagine we ask the model where is shanghai the model language model could say. Okay. Shanghai is a city in china. We should
00:11:26.140 | Give some reward to this answer. So how good this answer is?
00:11:30.540 | Now in my case, I would give it a high reward because I believe that the answer is short and to the point
00:11:37.340 | But some other people may think that this answer is too short. So they maybe want
00:11:41.420 | They prefer an answer that is a little longer or in this case, for example
00:11:46.380 | What is two plus two suppose that our language model only says the word four
00:11:49.980 | Now some in my case, I believe this answer is too short
00:11:53.820 | so it could be a little more elaborate, but some other people may think that this answer is
00:11:57.980 | Is good enough now what kind of reward should we give to this answer or this answer as you can see
00:12:05.020 | It's not easy to come up with a number that can be accepted by everyone
00:12:09.660 | So us humans are not very good at finding a common ground for agreement
00:12:14.380 | But unfortunately, we are very good at comparing so we will exploit this fact to create our data set for training our reward model
00:12:22.060 | So what if instead of generating one answer we could generate multiple answers using the same language model
00:12:28.940 | This can be done for example by using a high temperature and then we can ask a group of people
00:12:34.940 | So expert labelers experts in this field to choose which answer they prefer
00:12:41.900 | and having this data set of
00:12:44.460 | Preferences we can then create a model that will generate a numeric reward for each question and answer
00:12:52.860 | So first we create a data set of questions
00:12:57.340 | Then we ask the language model to generate multiple answers for the same question
00:13:00.940 | For example by using a high temperature and then we ask people to choose which answer they prefer
00:13:06.140 | Now our goal is to create a neural network, which will act as a reward model
00:13:11.180 | so a model that given a question and an answer will generate a numeric value for
00:13:16.780 | in such a way that the
00:13:19.740 | Answer that has been chosen should have a high reward and the answer that has not been chosen
00:13:25.100 | Which is something that we don't like should have a low reward. Let's see how it is done
00:13:29.820 | What we do in practice is that we take a pre-trained language model
00:13:34.620 | For example, we can take the pre-trained llama and we feed the language model the question and answer
00:13:41.180 | So the input tokens here you can see are the questions and the answer concatenated together
00:13:46.700 | We give it to the language model as input the language model. It's a transformer model
00:13:51.820 | So it will generate some output embeddings. These are called hidden states
00:13:56.140 | So as you know, the input are the tokens which are converted into embeddings
00:14:00.940 | Then the positional encoding then we feed it to the transformer layer
00:14:03.980 | The transformer layer will actually output some embeddings which are called hidden states
00:14:09.020 | And usually for text generation, we take the last hidden state
00:14:14.060 | We send it to some linear layer which will project it into the vocabulary
00:14:18.300 | Then we use the softmax and then we select the next token
00:14:21.180 | But instead of selecting because here we are we do not want to generate a text
00:14:25.820 | We just want to generate a numeric reward
00:14:28.540 | We can substitute the linear layer that is projecting the last hidden state into the vocabulary
00:14:34.940 | But instead we replace it with another linear layer with only one output feature
00:14:39.180 | So that it will take an input embedding as input and generate only one value as output
00:14:45.020 | Which will be the reward assigned to the answer for the particular given question
00:14:54.620 | Of course this is the architecture of the model. We also need to train it
00:14:58.700 | So we also need to tell this model that it has to generate a high reward for answers that are chosen
00:15:04.380 | And low reward for answers that are not chosen
00:15:07.420 | Let's see what is the loss function that we will use to train this model
00:15:11.100 | The loss function that we will be using is this one here
00:15:14.780 | So you can see it's minus the log of the sigmoid of the reward assigned to the good answer
00:15:21.740 | Minus the reward assigned to the bad answer
00:15:26.860 | Let's analyze this loss function here. So
00:15:29.580 | Pen, okay
00:15:32.300 | So there are two possibilities either this difference here
00:15:35.420 | So he is a negative or it is positive which means that either
00:15:40.140 | The response assigned to the so how do we train it?
00:15:44.140 | First of all basically because our data set is made up of questions and possible answers
00:15:49.100 | I suppose there are only two possible answers. One is a good one. And one is the bad one
00:15:53.100 | We take each question answers
00:15:56.140 | We feed the question to the model along with the answer concatenated to it and we general model will generate some reward
00:16:03.500 | We do it for the good question and for the sorry for the good answer and also for the bad answer
00:16:09.740 | And it will generate two rewards suppose. This is the reward for the good one. So let's write good one
00:16:17.180 | And this is the reward associated with the bad one
00:16:19.900 | Now either the model assigned a high reward to the good one and a low reward to the bad one
00:16:27.900 | So this difference will be positive and this is good
00:16:31.020 | So in this case the loss will be like this
00:16:33.500 | So if the reward given to the good answer is higher than the reward given to the bad answer
00:16:41.100 | This difference will be positive. So let's see the sigmoid function. How does it behave when the input is positive?
00:16:47.100 | So when the input is positive the sigmoid gives an output value that is between 0.5
00:16:52.000 | And one so this stuff here will be between 0.5
00:16:57.440 | And one when the log receives any because here you can think of as having a parenthesis
00:17:05.820 | When the log sees an input that is between 0.5 and 1 will generate a number negative number
00:17:11.820 | That is more or less between 0 and minus 1 more or less
00:17:17.260 | With the minus sign here, it will become a positive number between 0 and 1
00:17:21.900 | So the loss in this case will be small because it will be a number between more or less between 0 and 1
00:17:29.980 | I maybe it's two or three but okay depends on the graph of the log. I don't remember. What is the exact value for the
00:17:37.660 | 0.5 here
00:17:40.220 | However, let's see if the model
00:17:42.540 | Gave a high score to the bad response and the low score to the good response. So let's start again
00:17:54.380 | Okay, here is the bad
00:17:57.740 | Response and here is the good response
00:18:00.080 | Now what happens if this value here is smaller than this value here
00:18:06.620 | So this difference will be negative when the sigmoid receives as input something that is negative
00:18:12.880 | It will return an output that is between 0 and 0.5
00:18:17.440 | The log when it sees an input that is between 0 and 0.5
00:18:23.660 | So more or less here it will return a number negative number that is between minus infinity and more or less one
00:18:32.140 | It will return because there is a minus sign here. It will become a very big number in the negative range
00:18:37.420 | So the loss in this case will be big so big loss
00:18:41.260 | Here it was a small loss
00:18:45.340 | small
00:18:50.620 | Now as you can see when the reward model is real is giving a high reward to the good answer and a bad
00:18:57.660 | A low score to the bad answer. The loss is small. However, when the reward
00:19:04.140 | model gives a high reward to the bad answer and a low score to the good answer the loss is very big
00:19:11.980 | What does that what does it mean this for the model that it will force the model to always give
00:19:19.260 | High rewards to the winning response and low reward to the losing response
00:19:24.140 | So it because that's the only way for the model to minimize the loss because the goal of the model always during training is to minimize
00:19:31.180 | The loss so the model will be forced to give
00:19:33.420 | High reward to the chosen answer and the low reward to the not chosen answer or the bad answer
00:19:39.420 | In hugging face you we can this reward model is implemented in the reward trainer class
00:19:48.380 | So if you want to train your own reward model, you need to use this reward trainer class and it will take as input
00:19:55.020 | A auto model for sequence classification, which is exactly this architecture here
00:20:00.300 | So it's a transformer model with instead of having the linear layer that projects into the vocabulary
00:20:05.420 | It has a linear layer with only one output feature that gives the reward
00:20:09.660 | And if you look at the code on how this is implemented in the hugging face library
00:20:14.300 | You will see that they first generate the reward for the chosen answer
00:20:19.500 | So for the good answer, then they generate the reward for the bad answer. So for the rejected response here, it's called
00:20:27.020 | And then they calculated the loss exactly using the formula that we saw
00:20:31.020 | So the log sigmoid of the rewards given to the chosen one minus the rewards given to the rejected one
00:20:38.460 | Let's talk about trajectories now
00:20:41.900 | Now as I said previously in reinforcement learning the goal is to select a policy or to optimize a policy
00:20:48.060 | That maximizes the expected return of the agent when the agent acts according to this policy
00:20:54.700 | More formally we can write it as follows that we want to select a policy pi
00:20:59.740 | That gives us the maximum expected reward when the agent acts according to this policy pi
00:21:07.980 | Now, what is the expected return? The expected return of the policy is the
00:21:12.620 | Expected return over all possible trajectories that the agent can have when using this policy
00:21:18.860 | So it's the expected return over all possible trajectories as you know
00:21:24.220 | The expectation can also be written as an integral. So it is the probability of the
00:21:28.940 | Particular trajectory using this policy multiplied by the return over that particular trajectory
00:21:35.980 | Now, what is a trajectory first of all and later we will see what is the probability of a trajectory
00:21:40.860 | So the trajectory is a series of states and actions
00:21:44.620 | Which means that a trajectory you can think of in the case of the cat as a path that the cat can take
00:21:51.980 | Suppose that each of the trajectory have a maximum length. So we don't want the
00:21:56.460 | agent to perform more than 10 steps to
00:22:01.340 | To arrive to its goal. Now the cat can go to the meat for example using this path here or it can choose this path here
00:22:08.860 | Or it can use this path here or this one here or for example
00:22:12.700 | It can go forward and then go backward and then stop because it has already used the 10 steps
00:22:17.580 | Or it can go like this etc etc. So there are many many many paths. What we want is we want to
00:22:24.300 | we want to find a policy that
00:22:28.100 | Maximizes the expected return so the return that we get along each of these paths
00:22:33.880 | Now, we will also model the
00:22:39.800 | The the next state of the cat has been stochastic. So first of all, let's introduce what is the these states and actions
00:22:46.840 | So let me give you an example
00:22:48.920 | Suppose that our cat is starting from some state s0 which is the initial state
00:22:54.680 | The policy tells us what is the next action that we should take given the state
00:22:59.080 | So the cat will ask the policy
00:23:00.920 | What is the next action that it should take?
00:23:03.080 | And because the policy is a stochastic this policy will tell us what is the probability of the next action. So
00:23:09.800 | Just like in the case of the language model we given a prompt we select what is the probability of the next token
00:23:16.680 | So imagine that the policy tells us that the cat should move down
00:23:21.240 | So action down for example with very high probability or it should move right with lower probability
00:23:28.760 | It should move left with even lower probability or it should move up with an even lower probability
00:23:34.380 | Suppose that we select to move down it will result in a new state
00:23:38.920 | That may not be exactly this one. Why? Because we model the
00:23:44.760 | Cat as being drunk, which means that the cat
00:23:47.640 | Wants to move down but may not always move down and we will see later why this is helpful
00:23:54.600 | But another case could be for example
00:23:56.680 | Imagine we have a robot and the robot wants to move down but the wheels of the robot are broken
00:24:03.080 | So the robot will not actually move down. It will remain in the same state
00:24:06.760 | So we always model the next state not as being deterministically determined
00:24:12.280 | But as being stochastic given the current state and the action that we choose to perform
00:24:17.240 | So imagine that we choose to perform the action down
00:24:20.600 | The cat may arrive to a new state s1 which will be according to some probability distribution
00:24:26.380 | Then we can ask again the policy. What is the next action I should do?
00:24:29.800 | Policy might say okay
00:24:31.320 | You should move right with very high probability and you should move down with a lower probability or you should move left with an even lower
00:24:38.920 | Probability etc etc. So as you can see, we are creating a trajectory which is a series of states and actions
00:24:44.840 | Which define how our cat will move in a particular trajectory
00:24:54.120 | Let's see
00:24:55.080 | Now, what is the probability of a trajectory? The probability of a trajectory as you can see here
00:24:59.960 | The fact that we chose a particular action depends only on the state we were in
00:25:05.960 | And the fact that we arrived to this state here depended on the state we were in and the action that we have chosen
00:25:12.040 | and then
00:25:13.880 | The fact that we have chosen this action here depended only on this state
00:25:17.640 | We were in because the policy only takes as input the state and gives us what is the probability of the action that we should take
00:25:23.720 | So we can because they are independent from each other these events
00:25:28.600 | We can multiply them together to get the probability of the trajectory
00:25:32.840 | So the probability of the trajectory is the probability of starting from a particular starting point. So from this state zero here
00:25:39.400 | Then for each step that we have take so for each action state of this particular trajectory
00:25:45.500 | We have the probability of choosing
00:25:47.880 | This the action given the state and then to arriving a new state
00:25:52.840 | Given that we were at this state at time step t and we chose action 80 at time step t
00:26:00.760 | And we multiply all these probabilities together because they are independent from each other
00:26:05.160 | Another thing that we will consider is that is when
00:26:09.400 | How do we calculate the reward of a trajectory?
00:26:13.100 | A very simple way to calculate the reward of a trajectory is to just sum all the rewards that we get along this trajectory
00:26:19.800 | For example, imagine the cat to arrive to the meat follows this trajectory. You could say that the reward is zero here
00:26:26.360 | So it's zero zero zero zero zero zero zero and then suddenly it becomes plus 100 when we reach the meat
00:26:33.880 | If the cat for example follows this path here
00:26:36.760 | We could say okay, it will receive minus one because the cat is scared of the broom then zero zero zero zero zero one hundred
00:26:43.880 | Actually, this is not how we will calculate the reward of a trajectory
00:26:48.280 | We will actually calculate the reward as a discounted which means that we prefer immediate rewards instead of future rewards
00:26:55.560 | To give you an intuition in why this happens. First. Let me talk about money
00:26:59.560 | So if I give you ten thousand dollars today
00:27:02.120 | You prefer receiving it today instead of receiving it in one year
00:27:05.800 | Why because you could put the ten thousand dollars in the bank. It will generate some interest
00:27:10.360 | So at the end of the year, you will have more than ten thousand dollars
00:27:13.560 | And in the case of reinforcement learning, this is helpful also for another case
00:27:17.640 | For example, imagine the cat can only take 10 steps to arrive to the meat
00:27:22.920 | Or 20 steps. So one way for the cat to arrive to the meat is to just go directly to the meat like this
00:27:28.840 | And this is one trajectory
00:27:30.840 | But another way for the cat is to go like this
00:27:33.320 | For example, go here then go here then go here then go here and then go here
00:27:37.480 | So in this case, we prefer the cat to go directly to the meat instead of
00:27:42.520 | Taking this longer route. Why? Because we modeled the next state as being stochastic
00:27:48.520 | And if we take a longer route the probability of ending up in one of these obstacles is higher the longer the route is
00:27:56.040 | So we prefer having shorter routes in this case
00:28:01.080 | And this is also convenient from a mathematical point of view to have this discounted rewards
00:28:06.780 | Because this series which is infinite in some cases, okay, we will not work with infinite series
00:28:14.200 | but it's helpful because this series can converge if this
00:28:18.280 | Element of the series is becoming smaller and smaller and smaller
00:28:24.520 | So let me give you a practical example of how to calculate the reward in a discounted case
00:28:29.880 | so imagine the cat starts from here and it goes to the
00:28:34.440 | follows this path so
00:28:37.640 | to calculate the reward of this trajectory
00:28:40.680 | We will do like this. So it is the reward at time step zero, which is
00:28:44.920 | Arriving to the broom multiplied by gamma to the power of one. So it will be gamma multiplied by minus one
00:28:54.360 | All these rewards are 0 0 0 so they will not be summed up
00:28:57.880 | And finally we arrive here at where the reward is plus 100 at time step 1 2 3 4 5 6 7 8
00:29:06.920 | So it will be gamma to the power of 8 multiplied by 100
00:29:10.780 | So gamma is usually chosen. Not usually. It's always something that is between 0 and 1
00:29:17.240 | So it's a number smaller than 1. So it means that we are decaying
00:29:21.100 | this reward by gamma to the power of 8 so it will be
00:29:25.640 | Smaller the longer we take to reach it. This is the intuition behind discounted rewards
00:29:31.960 | Now you may be wondering
00:29:36.120 | The trajectories make sense in the case of the cat
00:29:38.440 | So I can see that the cat will follow some path to arrive to the meat and it can take many paths to arrive to the
00:29:44.040 | Meat to so so we know what is the trajectory in the case of the cat
00:29:47.720 | But what are the trajectories in case of language model?
00:29:50.520 | Well, as I saw before, we want to we have a policy which is the language model itself
00:29:56.520 | So because the policy tells us given the state what is the next action and in the case of language model
00:30:01.800 | We can see that the language model itself is a policy and we want to optimize this policy such that it selects
00:30:09.320 | The next token in such a way as to maximize a cumulative reward
00:30:14.360 | According to the reward model that we have built before using the data set of preferences that I saw before
00:30:20.360 | Also in the case of the language model the trajectory is a series of states and actions
00:30:25.480 | What are the states in the case of the language model? Are they prompts? What are the actions? Are the next tokens?
00:30:31.880 | So imagine we have a question like this too for the language model. So where is shanghai?
00:30:36.280 | Of course, we will ask the language model. What is the next token which will this will become the initial prompt?
00:30:41.960 | So the initial state of the language model we will ask the language model
00:30:45.320 | What is the next token and that will become our action the token that we choose
00:30:49.960 | But then we feed it back to the language model. So it will become the new state of the language model
00:30:55.400 | And then we ask the language model again. What is the next token?
00:30:59.080 | It will be for example, the word is and this will become again the input of the language model
00:31:04.280 | So the next state and then we ask the language model again. What is the next token?
00:31:08.840 | For example, we choose the token in and then the concatenation of all these tokens will become the new state of the language model
00:31:15.880 | So we ask the language model again. What is the next token, etc, etc until we generate an answer
00:31:21.000 | So as you can see also in the case of the language model
00:31:23.720 | We have trajectories which are the series of prompts and the tokens that we have chosen
00:31:29.720 | Now imagine that we have a policy because we our goal is to optimize our language model
00:31:34.520 | Which is a policy such that we maximize a cumulative reward according to some reward model that we have built in the past
00:31:43.480 | Our more formally our goal is this so we want to maximize this function here
00:31:48.600 | Which is the expected return over all possible trajectories that our language model can generate
00:31:54.120 | And we also saw that before the trajectory is a series of prompts and next tokens
00:31:58.840 | Now when we
00:32:01.960 | Use stochastic gradient descent. So for example when we try to optimize the neural network
00:32:05.880 | We use stochastic gradient descent, which means that we have some kind of loss function
00:32:09.800 | We calculate the gradient of the loss function with respect to the parameters of the model
00:32:15.000 | And we change the parameters of the model such that we move against the direction of this gradient
00:32:21.800 | So we take little steps against the direction of the gradient to optimize the parameters of the model to minimize this loss function
00:32:29.400 | In our case, we do not want to minimize a loss function. We want to maximize a function which is here
00:32:35.720 | And this is can also be thought of as an objective function that we want to maximize
00:32:40.860 | So instead of using a gradient descent, we will use a gradient ascent
00:32:44.760 | The only difference between the two is that instead of having a minus sign here. We have a plus sign
00:32:50.760 | Now, this algorithm is called the policy gradient optimization
00:32:55.740 | And the point is we need to calculate the gradient of this
00:32:59.720 | Function here of our objective function. So what is the gradient with respect to the parameters of our model?
00:33:06.760 | So our language model
00:33:08.680 | what is the gradient of the
00:33:10.680 | Expected return over all possible trajectories with respect to the parameters of the model
00:33:16.920 | We need to find an expression of this gradient so that we can calculate it
00:33:21.560 | And use it to optimize the parameters of the model using gradient ascent
00:33:25.800 | Using also a learning rate alpha you can see here
00:33:29.000 | Now, let's see how to derive the expression of the gradient of this objective function that we have
00:33:35.640 | Now the gradient of the objective function is the gradient of this expectation
00:33:41.880 | so it's the expectation over all possible trajectory of multiplied by the
00:33:46.280 | The return over the particular trajectory
00:33:50.700 | As we know the expectation is also an integral
00:33:53.720 | So it can be written as the gradient of the integral of the probability of following a particular trajectory
00:33:59.900 | Multiplied by the return over this trajectory
00:34:03.020 | as you know from high school the gradient of a sum is equal to the sum of the gradients or the
00:34:11.720 | You may recall it as the derivative. So the derivative of a sum of a function is equal to the
00:34:17.480 | sum of the derivatives
00:34:20.040 | So we can bring this gradient sign inside and it can it can be written like this
00:34:25.400 | Now we will use a trick called the log derivative trick to expand this expression
00:34:31.080 | So p of tau given theta into this expression here. Let me show you how it works
00:34:37.400 | Let's use the pen
00:34:43.000 | You may recall also from calculus that the gradient
00:34:46.300 | with respect to theta of the log function of the log of a function in this case of
00:34:53.720 | p of tau given theta
00:34:57.640 | Is equal to so the gradient of the derivative of the log function is one over the function
00:35:06.200 | p of tau given theta multiplied by the gradient with respect to theta
00:35:12.840 | of the function that is inside the log so p of
00:35:16.040 | tau given theta
00:35:19.240 | We can take this term to the left side multiply it here and this expression here
00:35:25.800 | Will become equal to the this expression multiplied by this expression and this is exactly what you see here
00:35:32.760 | So we can replace this expression that we see in the equation above. So this expression
00:35:37.900 | with this expression we can see here in the
00:35:41.320 | equation below
00:35:45.720 | We can this integral we can write it back as an expectation over all possible trajectories of this quantity here now
00:35:53.480 | Because the probability is only this term here
00:35:56.520 | So we can write it back as a expectation
00:36:01.000 | Now we need to expand this term here. So what is the gradient of the log?
00:36:05.560 | So this this expression here
00:36:08.440 | So what is the gradient of the log of probability of a particular trajectory given the parameters of the model?
00:36:15.320 | Let's expand it
00:36:17.000 | we saw before that the
00:36:19.000 | probability of a trajectory is just the product of all the
00:36:22.280 | Probabilities of the state actions that are in this trajectory. So the probability of starting from a particular state
00:36:29.540 | Multiplied by the probability of taking a particular action given the state we are in
00:36:34.740 | multiplied by the probability of ending up in a new state given that we started from
00:36:39.780 | The state at time step t and we took action at time step t
00:36:44.340 | And we do it for all the state actions that we have in this trajectory
00:36:48.840 | If we apply a log to this expression here, the product here will become a sum
00:36:57.060 | And let's do it actually. Okay, so we circle the log
00:37:00.980 | of p of tau
00:37:05.060 | given pi
00:37:08.500 | Pi theta actually because we model our
00:37:11.140 | Policy pi as parameterized by parameter theta here. I forgot the theta but doesn't matter
00:37:17.540 | It's equal to the log
00:37:20.500 | Of all this expression. So it's the log of a series of products so it can be written as the log of
00:37:29.220 | 0 of s 0
00:37:34.020 | the summation
00:37:36.020 | The log of
00:37:46.340 | t plus 1
00:37:48.180 | Given that we are in st
00:37:53.540 | not plus
00:37:54.900 | and at
00:37:56.900 | And we are in we took action at
00:37:59.060 | plus the log
00:38:05.220 | This the action that we took according to our policy at given that we were in st
00:38:11.860 | Okay. Now we are also taking the gradient of this expression and as you can see here there is no term that depends on
00:38:20.020 | Theta so it can be deleted
00:38:23.220 | Also in this case, we do not have any
00:38:25.460 | Expression that in this expression here. We do not have anything that depends on theta. So this can be deleted
00:38:32.420 | Because the derivative of something that does not have the variable being
00:38:38.500 | Derived is a constant so it can be deleted because it will be zero
00:38:42.980 | So the only term surviving in the summation is only these terms here because it's the only one that contains the theta
00:38:50.260 | As you can see here. So in the final expression is this one here. So this summation now, let me delete
00:38:56.660 | So we have derived
00:39:00.820 | An expression that allow us to calculate the gradient of the objective function because why we need the gradient of the objective function?
00:39:07.940 | Because we want to run gradient ascent
00:39:09.940 | Now one thing that we can see here. We still have this expectation over all possible trajectories
00:39:18.260 | To calculate over all possible trajectories in the case of the cat
00:39:21.940 | It means that we need to calculate this gradient over all the possible paths that the cat can take of
00:39:27.460 | Length, for example, 10 steps. So if we want to model trajectories of only length 10
00:39:33.460 | It means that we need to calculate all the possible paths that the cat can take of length 10
00:39:38.340 | And it could be a huge number in the case of language model
00:39:41.540 | It's even bigger because usually imagine we want to generate trajectories of size 100. It means that what are the possible
00:39:48.100 | All the possible texts that we can generate of size 100 tokens using our language model
00:39:54.660 | And for each of them we need to calculate the reward and the log action probabilities, which I will show later how to calculate
00:40:00.820 | Now as you can see the problem is this expectation is over a lot of terms
00:40:06.420 | So it's intractable computationally to calculate them to calculate this expression because we would generate
00:40:12.020 | Need to generate a lot a lot a lot of text for the language model. So one way to
00:40:17.540 | To calculate this expectation is to approximate it with the sample mean so we can always approximate
00:40:26.280 | An expectation with the sample mean so instead of calculating it over all the possible trajectories
00:40:33.300 | We can calculate it over some trajectories. So in the case of the cat it means that we
00:40:37.380 | Take the cat and we ask it to move using the policy for some number of steps and each
00:40:44.020 | And we will generate one trajectory
00:40:46.900 | We do it many times and it will generate some trajectories in the case of the language model
00:40:51.620 | We have some prompt we ask the language model to generate some text
00:40:55.140 | Then we do it many times using different temperatures and different sampling strategies
00:40:59.860 | For example by sampling randomly instead of using the greedy strategy. We can use the top p so it will generate many texts
00:41:06.100 | Each text will represent a trajectory. We do not have to do it over all the possible text that the language model can generate
00:41:13.140 | But only some so it means that we will generate some trajectories
00:41:16.740 | So we can calculate this expression here only on some trajectory that our language model will generate
00:41:22.120 | And this will give us an approximation
00:41:24.520 | of this gradient here
00:41:27.620 | Once we have this gradient here, we can evaluate it over the trajectories that we have sampled
00:41:33.160 | And then run gradient ascent on it
00:41:35.700 | So practically it works like this in the case of the cat
00:41:40.020 | We have some kind of neural network that defines the policy which is taking the state of the cat
00:41:46.260 | Which is the position of the cat tells us what is the probability of the next action that the cat should take
00:41:53.220 | We can use this policy, which is not optimized to generate some trajectories
00:41:58.180 | So for example, we start from here. We ask the policy
00:42:01.220 | Where should I go and we for example, we use the greedy strategy and we move down
00:42:07.060 | Or we use the top p for example
00:42:08.820 | Also in this case, we can use top p to sample randomly the action given the probabilities generated by the network
00:42:16.340 | So imagine the cat goes down and then we ask again the policy. Where should I go?
00:42:21.620 | Policy may say okay move right move down move right move right etc. So we will generate one trajectory
00:42:26.760 | We do it many times by sampling always randomly according to the probabilities generated by the policy
00:42:32.760 | For each state actions, we will generate many trajectories in this case
00:42:37.700 | Then we can evaluate because we also know the rewards that we accumulate over each state actions. We calculate the reward
00:42:45.060 | We also know the log probabilities of the each action because for each state we have
00:42:50.420 | The log what is what was the probability of taking that action and we choose it
00:42:55.540 | And we need to calculate also the gradient of this log probabilities
00:43:01.140 | This is done by automatically by pytorch when you run lost dot backwards. So pytorch actually will calculate the gradient for you
00:43:08.020 | We do it for all the other possible trajectories. This will give us the approximated
00:43:15.200 | Gradient of over the trajectories that we have collected
00:43:18.340 | We run gradient ascent and we optimize the parameters of the model using a step towards the gradient
00:43:25.540 | Now then we need to go
00:43:28.480 | We do we need to do it again. So we need to collect more trajectories
00:43:32.180 | We evaluate them. We evaluate the gradient of the log probabilities. We run a gradient ascent
00:43:38.880 | So we take one little step towards the direction of the gradient
00:43:42.740 | And then we do it again. We go again collect some trajectories. We evaluate this expression here to
00:43:49.760 | Calculate the gradient of the policy with respect to the parameters
00:43:54.180 | And we run again gradient ascent so a little step towards the direction of the gradient
00:44:00.880 | This is known as the reinforcement learning algorithm in literature
00:44:05.840 | And we can use it also to optimize our language model. So in the case of the language model
00:44:11.040 | We we have to also generate some trajectories
00:44:13.780 | So one way to generate the trajectories would be to for example use the database of
00:44:19.040 | Questions and answers that we have built before for the reward model
00:44:22.800 | Which means that we have some questions
00:44:25.680 | So we ask the language model to generate some answer for each question
00:44:31.280 | Using for example the top piece strategy. So it will generate according to the temperature many different
00:44:37.200 | answers for the same given question
00:44:39.840 | This will be a series of trajectories because the language model generation process is an iterative process made up of states
00:44:48.080 | So prompts and actions and which are the next tokens
00:44:51.520 | And this will result in a list of trajectories for which we have
00:44:56.320 | The log probabilities because the language model generates a list of probabilities over the next token
00:45:01.840 | And we can also calculate the gradient of this state
00:45:05.440 | Log probabilities using PyTorch because when we run loss.backward it will calculate the gradient
00:45:11.620 | But how do we do it in practice? Let's see
00:45:15.360 | Now we want to calculate this term here
00:45:18.960 | So the log probabilities of the action given the state for language models
00:45:24.400 | Which means what is the probability of the next token given a particular prompt?
00:45:28.480 | Imagine that our language model has generated the following response
00:45:32.560 | So we asked the language model where is Shanghai and the language model said Shanghai is in China
00:45:38.160 | Our language model is a transformer model. So it is a
00:45:42.240 | transformer layer
00:45:45.280 | And it will generate a given an input sequence of embeddings. It will generate
00:45:49.300 | An output sequence of embeddings which are called hidden states one for each input token
00:45:55.440 | As you know the language model when we use it for text generation
00:46:00.080 | It has a linear layer that allow us to calculate the logits for each position
00:46:05.440 | So usually we calculate the logits only of the last token because we want to understand what is the next token
00:46:11.520 | But actually we can calculate the logits for each position
00:46:14.560 | So for example, we can also calculate the logits for this position and the logits for this position will indicate
00:46:19.600 | What is the most likely next token?
00:46:22.240 | Given this input. So where is Shanghai?
00:46:26.000 | question mark Shanghai is
00:46:28.540 | So this is because of the causal mask that we apply during the self-attention mechanism
00:46:34.240 | So each hidden state actually encapsulates information about the current token. So in this case of the token is
00:46:41.840 | And also all the previous tokens. This is a property of the transformer model that is used during training
00:46:48.640 | So during training as you know, we do not calculate
00:46:51.860 | The output of the language model step by step
00:46:54.560 | We just give it the input sentence the output sentence, which is the shifted version of the input sentence
00:47:00.080 | we calculate the
00:47:02.880 | For we do the forward pass and then we calculate the log using only one forward pass
00:47:07.600 | We can use the same mechanism to calculate the log probabilities for each
00:47:11.760 | States and actions in this trajectory, which as I showed you is a series of prompts and next tokens
00:47:19.040 | Now we can calculate the logits for this position for this position for this position and for this position
00:47:24.800 | then we usually we apply the softmax to understand what is the
00:47:29.280 | Probability of the next token, but in this case, we want the log probabilities
00:47:34.080 | So we can apply the log softmax for each position. This will give us
00:47:38.160 | What is the log probability of the next token given only the previous tokens?
00:47:42.720 | Compared to the current one
00:47:44.640 | So for this position it will give us the log probability of the next token given that the input is only where is shanghai?
00:47:50.880 | question mark shanghai
00:47:53.360 | Of course, we do not want all the log probabilities
00:47:56.240 | We only want the log probability of the token that actually has been chosen in this trajectory
00:48:01.440 | What is the actual token that has been chosen for this particular?
00:48:05.220 | Position. Well, we know it. It's the word is so we only selected the log probability corresponding to the word is
00:48:13.360 | This will return us the log probability for the entire trajectory because now we have the log probability of selecting
00:48:20.020 | The word shanghai given the state where is shanghai?
00:48:24.580 | We have the log probability of selecting the word is given the input
00:48:29.120 | Where is shanghai question mark shanghai?
00:48:31.360 | We have the log probability of selecting the word in given the input where is shanghai question mark shanghai is etc, etc
00:48:38.080 | So now we have the log probabilities of each
00:48:41.440 | Of each position of each state action in this trajectory
00:48:47.140 | When we have this stuff here, we can always ask
00:48:50.720 | PyTorch to run
00:48:54.160 | The backward step to calculate the gradients and then we multiply each gradient by the reward that we receive
00:49:01.440 | From our reward model we can then calculate this expression and then we can run
00:49:06.400 | Gradient ascent to optimize our policy based on this approximated gradient
00:49:12.100 | Let's see how to calculate the reward now for the trajectory
00:49:16.020 | So calculating the reward is a similar process as you saw before we have a reward model
00:49:21.440 | That is a transformer model with a linear layer on top that it has only one output feature
00:49:26.960 | So imagine our sentence is the same. So
00:49:29.280 | Where is shanghai shanghai is in china. This is the trajectory that has been generated by our language model
00:49:35.040 | Now we give it to the reward model. The reward model will generate some hidden states because it's a transformer model
00:49:41.840 | And we apply the linear layer to all the positions that are corresponding to the action that are in this trajectory
00:49:49.840 | So first action is the selection of this word. The second action is this one the third and the fourth
00:49:54.400 | So we can generate the reward for each time step
00:49:57.600 | We can just sum these rewards to generate the total reward of the trajectory or we can sum the discounted reward
00:50:04.960 | Which means that we will calculate something like this. For example
00:50:08.080 | We will calculate
00:50:10.400 | Let's write it. So it will be the reward at time step zero plus
00:50:15.120 | gamma multiplied by the reward at time step one plus gamma multiplied by the reward at time gamma to the power of two multiplied at
00:50:21.760 | By the reward at time step two plus gamma to the power of three multiplied by the reward at time step three, etc
00:50:27.840 | Etc. So now we also know how to calculate the reward for each trajectory
00:50:31.620 | So now we know how to evaluate
00:50:34.480 | This expression you can see here. So now we know also how to run gradient ascent to optimize our language model
00:50:42.560 | The algorithm that I have described before is called the gradient policy optimization and it works fine for very small problems
00:50:49.600 | But it exhibits problems. It is not perfect for bigger problems. So for example language modeling
00:50:55.060 | And the problem is very simple. The problem is that we are approximating. So let's write here something so
00:51:02.400 | We as you saw before our objective function, which is j of theta, which is an expectation
00:51:11.520 | Over all possible trajectories that are sampled according to our policy
00:51:16.580 | And expectation each one with its reward along the trajectory
00:51:22.740 | So we are approximating the expectation with a sample mean so we do not
00:51:28.320 | Calculate this expression over all possible trajectories. We calculate it only
00:51:33.040 | on some trajectories
00:51:35.440 | now this is
00:51:37.120 | Fair, it means that the result that we will get will be an approximation that on average will converge to the true expectation
00:51:44.500 | So it means that on the long term it will converge to the true expectation, but it exhibits high variance
00:51:50.660 | So to give you an intuition into what this means, let's talk about something more simple
00:51:55.760 | For example, imagine I ask you to calculate the average
00:51:59.040 | age of the American population. Now the American population is made up of 330 million people
00:52:06.240 | To calculate the average age means that you need to go to every person ask what is their birthday calculate the
00:52:12.080 | Age and then sum all these ages that you collect divide by the number of people
00:52:17.600 | And this will give you the true average age of the American population
00:52:21.120 | But of course as you can see, this is not easy to compute because you would need to interview 330 million people
00:52:26.880 | Another idea would be say okay. I don't go to every American person
00:52:32.640 | I only go to some Americans and I calculate their average age which could give me a good
00:52:38.720 | indication of what is the average age of the American population
00:52:42.020 | But the result of this approximation depends on how many people you interview because if you only interview one person
00:52:49.440 | It may not be representative of the whole population. Even if you interview 10 people, it may not be representative of the whole population
00:52:56.640 | So the more people you interview the better and this is actually a result that is statistically proven by the central limit theorem
00:53:03.840 | So let's talk about the variance
00:53:07.280 | Of this estimator. So we want to calculate the average age of the American population
00:53:12.340 | Suppose that the average age of the American population is 40 years or 45 years or whatever
00:53:20.560 | If we approximate it using a sample mean which means that we do not ask every American but some Americans what is their average age
00:53:28.240 | We need to sample randomly some people and ask what their age. Suppose that we only interview one person because we are
00:53:34.640 | We do not have time
00:53:37.040 | Suppose that we are unlucky and this person happens to be a kindergarten student and this person will probably say
00:53:43.040 | The age is a six. So we will get a result that is very far from the true mean of the population
00:53:50.500 | On the other hand, we may ask again some random people and these people happen to be for example
00:53:55.700 | All people from retirement homes. So we will get some number that is very high which is for example 80 years
00:54:01.540 | Which is also not representative of the true population
00:54:04.200 | So the smaller the sample the more unlucky we are in getting these values that are very far from the true mean
00:54:11.700 | So one way is to increase the sample size
00:54:14.900 | So if we ask 1000 people what is their average age very probably we'll get something that is closer to this 40 years old
00:54:21.940 | because we cannot be
00:54:24.020 | so unlucky to get six or
00:54:26.020 | That all of them happen to be in the kindergarten or in the retirement age
00:54:30.180 | in the retirement home
00:54:32.500 | This happens also when we approximate an estimation with a sample mean here
00:54:38.580 | The quality of this approximation depends on how many trajectories we choose
00:54:44.820 | and as you saw before
00:54:47.060 | Choosing too many trajectories from language models is not easy because it means that you need to run
00:54:51.860 | Inference on the language model many times to calculate these trajectories
00:54:59.300 | So the problem is we cannot easily
00:55:02.100 | Increase the number of trajectories, but we need to find a way to reduce this value
00:55:06.660 | so we do not because this is the
00:55:08.660 | This tells us what is the direction of the gradient that we will use to run a gradient ascent
00:55:14.420 | We want to find the true direction of the gradient
00:55:17.380 | so imagine the true direction of the gradient is this one if we have high variance it means that sometimes the
00:55:22.740 | This approximation may tell us that the gradient is actually pointing in this direction or it's pointing in this direction
00:55:28.180 | Or it's pointing in this direction
00:55:30.020 | But if we increase the reduce the variance
00:55:32.740 | It will probably tell us something that is more closer to the true direction of the gradient
00:55:36.580 | So we will move our weights in a way that is moving
00:55:40.020 | To maximize the objective function because we are moving according to the true direction of the gradient
00:55:45.700 | So this is why we want to reduce the variance of this estimator
00:55:49.080 | Now, let's see what are the techniques that we can use to reduce the variance of this estimator without increasing the sample size
00:56:01.460 | The first thing that we should notice is that okay
00:56:04.580 | First of all, we had this expectation that we approximate using the sample mean you can see here
00:56:10.740 | Now each of these log probabilities. So this log probabilities here are multiplied by the reward over the entire trajectory
00:56:18.680 | Now the first thing that we should notice is that each action cannot alter the reward that it
00:56:25.700 | That we received in previous steps. So imagine
00:56:30.020 | We have a series of states and actions. So for example, we started from state zero
00:56:34.100 | And then we take action one
00:56:36.740 | Which led us to action zero and then this led us to state one
00:56:41.860 | In which we we took action one which led us to state two in which we took action two, etc, etc, etc
00:56:50.180 | For each state action we receive a reward because when we take an action it will
00:56:55.780 | For example in the cat it will move to a new cell or remain in the same cell and it will receive some reward
00:57:00.420 | And also for this one we will have some reward. So reward one and for this one we will have reward two
00:57:06.500 | Now when we take this action here, for example action number two, it cannot alter the reward that we already received in the past
00:57:14.340 | So when we multiply by this term reward of tau
00:57:18.100 | We do not consider all the rewards that came before the action that we are considering in this summation
00:57:24.580 | So instead of calculating the reward
00:57:26.580 | For the trajectory starting from zero
00:57:29.540 | We can calculate the reward starting from the time step of the action that we are considering for the log probabilities of the action
00:57:37.060 | This term here is known as the rewards to go which means what is the total reward if I start from this state
00:57:45.700 | And take this action and then act according to the policy for the rest of the trajectories
00:57:51.240 | Why do we want to do this?
00:57:54.100 | Because as you can see
00:57:56.100 | This expression here is an approximation of the true expectation here
00:58:03.780 | The less terms we have the better because we will have less noise
00:58:09.140 | Why? Because first of all
00:58:12.260 | As we know each action cannot alter the rewards that we received in the past
00:58:20.260 | Which means that on average all these past terms will cancel out with each other
00:58:25.460 | But so we if we do not consider them we avoid adding some noise in this approximation that will send our gradient in
00:58:33.380 | Directions that are further from the true gradient
00:58:36.520 | So if we can remove some terms from this expression
00:58:40.340 | It is better because we have less chance of introducing noise that sends our gradient in two directions that are far from
00:58:46.980 | The one that is the true gradient that would be given by this expectation
00:58:50.840 | So the first thing we do is we instead of calculating the reward over all the trajectory. We only calculate the reward
00:58:58.660 | For each state action of the reward starting from that state action onwards
00:59:04.980 | Until we reach the end of the trajectory
00:59:07.780 | So this T big T here you can see here capital T
00:59:11.380 | Indicates from the time of the current state action that we are considering here until the end of the trajectory
00:59:17.480 | Now this is one way to reduce the
00:59:21.060 | variance of the estimator
00:59:24.740 | Another way is to introduce a baseline. So
00:59:28.100 | You can introduce it has been proven in the research of reinforcement learning that introducing a constant here
00:59:37.140 | Reduces the variance and it doesn't have to be a constant but it can also be something that depends on the state
00:59:43.860 | So it could be also a function of the state
00:59:46.260 | For which we are calculating the reward of the trajectory. So for each log probability we multiply by a term here
00:59:54.020 | That indicates the rewards to go so the reward from this state action until the end of the trajectory
01:00:00.200 | minus a baseline that does not have to be
01:00:03.220 | Constant, but it can also be a function of the state
01:00:07.300 | And the function that we will choose is called the value function
01:00:11.860 | So this baseline we will there are many baselines, but with the one we will choose is the value function the value function
01:00:18.180 | tells us
01:00:20.740 | Of S according to some policy pi tells us what is the expected reward if you start from S
01:00:27.540 | And then act according to the policy for the rest of the trajectory. This is the value function
01:00:34.900 | Let me show you some examples
01:00:38.900 | The value function of this particular cell
01:00:41.140 | Of this cell here. We expect it to be high why because
01:00:45.860 | It's very probable that the cat will take the action move down
01:00:49.780 | And go directly to the meat in the case of language model
01:00:53.780 | this is a prompt because it's a series of tokens that we will feed to the language model to generate the
01:01:01.040 | Probabilities of the next token and it's very good to be in this state
01:01:05.360 | Why because it's very probable that the next token will be generated in such a way that it will actually answer the question
01:01:12.000 | Of where is shanghai?
01:01:14.400 | So if the model has already generated these two tokens, for example
01:01:17.360 | Shanghai is it's very probable that the next token will be the word in and the next next token will be the word china
01:01:23.840 | Which answers our question which will result in a good response by the language model
01:01:29.680 | Which in turn will give us a good reward according to our reward model on the other hand
01:01:35.600 | If we are here, for example with the cat
01:01:38.080 | This is a state that can lead us to move to the bathtub
01:01:42.580 | So we expect the value of this state to be lower than that of this state because it's less probable that from here
01:01:49.680 | We end up on the bathtub. Maybe we get closer to the bathtub, but we do not end up directly on the bathtub
01:01:55.040 | But from here we can end up there so it will reduce the value of this state
01:01:59.520 | So what is a bad value for a language model? For example in the case for this prompt here
01:02:05.520 | So we started with a prompt and the language model somehow generated these two words chocolate muffins for the question
01:02:11.600 | Where is shanghai?
01:02:13.040 | Now if we ask the language model to generate the next tokens for given this prompt
01:02:17.280 | It will probably move far from the actual response of where is shanghai
01:02:22.240 | It will not tell us that shanghai is in china
01:02:24.880 | So the value that we can get starting from this state is not so high because we will probably end up generating a
01:02:31.680 | Bad response which will give us a low reward according to our reward model
01:02:38.080 | So this is the meaning of a value function
01:02:40.880 | The value function tells us if I start from this state and then act according to the policy
01:02:46.160 | What is the expected return I can get?
01:02:48.960 | Now, how do we estimate this value function?
01:02:54.240 | Well, just like we did for the reward model we can generate a neural network
01:03:00.480 | To which we add a linear layer on top that can estimate this value function and usually what is done in
01:03:09.280 | Practically is we use the same language model that we are trying to optimize we add another linear layer on top
01:03:15.440 | So apart from the one that projects into the vocabulary
01:03:18.480 | We add another one that can also estimate the value so that the parameters of the transformer layer are shared
01:03:25.700 | For the language modeling and the estimation of the value. The only two differences are the linear layers
01:03:31.200 | One is used for projecting the tokens into the vocabulary and one is used to estimate the value of the state
01:03:37.600 | Which is the prompt basically
01:03:39.760 | So suppose our language model has generated this response for our
01:03:45.200 | Prompt, so where is Shanghai and the language model has said Shanghai is in China
01:03:49.120 | We send it to the policy model. So the language model that we're trying to optimize this is called the policy
01:03:57.460 | It will generate some hidden states one corresponding to each token
01:04:02.800 | and then instead of using the linear layer of the
01:04:06.400 | Vocabulary, so that will project each hidden state into the vocabulary
01:04:10.640 | We use another linear layer that with only one output feature that will be used to estimate the value of each state
01:04:17.280 | So we can estimate the value of this state of this state of this state and also of the entire sequence
01:04:24.640 | By using the values generated by this linear layer for each hidden state that we want
01:04:29.680 | Okay, now we have seen that before to reduce the variance first of all, we transformed the
01:04:38.560 | The reward of the entire trajectory in rewards to go
01:04:42.240 | So something that starts not from t zero, but t equal to the action state that we are considering here
01:04:49.040 | And we also saw that we can introduce a baseline that depends on the state
01:04:54.240 | And this will not change the approximation. So this approximator is still unbiased
01:05:00.980 | Which means that it will on average converge to the true gradient, but will have lower variance
01:05:07.360 | Which means that in the case of for example, we are calculating the average age of the american population
01:05:12.100 | which means that we are reducing the chance of
01:05:16.000 | Getting very low very low values for the age or very high values for the age
01:05:21.360 | But we will get something that is more closer to the true average age of the american population
01:05:25.780 | Now this function here this rewards to go is in reinforcement literature. It's also called the Q function
01:05:32.880 | So the Q function tells us if I start from this state and take this action. What is the future?
01:05:38.240 | Expected reward if I act according to the policy for the rest of the trajectory
01:05:43.060 | So the Q function tells us the expected reward if I start from this state and take this action
01:05:50.560 | So we get some immediate reward and then act according to the policy for the rest of the trajectory
01:05:58.560 | So we can simplify the expression that we have seen before as Q of state
01:06:03.600 | And action at time step t here. I forgot the t minus the value of the state at time step t
01:06:10.160 | The difference between the two is known as advantage function
01:06:14.320 | Now, I know that I am introducing a lot of terms and terminology bear with me because it will make sense
01:06:20.960 | later now just
01:06:23.680 | Don't you don't have to remember all the terms. I will repeat multiple times these concepts
01:06:28.580 | So what we were trying to do we are trying to reduce the variance of this estimator
01:06:32.680 | and we saw that we can instead of calculating the reward for all the trajectories only for the rewards for the
01:06:39.300 | Starting from the time step in which we are considering the action values
01:06:43.940 | Then we saw that we can introduce this baseline called the value function that will reduce further the
01:06:50.180 | variance of this estimator
01:06:54.100 | The difference between these two is called advantage function in the literature of reinforcement learning
01:06:59.620 | And the advantage function if you look at the expression here tells us. Okay. First of all, let's analyze these two terms
01:07:09.140 | Now the Q function tells us what is the expected return if I start from state s at time step t
01:07:15.940 | Take action a so here. I forgot the t's
01:07:21.060 | Action t and t and also here t and t. Okay
01:07:26.340 | so the
01:07:28.500 | Q function tells us if I start from state t take action a
01:07:32.740 | And then act according to the policy
01:07:35.620 | What is the expected return the value function on the other hand tells us if I start from state s
01:07:41.700 | And I act according to the policy. What is the expected return?
01:07:48.260 | in this case
01:07:49.620 | For example, let's use the pen in this case here in this state
01:07:53.540 | If I choose the action go down
01:07:56.660 | It is better than going left because by going down I will move towards the mid
01:08:02.660 | So it is better to use the action go down
01:08:05.140 | The advantage term that is the difference between these two terms tells us
01:08:10.100 | How better is this particular action compared to the average action that we can take in the state s
01:08:18.260 | Which means that the advantage function for the state for the action go down in this state here
01:08:24.180 | So in this state here will be higher than the advantage function of another action
01:08:29.940 | so the advantage function tells us how
01:08:32.820 | Better than the average is this action that we are considering compared to the other actions that we have in this state
01:08:39.300 | And if we want to give an interpretation to this whole expression
01:08:44.180 | It tells our model that for each log probability. So for each action in a particular state
01:08:50.340 | We want to multiply it by its
01:08:52.420 | advantage
01:08:55.040 | Because this is the gradient it will indicate a direction in which we need to optimize our parameters
01:09:01.320 | By using gradient ascent basically what we are doing is we are forcing our policy to push up
01:09:09.300 | So to increase the likelihood or the log probabilities
01:09:12.840 | Of the actions that have high advantage, which means that they result in a better than average
01:09:19.880 | Returns and push down the log probabilities of those actions in each state
01:09:26.260 | That result in lower than average returns
01:09:30.100 | Which means that for example, let's talk about language modeling if someone asks
01:09:35.220 | Where is Shanghai? So where
01:09:39.680 | Shanghai
01:09:41.680 | What is better
01:09:44.260 | And the question mark what's a good action to take? What's the good next token to select?
01:09:51.460 | Well, we know that the starting with the chocolate is going to be
01:09:57.940 | Going to produce a reward that is worse than average because very probably it will lead to a bad answer
01:10:04.900 | however starting the
01:10:06.900 | the answer with the word Shanghai will probably result in a
01:10:11.460 | In the correct answer because the next token will be in Shanghai is in China
01:10:16.660 | So it will actually result in a good answer which will be rewarded well by our reward model
01:10:22.420 | So our model will be more likely to select the word Shanghai when it will see this prompt
01:10:28.740 | So this is how to interpret this advantage term
01:10:32.400 | Basically, what we are trying to do is we are trying to push up the log probabilities of those actions for a given state
01:10:38.640 | That result in better than average reward according to our reward model and push down the probabilities of those actions
01:10:46.160 | Given the state that result in low than average reward for according to our reward model
01:10:53.280 | Let's see how to estimate this advantage term now
01:10:56.720 | So first of all, let me write again the expression of the advantage term
01:11:01.040 | So let's use the pen. So as we saw before the advantage term
01:11:04.560 | at time step t so the
01:11:07.600 | Starting from state s and taking action t is equal to the Q function
01:11:14.080 | at time step t
01:11:17.440 | Action a at time step t minus
01:11:20.480 | Minus the value at time step t
01:11:28.240 | What is the Q function the Q function tells us
01:11:30.720 | If we start from state s and take action a and then act according to the policy
01:11:37.920 | What is the expected return if we start from state a state s and take action a and then we act according to the policy?
01:11:44.640 | For the rest of the trajectory while the value function tells us what is the expected return?
01:11:49.840 | If we start from state s and then act according to the policy
01:11:54.880 | Which means that imagine we have a trajectory a trajectory is what it's a list of state ended actions. So we have a state 0
01:12:02.000 | action 0
01:12:04.800 | And this will
01:12:06.160 | Have some reward associated maybe reward 0 this will lead us to
01:12:10.000 | State 1 in which we will take maybe action 1 this will have some reward associated with it, which is reward 1
01:12:17.120 | This will take us to another state for example state 2
01:12:21.360 | Action in which we will take action 2 and this will have some reward associated with it, which is reward 2
01:12:27.040 | and then state 3
01:12:29.760 | In which we will take action 3 it will have some reward associated which is reward 3 etc
01:12:35.440 | Etc, etc, etc for the rest of the trajectory
01:12:37.700 | Let's try to understand how can we estimate this advantage term?
01:12:41.920 | We saw also before that for the estimating the value function
01:12:45.680 | We can build a neural network, which is a linear head on top of our policy network
01:12:50.640 | Which is the language model that we are trying to optimize
01:12:52.980 | So instead of using the linear network that linear layer that projects
01:12:57.700 | the hidden state into the vocabulary
01:13:00.080 | We can use another special linear layer with only one output feature that can estimate the value function of that particular state
01:13:06.320 | later, we will see also how to
01:13:08.880 | Which loss function we need to use to train this value head
01:13:14.400 | So now let's concentrate on estimating this advantage term
01:13:17.200 | Now imagine we have a trajectory this advantage term can be estimated like follows. So as we know the advantage term tells us
01:13:24.560 | The Q function, so this is the Q function
01:13:28.720 | at given state S and
01:13:32.400 | action A at time step T
01:13:35.440 | Can be calculated as follows. So if we start from state S, we will receive some reward
01:13:42.560 | and then we can calculate because for each trajectory we can
01:13:46.560 | for each trajectory we can calculate the
01:13:49.440 | The Q function so if we start from state S at time step T and take action T in this state
01:13:57.440 | And then act according to the policy
01:14:00.160 | We can either sum all of these terms that we have for the trajectory or we can just say okay
01:14:05.600 | If I start from state 0 and take action 0 I will have some immediate reward, which is this one
01:14:11.600 | Plus I approximate the rest of the rewards with the value function because I will end up in some state S1
01:14:17.200 | And I just approximate all this rest of the summation as the V of S1
01:14:24.320 | We can let me delete some stuff now because
01:14:31.120 | Now or we can say okay the advantage term
01:14:34.640 | Which means that if I start from state S at time step T and take action T can also be approximated as follows
01:14:41.200 | So I have some immediate reward
01:14:43.200 | Plus the reward that I get in the next state plus the rest of the trajectory
01:14:48.880 | I approximate it with the value function at time step T
01:14:52.160 | plus 2 so S2
01:14:54.720 | And this is exactly what we are doing here
01:14:57.360 | And we are also discounting it with the gamma parameter that we saw here
01:15:01.680 | So we want to discount future rewards
01:15:04.180 | And this minus V is just because of the formula of the advantage term has this minus
01:15:10.480 | value function
01:15:12.080 | We can also do it with three terms or four terms or five terms or whatever we want
01:15:17.040 | And then we can cut the rest just with the value function
01:15:19.840 | Now, why do we want to do this? Let me delete some stuff
01:15:23.680 | Okay, if we stop too early
01:15:27.040 | So we calculate or for example
01:15:28.960 | Just the first approximation because we are approximating most of the trajectory with the value function
01:15:34.640 | It will exhibit high bias, which means that the value of the estimation of this advantage
01:15:40.180 | Will not be very correct because we are approximating most of the trajectory with the value function, which is itself an approximation
01:15:48.260 | Or to improve this approximation we can introduce more rewards from the actual trajectory that we got
01:15:55.520 | And only approximate a little bit
01:15:59.040 | Of the trajectory with the value function or we can approximate all of the trajectory with the rewards that we get and
01:16:05.760 | Use no approximation with the value head
01:16:10.400 | But if we use more terms, it will result in a higher variance
01:16:15.780 | If we use less terms, it will result in a high bias because we are approximating more
01:16:22.480 | So in order to solve this bias variance problem, we can use a generalized advantage estimation, which basically takes the
01:16:29.920 | Weighted sum of all these terms. So of this one, this one, this one each multiplied by a decay parameter
01:16:37.280 | lambda
01:16:39.680 | We can see here
01:16:41.200 | So basically this results in a recursive formula in which we can calculate the advantage
01:16:45.860 | At each time step t given the future advantage at time step t plus one
01:16:52.240 | Let's try to use this formula. For example, imagine we have a trajectory which is a series of states and actions
01:16:57.360 | So we have a state zero with action zero which will result in a reward zero
01:17:02.560 | Then we have this will result in another state s1 in which we take action one and it will have some reward one
01:17:09.680 | This will result in a new state s2 in which we take action two
01:17:13.360 | Which will lead us to state three in which we take action three, etc, etc
01:17:17.600 | This one will have reward three and this one reward two and this one will have reward three
01:17:23.600 | Let's try to calculate the advantage. For example, the advantage at time step three because it's the last term in our trajectory
01:17:31.780 | Is equal to delta at time step t
01:17:37.840 | Gamma multiplied by lambda at time step four, but we do not have any time step four
01:17:42.400 | So we this term does not exist. So delta three
01:17:46.400 | Is equal to the return that we have at time step t plus
01:17:50.640 | Gamma multiplied by the value function at time step four
01:17:55.600 | But we do not have this term because there is no state four
01:17:59.120 | Minus the value of the state
01:18:05.200 | this tells us
01:18:07.200 | The advantage estimation at time step three then we can use it to calculate the advantage estimation at time step two
01:18:16.480 | which is
01:18:17.760 | A2 is equal to delta two plus lambda
01:18:23.200 | Gamma lambda
01:18:30.320 | But what is delta two? Delta two is equal to the reward that we have at time step two plus
01:18:36.800 | Gamma
01:18:39.120 | multiplied by the value of the state three
01:18:43.360 | Minus the value of the state two, etc, etc
01:18:46.400 | So we can recursively calculate the advantage estimation of each term
01:18:50.320 | Why do we need to calculate the advantage estimation because the advantage is in the formula of our
01:18:54.960 | gradient that we need to calculate the
01:18:58.400 | That we need to run a gradient ascent
01:19:01.360 | I know that I have introduced a lot of concepts. I have introduced the value function
01:19:06.320 | I have introduced the Q function and the advantage function
01:19:09.600 | I also know that it may not be very clear to you
01:19:12.560 | Why we are calculating all this stuff because we have not seen the code and how it will be used
01:19:17.600 | So please bear with me now. I know that there is a lot of stuff that you need to remember
01:19:21.760 | But when we will see the code, I will go back to all these slides for now. I just made this
01:19:27.600 | I just made all these formulas because later when we go back it they will make more sense to you
01:19:34.320 | And also if you want to in the future review this video, you don't have to kind of watch the code to understand
01:19:40.000 | the formulas because once you understand the
01:19:43.040 | This video once you can just review the parts that you're interested and they will be more clarified to you
01:19:49.440 | Okay, now let's see what is the advantage term for language model so just like the example I made before I said, okay we have this
01:19:57.840 | expression for our gradient
01:20:00.960 | In which we are multiplying each log probability by the advantage function also here. I forgot the t
01:20:06.880 | And here I forgot the t later. I will fix the slides
01:20:10.560 | Now as I saw before as we saw before if we have our question is where is shanghai and our language model selects the
01:20:20.080 | word shanghai
01:20:21.600 | Very probably this will be a new state that will be fed to the language model for generating the next next next next tokens
01:20:29.920 | This the first choice of shanghai will lead to a good answer because very probably the next tokens will be selected in such a way
01:20:37.120 | that it will result in the for example, the
01:20:39.200 | Phrase shanghai is in china, which is a good response because it matches
01:20:44.800 | What is what are the chosen answer in our data set of the reward model. So our reward model will give a good
01:20:52.560 | reward to this kind of
01:20:56.000 | Answer so we can say that this is a good state to be in because it will lead to future states that will be rewarded
01:21:02.640 | Well by the reward model
01:21:04.240 | However, if our language model happens to choose the word chocolate as the next token after this question
01:21:10.720 | This new state will lead to new tokens being selected that are not very close to the answer that we are trying to find
01:21:20.000 | This will result in a bad response. So it will result in a low reward from our reward model
01:21:26.240 | So in the case of language models, we are trying to push up
01:21:29.920 | The log probabilities of the word shanghai when it sees the state
01:21:35.840 | Where is shanghai and push down the log probability of the word chocolate
01:21:42.560 | When the state is where is shanghai because the advantage for choosing shanghai
01:21:47.540 | Is higher than the advantage for choosing the word chocolate given this prompt. This is the how do we
01:21:54.240 | Interpret the advantage estimation for language models
01:21:59.040 | Another problem that we have a policy gradient optimization is because of the sampling that we are doing
01:22:05.120 | So as you know in the policy gradient optimization, the algorithm is like this
01:22:09.280 | So we have a language model. We sample some trajectories from this language model. We calculate the
01:22:14.560 | Rewards associated with these trajectories. We calculate the advantages associated with these trajectories
01:22:21.220 | We calculate the log probabilities associated with these trajectories
01:22:25.140 | Then we can use all this information to calculate this big expression here, which is the direction of the gradient
01:22:31.840 | so which is the gradient of the
01:22:35.820 | Expected reward with respect to the parameters of the model and then we can run gradient ascent to optimize the parameters of the model
01:22:43.820 | according to the direction of the gradient
01:22:47.820 | This is a process that is also used in gradient descent
01:22:51.020 | So using gradient descent we have a loss function
01:22:53.500 | We calculate the gradient of the loss function with respect to the parameter of the model
01:22:57.420 | And then we optimize the parameters of the model according to the direction of the gradient
01:23:02.300 | We do this process many many many times. Why? Because we do little steps
01:23:06.460 | With respect to the direction of the gradient according to a learning rate alpha
01:23:12.620 | Now the problem is that we are sampling trajectories from the language model
01:23:19.100 | For each step that you are making in this gradient ascent
01:23:22.700 | So for each step of this optimization process, we need to sample many trajectories. We need to calculate many advantages
01:23:29.500 | We need to calculate many rewards. We need to calculate many log probabilities
01:23:33.040 | So this can be very very inefficient because we will when doing gradient ascent, we are taking only small steps
01:23:40.380 | so for each of those small steps, we need to do a lot of calculation which makes the
01:23:44.460 | Which is makes the computation nearly impossible because we cannot run all these forward steps
01:23:51.020 | on many different
01:23:53.020 | Models to calculate the values the advantages and the rewards etc. We need to find a better way
01:23:58.780 | so as you remember this
01:24:00.780 | This formula for the gradient that we have found is an approximation of an expectation
01:24:06.240 | And in probability we have this thing called important sampling
01:24:11.580 | So when evaluating the expectation with respect to one distribution
01:24:16.160 | we can calculate the expectation with respect to another distribution different from the
01:24:22.940 | The previous one as long as we modify the we multiply the function
01:24:27.760 | Inside the expectation by an additional term here. So let's try to understand. What does it mean?
01:24:33.980 | Imagine we are trying to calculate this expectation and I want to remind you that in the case of the language model optimization
01:24:40.400 | Or the gradient policy optimization. We are calculating the gradient
01:24:44.480 | of e over all the possible trajectory according to
01:24:49.340 | policy
01:24:51.960 | Parameterized by theta of what of the reward
01:24:56.380 | of each trajectory
01:24:59.100 | Here so in this case the x is we can consider x to be the trajectory sampled from the
01:25:06.780 | policy theta
01:25:09.500 | Pi theta and this could be the reward of each theta. Now, as you know, the expectation can be written as a
01:25:17.560 | integral of the probability of each item in the expectation multiplied by the function f of x which is the inside here in the
01:25:25.160 | parentheses of the expectation
01:25:27.240 | We can multiply
01:25:30.680 | By the this constant here, which is basically the number one
01:25:33.800 | So we can always multiply by the number one in a multiplication without changing the result of this multiplication
01:25:39.020 | So we are multiplying up and down in this fraction by the same quantity, which is the number one so we can do it
01:25:46.200 | then we can rearrange the terms such that the
01:25:48.760 | We divide the p
01:25:51.800 | Basically the p of x by this q of x where q of x this term here
01:25:56.680 | Is the distribution it's another distribution is the probability density function of another distribution
01:26:03.960 | Then we can return back this integral to the expectation form
01:26:08.360 | So now we we can write the expectation as a sample from the distribution q
01:26:14.520 | And calculate the the with respect to a function that is the f of x multiplied by this additional term
01:26:21.640 | So this means that in order to calculate the initial expectation here instead of sampling from the distribution
01:26:28.460 | For which we want to calculate the expectation
01:26:30.680 | We can sample from another distribution as long as each item is multiplied by this additional factor here
01:26:38.760 | And we can do the same for our expression of the gradient policy optimization in which we were sampling from some policy
01:26:45.480 | Here, which is the policy that we are trying to optimize
01:26:49.420 | But we can modify it by using important sampling to sample from another policy, which could be a different
01:26:57.000 | Neural network, but we will see that actually it's the same
01:27:00.680 | But okay, suppose that it's a different neural network
01:27:04.920 | Because sampling trajectory means that we generate some text given some questions. So it's actually we are sampling from our neural network
01:27:12.200 | And each of the items so each of this advantage term instead of being multiplied only by the probability according to the
01:27:19.800 | To the network that we're trying to optimize. We also divide it by this q of x. So this the log probabilities of the
01:27:27.800 | Distribution from which we are sampling
01:27:32.200 | We will call the distribution from which we are sampling pi offline and the distribution that we are trying to optimize
01:27:39.660 | pi online
01:27:41.720 | Let me give you an example a graphical example on how it works
01:27:45.480 | So for now, just remember that with important sampling we can calculate this expectation
01:27:51.500 | By sampling from another network while optimizing another one a different one
01:27:57.240 | It works like this. This is called off policy learning in reinforcement learning literature. So imagine we have a language model and we will call it
01:28:04.760 | Parameterized by some parameters called theta offline and we will call it the offline policy
01:28:11.880 | We will sample some trajectories. What does it mean? We give some questions according to our reward model data set, for example
01:28:18.520 | So we ask it where is shanghai and we ask the language model to generate many answers giving using a high temperature
01:28:25.560 | For example, then we calculate the rewards for these trajectories that are generated
01:28:30.440 | We calculate the advantages for all the state action pairs. We calculate the log probabilities for this state action pairs
01:28:37.000 | and then
01:28:39.240 | We optimize another another model called
01:28:42.520 | online policy
01:28:44.600 | So we take all these trajectories that we have sampled from the offline policy and we save it in some database or in some memory
01:28:52.360 | And we keep it there. Then we take some mini batch of trajectories from this database or from this memory
01:28:59.000 | And then we run we calculate this expression here because we can calculate it
01:29:04.440 | So we can calculate the log probabilities according to the online model
01:29:09.000 | So for this the trajectories that we have sampled from this memory
01:29:13.320 | We can also calculate again the advantage term according to the online policy, which is another neural network
01:29:20.360 | We can also calculate and later i will show in the code how it's done
01:29:23.400 | We can also calculate the advantage term according to the online policy. We can also calculate the rewards according to the online policy, etc
01:29:30.440 | And then we run gradient ascent
01:29:33.960 | Based on this expression only optimizing this online policy here
01:29:39.240 | And we do it for a few epochs, which means for a few mini batches that we sample from this big memory of trajectories
01:29:47.800 | And after a while, we just set the online policy
01:29:51.400 | The parameters of the offline policy equal to the parameters of the online policy and restart the loop
01:29:56.760 | So we start again by sampling some trajectories, which we keep them in the memory
01:30:02.760 | For a few epochs, we sample some trajectories from here. We calculate the log probabilities with respect to the online policy
01:30:09.260 | We calculate this expression here, which is needed to optimize
01:30:14.600 | With the gradient ascent and then after a while we set the offline policy equal to the online policy
01:30:21.880 | They look like two different network neural network
01:30:24.520 | But actually it's the same neural network in which we first sample from the neural network
01:30:29.160 | We keep the memory of the trajectories that we sample and then we optimize this neural network by taking these trajectories
01:30:36.620 | After a while, we do this process again. I know that this is not easy to visualize
01:30:43.400 | So later we will see this in the code, but the important thing is that now we have found a way to
01:30:48.920 | Run gradient ascent multiple times without having to sample each time from the policy that we are optimizing from the network that we are trying
01:30:57.480 | To optimize. We can sample once, keep these trajectories in memory
01:31:02.040 | Optimize the network for some steps and then after we have optimized for some steps, we can sample new trajectories
01:31:09.100 | We do not have to do it for every step of gradient ascent
01:31:12.840 | So this makes the computation of this policy gradient algorithm tractable because otherwise it was too slow to run it
01:31:21.080 | And this is how we do it in the code, so
01:31:25.720 | I also created some pseudocode in how to do this offline policy. So imagine we have a model that we want to train
01:31:33.800 | Okay, let's use
01:31:39.000 | This one here, okay
01:31:41.000 | For now, just ignore the frozen model. We're not using it
01:31:43.720 | So we have a neural network that we want to train with gradient ascent
01:31:48.200 | So we have a policy that we want to optimize with gradient ascent
01:31:51.580 | We sample some trajectories from this policy and we keep them in memory. For each trajectory
01:31:57.560 | We calculate the log probabilities, the rewards, the advantages, the KL divergence, etc, etc
01:32:04.760 | Later, we will see why we need the KL divergence for now. Just ignore it
01:32:09.640 | This part
01:32:11.320 | Then we sample some mini-batch from these trajectories that we have seen. We run the PPO algorithm that we
01:32:17.800 | calculated the loss, basically the expression that we saw before
01:32:21.720 | We calculate the gradient using loss.backward and we run optimizer step, but we do not need to sample again
01:32:28.840 | We just take another sample from the trajectories that we have already saved
01:32:32.840 | We do again another step of gradient ascent and then etc, etc until we reach a specified number of steps
01:32:39.240 | And then after we have optimized the model for some number of steps
01:32:43.240 | We can sample new trajectories and then run again this loop of optimization for many steps
01:32:48.760 | So not for every step of gradient ascent, we have to sample new trajectories
01:32:53.100 | We sample once, we do many steps of gradient ascent and then we sample again
01:32:57.320 | We do many steps of gradient ascent and then we sample again. This makes the training much faster
01:33:02.520 | Okay, I promise this is the last group of formulas that we are going to see. So this is finally the PPO loss
01:33:07.800 | Let's try to understand it
01:33:11.400 | So based on what we have seen before, the first thing that we should see is that
01:33:15.480 | This term here is exactly the one that we saw before
01:33:18.920 | So we have the log probabilities according to the policy that we are trying to optimize
01:33:23.420 | Divided by the log probability of the policy that we sample from, so the offline policy. Yeah, I don't know why it's so ugly
01:33:32.280 | So we have the log probability of the, this is called the
01:33:35.480 | Online policy, so the policy that we are trying to optimize. So let's call it online
01:33:40.600 | This is the log probabilities according to the policy that we sample from, so we sample some trajectories from this policy
01:33:48.360 | This is the offline policy
01:33:50.680 | And then we have this advantage term which is multiplied by each of the action state pairs
01:33:58.440 | We are calculating the minimum value of this expression and this other expression here. So this clipped
01:34:04.520 | log probabilities
01:34:07.400 | Why? Well, first of all, what is the clip function? The clip function says that if this
01:34:12.200 | expression we can see here is bigger than 1 plus epsilon, then it will be clipped to 1 plus epsilon
01:34:20.680 | If this expression is smaller than 1 minus epsilon, then it will be clipped to 1 minus epsilon
01:34:27.900 | Why do we want this? Well, it means that
01:34:30.700 | First of all, let's try to interpret this
01:34:33.740 | This term here
01:34:37.660 | The difference, the ratio of the two log probabilities
01:34:41.360 | so we have some log, we have some policy that we sample from and then we have a
01:34:46.860 | policy that we are optimizing
01:34:49.740 | This means that if the log probability in the policy that we are optimizing
01:34:55.020 | Is much higher for a specific action compared to the one that we sampled from
01:35:00.300 | Which means that we are trying to increase the likelihood of selecting that action in the future
01:35:05.580 | We don't want this
01:35:08.700 | This increase to be too far. So we want to clip it to maximum at this value
01:35:15.340 | On the other hand, if we are trying to decrease the likelihood of an action compared to what it was before
01:35:23.740 | We don't want it to decrease by too much, but at most by this quantity here
01:35:28.780 | This means that in our optimization step
01:35:31.580 | We are moving the action probabilities
01:35:34.700 | So the probabilities of selecting a particular token given a particular prompt we are changing them continuously
01:35:41.040 | But we don't want them to change too much. We want to make little steps
01:35:46.220 | Why? Because we are
01:35:48.860 | If we move them too much, maybe the model will
01:35:52.940 | Run into maybe the model will kind of
01:35:57.100 | Not explore enough
01:36:00.220 | Of the other options so the model may actually optimize for that particular action too much
01:36:05.900 | So it may always avoid that action or it will always use that action in this case
01:36:11.020 | We want to do it little by little
01:36:13.100 | So we want to the model to make little steps
01:36:16.060 | In increasing a particular action or a little step in decreasing the log probability of that particular action
01:36:22.620 | Why are we talking about actions? Because we are talking about language models and so we want to
01:36:26.620 | Increase or decrease the probability of selecting a particular token given a prompt
01:36:32.380 | But we don't want this probability to change too much. This is why we have the minimum here. So we want to make the most
01:36:40.380 | Pessimistic update we can we don't want to be too optimistic. We don't want the model to make the most optimistic steps
01:36:47.900 | So if the model is very sure that it can always select this token, we don't want the model to be sure
01:36:53.340 | We want the model to make a little step towards what the model thinks is better choice
01:36:57.820 | The other head that we introduced before was the head for calculating the value function
01:37:04.940 | So as you remember, we also
01:37:07.020 | Introduced this value function and we say that this value function which is a function of the state
01:37:12.380 | indicates what is the
01:37:15.160 | Expected reward that we can receive from start by starting from that particular state
01:37:20.120 | And the example that I gave you was for example, imagine we are our question is where is Shanghai?
01:37:26.520 | So where is Shanghai?
01:37:29.080 | If the model has selected for example the word Shanghai as the next token
01:37:36.600 | We expect the value of this state
01:37:40.360 | So because this will become a new input for the language model to be high
01:37:43.720 | Why? because it will probably result in a good answer that will be rewarded well by our model
01:37:49.720 | But of course, we also need to train our neural network to approximate this value function
01:37:55.240 | Well, so what we do is we use this other term for the PPO loss is for training the value function estimator
01:38:02.940 | and basically it means the
01:38:06.040 | The value function estimator based on a particular state
01:38:10.280 | So this is the output of the model and we compare it with what is the value actual value of this state based on the
01:38:16.760 | trajectories that we have sampled because we have trajectories
01:38:20.220 | Each trajectory is made up of state actions. Each state action has some reward
01:38:25.320 | So we actually can calculate the value of this state
01:38:30.200 | According to the trajectory that we have sampled. So we want to optimize the value function estimator
01:38:35.580 | According to the trajectories that we have actually sampled from our policy
01:38:39.160 | The last term in the policy the PPO loss. So we have first of all the policy optimization term, which is this
01:38:45.880 | This is the one I described here
01:38:48.200 | Then we have the the loss because for the value function estimator and then we have another term here the entropy loss
01:38:56.360 | This is to introduce some kind of this is to force our model to explore more options
01:39:02.360 | so imagine our model if we don't have this term here the model may just
01:39:07.000 | optimize the actions
01:39:10.260 | in such a way
01:39:12.760 | In such a way to select the actions that resulted in a very good advantage
01:39:16.780 | To select them more often and the actions that resulted in lower than average advantage to select them less often
01:39:25.480 | So this will kind of make the model very rigid in selecting tokens
01:39:30.200 | The model will always choose the tokens that resulted in good advantage and never select the tokens that resulted in bad advantage
01:39:37.660 | But this will make also the model not explore other options
01:39:42.200 | Which means that for example, imagine we sample some trajectories
01:39:45.180 | And for the question, where is shanghai the model always selects the word shanghai because it results in a good answer
01:39:51.480 | But we want the model to be also kind of explore other options. Maybe there is another word
01:39:56.440 | For example, where is shanghai?
01:39:59.000 | Maybe the next word is can be the word it because it will result in it is in china
01:40:04.520 | So we also want the model to give the model the possibility to explore or more of these options
01:40:12.760 | And this is why we introduce this entropy term because we want the model for each state actions to also explore other options
01:40:19.640 | So we want to force the model to explore other options. So because we are maximizing this
01:40:25.480 | This objective here. So we are maximizing the this objective function here. We also want to
01:40:31.640 | Minimize this loss here. We will see later how to do it and we want to maximize the entropy
01:40:37.260 | So that the model can also explore more options why we use the entropy because the entropy tells us
01:40:45.320 | How much kind of disorder there is uncertainty there is in the prediction
01:40:50.280 | So we want the model to be more uncertain why because it will help the model to explore more next tokens for a given prompt
01:40:57.640 | The last thing that we need to consider is that if we kind of optimize the policy
01:41:04.040 | Using the ppo loss that we have described before
01:41:06.760 | the model may learn
01:41:09.480 | Some tokens or some sequence of tokens that always result in a good reward
01:41:14.360 | And the model may always choose these tokens to always get good rewards
01:41:19.080 | But these tokens may not make sense for us humans. So for example, imagine our model
01:41:24.840 | our data set
01:41:27.480 | Forces our data set for the reward model forces the model to be polite
01:41:31.640 | The model may just use the word
01:41:34.520 | Thank you. Thank you. Thank you continuously because we know that it is very polite and it results in a good reward
01:41:40.360 | but this is not a good answer for a
01:41:43.320 | Question because if I ask you where is shanghai and you are just keep if the model just keeps telling me
01:41:48.600 | Thank you. Thank you. Thank you
01:41:49.400 | Then for sure the reward model will give a good reward to this answer because it's a polite answer
01:41:54.040 | But it does not make sense to humans
01:41:56.200 | So we want the model to actually generate output that makes sense that are very similar to the data
01:42:01.640 | It has seen during the training. That's why we want to constrain the model
01:42:06.360 | Not only to get good rewards, but at the same time to generate answers that are very similar to the one
01:42:12.760 | It would generate by just looking at the untrained model. So at the unaligned model
01:42:19.400 | This is why we make another copy of the model that we want to optimize and we freeze its weights
01:42:25.080 | So this is the frozen model. We generate the rewards for each step in the trajectory
01:42:31.960 | But we penalize by how much the log probabilities at each step change from the frozen model
01:42:38.360 | So for each hidden state we can generate the reward by using the linear layer that we saw before with only one output feature
01:42:45.240 | But at the same time for each hidden state, we will also calculate the log probabilities using the other linear layer for generating the logits
01:42:52.440 | So we'll send it also this one to the linear layer to generate the logits
01:42:56.520 | This will calculate the logits
01:43:01.160 | And then the log probabilities
01:43:03.160 | So the prob
01:43:06.040 | We do the same for the frozen model and then we penalize the reward
01:43:11.560 | So this reward here for this time step, we say the reward is equal to the reward at the time step zero
01:43:18.120 | minus the KL divergence between the log probabilities of the frozen model
01:43:23.960 | so the log probabilities of the frozen model and
01:43:28.760 | The log probabilities of the policy that we are optimizing
01:43:32.300 | We want to penalize the model for generating answers that are too different from the frozen model. So we want the
01:43:39.720 | Reward to be maximized but at the same time we don't want the model to cheat
01:43:44.120 | In just getting reward by generating any kind of output
01:43:48.040 | But we want the model to actually get rewards for good answer that are very similar to the one that it would generate
01:43:53.640 | If it was not optimized
01:43:56.440 | Okay, I know that you are tired of looking at all this explanation and all this theory. So let's jump into the code now
01:44:02.360 | Okay. So the the goal that we are the code that we are going to see is a code that I took from the HuggingFace
01:44:10.040 | Website which basically allow us to train a reinforcement learning
01:44:14.680 | Setup in which we want to train a language model to generate positive reviews
01:44:19.480 | So if we have a language model that is generating text
01:44:22.520 | But we want to force the language model to generate positive reviews of a particular
01:44:27.320 | For example a restaurant or a movie or something like this
01:44:33.800 | We want the language model to be still similar to to generate something that is
01:44:38.440 | Comprehensible to humans, but at the same time we want to like we force the language model to be positive to generate positive stuff
01:44:44.840 | So say stuff like for example, I really like this movie or I really like this restaurant
01:44:50.200 | We will be using the imdb data set. So as you can see from the website of HuggingFace the imdb data set
01:44:56.120 | It's a data set made up of text of reviews and for each review
01:45:00.120 | It indicates what is if the review is positive or negative
01:45:04.140 | And we will use this imdb data set
01:45:08.440 | To understand what is the score that we want to give to a review
01:45:13.160 | So if the review will be positive according to this data set
01:45:17.080 | It will be given a high reward and if the text generated will be similar to a negative review
01:45:22.360 | Then it will be given a low reward
01:45:24.440 | so the first thing that we do is we create the model that we want to optimize which is a
01:45:29.400 | which is this
01:45:32.520 | language model here, so it's
01:45:34.520 | I think it's gpt2 already fine-tuned on the imdb data set
01:45:38.840 | And then we create a reference model why because we need a frozen model of with frozen weights
01:45:45.640 | that we need to
01:45:47.400 | keep the weights frozen to compare how
01:45:49.720 | different is the response of the model that we are trying to optimize from the frozen model because we don't want the
01:45:55.640 | Output to be much different. We just want it to be a little positive, but we don't want the model to just
01:46:01.240 | Output garbage just to get high reward. We want to actually
01:46:04.840 | Get actual text that makes sense. This is why we keep also a frozen model
01:46:12.520 | And then we load this PPO trainer. The PPO trainer in HuggingFace is the class that is used to train
01:46:19.560 | To run reinforcement learning from human feedback using the PPO algorithm
01:46:24.860 | So, let's see. First of all, what is the reward model?
01:46:28.200 | The reward model is basically just a sentiment analysis or using this model here
01:46:32.920 | It will give us for each text that we feed to this reward model
01:46:38.840 | A number that indicates how positive it is according to this imdb data set you can see here
01:46:44.280 | so it will tell us if the
01:46:46.760 | The text that we are receiving is a positive review or a negative review
01:46:50.600 | For example, if we give this text here, it will probably tell us that it's a bad review
01:46:55.000 | So low reward and if this we give it give this text here for this movie was really good. It will give us a positive
01:47:04.200 | Reward and we will use this number here as the reward. So the score corresponding to the positive class
01:47:10.120 | Okay, the first step in PPO is to
01:47:14.520 | Is to generate the trajectories. So we have some model
01:47:21.400 | Policy that is the offline policy and we need to sample some trajectories from it. What do I mean by
01:47:28.680 | Sampling some trajectories means that we give it some text and it will generate some responses some output text
01:47:35.800 | And what we will be using as a kind of questions or prompt for generating the text we will be using
01:47:42.680 | Just some initial sampled
01:47:45.240 | Text from this imdb data set you can see here. So for example, uh, this data set is composed of many
01:47:52.600 | Reviews some are positive. Some are negative. We just randomly take the initial part of a
01:47:58.600 | Review and we use it as a prompt to generate the rest of the review
01:48:02.760 | And then we ask the reward model to judge this review that was generated if it's positive or negative
01:48:07.800 | It's positive then it will achieve a high reward if it's negative, it will achieve low reward
01:48:12.920 | So we take some
01:48:16.680 | Okay, we generate some
01:48:19.080 | Lengths, so random select how many tokens we need to take from each review. We select it randomly
01:48:25.480 | We get these prompts from our data set and we ask the ppo model to generate some
01:48:31.560 | Answers for these questions for these prompts. So generate the rest of the text up to a maximum length
01:48:39.080 | That is also sampled randomly
01:48:42.440 | These are our trajectories for now
01:48:45.720 | These are just the combination of prompt and the generated text. We did not calculate the log probabilities
01:48:52.680 | We did not calculate the advantages. We did not calculate the rewards etc
01:48:57.320 | Okay. So now for now, we only have the query and the response generated by our
01:49:02.920 | offline policy
01:49:05.560 | What is the offline policy is the model that we are trying to train. So this variable here model
01:49:12.440 | Now that we have some responses
01:49:13.960 | We can ask our reward model to judge these responses and we use basically just do a sentiment classification
01:49:21.160 | in which we give the response that was given by the
01:49:24.520 | Policy and we ask the sentiment pipe
01:49:28.440 | So the sentiment analysis pipes which will act as our reward model to judge this text
01:49:33.640 | So how positive is this review that was generated and we will take the
01:49:38.600 | Score associated with the positive class that will be generated as you can see here. So as a reward we take the
01:49:44.840 | We assign the reward to the full response. So for each response, we will get one number
01:49:51.480 | And this number is actually the
01:49:53.480 | Logits, so the score corresponding to the positive class according to this sentiment analysis pipeline
01:50:00.540 | Now that we have some trajectories, which are some questions
01:50:04.360 | So some prompts along with the text that was generated along with the reward for each of this text that was generated
01:50:11.900 | We can run the PPO training setup. So let's now go inside the code of the library
01:50:18.600 | So the first thing we do is we call this function here step in which we give out the prompt that we gave to the language
01:50:24.600 | Model the responses that were generated and the rewards associated with each
01:50:28.520 | response
01:50:30.360 | And then we run this step function here
01:50:33.000 | Now the step function here. Okay. First it checks if the tensors that you pass it are correct
01:50:40.040 | So the data types and the shapes of the tensors, etc, etc
01:50:44.360 | Then it converts the scores into a tensor because the scores are at least one score for each response
01:50:50.360 | So it converts it into a tensor. I commented the code that I don't find
01:50:54.920 | Useful for my explanation. So there are many functions in a hugging phase, but we will not be using all of them
01:51:00.920 | I will just concentrate on explaining the vanilla PPO like it was described in my slides
01:51:08.200 | The first thing that we need to do is to calculate all the log probabilities of the actions that we
01:51:13.560 | that we need to calculate the gradient
01:51:17.560 | we do it here in this function here, so given the
01:51:20.280 | answers the text generated by our
01:51:24.040 | model
01:51:25.720 | And the queries that were used so here they are called queries and responses
01:51:29.480 | But they are actually the prompts and they generated the text
01:51:32.520 | The hugging phase they calculate the log probabilities for each step. How do they calculate it?
01:51:37.720 | Well, they calculate the call this function batched forward pass
01:51:41.080 | in which they pass the model from which the
01:51:44.200 | Answers were generated. So the text was generated the prompt that were used to generate this text
01:51:51.080 | And they divide each of these
01:51:54.680 | Questions and responses into mini batches and then they run it through the model the model as we saw in the slides
01:52:02.440 | So let's go back here, I think
01:52:07.480 | Here we know that we can calculate the log probabilities corresponding to each position
01:52:13.320 | based on the
01:52:15.720 | Text and the question that was asked so we can create a concatenation
01:52:19.820 | Of the question and the text that was generated. We pass it to the model. The model will generate some logits one for each position
01:52:27.180 | Of the token. We only take the log probability of the next token because we already know which next token was generated
01:52:34.200 | So we know that for this particular prompt made up of these four tokens. The next token is shanghai
01:52:39.400 | So we only take the log probability corresponding to the word shanghai and this is what is done in this line here
01:52:45.240 | so we ask the language model to generate the logits corresponding to all the
01:52:49.400 | Positions then we calculate the log probabilities from this logits. How?
01:52:58.520 | We calculated the log softmax so exactly like in my slides
01:53:01.960 | So we calculate the log softmax here as you can see. So for each logits we calculate the log
01:53:07.320 | log softmax which is the
01:53:10.520 | Log probabilities for each position
01:53:13.720 | But we are only interested in the position corresponding to the next token and this is done here with the gather function
01:53:19.880 | You can see here. So
01:53:21.480 | From all the log probabilities
01:53:22.920 | It only selects the one corresponding to the next token because we already know which token was generated
01:53:28.120 | So now we have the log probabilities
01:53:30.120 | and we can
01:53:32.680 | We can save them because we don't have the log we don't want the log probabilities for all the tokens
01:53:38.280 | We also need to keep track of where the log probabilities start
01:53:41.960 | So the one that we want to consider and where they end why because as you can see from my slide
01:53:47.480 | Our trajectory here. The question was where is shanghai the model generated four tokens
01:53:53.320 | Where is shanghai is in china? So we are all interested in this trajectory. We only have four steps
01:53:58.760 | So we are all interested in the log probabilities of four tokens
01:54:02.520 | And this is exactly what we do here. So we consider
01:54:05.720 | which is the starting point from which we consider the log probabilities and which is the ending token for which we consider the log probabilities because
01:54:16.600 | The model will generate the log probabilities for all the positions
01:54:19.560 | But we only want some of them and here is what we do. So we create a mask in which we say that the model
01:54:25.160 | Only consider we will be considering only these four probabilities or four five probabilities according to which token were actually generated by the model
01:54:35.880 | So now we have the log probabilities of each action. So let's go back
01:54:41.720 | To the step function
01:54:49.320 | Okay, so we calculated the log probabilities
01:54:51.420 | according to our
01:54:54.180 | offline policy
01:54:55.880 | Why do we do it here inside the step method and not outside?
01:54:59.480 | Well, because the hugging face is a library that is user friendly
01:55:02.840 | So they don't want to give to the user the burden of calculating the log probabilities of each action
01:55:09.240 | They do it inside the library
01:55:11.160 | So they only ask the user to generate the responses for each prompt and then they take care of calculating the rest of the information
01:55:19.420 | Now we also need to calculate the log probability with respect to the reference model
01:55:23.900 | So the frozen model why because we also need to calculate the KL divergence that will be used to
01:55:29.660 | penalize the reward for each position because we want to penalize the model for generating
01:55:35.680 | Log probabilities that are much different from the frozen model
01:55:40.540 | Otherwise the model will just do what is known as a reward hacking which is just generate
01:55:46.220 | Random tokens that actually give a good reward, but they do not make sense for the user
01:55:51.660 | So we also need to generate using the same method
01:55:54.700 | So this batched forward pass using the frozen model to generate the log probabilities
01:56:00.160 | Which will be used to calculate the KL divergence to penalize the reward
01:56:04.300 | The next step we do is we actually compute these rewards. So how do we compute the rewards?
01:56:10.140 | Well using the log probabilities of the model that we are trying to optimize and the frozen model because we need to calculate the KL divergence
01:56:17.280 | We have this mask which indicate which log probabilities we need to take into consideration because we have the log probabilities of all the response
01:56:25.100 | But only some of them are interesting for us because they belong to the trajectory
01:56:29.440 | And let's see how to compute the rewards
01:56:34.620 | So the rewards are computed as follows. So we calculate the KL penalty, which is the difference in log probabilities
01:56:39.980 | So if you go to here, you can see that the KL divergence is just a difference in log probabilities as you can see here
01:56:46.940 | and we
01:56:49.400 | penalize as you can see here
01:56:51.260 | The reward is basically just the KL divergence penalization, which is the KL divergence multiplied by some factor
01:56:57.820 | Which is the penalty factor
01:56:59.900 | and then we sum the score so
01:57:04.380 | We saw before that the score is what is just the score associated to each response
01:57:10.140 | By our reward model. Our reward model is just a sentiment classification pipeline that will generate one reward one single number
01:57:17.980 | for each response so indicating how
01:57:20.940 | Positive is the response that was generated or how negative it is
01:57:25.420 | Because we only have one generated response
01:57:30.620 | We and this response this reward is associated with the last token. So let me show you in the slides
01:57:37.340 | Here we were computing the reward for each step
01:57:41.660 | But actually the sentiment classification model will compute the reward only for the last token for the full answer for the full generated text
01:57:50.380 | So we basically we create but we need of course to calculate the reward of the trajectory. We need the reward for each
01:57:57.900 | state actions
01:58:00.140 | so we compute the KL penalty for each position because we know the log probabilities of the frozen model and of the
01:58:07.660 | Model that we are trying to optimize. So we have the KL penalty for each position, but we have the reward only for the last one
01:58:13.660 | So this is exactly what we are doing here. We calculate the log probabilities
01:58:17.280 | For the KL penalty for each position, but the score is only added to the last
01:58:23.340 | token, so here in this position here
01:58:27.580 | And then when we compute the advantage because we compute the advantage starting from the last to the first
01:58:32.780 | we will kind of
01:58:35.340 | Take this reward and put it in the previous steps and we will see this later
01:58:39.180 | so now we have found a way to calculate the
01:58:41.820 | The rewards associated with each position in which each position is given some score by the sentiment classification
01:58:48.720 | But this is only given to the last token while the KL penalty is given to each position
01:58:56.540 | So, let's go back
01:58:58.540 | Okay, so we have computed the rewards now we can compute the advantages
01:59:04.720 | Let's see how we compute the advantages to compute the advantages. We need the values
01:59:08.860 | What are the values? Well, the value is the estimation of the value
01:59:13.020 | value function
01:59:15.740 | As we saw before the value is computed by using the same model
01:59:20.140 | So the policy network with an additional head
01:59:23.500 | Which is a linear layer that gives us the value estimation for that particular
01:59:28.160 | State so let me show you in the slides
01:59:31.260 | Here we saw before that of the policy network
01:59:36.700 | So the model that we are trying to optimize also has an additional linear layer that gives us a value
01:59:41.900 | estimation for each
01:59:44.220 | step of the trajectory
01:59:46.540 | And this is actually already when we calculated the log probabilities the this function also returns the value head the value
01:59:53.340 | estimation for each
01:59:55.900 | Step of the trajectory then we can use the values estimated plus the rewards that we calculated
02:00:01.520 | Plus the mask because we need to know which value we have and which value we don't have
02:00:06.220 | to compute the advantage using the same formula that we saw before so we start from the
02:00:12.620 | The formula of the which is a this one here. So let's go back to the formula
02:00:24.700 | We calculate the delta t
02:00:26.700 | to compute the advantage estimation at time step t so
02:00:30.940 | Here we are computing the first delta t which is the reward at time step t plus gamma as you can see here
02:00:39.180 | Multiplied by the value at time step t plus one and this is here. So it's zero if we do not have any future
02:00:45.420 | Values, otherwise, it's the value at time step t plus one
02:00:49.020 | Minus the value at time step t exactly according to this formula here. You can see here
02:00:54.860 | and then we use this delta value to compute the
02:00:58.060 | Ge estimation, which is the delta plus gamma multiplied by lambda multiplied by the
02:01:05.260 | Ge at the next time step which is exactly what we do here. So delta at time step t plus gamma multiplied by lambda multiplied by the
02:01:13.420 | Advantage estimation at time step t plus one
02:01:16.460 | And we do it from the last
02:01:19.340 | From the last item in the trajectory to the first item in the trajectory
02:01:24.240 | That's why we do this for loop in reverse
02:01:27.820 | And then we reverse it back because we computed the advantage
02:01:32.060 | Reversed and then we reverse the computed advantages to have them from zero to time step t to
02:01:38.460 | Capital t instead of capital t to zero
02:01:40.940 | Then we compute the q values that will be used to
02:01:45.900 | Optimize the value function. So as you can see here
02:01:49.740 | To optimize the value head. So the value estimation
02:01:53.280 | we need to have the
02:01:56.060 | estimation of the value function
02:01:59.260 | But according to the trajectory that we have sampled, but what is the estimation of the value function according to the trajectory?
02:02:05.840 | It is actually the q function because
02:02:09.180 | For the value function tells us. Okay. Let me use some kind of let me write here
02:02:15.580 | Otherwise, it's not easy to understand. So
02:02:17.980 | The value function here is tells us what is the value
02:02:23.260 | Of a particular state. So what is the expected return that we can get?
02:02:29.500 | By starting from a particular state
02:02:31.500 | and we can
02:02:34.300 | We can approximate it also actually with the q function from the sample trajectories. Why?
02:02:40.060 | Because the value function is
02:02:44.140 | At time step t is the expected return
02:02:48.480 | Over all possible actions that we can take starting from the
02:02:56.540 | State s
02:02:57.980 | And taking action a so the value function here can be actually calculated from the q function
02:03:03.900 | But it's an estimated
02:03:06.540 | An expectation over all the possible actions that we can take
02:03:09.500 | which means that
02:03:11.900 | The q function tells us what is the expected return if we start from state s and take action a the value function tells us
02:03:18.460 | What is the expected return that we can get if we only start from?
02:03:21.260 | State s and react according to the policy
02:03:25.980 | Which is also the which basically can also be
02:03:28.380 | Calculated as the expected
02:03:31.400 | return over the q function, but
02:03:34.860 | Expected expectation over all the possible actions that we can take which kind of can be thought of as what is the
02:03:42.140 | average return that we can get by starting from the
02:03:46.220 | State s and taking some actions over all the possible actions that we can take
02:03:51.580 | But we do not have all the possible actions
02:03:55.260 | So we can approximate this expectation with a sample mean according to the one that we have in our trajectory
02:04:02.000 | So we have some actions state actions in our trajectory
02:04:05.280 | So we can actually approximate it this using the q
02:04:09.180 | S a that we have in our trajectory
02:04:11.920 | And how do we compute this q?
02:04:16.940 | As you remember the formula for the advantage is advantage of s a at particular time step is equal to the q
02:04:24.700 | Of s a minus v of s so we can get q
02:04:29.740 | S a is equal to advantage
02:04:33.520 | S a plus the value s
02:04:36.700 | And this is exactly what we are doing
02:04:39.660 | Here, so we are saying to get the q function
02:04:43.420 | We are calculating the advantages plus values and this term here will be used to
02:04:47.660 | Calculate the loss for the value head. We will see later. So remember these returns we are doing here
02:04:54.460 | Okay. So now we have computed the advantages and the values
02:04:59.100 | Now we still are in the first phase. So we have sampled some trajectories from our model that we're trying to optimize
02:05:07.200 | We computed the rewards using the rewards. We
02:05:11.100 | We also computed the log probabilities for each time step
02:05:14.620 | We also computed the advantages for each time step and we also computed the q values for each time step
02:05:20.380 | Which are used for the value head
02:05:23.260 | Now let's go to the second phase of the ppo algorithm, which is the phase two
02:05:27.340 | Which means that we take some mini-batch from these trajectories
02:05:31.040 | We optimize the model based on the estimated gradient
02:05:35.680 | We do it with many steps and then again, we sample new trajectories
02:05:39.440 | We sample some mini-batches. We optimize the model according to the loss
02:05:45.100 | We do it many times and then again, we sample new trajectories
02:05:48.000 | So let's go back to our step function
02:05:53.340 | We are here
02:05:56.860 | Okay, so we computed the
02:05:59.020 | the advantages
02:06:01.420 | Now we can use the sampled trajectories to optimize the model. So what do we do?
02:06:07.020 | We sample some mini-batches. This is the mini-batch that we are sampling
02:06:10.540 | So we sample a mini-batch as you can see here
02:06:14.060 | And then what we need to do
02:06:16.700 | First of all, we go as we saw in the formula of the ppl also we need to have the log probabilities according to the model
02:06:23.900 | We sampled from which is this pi old and also according to the model that we are trying to optimize using this
02:06:30.540 | Sampled mini-batches, which is exactly what we did here with offline policy
02:06:34.380 | So we sample from some policy and we need to have the trajectories from this policy and also the log probabilities from this policy
02:06:41.420 | Which is the offline policy and then we use this sample trajectory
02:06:44.780 | So we take a mini-batch and then we run a gradient ascent on an online policy
02:06:49.740 | But we also need to have the log probabilities according to this online policy
02:06:53.260 | The one we are that we are trying to optimize and this is exactly what we do here
02:06:57.580 | So we run again this method that we ran before so the batch the forward pass to calculate the log probabilities the logits and the value
02:07:05.420 | Head prediction according to the mini-batch that you are considering and then we train the model according to this mini-batch
02:07:12.780 | Let's see how it's done
02:07:14.860 | The first thing that we need to do is to calculate the loss of ppu according to the formula that we saw on the slides
02:07:21.900 | So let's go in the loss
02:07:23.900 | In the loss we have to calculate three losses. The first is the loss for the value head, which is this loss here
02:07:30.700 | So they are actually doing it the hugging face is actually calculating also the clipped loss
02:07:36.060 | But let's not consider the clipped loss for now. It's just some new
02:07:39.660 | It's just an optimization, but it doesn't have to be in the vanilla ppu. We don't have to do it
02:07:46.300 | so we are taking the
02:07:48.700 | Values that were predicted by the model and the returns that we calculated as the sum of the advantages plus the values that we saw before
02:07:55.660 | So this is how we this is the loss for the value head
02:08:00.620 | According to this formula here. So as you can see, so this is basically the
02:08:05.660 | Estimated q functions according to our trajectories
02:08:10.880 | And this is the loss of the value head then we have the loss of the ppu
02:08:16.720 | Which is just the advantage term multiplied by the ratio of the log probabilities. What is the ratio of the probabilities? It's the
02:08:27.260 | The log probabilities, okay, let's go to the formula first. Okay, as you can see here we have the ratio of the two probabilities
02:08:34.080 | But we have the log probabilities. So what we can do is we can calculate
02:08:38.880 | Let's use the here
02:08:42.620 | Okay, we have the log probabilities so
02:08:46.220 | The log of a minus the log of b
02:08:55.740 | And then we are doing the exponential of this
02:08:57.740 | This is equivalent to doing the exponential of the log of a divided by b
02:09:07.660 | Which is equal to doing a divided by b
02:09:11.340 | Of the two probabilities. So because we do not have the log with the probabilities, but we have the log probabilities
02:09:16.880 | We are calculating like this. So we first do the log probabilities of the
02:09:21.180 | Online model minus the log probabilities of the offline model and then we apply the exponential which will result in a divided by b
02:09:28.540 | Which is exactly what we want here. So let's check
02:09:31.260 | In the code
02:09:35.180 | We are calculating the difference in the log probabilities and applying the exponential which will result in this ratio here being calculated
02:09:42.320 | Then we need to
02:09:45.020 | This ratio is multiplied by the advantage term as you can see here. So we need to multiply it by this term advantage
02:09:51.680 | And then we need to also calculate the other part of this
02:09:55.360 | Expression, which is this clipped advantage as you can see here
02:09:59.600 | So again the ratio but clipped between the value one minus epsilon and one plus epsilon
02:10:08.000 | We are doing it here. So the advantage multiplied by the ratio clipped between
02:10:11.840 | One minus epsilon and one plus epsilon
02:10:15.220 | Why do we have this minus sign here because
02:10:20.720 | The goal in ppo is we want to maximize
02:10:23.780 | This term here, but we are doing you is using pytorch and the optimizer of pytorch, pytorch always run gradient descent
02:10:33.140 | Which means that it's the opposite of gradient ascent. So if we want to um
02:10:38.800 | So it's basically we are instead of maximizing this we can minimize the negative
02:10:44.500 | Loss that we can see here and this is exactly why we have this minus sign
02:10:50.160 | So because pytorch always minimizes we can multiply this by minus so it's like we are maximizing this term
02:10:59.520 | The entropy is calculated here as you can see
02:11:02.240 | Then they have also others
02:11:05.360 | Other terms that we do not use because they they do some optimizations, but they are not present in the vanilla loss of the ppo
02:11:12.880 | So the loss of the ppo is calculated as the the loss of the policy
02:11:16.960 | Plus the value head multiplied by its coefficient that you can see here. So loss policy
02:11:22.240 | They calculate also the entropy, but they do not use it. I don't know why to be honest
02:11:26.880 | So they calculated the entropy here. So they calculated the entropy using the logits as you can see
02:11:31.840 | And they do it not using the formula that I show in the slides, which is the actual formula of the entropy
02:11:37.600 | But they're using an optimized version called the log sum exp and I am putting here some information for those who want to have
02:11:45.840 | The derivation of how it's done. But basically wikipedia says that the convex conjugate of the log sum eps is the negative entropy
02:11:56.000 | Yeah, so we have also this entropy term here and we return our loss
02:12:01.600 | So let's go back to the optimization step
02:12:04.400 | So we go here
02:12:07.520 | So now we are optimizing over a mini batch
02:12:10.960 | Which means that the first thing that we do is we calculate the loss and then we run a back propagation on this loss
02:12:16.640 | and then we optimize
02:12:18.720 | and we do it for
02:12:20.560 | Oops, we do it for
02:12:22.560 | Many mini batches, so let me go back again
02:12:26.720 | So we train on a one mini batch then we do it again for many mini batches you can see here
02:12:35.360 | And after a while we return
02:12:37.600 | We return here and we do again the procedure again. So we again generate the new trajectories
02:12:44.100 | Then the hugging face library will calculate and we calculate of course also the rewards the hugging face library will calculate the log probabilities
02:12:53.460 | According to these trajectories the advantage estimation according to these trajectories the value estimation according to the trajectories
02:13:01.140 | Then we'll iteratively
02:13:04.080 | Sample from these trajectories some mini batches and then we run a gradient ascent
02:13:09.600 | According to the ppo loss on these mini batches many times and then again, we restart the loop and this is how we
02:13:16.560 | Run the ppo algorithm for reinforcement learning from human feedback
02:13:22.180 | Let's go back to the slides and thank you guys for watching this video, I know it has been very very demanding
02:13:31.760 | It has been one of my most difficult video also for me to describe all these parts without
02:13:37.360 | Getting lost myself
02:13:40.240 | I know that I gave a lot of knowledge because
02:13:42.560 | Actually ppo and the reinforcement learning are quite big topic. So there are entire university courses on this stuff. So it's not easy to give
02:13:50.160 | A complete understanding in just a few hours. This is also one of the reason I decided not to code it from scratch because
02:14:00.000 | It would make the video like 10 hours of video
02:14:02.160 | but at least I hope that now you have a deep understanding into how each step of the
02:14:08.320 | Reinforcement learning from human feedback is done. I will share with you the code commented by me with all the parts that are unnecessary
02:14:15.540 | Removed or anyway, I will comment telling explicitly which parts are not necessary for the ppo algorithm
02:14:22.160 | It took me more than one month of research to prepare this video and I had to record it
02:14:28.560 | multiple times because
02:14:30.560 | I made some some mistakes and then I realized that I forgot something in the slides
02:14:34.880 | Then I had to fix them etc, etc
02:14:37.520 | So the best way to help me guys is to share this video with others if you found it useful
02:14:42.320 | I know that it's very difficult
02:14:43.600 | So I suggest watching it multiple times because the first time you watch this video you will have some understanding but not very deep
02:14:51.360 | The second time you will realize that you will have a better understanding
02:14:55.520 | And maybe you will need to review some concepts from reinforcement learning or from the transformer to better understand it fully
02:15:02.000 | So I recommend watching it multiple times and please leave in the comments if some part was not clear
02:15:08.240 | I will always try to help you and yeah, have a nice day