back to indexDirect Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math
Chapters
0:0 Introduction
2:10 Intro to Language Models
4:8 AI Alignment
5:11 Intro to RL
8:19 RL for Language Models
10:44 Reward model
13:7 The Bradley-Terry model
21:34 Optimization Objective
29:52 DPO: deriving its loss
41:5 Computing the log probabilities
47:27 Conclusion
00:00:00.000 |
Hello guys, welcome back to my channel. Today we are gonna talk about DPO, which stands 00:00:04.480 |
for Direct Preference Optimization. It's a new technique that came out in the middle 00:00:08.760 |
of last year, in 2023, to align language models. Let's review the topics of today. I will start 00:00:17.320 |
with a short introduction to language models, as usual, so we can review how language models 00:00:21.700 |
work. Then we will introduce the topic of AI alignment, so what we mean by AI alignment. 00:00:27.660 |
And then we will review reinforcement learning. Now, you may be wondering why are we reviewing 00:00:31.780 |
reinforcement learning if the whole point of DPO is to remove reinforcement learning 00:00:36.340 |
from language models. Well, the reason is that, actually, even if DPO does not use reinforcement 00:00:42.220 |
learning algorithms, they are still interconnected, especially when we talk about the reward model 00:00:47.040 |
and the predatory model. So in order to understand the reward model and the predatory model, 00:00:51.840 |
we need to review reinforcement learning and how the reward model affected the process 00:00:56.560 |
in reinforcement learning from human feedback. In the last part of the video, we will derive 00:01:02.760 |
the DPO loss, so we will understand where does it come from. I will also show you the 00:01:11.240 |
code on how to compute the log probability, so how we actually can use this loss in practice. 00:01:18.440 |
Now what are the prerequisites for watching this video? Well, for sure that you are familiar 00:01:21.920 |
with a little bit of probability and statistics, not much, for example, conditional probability. 00:01:28.420 |
We are familiar with deep learning, so what we mean by gradient descent and loss functions. 00:01:33.900 |
It's really great if you have watched my previous video on reinforcement learning from human 00:01:37.400 |
feedback in which I explain all the aspects of the reward model and the reinforcement 00:01:42.760 |
learning framework and the DPO. But it's not necessary for this video because I will review 00:01:49.040 |
most of the part that are needed to understand the DPO, but it's really great if you have 00:01:53.000 |
already watched that video so you can compare the two methods. And also that you're familiar 00:01:58.960 |
with the transformer model because we will be using it in practice when we want to compute 00:02:03.360 |
the log probabilities. Otherwise, we don't know how to use the loss of the DPO. Let's 00:02:09.440 |
start our journey. So what is a language model? Well, a language model is a probabilistic 00:02:15.240 |
model that assigns the probabilities to a sequence of words. In practice, given a prompt, 00:02:21.440 |
for example, a language model allow us, let me use the laser. So given a prompt, for example, 00:02:27.080 |
Shanghai is a city in a language model tells us the probability of what is maybe the next 00:02:34.320 |
token or word. Now, in my videos, I always make the simplification that a token is a 00:02:39.760 |
word and the word is a token. This is actually not the case in most language models, but 00:02:44.520 |
it's useful for explanation purposes. So what is the probability that the next token 00:02:50.640 |
is China or the next token is Beijing or the next token is cat or pizza given a particular 00:02:56.120 |
prompt? This is the only thing that a language model does. And the language model gives us 00:03:02.200 |
this probability. Now, you may be wondering, how can we use this language model to generate 00:03:06.840 |
a text? Well, we do it with an iterative process. So we take a prompt, for example, where is 00:03:13.080 |
Shanghai? We give it to the language model. The language model will give us a list of 00:03:16.920 |
probabilities or what is the possible next word or token. Suppose that we choose the 00:03:22.000 |
token with the most, with the highest probability score. So suppose it's Shanghai. We take this 00:03:27.880 |
token, we select it and we put it back into the prompt and we ask again, the language 00:03:32.560 |
model, what is the next token? Then the language model again will give us a list of probabilities 00:03:37.180 |
over what is the possible next token. We select the one that we think is the most relevant. 00:03:42.960 |
Usually we select the one that is most probable and we put it back into the prompt and we 00:03:49.720 |
ask again, the language model, et cetera, et cetera, until we reach a specified number 00:03:54.200 |
of generated tokens or we reach the end of sentence token, which is a special token. 00:03:59.440 |
In this case, after four tokens generated, the language model will probably say Shanghai 00:04:05.240 |
is in China, which is the answer to our question. What do we mean by AI alignment? Now when 00:04:13.120 |
we train language model, we train it on a massive amount of data. For example, thousands 00:04:18.680 |
of books, billions of web pages and the entire Wikipedia, et cetera. This gives the language 00:04:24.740 |
model a vast knowledge to complete any prompt in a reasonable way. However, this does not 00:04:33.160 |
teach the language model to behave in a particular way. For example, the pre-training does not 00:04:39.240 |
teach the language model to be polite or to not use any offensive language or to not use 00:04:44.120 |
any racist expressions, et cetera, because the language model will just behave based 00:04:48.200 |
on the data that it has seen. If you feed the internet data, the language model will 00:04:52.720 |
behave very, very, very badly actually. We need to kind of align the language model to 00:04:57.680 |
a desired behavior. We don't want the language model to use any offensive language. We don't 00:05:02.040 |
want it to be racist. We want the language model to be helpful to the user, so to answer 00:05:06.840 |
questions like an assistant, et cetera, et cetera. And this is the goal of AI alignment. 00:05:12.400 |
Now let's talk about reinforcement learning. So reinforcement learning is an area of AI 00:05:17.760 |
that is concerned with training intelligent agents to perform actions in an environment 00:05:23.400 |
in order to maximize a reward that they receive from this environment. Let me show you with 00:05:29.040 |
a very concrete example. I usually always use my cat, Oleo, for examples, so let's talk 00:05:34.440 |
about Oleo. Oleo is the agent in this case, in this reinforcement learning scenario, and 00:05:39.860 |
he lives in a very simple world. Let's call it a grid world that is made of cells in which 00:05:45.980 |
the cat's position is indicated by two coordinates, the X position and the Y position. This can 00:05:52.220 |
also be treated as the state of the agent because at every position corresponds to a 00:05:57.800 |
particular state. The agent, when it is in a particular state, it can take some actions. 00:06:04.360 |
In the case of the cat, it can go right, left, up or down. For every action that the agent 00:06:11.680 |
takes, it will receive some reward from the environment. It will for sure change its state 00:06:17.360 |
to a new one. So for example, when the cat moves down, it will change to a new state, 00:06:22.200 |
to a new position, and it will receive some reward according to a reward model that we 00:06:26.800 |
specified. In my case, I have specified the following reward model. So when the cat moves 00:06:31.980 |
to an empty cell, it receives a reward of zero. If it moves towards the broom, it receives 00:06:37.400 |
a reward of minus one. If somehow it arrives to the bathtub, it will receive a reward of 00:06:44.240 |
minus 10 because my cat is very scared of water. And if it arrives to the meat, which 00:06:48.720 |
is the cat's dream, it will receive a reward of plus 100. Now, what dictates what action 00:06:57.400 |
the agent will take given a particular state or position? Well, it is the policy. The policy 00:07:03.480 |
indicates what is the probability of the next action among all the actions that are available 00:07:09.400 |
that the agent can take given a particular state. And we usually write it like this, 00:07:15.220 |
so that the next action at time step t is distributed like the distribution induced 00:07:21.360 |
by the policy according to the state the agent is in. Now, what is the goal in reinforcement 00:07:28.760 |
learning? The goal in reinforcement learning is to select a policy or to optimize a policy 00:07:33.900 |
in order for the agent to maximize the expected return when it acts according to this policy. 00:07:41.840 |
So imagine we have such a policy that is optimized. Well, our policy, for sure, if the goal is 00:07:50.020 |
to maximize the expected reward when using this policy, for sure, we will have a policy 00:07:54.780 |
that will take us on average to the meat because that's one way to maximize the expected reward. 00:08:00.240 |
And for sure, it will be a policy that will allow us to minimize the chance of ending 00:08:04.880 |
up in the water here or to the broom here. Now, you may be wondering, okay, the cat can 00:08:11.340 |
be seen as a reinforcement learning agent, as a physical agent that takes some actions. 00:08:15.380 |
But what is the connection between reinforcement learning and language models? Well, as we 00:08:19.900 |
saw before, in reinforcement learning, we have this thing called policy, in which given 00:08:24.860 |
a state, the policy tells us what is the probability over the action space of the next action. 00:08:31.780 |
So what possible next action we can take and the probability of each action. This is also 00:08:36.700 |
something similar to what we do with language models, because in language models, we also 00:08:40.420 |
have some kind of state, which is our prompt. And we ask the language model to give us the 00:08:44.740 |
probability of the next token, or we can consider it the next action that the language model 00:08:50.180 |
can take. And we want to reward this language model for selecting tokens in such a way that 00:08:58.260 |
they end up generating good responses. And we don't want to reward the language model 00:09:03.220 |
for selecting sequence of tokens that end up giving us bad responses. Now, imagine we 00:09:09.140 |
are trying to train a language model that needs to act like an AI assistant. So for 00:09:15.620 |
sure we want the language model to be helpful, to answer questions in a meaningful way, so 00:09:21.220 |
not just output garbage. We want the language model to not be racist or not use any offensive 00:09:26.420 |
language. So this is all good behaviors that we want from this language model. So we may 00:09:30.900 |
want to build a reward model that will treat good responses, for example, responses that 00:09:36.620 |
actually answer the question asked by the user. And we will reward them with a high 00:09:42.060 |
reward. And we give maybe zero reward or negative reward to all those answers that are not coherent 00:09:48.860 |
with what we want. So for example, if the language model generates dirty jokes or racist 00:09:57.660 |
jokes, for example, we can give zero reward to those responses. So the language model 00:10:06.340 |
acts also as a policy because the policy is something that, given a prompt, tells you 00:10:11.540 |
what is the probability over the action space, or in this case, the probability over the 00:10:15.780 |
token space. So we want to optimize this policy. So we want to optimize the language model 00:10:21.780 |
to maximize the probability, to maximize the expected return or the expected reward that 00:10:28.260 |
it receives from our reward model. So we want to optimize our language model to generate 00:10:33.300 |
good response, because that's one way to obtain a high reward from the reward model. Now you 00:10:39.580 |
may be wondering, OK, but how to define the reward model for a language model? Well, one 00:10:44.780 |
way would be, OK, we can have a list of questions and answers generated by the language model, 00:10:50.380 |
and then we can give a numerical reward to each one of them. And then we can use some 00:10:54.460 |
reinforcement learning algorithm to feed this reward model to the language model to optimize 00:10:59.420 |
it. But the problem is, what kind of reward can we give to each of these pairs of questions 00:11:05.260 |
and answers? Because, for example, let's look at the first question. Where is Shanghai? 00:11:09.860 |
The answer, suppose it is generated by the language model, is that Shanghai is a city 00:11:13.700 |
in China. Now, in my opinion, this is a good response because it's short and up to the 00:11:19.260 |
point. But some other people maybe think that only the word China is enough because the 00:11:25.660 |
user just asked, where is Shanghai? So there is no need to repeat the word Shanghai. But 00:11:29.860 |
someone else maybe think that this response is too short and the assistant should say, 00:11:34.940 |
hello, I think that the answer to your question is Shanghai is a city in China. So different 00:11:40.940 |
people will have different opinions on what reward to assign to this particular pair of 00:11:46.260 |
question and answer. Because we humans are not very good at finding a common ground for 00:11:51.660 |
agreement. But unfortunately, we are very good at comparing. And we will exploit this 00:11:56.620 |
fact. So instead of building a data set that is made of questions and answers and the rewards, 00:12:03.980 |
because we do not know what kind of reward to assign, we will build a data set of questions 00:12:09.100 |
and multiple answers. And then we ask people to choose an answer that they like, according 00:12:15.420 |
to some preference that we have. So we want to generate, for sure, a language model that 00:12:19.860 |
is helpful. So we want the general language model to give responses that are correct. 00:12:24.920 |
And we want the language model to be polite, for example. So for example, imagine we have 00:12:29.860 |
a list of questions. And then we ask the language model by using, for example, a high temperature 00:12:35.300 |
to generate multiple answers. And then we ask people to choose which one they like. 00:12:40.020 |
In this case, for example, where is Shanghai? For sure, most people will choose the answer 00:12:44.960 |
number one, because Shanghai is a city in China, it's the correct one. For this question 00:12:49.860 |
here, for example, most people will probably choose this question here, even if it's very 00:12:54.660 |
short because the other one is probably wrong. So using a data set like this, we can actually 00:13:00.660 |
train a model to transform a pair of questions and answers into a numeric reward. Let's see 00:13:06.780 |
how it is done. If you have a pet, you probably know that to teach a particular behavior to 00:13:12.740 |
your cat or to your dog, you need to use biscuits or some treats. So you ask the cat to do something. 00:13:20.340 |
And if the cat does it, then you give it a treat. So it will reinforce this memory in 00:13:25.140 |
your cat. And then the next time the cat is more likely to do it because it will remember 00:13:29.620 |
that it received some treat. And so it will again perform that action again. So it can 00:13:34.740 |
probably receive another treat. This is exactly what we do in reinforcement learning. We want 00:13:39.780 |
to give some digital biscuits to our reinforcement learning agent so that it is more likely to 00:13:46.380 |
perform that action or that series of actions again in order to receive more reward. However, 00:13:53.140 |
the data set that we have built so far is made up of preferences. So we have a question, 00:13:58.580 |
multiple answers, and then we ask people to choose which answer they like. We need to 00:14:03.940 |
convert this data set of preferences into a numeric score that we can give as a reward 00:14:10.620 |
to our language model to choose more likely the answer that was chosen by the people and 00:14:18.620 |
to make it less likely to choose the answer that was not liked by the people, by our annotators. 00:14:25.180 |
And this can be done through a preference model. In DPO and also in reinforcement learning 00:14:30.820 |
from human feedback, we make use of the Bradley-Terry model. So the Bradley-Terry model is a way 00:14:36.300 |
of converting a data set of preferences into a numeric score called reward that is given 00:14:42.460 |
for each pair of questions and answers. Our goal is to train a model that given a question 00:14:49.500 |
and answer or a prompt and the text generated text to give a score that resembles the preferences 00:14:57.100 |
that have been chosen by our annotators. This is the expression of the Bradley-Terry model. 00:15:03.940 |
So it is a model meaning that we choose to model our preferences like this. And actually 00:15:09.460 |
it makes sense because it is a probability. So that's why, for example, we use explanations 00:15:14.980 |
because we want the probability has to be non-negative. And also the probability that 00:15:22.140 |
the assigned to the correct preferences. So the probability of choosing the correct answer 00:15:28.780 |
over the wrong answer. So the one that is chosen by the annotators over the one that 00:15:33.900 |
was not chosen by the annotators. Here, I call it the winner and loser because also 00:15:39.300 |
in the DPO paper, they call it winner and loser. It is modeled like this. So it is proportional 00:15:44.420 |
to the reward to the exponential of the reward that was assigned to the winning answer. Now, 00:15:52.020 |
how to train a model to convert a dataset of preferences into a numeric reward? We take 00:15:58.580 |
this expression and we can use a maximum likelihood estimation. Now, it doesn't matter if you 00:16:04.560 |
don't know what is maximum likelihood estimation. The point is we want to maximize the probability 00:16:10.260 |
of assigning the correct ordering in our preferences. So we want to maximize the probability of 00:16:16.580 |
choosing the correct answer over the wrong answer. And suppose that we are maximizing 00:16:24.220 |
this expression here. Let's see how we can derive the loss to maximize this expression 00:16:29.580 |
here. If you look at the DPO paper, you will see that they go from the Bradley Terry model, 00:16:35.980 |
which is this one, directly to the loss here, but they don't show you the derivation. So 00:16:40.940 |
I will show you how to derive the loss that maximizes this probability here. The derivation 00:16:50.020 |
is very simple actually. So first of all, as you can see in the loss, you can see this 00:16:55.260 |
function here. It's a sigmoid function. The expression of the sigmoid function is this 00:17:00.180 |
one and this is the graph of the sigmoid. So the expression of the sigmoid function 00:17:04.540 |
is one over one plus e to the power of minus x. The first step of the derivation is to 00:17:11.980 |
prove that two exponentials, so a fraction of this expression here, so exponential divided 00:17:20.180 |
by the sum of two exponentials can be written as a sigmoid of a minus b. So here I call 00:17:26.260 |
all this part here. So let me use the pen. I think it's easier. So this part here, so 00:17:34.340 |
the reward assigned to the, let's say the winning answer is, we call it a, and the reward 00:17:41.820 |
assigned to the losing answer, we call it b. So this expression can be written as e 00:17:48.780 |
to the power of a divided by e to the power of a plus e to the power of b. And we will 00:17:54.860 |
prove that it can be written as the sigmoid of a minus b through the following step. So 00:18:03.280 |
first we can divide, we take this expression, which is basically this one. We just replace 00:18:09.940 |
the rewards with a and b because it makes it simpler to visualize. We divide the numerator 00:18:16.580 |
and denominator by the same quantity, e to the power of a. We can do it. Then we can, 00:18:25.360 |
at the numerator, e to the power of a cancels out with e to the power of a and it becomes 00:18:29.860 |
a one. Then in the denominator, we add and subtract one. We can do it because it's like 00:18:36.460 |
adding zero. And then we collect the minus one. So we don't change anything. We just 00:18:41.460 |
put the parentheses. This is possible through the associative property. Then we do the common 00:18:48.860 |
denominator for these two expressions. And we arrive to this one. We can simplify e to 00:18:55.260 |
the power of a with minus e to the power of a. So it becomes e to the power of b divided 00:18:59.580 |
by e to the power of a, which thanks to the property of the exponentials can be written 00:19:04.580 |
as e to the power of b minus a. Then we can take a minus sign outside. And this expression 00:19:09.980 |
here is exactly the expression of the sigmoid function you can see here. So it's one over 00:19:15.020 |
one plus e to the power of minus something. So it becomes the sigmoid of that something 00:19:21.220 |
here a minus b. And this is exactly the loss that you see here. So it is the sigmoid of 00:19:27.380 |
the reward assigned to the winning answer minus the reward assigned to the losing answer. 00:19:35.460 |
Here we also see a log because usually we do not model the probability directly but 00:19:41.220 |
we model the log probability. So we have also this log because we want to model the log 00:19:46.740 |
probabilities. It is something that we can do because the logarithm is a monotonic function. 00:19:52.980 |
And also you may be wondering why do we have this minus sign here? This is because we want 00:19:58.300 |
to maximize this expression. But as you know in deep learning frameworks like PyTorch, 00:20:04.780 |
we have an optimizer that is always minimizing a loss. So instead of maximizing something 00:20:09.460 |
we can minimize the negative expression of the objective function which is the same thing. 00:20:14.900 |
So basically we take this loss function and if we apply it to a reward model which is 00:20:20.300 |
a neural network, it will be trained to maximize the probability of giving the correct ordering 00:20:27.540 |
to our preferences which can only happen when it assigns a high reward to the winning answer 00:20:33.740 |
and a low reward to the losing answer. Because if you look at this expression here, as you 00:20:38.900 |
can see the probability is maximized when in the numerator we have the reward assigned 00:20:44.500 |
to the winning answer. So the reward assigned to the winning answer is higher than the one 00:20:48.660 |
assigned to the losing answer. And if you are wondering how to read an expression like 00:20:55.100 |
this, so let me cancel because we will use it a lot, this kind of convention. This one. 00:21:02.060 |
This basically means that we have a data set of preferences where we have a prompt, a winning 00:21:08.020 |
answer and a losing answer and they belong to our data set of preferences and we train 00:21:13.700 |
a model with a gradient descent for each of these preferences we calculate this loss here, 00:21:21.060 |
this expression here. And if we minimize this loss with the gradient descent we will have 00:21:26.340 |
a neural network that is trained for the following, the Bradley-Tarry model basically. 00:21:33.340 |
Okay, now that we have built a reward model, which means that we have a model that given 00:21:40.100 |
a question and answer can assign a numeric reward to the language model if the response 00:21:45.660 |
is correct or looks good according to the behavior that we want from our language model 00:21:50.300 |
or looks bad according to the behavior that we want from our language model. Now we can 00:21:55.900 |
train our language model. So as you recall, what is the goal in reinforcement learning? 00:22:02.900 |
In reinforcement learning, the goal is to optimize a language model, which is also the 00:22:07.660 |
policy of our reinforcement learning agent, in order to maximize the cumulative reward 00:22:14.340 |
when the agent acts according to this policy. In other words, if we, let me use the pen, 00:22:20.760 |
so let's ignore for now this green part here. Suppose there is no green part here. So this 00:22:27.460 |
doesn't exist. Imagine we have a language model, let's call it Pi Theta because it's 00:22:32.940 |
a policy and we want to optimize this policy. So we want to optimize this language model 00:22:39.780 |
in order to maximize the reward that it receives from the reward model. It means that the language 00:22:45.660 |
model will generate answers that give good reward and how they get good reward if the 00:22:51.700 |
answers are, looks good. They, for example, are not racist. They are not using any sexual 00:22:57.100 |
jokes and they are actually answering the question that was asked. However, and this 00:23:04.020 |
is the goal in reinforcement learning from human feedback, for example. That's why it's 00:23:09.860 |
called the reinforcement learning from human feedback. Now, if we use a model, if we use 00:23:18.280 |
an objective like this, that is, we only want to maximize the reward, then the language 00:23:26.340 |
model may become greedy and just output garbage that gives it good reward. So imagine we have 00:23:33.860 |
a reward model that rewards the language model for being polite. The language model may just 00:23:39.420 |
start saying a list of "thank you, thank you, thank you" or "please, please, please" and 00:23:44.660 |
a lot of "please" or a lot of "thank you's" just to get high reward. Because probably 00:23:49.700 |
the word "thank you" and "please" are highly rewarded by the reward model. But we don't 00:23:57.100 |
want the language model to just output garbage to get reward. We want the language model 00:24:02.940 |
to also output something that was according to its training data. So it's a pre-training, 00:24:12.780 |
but we want to change it a little bit so that it also acts according to our reward model, 00:24:18.380 |
so to our data set of preferences. So it is more polite, but without forgetting what it 00:24:23.900 |
has learned from the pre-training. And this is why we add this KL divergence in the objective. 00:24:32.260 |
So let me use the pen again. So we change the objective a little bit. So we want the 00:24:36.820 |
language model to maximize the reward it gets from the reward model. But at the same time, 00:24:42.500 |
we add a constraint to the language model through a KL divergence. Now the KL divergence 00:24:48.300 |
can be thought of as a distance metric. It is not a distance metric, but can be thought 00:24:52.500 |
of as a distance metric between two distributions in which we have a pre-trained model. So a 00:25:00.740 |
language model that was not fine-tuned through reinforcement learning from human feedback 00:25:05.140 |
or DPO. So it's just a language model that has been pre-trained on the Wikipedia, on 00:25:10.580 |
the books, and on the internet web pages. And then we have the language model that we 00:25:15.220 |
are optimizing. So this pi theta, and we want them to be very similar. So we want the language 00:25:20.700 |
model to not change much compared to what it was before the reinforcement learning from 00:25:26.220 |
human feedback or before the DPO training. And this is why we add this KL divergence. 00:25:32.100 |
So we want the language model to maximize its reward, but at the same time, not forget 00:25:37.900 |
or not change too much its output in getting this reward. Now that we know the reinforcement 00:25:46.740 |
learning objective, which is basically also the same objective that we have in DPO, because 00:25:51.860 |
also in DPO we want to train a language model that maximizes a reward, but at the same time 00:25:57.500 |
does not forget its training data. Let's look at what does it mean to actually maximize 00:26:03.100 |
an objective function, because this is an objective function that we have and we want 00:26:06.780 |
to maximize it. But what does it mean to maximize an objective? Let's see. Maximizing a function 00:26:14.820 |
means to find the values of some variable such that the value of the function is maximized. 00:26:21.540 |
For example, if I give you the following function, f of x is equal to minus x minus three plus 00:26:27.500 |
four, whose graph is very simple. It's just a parabola facing down. To maximize these 00:26:34.260 |
functions means to find the value of the x variable such that the function, the y basically, 00:26:41.820 |
the y of this function is maximized. How to do that analytically? Well, we calculate the 00:26:48.380 |
derivative of this function here. We set the derivative equal to zero and we find the values 00:26:54.740 |
of x for which this derivative is zero. And that is also the value for which the function 00:27:00.860 |
will be maximized. So the derivative of this simple function is minus two x plus six. And 00:27:06.620 |
the value of x that makes this derivative zero is the value three, x equal to three, 00:27:11.980 |
which is also the value of the, as you can see in the graph, that maximizes the function. 00:27:17.780 |
Now the objective function that we saw before, so this one, so in which we want to maximize 00:27:23.820 |
a reward, but at the same time, we want the language model to not be too much different 00:27:31.420 |
from the unaligned language model. So the language model that is not aligned through 00:27:37.100 |
reinforcement learning from human feedback or DPO, it is called a constrained optimization 00:27:42.980 |
problem because we want to maximize the reward, but at the same time, we want to put some 00:27:48.460 |
constraint on this objective function. We don't want the KL divergence to be too big. 00:27:55.620 |
We want it to be constrained in some limit. Now, the point is, okay, there are many techniques 00:28:04.580 |
for constrained optimization and we will not see them because there are university PhDs 00:28:09.260 |
on optimization, but one thing you may notice is that, okay, this one here looks like the 00:28:16.220 |
objective function looks like a loss function. So why cannot we just use, for example, gradient 00:28:21.900 |
descent to optimize this objective function here such that we can train our language model 00:28:28.940 |
to behave in a particular way to maximize this reward? Well, we could, but as you know, 00:28:35.460 |
in deep learning and especially with backpropagation, we need an objective function or a loss function 00:28:41.380 |
that is a differentiable. The following, this objective function is not differentiable. 00:28:47.340 |
Why? Because as you can see from the expression here, this is an estimation of all the prompts 00:28:53.900 |
in our dataset and then a output that is generated by the language model. Now, to generate the 00:29:01.380 |
output of the language model, as we saw before, we need to use an iterative process in which 00:29:05.700 |
we feed one token at a time into the prompt. We sample one token at a time from the language 00:29:12.460 |
model. We take this token and we put it back into the prompt, feed it again to the language 00:29:15.940 |
model, et cetera, and we use many strategies for selecting the next token. Sometimes we 00:29:20.580 |
use the greedy strategy. Sometimes we use the beam search. Sometimes we use the top 00:29:24.180 |
case, top P, et cetera, et cetera. Now, this sampling operation that we do on the language 00:29:29.340 |
model to sample the answer of the language model is not differentiable. That's why we 00:29:34.500 |
cannot run reinforcement learning to maximize this objective or to minimize the negative 00:29:40.140 |
objective in case we treat it as a loss. And that's why in reinforcement learning, we were 00:29:44.940 |
forced to use algorithms like PPO. Now, let's see how DPO handles this. In the DPO paper, 00:29:55.700 |
they start with a very simple introduction to the reinforcement learning objective. As 00:30:01.060 |
we saw before, the reinforcement learning objective is to select a policy that maximizes 00:30:09.860 |
the expected reward when using this policy, so the policy is the language model, and at 00:30:14.300 |
the same time puts a constraint on how much this policy can change during this training, 00:30:19.560 |
this optimization. And in the DPO paper, they say, okay, there is an exact solution to this 00:30:24.860 |
optimization problem, and it is the following. It is the equation four in the DPO paper. 00:30:30.660 |
And exact solution, I mean that there is an analytical solution to this constrained optimization 00:30:37.020 |
problem. Just like we had an analytical solution for the maximization problem of this parabola, 00:30:44.820 |
so we could find through the derivative and setting the derivative equal to zero, we could 00:30:49.140 |
find the value of x such that this function here is maximized. And for using the same 00:30:55.820 |
reasoning but different techniques, we also have an analytical solution for the constrained 00:31:03.140 |
optimization problem that we saw before, and this is the solution. Now, you may be wondering, 00:31:08.500 |
okay, great, we have an exact solution just like the parabola, so now we are all set, 00:31:13.780 |
right? Yes. The problem is we have an exact solution, but it's not easily computable, 00:31:21.060 |
so it's not easy to compute. So mathematically, it exists, it makes sense, but it's not easy 00:31:27.440 |
to compute. Why? Because we have this z of x term here. Now, this z of x term here, if 00:31:34.460 |
you look at how it's defined, it's the summation of all possible y's that are generated by 00:31:41.420 |
the reference model. So as you know, when we do reinforcement learning from human feedback 00:31:45.760 |
or DPO, we have two models. One is the language model that we are trying to optimize, and 00:31:51.380 |
one is the frozen model that we don't optimize, but we use it as a reference for the KL divergence. 00:31:56.740 |
So this is called the PyRef. So all the outputs generated by PyRef multiplied by the exponential 00:32:03.420 |
of the reward. Now, the problem is this summation is done over all possible y's. It means that 00:32:09.460 |
we need to sample all possible outputs from our language model, given all the prompts 00:32:18.700 |
that we have in our data set of preferences. Now, to generate all possible outputs is very, 00:32:24.340 |
very, very expensive. Imagine you need to generate, your language model can generate 00:32:29.300 |
2,000 tokens for each prompt. It means that, and you have a vocabulary size of 30,000, 00:32:36.500 |
it means that for the first position, you have 30,000 possibilities, for the second 00:32:41.420 |
position, you have 30,000 possibilities, for the third position, you have 30,000 possibilities, 00:32:45.580 |
and then you multiply all these possibilities. So it becomes a lot, a lot, a lot of outputs 00:32:50.900 |
that you need to generate to evaluate this Z of X term. So the analytical solution to 00:32:55.780 |
the constraint optimization problem that we saw before exists, but it's not easy to compute. 00:33:01.820 |
However, one thing is interesting from this expression. Imagine that somehow, magically, 00:33:08.340 |
we have access to an optimal policy. So this solution to the optimization problem allow 00:33:14.140 |
us to compute what is the optimal policy, given the optimal reward model and the reference 00:33:20.140 |
policy, so the reference language model. But imagine that for some reason, some magically, 00:33:25.180 |
we have access to, we have this term here. So if we have this term here, we can compute 00:33:31.260 |
the optimal reward model with respect to the optimal policy. How? Well, we can just isolate 00:33:39.660 |
this R of X and Y term from this expression here. And it's very easy to compute because 00:33:44.820 |
we can apply the logarithm on the left and the right side of this expression. So let's 00:33:50.260 |
do it step by step. We can apply the log on the left side and on the right side of this 00:33:58.940 |
expression here. So this expression here, and we will get that the log of a product, 00:34:05.860 |
as you know, is the sum of the logs and the log of the ratio is the difference of the 00:34:11.260 |
logs. So this Z term is in the denominator. So it becomes a minus log of Z of X. This 00:34:17.700 |
one is in the numerator and this one is in the numerator. So they become sums of logs. 00:34:22.180 |
So this one plus this log here, then the log and the exponential can cancel out because 00:34:29.100 |
they're inverse functions. So this allow us to isolate this R of X, Y term with respect 00:34:35.300 |
to all the other terms, and we can write it like this. So we can calculate R of X and 00:34:41.120 |
Y with respect to an optimal policy that we think we have access to. We do not have access 00:34:47.060 |
to it, but we pretend we have access to it. Okay. So there are two things that we do not 00:34:52.500 |
have in this expression. We do not have the reward model, the optimal reward model, and 00:34:57.420 |
we do not have the optimal policy, but we pretend that we have the optimal policy. Why? 00:35:04.180 |
Let's see the next step that they do in the DPO paper. The next step is, they say, okay, 00:35:10.740 |
do you remember the Bradley Terry model? As you remember, the Bradley Terry model is our 00:35:15.420 |
reward model, right? Is the model that given a data set of preferences, allow us to compute 00:35:21.580 |
a numeric score, a numeric reward. Well, this Bradley Terry model is based on a reward that 00:35:30.820 |
we assign, right? So what if we plug the reward that we have computed from the constraint 00:35:38.860 |
optimization problem into the Bradley Terry model? Well, we can do it. So we have this 00:35:44.580 |
reward that we obtained from the constraint optimization problem by inverting the formula 00:35:49.660 |
and we plug it inside the Bradley Terry model. So if you remember, the Bradley Terry model 00:35:54.500 |
can also be written as a sigmoid and we prove it before in the previous slide. So what we 00:36:01.660 |
do is, okay, the Bradley Terry model can be written as a difference of rewards in the 00:36:08.140 |
sigmoid function. So if we plug the reward here, so the reward obtained by the constraint 00:36:17.300 |
optimization problem solution, we will see that the two Z of X terms, because this is 00:36:25.420 |
a difference of rewards, as you can see, if we plug here for the reward assigned to the 00:36:30.820 |
winning response, and here the reward assigned to the losing response, we have these two 00:36:37.700 |
Z of X terms, so plus beta log of Z of X and minus beta log of Z of X that will cancel 00:36:45.220 |
out because they are one the opposite of the other. This way we can obtain a formula that 00:36:53.340 |
does not contain the Z of X term and it's now computable. So basically, if we use the 00:37:00.740 |
loss of the Bradley Terry model, so as you remember, the Bradley Terry model is a model 00:37:07.260 |
that allow us to train a language model to model the reward, right? And if we use the 00:37:15.460 |
loss of the Bradley Terry model in which the reward is coming with respect to the optimal 00:37:23.900 |
policy, we can use it to optimize the policy to adhere implicitly to the reward model according 00:37:32.460 |
to the Bradley Terry model. And this is the whole idea of the DPO paper. So we can plug 00:37:39.660 |
the exact solution of the constraint optimization of the reinforcement learning objective, we 00:37:46.060 |
can invert it to get the reward, we plug it into the Bradley Terry model, because the 00:37:50.700 |
Bradley Terry model only depends on the difference of rewards assigned to the winning answer 00:37:57.940 |
and to the losing answer, the uncomputable term Z of X cancels out and then it becomes 00:38:04.960 |
computable. And now we can use it to train a language model that will act according to 00:38:11.140 |
the reward model of the Bradley Terry model, so to the preference model given by the Bradley 00:38:17.620 |
Terry model. So it will favor good responses and at the same time it will be less likely 00:38:27.140 |
to output the preferences that were not chosen. And at the same time it will put a constraint 00:38:32.900 |
onto the KL divergence, so at the same time it will put a restriction on how much the 00:38:38.020 |
language model can change with respect to the reference model, so the language model 00:38:43.320 |
that was not optimized with reinforcement learning from humanist feedback or DPO. So 00:38:49.460 |
basically with the DPO we are doing kind of the same thing that we are doing in reinforcement 00:38:56.300 |
learning but without using the reinforcement learning algorithms. So the goal in both of 00:39:02.220 |
them is the same, so we want to optimize a policy, we want to optimize a language model 00:39:07.340 |
to maximize a cumulative reward but at the same time we want to put a constraint on how 00:39:12.600 |
much it can change using the KL divergence. In the case of reinforcement learning from 00:39:16.940 |
humanist feedback we are using the PPO algorithm to optimize this objective, to optimize this 00:39:23.340 |
policy, but in the case of DPO we do not have to use reinforcement learning from humanist 00:39:28.380 |
feedback because we found a loss that implicitly is already mapping this reward objective through 00:39:37.100 |
this loss. Let's see how to actually now compute the log probability, so how to actually use 00:39:47.500 |
this loss, because first of all let's look at the expression of this loss. This loss 00:39:51.900 |
says that if you have a data set of preferences in which X is the prompt, YW is the chosen 00:40:00.740 |
answer and YL is the not chosen answer because as you remember this data set is made up of 00:40:06.240 |
preferences of question and two answers and then we asked some annotators to tell us which 00:40:13.700 |
answer they prefer. So if we have this data set we can run a gradient descent using this 00:40:19.860 |
data set over this loss here, this loss here. Now to calculate this loss, okay the logarithm 00:40:26.620 |
we can always calculate, it's just a function, the sigmoid is a function we can calculate, 00:40:31.460 |
but the logarithm and the beta are, the beta is a hyperparameter that indicates how much 00:40:36.040 |
we want the language model to change with respect to the reference language model or 00:40:40.140 |
how much we want to constrain it and then we have to compute these log probabilities, 00:40:45.180 |
so the log of the probability of generating this YW when the language model is prompted 00:40:53.300 |
with the prompt X and also for the Pyref, so also for the language model that is not 00:41:00.140 |
being optimized by DPO. Let's see how to practically compute these log probabilities. So when you 00:41:09.080 |
run DPO it's very simple, so imagine for example you are using a hugging face, it's just a 00:41:14.820 |
matter of using this class, so DPO trainer, in which you pass the language model that 00:41:19.780 |
you're optimizing, the frozen version of the language model that you don't want to optimize 00:41:23.940 |
but it's the reference language model that is used to compute the log probabilities to 00:41:29.500 |
calculate the KL divergence, then you can give some other training arguments, you can 00:41:35.340 |
have the list of them on the website of hugging face and then this beta parameter which indicates 00:41:39.900 |
the strength on how much you want the language model to change and also in the website of 00:41:46.540 |
hugging face they also give you what is the typical range for this hyperparameter. Now 00:41:53.820 |
what will happen inside the library of the DPO trainer, so inside the hugging face library 00:42:00.220 |
when you use the DPO trainer, they compute this loss, so they calculate these log probabilities 00:42:05.580 |
you can see here, so for example this log probability you can see here, but how do they 00:42:10.100 |
actually compute, well as you know a language model usually, in most cases it is a transformer 00:42:21.020 |
model and to compute these log probabilities as you know they use a prompt, a question, 00:42:29.380 |
the answer that is chosen and the answer that is not chosen, so the winning and the losing 00:42:34.140 |
answer, suppose that we want to generate the log probabilities for the winning answer, 00:42:39.420 |
so this one, so we have a language model, we give it a prompt and the answer that was 00:42:44.220 |
generated and we want to calculate this log probabilities, what we can do is we can combine 00:42:49.940 |
the question and the answer in the same string, in the same input for the language model, 00:42:56.740 |
so imagine the question is where is Shanghai, question mark, and the answer is Shanghai 00:43:07.100 |
is in China, now let me use the laser, ok, we can feed all of this to our language model, 00:43:14.460 |
so the pi theta, language model is a transformer model, most of the cases, and it will generate 00:43:21.420 |
as you know the transformer model generates some hidden states, so it takes some input 00:43:26.900 |
which are embeddings, and it outputs some embeddings that are contextualized also according 00:43:34.140 |
to the self-attention mask, now if you don't know how this works, I highly recommend you 00:43:38.940 |
watch my previous video on the transformer in which I show the self-attention mechanism, 00:43:43.500 |
but basically the transformer model is a model that takes some embeddings and through the 00:43:47.780 |
self-attention mechanism output embeddings, then we take these embeddings and we can project 00:43:53.300 |
them into logits using a linear layer, and we can do that for all the tokens that we 00:43:58.580 |
give to the input, usually we only apply the linear layer to the last token when generating 00:44:03.940 |
the tokens because we are interested in generating the next token, but we can do it for all the 00:44:10.980 |
hidden states, we are not forced to only use the last one because each hidden state encapsulates 00:44:17.540 |
information about itself and all the tokens that come before it, so we can take this logits 00:44:25.580 |
and then we can also convert them into probabilities, but we do not want probabilities, we want 00:44:31.300 |
log probabilities because as you can see here we have this log function here, so instead 00:44:35.540 |
of applying the softmax we can apply the log softmax to each of these logits, now when 00:44:41.180 |
we apply the softmax, it will become a distribution over the entire vocabulary, one for each token 00:44:49.140 |
in the vocabulary, but we want only the probability corresponding to the token that was actually 00:44:55.620 |
chosen to generate this particular answer, and we also know which token it was because 00:45:00.740 |
we have the answer, so to the question where is Shanghai, we know what is the answer because 00:45:05.340 |
it's in our data set of preferences, so we know that the answer is Shanghai is in China, 00:45:10.740 |
so how to compute these log probabilities, so we can compute the log probabilities over 00:45:15.340 |
the entire dictionary, over the entire vocabulary, and then we only select the log probability 00:45:21.300 |
corresponding to the token that was actually selected in the answer, so for this question 00:45:27.940 |
for example, we select for example the last hidden state for the question, which corresponds 00:45:33.620 |
to what should be the next token, and we know what is the next token, the next token is 00:45:37.540 |
Shanghai, so we take the log probability only corresponding to Shanghai, for this prompt 00:45:43.100 |
here, so where is Shanghai, question mark Shanghai, we know that the next token should 00:45:47.580 |
be is, because it is already present, so we take the log probability only corresponding 00:45:53.140 |
to the token is, etc, etc, and we do it for all the tokens that are in the answer, and 00:45:59.140 |
this gives us all the log probabilities of the tokens of the answer for this given question, 00:46:06.660 |
and then we can sum them up, why we need to sum them up, because it's a log probabilities, 00:46:12.060 |
usually if they are probabilities we multiply them, but because they are log probabilities 00:46:17.260 |
we sum them up, because the logarithm transforms products into summations, and this is exactly 00:46:25.420 |
what happens inside the HuggingFace library, so inside the HuggingFace library to compute 00:46:30.860 |
the log probabilities, to calculate this loss, they actually do what I described, so they 00:46:35.460 |
take the logits, and they use the labels, what are the labels, just the tokens corresponding 00:46:41.380 |
to the answer, for example to the winning answer or to the losing answer, depending 00:46:46.000 |
on which term you are computing, this one or this one, and the model that we are using, 00:46:50.340 |
so the reference model or the model that we are trying to optimize, and then they select 00:46:55.820 |
here, in this line here, they check the log probabilities only corresponding to the labels, 00:47:03.120 |
so to the next token that we already know what is it, and then they sum them up, as 00:47:09.420 |
you can see here, and here they also apply a loss, because we don't want all the log 00:47:15.900 |
probabilities, but only the one corresponding to the tokens that are belonging to the answer, 00:47:21.620 |
not to the one that belong to the question, and this is how DPO works, thank you guys 00:47:28.140 |
for watching my video, I hope you learned a lot, I tried to simplify as much as possible 00:47:33.340 |
the math of DPO, but the basic idea is that we want to remove reinforcement learning to 00:47:39.460 |
align language models, this makes the tractation much simpler, because it just becomes a simple 00:47:45.020 |
loss in which you can run a gradient descent, and you don't have to worry about training 00:47:48.900 |
a separate reward model, which is something that we did in reinforcement learning from 00:47:53.040 |
Deque, so if you watched my previous video, as you remember, the math is much more hard 00:47:59.420 |
and much more topics to introduce on how to optimize the objective that we saw, and please 00:48:06.820 |
come back to my channel for more videos like this, I usually try to make videos that are 00:48:10.980 |
very deep, very in depth for every topic, sometimes they can be a little hard, but I 00:48:17.820 |
try to simplify as much as possible, also depending on my knowledge, and also depending 00:48:22.260 |
on how much it is possible to simplify a difficult topic, and if you have any questions, please 00:48:29.060 |
leave it in the comments, and I will probably keep publishing more videos like this, but 00:48:36.020 |
if you want videos that are more simpler, please let me know, and also let me know in 00:48:39.620 |
the comments what kind of topics you would like me to explore next, thank you guys and