Back to Index

Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math


Chapters

0:0 Introduction
2:10 Intro to Language Models
4:8 AI Alignment
5:11 Intro to RL
8:19 RL for Language Models
10:44 Reward model
13:7 The Bradley-Terry model
21:34 Optimization Objective
29:52 DPO: deriving its loss
41:5 Computing the log probabilities
47:27 Conclusion

Transcript

Hello guys, welcome back to my channel. Today we are gonna talk about DPO, which stands for Direct Preference Optimization. It's a new technique that came out in the middle of last year, in 2023, to align language models. Let's review the topics of today. I will start with a short introduction to language models, as usual, so we can review how language models work.

Then we will introduce the topic of AI alignment, so what we mean by AI alignment. And then we will review reinforcement learning. Now, you may be wondering why are we reviewing reinforcement learning if the whole point of DPO is to remove reinforcement learning from language models. Well, the reason is that, actually, even if DPO does not use reinforcement learning algorithms, they are still interconnected, especially when we talk about the reward model and the predatory model.

So in order to understand the reward model and the predatory model, we need to review reinforcement learning and how the reward model affected the process in reinforcement learning from human feedback. In the last part of the video, we will derive the DPO loss, so we will understand where does it come from.

I will also show you the code on how to compute the log probability, so how we actually can use this loss in practice. Now what are the prerequisites for watching this video? Well, for sure that you are familiar with a little bit of probability and statistics, not much, for example, conditional probability.

We are familiar with deep learning, so what we mean by gradient descent and loss functions. It's really great if you have watched my previous video on reinforcement learning from human feedback in which I explain all the aspects of the reward model and the reinforcement learning framework and the DPO.

But it's not necessary for this video because I will review most of the part that are needed to understand the DPO, but it's really great if you have already watched that video so you can compare the two methods. And also that you're familiar with the transformer model because we will be using it in practice when we want to compute the log probabilities.

Otherwise, we don't know how to use the loss of the DPO. Let's start our journey. So what is a language model? Well, a language model is a probabilistic model that assigns the probabilities to a sequence of words. In practice, given a prompt, for example, a language model allow us, let me use the laser.

So given a prompt, for example, Shanghai is a city in a language model tells us the probability of what is maybe the next token or word. Now, in my videos, I always make the simplification that a token is a word and the word is a token. This is actually not the case in most language models, but it's useful for explanation purposes.

So what is the probability that the next token is China or the next token is Beijing or the next token is cat or pizza given a particular prompt? This is the only thing that a language model does. And the language model gives us this probability. Now, you may be wondering, how can we use this language model to generate a text?

Well, we do it with an iterative process. So we take a prompt, for example, where is Shanghai? We give it to the language model. The language model will give us a list of probabilities or what is the possible next word or token. Suppose that we choose the token with the most, with the highest probability score.

So suppose it's Shanghai. We take this token, we select it and we put it back into the prompt and we ask again, the language model, what is the next token? Then the language model again will give us a list of probabilities over what is the possible next token. We select the one that we think is the most relevant.

Usually we select the one that is most probable and we put it back into the prompt and we ask again, the language model, et cetera, et cetera, until we reach a specified number of generated tokens or we reach the end of sentence token, which is a special token. In this case, after four tokens generated, the language model will probably say Shanghai is in China, which is the answer to our question.

What do we mean by AI alignment? Now when we train language model, we train it on a massive amount of data. For example, thousands of books, billions of web pages and the entire Wikipedia, et cetera. This gives the language model a vast knowledge to complete any prompt in a reasonable way.

However, this does not teach the language model to behave in a particular way. For example, the pre-training does not teach the language model to be polite or to not use any offensive language or to not use any racist expressions, et cetera, because the language model will just behave based on the data that it has seen.

If you feed the internet data, the language model will behave very, very, very badly actually. We need to kind of align the language model to a desired behavior. We don't want the language model to use any offensive language. We don't want it to be racist. We want the language model to be helpful to the user, so to answer questions like an assistant, et cetera, et cetera.

And this is the goal of AI alignment. Now let's talk about reinforcement learning. So reinforcement learning is an area of AI that is concerned with training intelligent agents to perform actions in an environment in order to maximize a reward that they receive from this environment. Let me show you with a very concrete example.

I usually always use my cat, Oleo, for examples, so let's talk about Oleo. Oleo is the agent in this case, in this reinforcement learning scenario, and he lives in a very simple world. Let's call it a grid world that is made of cells in which the cat's position is indicated by two coordinates, the X position and the Y position.

This can also be treated as the state of the agent because at every position corresponds to a particular state. The agent, when it is in a particular state, it can take some actions. In the case of the cat, it can go right, left, up or down. For every action that the agent takes, it will receive some reward from the environment.

It will for sure change its state to a new one. So for example, when the cat moves down, it will change to a new state, to a new position, and it will receive some reward according to a reward model that we specified. In my case, I have specified the following reward model.

So when the cat moves to an empty cell, it receives a reward of zero. If it moves towards the broom, it receives a reward of minus one. If somehow it arrives to the bathtub, it will receive a reward of minus 10 because my cat is very scared of water.

And if it arrives to the meat, which is the cat's dream, it will receive a reward of plus 100. Now, what dictates what action the agent will take given a particular state or position? Well, it is the policy. The policy indicates what is the probability of the next action among all the actions that are available that the agent can take given a particular state.

And we usually write it like this, so that the next action at time step t is distributed like the distribution induced by the policy according to the state the agent is in. Now, what is the goal in reinforcement learning? The goal in reinforcement learning is to select a policy or to optimize a policy in order for the agent to maximize the expected return when it acts according to this policy.

So imagine we have such a policy that is optimized. Well, our policy, for sure, if the goal is to maximize the expected reward when using this policy, for sure, we will have a policy that will take us on average to the meat because that's one way to maximize the expected reward.

And for sure, it will be a policy that will allow us to minimize the chance of ending up in the water here or to the broom here. Now, you may be wondering, okay, the cat can be seen as a reinforcement learning agent, as a physical agent that takes some actions.

But what is the connection between reinforcement learning and language models? Well, as we saw before, in reinforcement learning, we have this thing called policy, in which given a state, the policy tells us what is the probability over the action space of the next action. So what possible next action we can take and the probability of each action.

This is also something similar to what we do with language models, because in language models, we also have some kind of state, which is our prompt. And we ask the language model to give us the probability of the next token, or we can consider it the next action that the language model can take.

And we want to reward this language model for selecting tokens in such a way that they end up generating good responses. And we don't want to reward the language model for selecting sequence of tokens that end up giving us bad responses. Now, imagine we are trying to train a language model that needs to act like an AI assistant.

So for sure we want the language model to be helpful, to answer questions in a meaningful way, so not just output garbage. We want the language model to not be racist or not use any offensive language. So this is all good behaviors that we want from this language model.

So we may want to build a reward model that will treat good responses, for example, responses that actually answer the question asked by the user. And we will reward them with a high reward. And we give maybe zero reward or negative reward to all those answers that are not coherent with what we want.

So for example, if the language model generates dirty jokes or racist jokes, for example, we can give zero reward to those responses. So the language model acts also as a policy because the policy is something that, given a prompt, tells you what is the probability over the action space, or in this case, the probability over the token space.

So we want to optimize this policy. So we want to optimize the language model to maximize the probability, to maximize the expected return or the expected reward that it receives from our reward model. So we want to optimize our language model to generate good response, because that's one way to obtain a high reward from the reward model.

Now you may be wondering, OK, but how to define the reward model for a language model? Well, one way would be, OK, we can have a list of questions and answers generated by the language model, and then we can give a numerical reward to each one of them. And then we can use some reinforcement learning algorithm to feed this reward model to the language model to optimize it.

But the problem is, what kind of reward can we give to each of these pairs of questions and answers? Because, for example, let's look at the first question. Where is Shanghai? The answer, suppose it is generated by the language model, is that Shanghai is a city in China. Now, in my opinion, this is a good response because it's short and up to the point.

But some other people maybe think that only the word China is enough because the user just asked, where is Shanghai? So there is no need to repeat the word Shanghai. But someone else maybe think that this response is too short and the assistant should say, hello, I think that the answer to your question is Shanghai is a city in China.

So different people will have different opinions on what reward to assign to this particular pair of question and answer. Because we humans are not very good at finding a common ground for agreement. But unfortunately, we are very good at comparing. And we will exploit this fact. So instead of building a data set that is made of questions and answers and the rewards, because we do not know what kind of reward to assign, we will build a data set of questions and multiple answers.

And then we ask people to choose an answer that they like, according to some preference that we have. So we want to generate, for sure, a language model that is helpful. So we want the general language model to give responses that are correct. And we want the language model to be polite, for example.

So for example, imagine we have a list of questions. And then we ask the language model by using, for example, a high temperature to generate multiple answers. And then we ask people to choose which one they like. In this case, for example, where is Shanghai? For sure, most people will choose the answer number one, because Shanghai is a city in China, it's the correct one.

For this question here, for example, most people will probably choose this question here, even if it's very short because the other one is probably wrong. So using a data set like this, we can actually train a model to transform a pair of questions and answers into a numeric reward.

Let's see how it is done. If you have a pet, you probably know that to teach a particular behavior to your cat or to your dog, you need to use biscuits or some treats. So you ask the cat to do something. And if the cat does it, then you give it a treat.

So it will reinforce this memory in your cat. And then the next time the cat is more likely to do it because it will remember that it received some treat. And so it will again perform that action again. So it can probably receive another treat. This is exactly what we do in reinforcement learning.

We want to give some digital biscuits to our reinforcement learning agent so that it is more likely to perform that action or that series of actions again in order to receive more reward. However, the data set that we have built so far is made up of preferences. So we have a question, multiple answers, and then we ask people to choose which answer they like.

We need to convert this data set of preferences into a numeric score that we can give as a reward to our language model to choose more likely the answer that was chosen by the people and to make it less likely to choose the answer that was not liked by the people, by our annotators.

And this can be done through a preference model. In DPO and also in reinforcement learning from human feedback, we make use of the Bradley-Terry model. So the Bradley-Terry model is a way of converting a data set of preferences into a numeric score called reward that is given for each pair of questions and answers.

Our goal is to train a model that given a question and answer or a prompt and the text generated text to give a score that resembles the preferences that have been chosen by our annotators. This is the expression of the Bradley-Terry model. So it is a model meaning that we choose to model our preferences like this.

And actually it makes sense because it is a probability. So that's why, for example, we use explanations because we want the probability has to be non-negative. And also the probability that the assigned to the correct preferences. So the probability of choosing the correct answer over the wrong answer. So the one that is chosen by the annotators over the one that was not chosen by the annotators.

Here, I call it the winner and loser because also in the DPO paper, they call it winner and loser. It is modeled like this. So it is proportional to the reward to the exponential of the reward that was assigned to the winning answer. Now, how to train a model to convert a dataset of preferences into a numeric reward?

We take this expression and we can use a maximum likelihood estimation. Now, it doesn't matter if you don't know what is maximum likelihood estimation. The point is we want to maximize the probability of assigning the correct ordering in our preferences. So we want to maximize the probability of choosing the correct answer over the wrong answer.

And suppose that we are maximizing this expression here. Let's see how we can derive the loss to maximize this expression here. If you look at the DPO paper, you will see that they go from the Bradley Terry model, which is this one, directly to the loss here, but they don't show you the derivation.

So I will show you how to derive the loss that maximizes this probability here. The derivation is very simple actually. So first of all, as you can see in the loss, you can see this function here. It's a sigmoid function. The expression of the sigmoid function is this one and this is the graph of the sigmoid.

So the expression of the sigmoid function is one over one plus e to the power of minus x. The first step of the derivation is to prove that two exponentials, so a fraction of this expression here, so exponential divided by the sum of two exponentials can be written as a sigmoid of a minus b.

So here I call all this part here. So let me use the pen. I think it's easier. So this part here, so the reward assigned to the, let's say the winning answer is, we call it a, and the reward assigned to the losing answer, we call it b. So this expression can be written as e to the power of a divided by e to the power of a plus e to the power of b.

And we will prove that it can be written as the sigmoid of a minus b through the following step. So first we can divide, we take this expression, which is basically this one. We just replace the rewards with a and b because it makes it simpler to visualize. We divide the numerator and denominator by the same quantity, e to the power of a.

We can do it. Then we can, at the numerator, e to the power of a cancels out with e to the power of a and it becomes a one. Then in the denominator, we add and subtract one. We can do it because it's like adding zero. And then we collect the minus one.

So we don't change anything. We just put the parentheses. This is possible through the associative property. Then we do the common denominator for these two expressions. And we arrive to this one. We can simplify e to the power of a with minus e to the power of a. So it becomes e to the power of b divided by e to the power of a, which thanks to the property of the exponentials can be written as e to the power of b minus a.

Then we can take a minus sign outside. And this expression here is exactly the expression of the sigmoid function you can see here. So it's one over one plus e to the power of minus something. So it becomes the sigmoid of that something here a minus b. And this is exactly the loss that you see here.

So it is the sigmoid of the reward assigned to the winning answer minus the reward assigned to the losing answer. Here we also see a log because usually we do not model the probability directly but we model the log probability. So we have also this log because we want to model the log probabilities.

It is something that we can do because the logarithm is a monotonic function. And also you may be wondering why do we have this minus sign here? This is because we want to maximize this expression. But as you know in deep learning frameworks like PyTorch, we have an optimizer that is always minimizing a loss.

So instead of maximizing something we can minimize the negative expression of the objective function which is the same thing. So basically we take this loss function and if we apply it to a reward model which is a neural network, it will be trained to maximize the probability of giving the correct ordering to our preferences which can only happen when it assigns a high reward to the winning answer and a low reward to the losing answer.

Because if you look at this expression here, as you can see the probability is maximized when in the numerator we have the reward assigned to the winning answer. So the reward assigned to the winning answer is higher than the one assigned to the losing answer. And if you are wondering how to read an expression like this, so let me cancel because we will use it a lot, this kind of convention.

This one. This basically means that we have a data set of preferences where we have a prompt, a winning answer and a losing answer and they belong to our data set of preferences and we train a model with a gradient descent for each of these preferences we calculate this loss here, this expression here.

And if we minimize this loss with the gradient descent we will have a neural network that is trained for the following, the Bradley-Tarry model basically. Okay, now that we have built a reward model, which means that we have a model that given a question and answer can assign a numeric reward to the language model if the response is correct or looks good according to the behavior that we want from our language model or looks bad according to the behavior that we want from our language model.

Now we can train our language model. So as you recall, what is the goal in reinforcement learning? In reinforcement learning, the goal is to optimize a language model, which is also the policy of our reinforcement learning agent, in order to maximize the cumulative reward when the agent acts according to this policy.

In other words, if we, let me use the pen, so let's ignore for now this green part here. Suppose there is no green part here. So this doesn't exist. Imagine we have a language model, let's call it Pi Theta because it's a policy and we want to optimize this policy.

So we want to optimize this language model in order to maximize the reward that it receives from the reward model. It means that the language model will generate answers that give good reward and how they get good reward if the answers are, looks good. They, for example, are not racist.

They are not using any sexual jokes and they are actually answering the question that was asked. However, and this is the goal in reinforcement learning from human feedback, for example. That's why it's called the reinforcement learning from human feedback. Now, if we use a model, if we use an objective like this, that is, we only want to maximize the reward, then the language model may become greedy and just output garbage that gives it good reward.

So imagine we have a reward model that rewards the language model for being polite. The language model may just start saying a list of "thank you, thank you, thank you" or "please, please, please" and a lot of "please" or a lot of "thank you's" just to get high reward.

Because probably the word "thank you" and "please" are highly rewarded by the reward model. But we don't want the language model to just output garbage to get reward. We want the language model to also output something that was according to its training data. So it's a pre-training, but we want to change it a little bit so that it also acts according to our reward model, so to our data set of preferences.

So it is more polite, but without forgetting what it has learned from the pre-training. And this is why we add this KL divergence in the objective. So let me use the pen again. So we change the objective a little bit. So we want the language model to maximize the reward it gets from the reward model.

But at the same time, we add a constraint to the language model through a KL divergence. Now the KL divergence can be thought of as a distance metric. It is not a distance metric, but can be thought of as a distance metric between two distributions in which we have a pre-trained model.

So a language model that was not fine-tuned through reinforcement learning from human feedback or DPO. So it's just a language model that has been pre-trained on the Wikipedia, on the books, and on the internet web pages. And then we have the language model that we are optimizing. So this pi theta, and we want them to be very similar.

So we want the language model to not change much compared to what it was before the reinforcement learning from human feedback or before the DPO training. And this is why we add this KL divergence. So we want the language model to maximize its reward, but at the same time, not forget or not change too much its output in getting this reward.

Now that we know the reinforcement learning objective, which is basically also the same objective that we have in DPO, because also in DPO we want to train a language model that maximizes a reward, but at the same time does not forget its training data. Let's look at what does it mean to actually maximize an objective function, because this is an objective function that we have and we want to maximize it.

But what does it mean to maximize an objective? Let's see. Maximizing a function means to find the values of some variable such that the value of the function is maximized. For example, if I give you the following function, f of x is equal to minus x minus three plus four, whose graph is very simple.

It's just a parabola facing down. To maximize these functions means to find the value of the x variable such that the function, the y basically, the y of this function is maximized. How to do that analytically? Well, we calculate the derivative of this function here. We set the derivative equal to zero and we find the values of x for which this derivative is zero.

And that is also the value for which the function will be maximized. So the derivative of this simple function is minus two x plus six. And the value of x that makes this derivative zero is the value three, x equal to three, which is also the value of the, as you can see in the graph, that maximizes the function.

Now the objective function that we saw before, so this one, so in which we want to maximize a reward, but at the same time, we want the language model to not be too much different from the unaligned language model. So the language model that is not aligned through reinforcement learning from human feedback or DPO, it is called a constrained optimization problem because we want to maximize the reward, but at the same time, we want to put some constraint on this objective function.

We don't want the KL divergence to be too big. We want it to be constrained in some limit. Now, the point is, okay, there are many techniques for constrained optimization and we will not see them because there are university PhDs on optimization, but one thing you may notice is that, okay, this one here looks like the objective function looks like a loss function.

So why cannot we just use, for example, gradient descent to optimize this objective function here such that we can train our language model to behave in a particular way to maximize this reward? Well, we could, but as you know, in deep learning and especially with backpropagation, we need an objective function or a loss function that is a differentiable.

The following, this objective function is not differentiable. Why? Because as you can see from the expression here, this is an estimation of all the prompts in our dataset and then a output that is generated by the language model. Now, to generate the output of the language model, as we saw before, we need to use an iterative process in which we feed one token at a time into the prompt.

We sample one token at a time from the language model. We take this token and we put it back into the prompt, feed it again to the language model, et cetera, and we use many strategies for selecting the next token. Sometimes we use the greedy strategy. Sometimes we use the beam search.

Sometimes we use the top case, top P, et cetera, et cetera. Now, this sampling operation that we do on the language model to sample the answer of the language model is not differentiable. That's why we cannot run reinforcement learning to maximize this objective or to minimize the negative objective in case we treat it as a loss.

And that's why in reinforcement learning, we were forced to use algorithms like PPO. Now, let's see how DPO handles this. In the DPO paper, they start with a very simple introduction to the reinforcement learning objective. As we saw before, the reinforcement learning objective is to select a policy that maximizes the expected reward when using this policy, so the policy is the language model, and at the same time puts a constraint on how much this policy can change during this training, this optimization.

And in the DPO paper, they say, okay, there is an exact solution to this optimization problem, and it is the following. It is the equation four in the DPO paper. And exact solution, I mean that there is an analytical solution to this constrained optimization problem. Just like we had an analytical solution for the maximization problem of this parabola, so we could find through the derivative and setting the derivative equal to zero, we could find the value of x such that this function here is maximized.

And for using the same reasoning but different techniques, we also have an analytical solution for the constrained optimization problem that we saw before, and this is the solution. Now, you may be wondering, okay, great, we have an exact solution just like the parabola, so now we are all set, right?

Yes. The problem is we have an exact solution, but it's not easily computable, so it's not easy to compute. So mathematically, it exists, it makes sense, but it's not easy to compute. Why? Because we have this z of x term here. Now, this z of x term here, if you look at how it's defined, it's the summation of all possible y's that are generated by the reference model.

So as you know, when we do reinforcement learning from human feedback or DPO, we have two models. One is the language model that we are trying to optimize, and one is the frozen model that we don't optimize, but we use it as a reference for the KL divergence. So this is called the PyRef.

So all the outputs generated by PyRef multiplied by the exponential of the reward. Now, the problem is this summation is done over all possible y's. It means that we need to sample all possible outputs from our language model, given all the prompts that we have in our data set of preferences.

Now, to generate all possible outputs is very, very, very expensive. Imagine you need to generate, your language model can generate 2,000 tokens for each prompt. It means that, and you have a vocabulary size of 30,000, it means that for the first position, you have 30,000 possibilities, for the second position, you have 30,000 possibilities, for the third position, you have 30,000 possibilities, and then you multiply all these possibilities.

So it becomes a lot, a lot, a lot of outputs that you need to generate to evaluate this Z of X term. So the analytical solution to the constraint optimization problem that we saw before exists, but it's not easy to compute. However, one thing is interesting from this expression.

Imagine that somehow, magically, we have access to an optimal policy. So this solution to the optimization problem allow us to compute what is the optimal policy, given the optimal reward model and the reference policy, so the reference language model. But imagine that for some reason, some magically, we have access to, we have this term here.

So if we have this term here, we can compute the optimal reward model with respect to the optimal policy. How? Well, we can just isolate this R of X and Y term from this expression here. And it's very easy to compute because we can apply the logarithm on the left and the right side of this expression.

So let's do it step by step. We can apply the log on the left side and on the right side of this expression here. So this expression here, and we will get that the log of a product, as you know, is the sum of the logs and the log of the ratio is the difference of the logs.

So this Z term is in the denominator. So it becomes a minus log of Z of X. This one is in the numerator and this one is in the numerator. So they become sums of logs. So this one plus this log here, then the log and the exponential can cancel out because they're inverse functions.

So this allow us to isolate this R of X, Y term with respect to all the other terms, and we can write it like this. So we can calculate R of X and Y with respect to an optimal policy that we think we have access to. We do not have access to it, but we pretend we have access to it.

Okay. So there are two things that we do not have in this expression. We do not have the reward model, the optimal reward model, and we do not have the optimal policy, but we pretend that we have the optimal policy. Why? Let's see the next step that they do in the DPO paper.

The next step is, they say, okay, do you remember the Bradley Terry model? As you remember, the Bradley Terry model is our reward model, right? Is the model that given a data set of preferences, allow us to compute a numeric score, a numeric reward. Well, this Bradley Terry model is based on a reward that we assign, right?

So what if we plug the reward that we have computed from the constraint optimization problem into the Bradley Terry model? Well, we can do it. So we have this reward that we obtained from the constraint optimization problem by inverting the formula and we plug it inside the Bradley Terry model.

So if you remember, the Bradley Terry model can also be written as a sigmoid and we prove it before in the previous slide. So what we do is, okay, the Bradley Terry model can be written as a difference of rewards in the sigmoid function. So if we plug the reward here, so the reward obtained by the constraint optimization problem solution, we will see that the two Z of X terms, because this is a difference of rewards, as you can see, if we plug here for the reward assigned to the winning response, and here the reward assigned to the losing response, we have these two Z of X terms, so plus beta log of Z of X and minus beta log of Z of X that will cancel out because they are one the opposite of the other.

This way we can obtain a formula that does not contain the Z of X term and it's now computable. So basically, if we use the loss of the Bradley Terry model, so as you remember, the Bradley Terry model is a model that allow us to train a language model to model the reward, right?

And if we use the loss of the Bradley Terry model in which the reward is coming with respect to the optimal policy, we can use it to optimize the policy to adhere implicitly to the reward model according to the Bradley Terry model. And this is the whole idea of the DPO paper.

So we can plug the exact solution of the constraint optimization of the reinforcement learning objective, we can invert it to get the reward, we plug it into the Bradley Terry model, because the Bradley Terry model only depends on the difference of rewards assigned to the winning answer and to the losing answer, the uncomputable term Z of X cancels out and then it becomes computable.

And now we can use it to train a language model that will act according to the reward model of the Bradley Terry model, so to the preference model given by the Bradley Terry model. So it will favor good responses and at the same time it will be less likely to output the preferences that were not chosen.

And at the same time it will put a constraint onto the KL divergence, so at the same time it will put a restriction on how much the language model can change with respect to the reference model, so the language model that was not optimized with reinforcement learning from humanist feedback or DPO.

So basically with the DPO we are doing kind of the same thing that we are doing in reinforcement learning but without using the reinforcement learning algorithms. So the goal in both of them is the same, so we want to optimize a policy, we want to optimize a language model to maximize a cumulative reward but at the same time we want to put a constraint on how much it can change using the KL divergence.

In the case of reinforcement learning from humanist feedback we are using the PPO algorithm to optimize this objective, to optimize this policy, but in the case of DPO we do not have to use reinforcement learning from humanist feedback because we found a loss that implicitly is already mapping this reward objective through this loss.

Let's see how to actually now compute the log probability, so how to actually use this loss, because first of all let's look at the expression of this loss. This loss says that if you have a data set of preferences in which X is the prompt, YW is the chosen answer and YL is the not chosen answer because as you remember this data set is made up of preferences of question and two answers and then we asked some annotators to tell us which answer they prefer.

So if we have this data set we can run a gradient descent using this data set over this loss here, this loss here. Now to calculate this loss, okay the logarithm we can always calculate, it's just a function, the sigmoid is a function we can calculate, but the logarithm and the beta are, the beta is a hyperparameter that indicates how much we want the language model to change with respect to the reference language model or how much we want to constrain it and then we have to compute these log probabilities, so the log of the probability of generating this YW when the language model is prompted with the prompt X and also for the Pyref, so also for the language model that is not being optimized by DPO.

Let's see how to practically compute these log probabilities. So when you run DPO it's very simple, so imagine for example you are using a hugging face, it's just a matter of using this class, so DPO trainer, in which you pass the language model that you're optimizing, the frozen version of the language model that you don't want to optimize but it's the reference language model that is used to compute the log probabilities to calculate the KL divergence, then you can give some other training arguments, you can have the list of them on the website of hugging face and then this beta parameter which indicates the strength on how much you want the language model to change and also in the website of hugging face they also give you what is the typical range for this hyperparameter.

Now what will happen inside the library of the DPO trainer, so inside the hugging face library when you use the DPO trainer, they compute this loss, so they calculate these log probabilities you can see here, so for example this log probability you can see here, but how do they actually compute, well as you know a language model usually, in most cases it is a transformer model and to compute these log probabilities as you know they use a prompt, a question, the answer that is chosen and the answer that is not chosen, so the winning and the losing answer, suppose that we want to generate the log probabilities for the winning answer, so this one, so we have a language model, we give it a prompt and the answer that was generated and we want to calculate this log probabilities, what we can do is we can combine the question and the answer in the same string, in the same input for the language model, so imagine the question is where is Shanghai, question mark, and the answer is Shanghai is in China, now let me use the laser, ok, we can feed all of this to our language model, so the pi theta, language model is a transformer model, most of the cases, and it will generate as you know the transformer model generates some hidden states, so it takes some input which are embeddings, and it outputs some embeddings that are contextualized also according to the self-attention mask, now if you don't know how this works, I highly recommend you watch my previous video on the transformer in which I show the self-attention mechanism, but basically the transformer model is a model that takes some embeddings and through the self-attention mechanism output embeddings, then we take these embeddings and we can project them into logits using a linear layer, and we can do that for all the tokens that we give to the input, usually we only apply the linear layer to the last token when generating the tokens because we are interested in generating the next token, but we can do it for all the hidden states, we are not forced to only use the last one because each hidden state encapsulates information about itself and all the tokens that come before it, so we can take this logits and then we can also convert them into probabilities, but we do not want probabilities, we want log probabilities because as you can see here we have this log function here, so instead of applying the softmax we can apply the log softmax to each of these logits, now when we apply the softmax, it will become a distribution over the entire vocabulary, one for each token in the vocabulary, but we want only the probability corresponding to the token that was actually chosen to generate this particular answer, and we also know which token it was because we have the answer, so to the question where is Shanghai, we know what is the answer because it's in our data set of preferences, so we know that the answer is Shanghai is in China, so how to compute these log probabilities, so we can compute the log probabilities over the entire dictionary, over the entire vocabulary, and then we only select the log probability corresponding to the token that was actually selected in the answer, so for this question for example, we select for example the last hidden state for the question, which corresponds to what should be the next token, and we know what is the next token, the next token is Shanghai, so we take the log probability only corresponding to Shanghai, for this prompt here, so where is Shanghai, question mark Shanghai, we know that the next token should be is, because it is already present, so we take the log probability only corresponding to the token is, etc, etc, and we do it for all the tokens that are in the answer, and this gives us all the log probabilities of the tokens of the answer for this given question, and then we can sum them up, why we need to sum them up, because it's a log probabilities, usually if they are probabilities we multiply them, but because they are log probabilities we sum them up, because the logarithm transforms products into summations, and this is exactly what happens inside the HuggingFace library, so inside the HuggingFace library to compute the log probabilities, to calculate this loss, they actually do what I described, so they take the logits, and they use the labels, what are the labels, just the tokens corresponding to the answer, for example to the winning answer or to the losing answer, depending on which term you are computing, this one or this one, and the model that we are using, so the reference model or the model that we are trying to optimize, and then they select here, in this line here, they check the log probabilities only corresponding to the labels, so to the next token that we already know what is it, and then they sum them up, as you can see here, and here they also apply a loss, because we don't want all the log probabilities, but only the one corresponding to the tokens that are belonging to the answer, not to the one that belong to the question, and this is how DPO works, thank you guys for watching my video, I hope you learned a lot, I tried to simplify as much as possible the math of DPO, but the basic idea is that we want to remove reinforcement learning to align language models, this makes the tractation much simpler, because it just becomes a simple loss in which you can run a gradient descent, and you don't have to worry about training a separate reward model, which is something that we did in reinforcement learning from Deque, so if you watched my previous video, as you remember, the math is much more hard and much more topics to introduce on how to optimize the objective that we saw, and please come back to my channel for more videos like this, I usually try to make videos that are very deep, very in depth for every topic, sometimes they can be a little hard, but I try to simplify as much as possible, also depending on my knowledge, and also depending on how much it is possible to simplify a difficult topic, and if you have any questions, please leave it in the comments, and I will probably keep publishing more videos like this, but if you want videos that are more simpler, please let me know, and also let me know in the comments what kind of topics you would like me to explore next, thank you guys and have a nice day!