back to indexML Interpretability: feature visualization, adversarial example, interp. for language models
00:00:00.000 |
Hello guys, welcome back to my channel. Today we are going to talk about machine learning 00:00:03.920 |
interpretability. Let's review the topics of today. I will be starting by introducing 00:00:09.840 |
what is machine learning interpretability. Then we will review deep learning and back 00:00:14.320 |
propagation because they are needed for us to understand the rest of the videos and the 00:00:18.080 |
topics. Then we will see a nice little trick, so how to trick a classifier. So imagine that 00:00:23.560 |
you have a classification neural network, like for example a convolutional neural network 00:00:28.400 |
that can classify pictures into classes. For example it will tell you that the picture 00:00:32.600 |
of a dog is a dog, the picture of a person is a person, etc. Our goal is, without touching 00:00:38.560 |
anything of this model, so without touching its weights, without touching its parameters, 00:00:42.760 |
its structure or anything related to the model, we want any classifier to be tricked into 00:00:48.040 |
believing that for example the picture of a dog is actually a person or the picture 00:00:51.620 |
of a person is actually a dog. And we will see how we can trick any classifier of our 00:00:56.800 |
choice. Later I will be introducing an interpretability engine, so a library that allows us to make 00:01:03.240 |
vision models more interpretable. We will explore the topic of feature visualization 00:01:08.920 |
which is very important for interpretability. And finally we will apply the techniques that 00:01:13.480 |
we have learned to language models, so how to make language models more interpretable. 00:01:18.680 |
What are the prerequisites for watching this video? Well for sure that we have a little 00:01:22.160 |
background in calculus, I think knowing what are derivatives and how to calculate them 00:01:26.600 |
is enough. And also of course that you have a background in deep learning, so you know 00:01:32.420 |
what is a loss function or what is the softmax function for example. So let's start our journey. 00:01:39.760 |
What is machine learning interpretability? In 2016 there was a fatal accident between 00:01:46.240 |
a Tesla car driver and a truck. And as reported by the Guardian, we can see that the car sensor 00:01:54.920 |
system against a bright spring sky failed to distinguish a large white 18 wheel truck 00:02:01.280 |
and trailer crossing the highway. So basically the car was going and it failed to recognize 00:02:06.800 |
this obstacle which was the truck. And the car just attempted to drive full speed under 00:02:12.360 |
this truck. Of course resulting in the crash and it was unfortunately fatal. Now I don't 00:02:18.480 |
want to say that it was Tesla's fault or it was the software's fault, I don't have enough 00:02:22.320 |
information for that. So let's make a hypothetical case like you are creating a self-driving 00:02:27.080 |
company and you want to deploy your car, self-driving car. How do you make sure that the car can 00:02:35.620 |
recognize any obstacle? How do you know what your model has learned? Because for example 00:02:43.600 |
if the car, the first question that you want to answer is what your model has learned. 00:02:50.160 |
For example imagine that you have a model that allows you to segment the obstacles on 00:02:54.520 |
the road. The first question that you want to answer is what did my model learn? So how 00:02:59.920 |
does my model recognize a person? Does it recognize a person by its shape or does it 00:03:08.360 |
recognize a person by its shoes or by the color of the clothes etc? Knowing this is 00:03:16.320 |
important. Why? Because this allows us to understand what could be a failure mode of 00:03:21.600 |
our model. Because if our model is only looking at the color of the clothes for example to 00:03:26.760 |
recognize a person, so only looking at the clothes and not at the face for example, then 00:03:32.500 |
if one day the model will see a person that is wearing strange clothes, something that 00:03:37.800 |
the model has never seen, the model may fail to recognize that person as an obstacle. So 00:03:43.960 |
this is very important. So the second question that we want to answer is what features or 00:03:49.240 |
patterns from the input make the model generate certain outputs? And this is very important 00:03:55.200 |
for example for language models. So imagine our language model is a cursing and we want 00:03:59.640 |
to understand which tokens in the input are being used by the model to generate that kind 00:04:06.120 |
of output. Knowing how a model thinks, and pardon me for this word, it's very wrong, 00:04:12.320 |
so let's think about knowing how a model predicts or makes its prediction, allow us to debug 00:04:18.840 |
and fine tune the model, which means that during training we can understand why our 00:04:23.360 |
model is not learning something that we want it to learn, or how should we change our hyperparameters 00:04:29.880 |
that will affect the learning of our model. We can identify failure modes before deployment, 00:04:36.240 |
which means that we can understand what are the things that are more likely to make my 00:04:42.280 |
model fail when deployed in production. It can increase trust because we can demonstrate 00:04:48.520 |
that our model is well trained and so people will trust it and this is especially for some 00:04:54.640 |
scenarios like self-driving cars. And also we can discover novel insights from the data 00:04:59.320 |
because sometimes models learn something that we as humans did not see, and this is very 00:05:03.960 |
important when the models learn patterns, for example in image models, that humans did 00:05:09.720 |
not see. This is for example in the healthcare sector, imagine that you are training a model 00:05:14.320 |
that can recognize cancer cells from non-cancer cells, and we realize that the model is performing 00:05:20.560 |
well and we realize that the model is looking at some parts of the cell that we as humans 00:05:24.840 |
didn't think of checking before, that is actually a good predictor for the cell being cancerous 00:05:29.880 |
or not cancerous, so even the model can teach us something that we did not before. When 00:05:36.360 |
we define a linear layer in PyTorch it gets converted into a computation graph, for example 00:05:42.280 |
in this case we have an input that is made up of two features, which means that the input 00:05:46.400 |
is a vector made up of X1 and X2. For example in this case we have two linear layers, one 00:05:52.260 |
that converts two features into two features, so it's taking two features as input and it's 00:05:56.960 |
converting it into two features as output, and we can see it as a layer made up of two 00:06:03.960 |
neurons, each neuron that is doing a weighted sum of the input features multiplied, each 00:06:11.900 |
input feature multiplied by its own weight, so X1 is multiplied in this case by W11, and 00:06:17.520 |
X2 is multiplied by W12, and then it performs the sum, plus a bias, and then we have the 00:06:23.920 |
application of a non-linear activation, usually the ReLU function. Then we have another linear 00:06:31.280 |
layer that is going from two features to one feature that will produce our output. In this 00:06:36.520 |
case we are trying to model a very simple neural network that takes an input two features 00:06:41.700 |
that represent features of a house, for example the number of bedrooms and the number of bathrooms, 00:06:47.640 |
and wants to predict a price for this house, so only one output. This is a very simple 00:06:53.520 |
regression task and we can train it by having a training data with two input features and 00:06:58.880 |
one target. PyTorch will convert this neural network into a computation graph. What does 00:07:05.920 |
it mean? It means that each node will become an operation that is performed on the input 00:07:12.600 |
subsequently to arrive to the final output, which is the price of the house. In this case 00:07:17.880 |
this is actually a simplified version of the computation graph. The computation graph usually 00:07:21.600 |
is made up of more nodes than the one you see here, because each kind of single operation 00:07:26.680 |
is a node. In this case we can see that X1 and X2 are multiplied by some weights in the 00:07:33.520 |
first neuron and then we sum up a bias, we apply the ReLU and this becomes the input 00:07:38.400 |
for the neuron at the next layer. So we take the two outputs at the previous layer, we 00:07:44.400 |
multiply by them some weights, as you can see here W31 and W32, we add a bias and this 00:07:50.360 |
becomes our output. How do we train such a network? Well, usually we have a dataset made 00:07:58.620 |
up of inputs and output pairs, or input and label pairs. The input represents the features 00:08:05.880 |
of a house and the output the corresponding price, the price of this house. And our goal 00:08:10.960 |
is to train the neural network to minimize a certain loss function that we can choose. 00:08:16.800 |
For this regression task an ideal loss function could be the mean squared error, because we 00:08:22.160 |
want to minimize the error that the model makes on the final price. Our hope is that 00:08:27.880 |
the neural network not only learns the data that it has seen during training, so not only 00:08:32.840 |
it can predict correctly the price of the houses that it has seen during training, but 00:08:36.920 |
it also learns some kind of pattern that can generalize to unseen inputs. So how do we 00:08:43.200 |
proceed practically? We choose a loss function and we choose, for example, the mean squared 00:08:48.320 |
error in this case. We run an input through the neural network. So we take the input, 00:08:53.640 |
we run it through this neural network, which is a feedforward neural network, which means 00:08:57.440 |
that each output becomes the input of the next layer. So we run our input, it will produce 00:09:04.920 |
an output here. We have a loss function that will compare the output of the network with 00:09:10.040 |
the label that we have assigned to this input label pair. It will compute a loss. Then we 00:09:17.600 |
calculate the gradient of the loss function with respect to the weight of the network 00:09:22.560 |
and the weights of this network are also the parameters of this network. And in this case 00:09:27.560 |
are W_11, W_12, the bias here by B_1, W_21, W_22, B_2, which are the weights and the bias 00:09:36.620 |
for the first linear layer and also the weights and the bias for the second linear layer. 00:09:42.680 |
We calculated this gradient because the gradient indicates kind of a direction. So if you remember 00:09:48.360 |
from high school, the gradient is basically a derivative. So imagine we do it for a single 00:09:54.280 |
variable. So imagine that we have, we want to calculate the derivative of the loss function 00:09:59.000 |
with respect to W_11. So we write it here. This is the loss function and this is W_11. 00:10:12.000 |
Imagine that the loss function is doing something like this. We have a kind of a local minima 00:10:19.440 |
here and then we have a global minima here. Imagine that we are currently here. Our W_11 00:10:25.640 |
initially is here. When we calculate the derivative of the loss function with respect to W_11, 00:10:31.360 |
we will get the inclination of the tangent line at this point, which is this one here. 00:10:37.400 |
And this indicates the direction in which the function is growing. So the function is 00:10:42.040 |
growing in this direction. We usually update our weights to move to the opposite direction 00:10:47.620 |
of the gradient. So we update our weights to move right so that the loss will diminish. 00:10:53.000 |
So for example, we take a little step in this direction so that the loss, as you can see, 00:10:56.940 |
will decrease because this will be the new loss 2. We started from loss 1 here. And this 00:11:03.960 |
is why we do backpropagation. So we calculate this gradient, so the gradient of the loss 00:11:08.280 |
function with respect to the parameters of the model. And then we update the models to 00:11:12.440 |
move against the direction of the gradient. The first thing that we do during our training 00:11:18.160 |
is the forward pass, which means that we have an input. We run it through our computation 00:11:22.800 |
graph to calculate an output. So let's do it here. So we have an input that is X_1=2 00:11:28.960 |
and X_2=4. We multiply, for example here in this node, each input by the weights of this 00:11:36.120 |
network and the weights are initialized as follows. So W_11=0.24, W_12=0.29, the bias 00:11:43.360 |
of W_11=-0.70. This will result in some co-activations being calculated, so the values of each node 00:11:51.600 |
are called activations. We use the previous activation to calculate the next one, etc. 00:11:57.560 |
etc. until we arrive to the output of this neural network. We have a target because we 00:12:02.720 |
are training and we can calculate a loss. What do we do with this loss? We run backpropagation, 00:12:08.080 |
which means that we calculate the gradient of the loss function with respect to each 00:12:12.600 |
of these weights. For example, to calculate the gradient of the loss function with respect 00:12:16.960 |
to W_11, which is this parameter here, we can use the chain rule, which means that the 00:12:23.000 |
derivative of the loss function with respect to A_6, because we need to watch what are 00:12:30.040 |
the nodes that connect this parameter to the loss. So the nodes that connect this parameter 00:12:35.240 |
W_11 to the loss function are this node here, this node here, this one, and this one. What 00:12:41.480 |
we do in the chain rule is we just go from the loss to the parameter, backwards. So we 00:12:46.600 |
do the loss function with respect to the previous node, then this node, the derivative of this 00:12:52.520 |
node with respect to the previous node, then the derivative of this node with respect to 00:12:56.240 |
the previous node, the derivative of this node with respect to this node, and the derivative 00:13:00.320 |
of this node with respect to W_11, because this node contains the expression W_11. This 00:13:07.000 |
will result in a series of numbers that will give us, that when multiplied together will 00:13:12.200 |
give us the derivative of the loss function with respect to W_11. What do we do with this 00:13:17.440 |
derivative, which is a number, because we evaluated this derivative on the input points 00:13:22.480 |
that we have chosen. It will give us a number that we will call gradient, and we use it 00:13:27.800 |
to update the value of our parameter. So the new value of the parameter W_11 is equal to 00:13:33.040 |
the old value of this parameter minus alpha, which is our learning rate, multiplied by 00:13:38.600 |
the value of this gradient. Now, why do we have a minus sign here? Because as I saw before, 00:13:46.240 |
the gradient indicates the direction in which the function is growing, the loss function 00:13:52.440 |
is growing with respect to the parameter, and we don't want to make the loss function 00:13:57.320 |
grow, we want the loss function to decrease, so we move in the opposite direction of the 00:14:01.640 |
gradient. So that's why we have a minus sign here. Now, let's trick a classifier. So I 00:14:09.040 |
introduced the gradient descent because it was needed for us to understand how to do 00:14:14.240 |
this trick. So imagine, first of all, what is a classifier? A classifier is a neural 00:14:18.960 |
network that can classify the input into one of the defined classes that we have. For example, 00:14:24.680 |
in this case, we may have a classifier that can take an input picture, and then can classify 00:14:30.560 |
it as a fish, or as a dog, as a volcano, as a car, or a pencil. For example, the ResNet 00:14:35.840 |
network can classify the input picture into one of the thousand classes it has in its 00:14:41.680 |
output logits. The output of a neural network of this kind is called logits, because it 00:14:48.800 |
indicates what is the score that the network assigns to each of these classes. We don't 00:14:54.960 |
usually work directly with the logits, we apply a softmax. A softmax is a function that 00:14:59.640 |
makes the logits kind of turn them into probability scores, because they will sum up to one. And 00:15:05.920 |
then we take the class with the highest value of the softmax as the prediction of the model. 00:15:13.840 |
So if after applying softmax, we see that our network indicated that this node here 00:15:21.040 |
is 95%, then it means that the network is saying us that this is a fish. And that also 00:15:27.840 |
applies to other cases, of course. So what do I mean by tricking a classifier? I mean 00:15:32.480 |
that I give you a classifier, so a neural network like this one, and you are not allowed 00:15:38.080 |
to change anything of the network. So you're not allowed to change the weights of this 00:15:43.120 |
network, you're not allowed to change the architecture of this network, you're not allowed 00:15:46.840 |
to change anything. So the weights are frozen, and the architecture and the hyperparameter, 00:15:52.040 |
everything is frozen. When we run a picture of a fish in this network, it will probably 00:15:57.560 |
classify it as a fish. But our goal, we want to give a picture of a fish as input, and 00:16:03.800 |
we want the network to classify it as something else that we can choose, for example, as a 00:16:07.880 |
volcano with very high probability. So of course, if you think about it, the only place 00:16:13.920 |
where we can work to trick this network is actually in the input. And this is what we 00:16:19.160 |
will do. We will change the input in such a way that the network will not see a fish 00:16:25.080 |
anymore, but it will see, for example, a volcano. So this fish and the previous fish are the 00:16:32.280 |
same for a human because you can see a fish and I can see a fish. But what we did was 00:16:37.760 |
to add a little bit of noise in this picture so that when the network sees this picture, 00:16:43.640 |
so this one here, it will not see a fish anymore, it will see a volcano. How is that even possible? 00:16:50.280 |
Let's see. So our goal is that we want to have a picture as input and we want to change 00:16:58.320 |
this picture in such a way that the network sees something else. How can we proceed? Well, 00:17:05.480 |
we take what we do usually when we train a network like this is this. We have a series 00:17:12.200 |
of pictures of images of fish, of trees, of people, etc. and the corresponding label. 00:17:19.080 |
So for example, we have a thousand pictures of dogs and the label saying that they are 00:17:24.400 |
dogs. And then we have a thousand pictures of cats and saying that the corresponding 00:17:28.160 |
label is cat, etc. So what we do is we feed the input picture to the neural network. The 00:17:35.120 |
neural network will calculate some output, which is here, to which we apply the Softmax. 00:17:41.480 |
Then we have the corresponding label because we know what is this picture. This is coming 00:17:44.800 |
from our training data. So we know that is a fish, for example, in this case. We can 00:17:49.480 |
calculate the loss. Then we can run backpropagation, which means that we calculated the gradient 00:17:55.080 |
of the loss function with respect to the weights of this network, so the parameters of this 00:17:59.840 |
network. And then we update the parameters to reduce this loss. And this is how we train 00:18:04.320 |
this network. So let's try to see how we can trick the network into believing that, for 00:18:09.720 |
example, this fish here is actually a volcano. Now, when we do the training, as you saw before, 00:18:17.600 |
we calculated the gradient of the loss function with respect to the parameters, but we can 00:18:21.760 |
actually calculate also the gradient of the loss function with respect to the input picture. 00:18:27.360 |
So what we can do is as follows. We can create a new loss function. So imagine we have a 00:18:33.200 |
picture of a fish. We know it's a fish, but we want to trick the model into believing 00:18:36.800 |
it's a volcano. We can create a loss function with respect to the target that we want the 00:18:41.600 |
network to have. So we want the network to believe it's a volcano, so we can create a 00:18:45.720 |
new loss function with respect to the target volcano. And then we run this picture in the 00:18:51.080 |
network and we calculate the gradient of the loss function with respect to the input. And 00:18:55.640 |
later we will see in the code how to do that. But let's try to analyze what does it mean 00:19:01.340 |
to calculate the gradient of the loss function with respect to the input. It means that it 00:19:06.400 |
will indicate a direction in which we should change the input to make, because it's a gradient, 00:19:12.460 |
so it indicates the direction in which we should change the input to make the loss grow. 00:19:17.760 |
So we can run backpropagation and optimization on the input to decrease this loss. So because 00:19:26.480 |
the gradient tells us how we should change the input image to make the loss grow, we 00:19:31.360 |
can also change the input image in the opposite direction to make the loss decrease. So that's 00:19:38.180 |
what we will do. We calculate the gradient of the loss function with respect to the input 00:19:44.520 |
image, which will indicate a direction. We update the image with some noise in the opposite 00:19:50.280 |
direction, so we add a little bit of noise in the opposite direction indicated by this 00:19:54.120 |
gradient, and we keep updating it until the network predicts correctly the output as Volcano. 00:20:02.040 |
So in the code it is done as follows. Imagine we have a model, we have an input image, and 00:20:08.160 |
we have a target class, for example, Volcano. What we can do is we take our input image 00:20:13.200 |
and we create a tensor of it by asking PyTorch to also calculate the gradient with respect 00:20:19.320 |
to this tensor, because by default PyTorch will only calculate the gradient with respect 00:20:23.360 |
to the weights. So the gradient of the loss function with respect to the weights. But 00:20:26.720 |
we also want the gradient with respect to this input image. Then we run for a few steps 00:20:32.360 |
the following. We calculate the output of the model, so we are calculating this output 00:20:37.680 |
here. We are creating a special loss function with respect to this target that we want the 00:20:43.320 |
network to have. So we want the network to output Volcano, so we create a loss function 00:20:48.600 |
with respect to this target class. We run backward, which means that we calculate the 00:20:53.800 |
gradient of the loss function with respect to the input. And then we update the image, 00:21:01.320 |
so the image is updated just like the update formula for the parameters, so it's equal 00:21:06.520 |
to the old image minus some learning rate, here I call it alpha, multiplied by the direction 00:21:13.560 |
of the gradient of the loss function with respect to the input. And we are moving against 00:21:20.360 |
the direction of the gradient because we want the loss to decrease. If we update the image 00:21:24.840 |
continuously as follows, we will see that the network will predict it as Volcano. And 00:21:31.040 |
this is how we can trick a classifier. I made this example because I wanted to show you 00:21:35.960 |
that models may look at patterns that are completely different from us humans. For example, 00:21:43.000 |
in this case, the model is predicting this picture as a Volcano. So the model somehow 00:21:49.240 |
is seeing a Volcano here, even if to us humans, we will never be able to see a Volcano in 00:21:54.960 |
this picture, it's a fish. So understanding how our model makes its prediction can help 00:22:02.480 |
us improve our models. And thanks to the sponsor of today's video, LeapLabs, we can get insights 00:22:09.040 |
into how our model makes their predictions. LeapLabs is a research lab that is focusing 00:22:15.400 |
on machine learning interpretability. And they have developed this library, the LeapLabs 00:22:20.680 |
interpretability engine, that allows us to understand what our model has learned and 00:22:26.000 |
how we can get insights from our model to improve it. For example, this library allows 00:22:33.160 |
us to generate prototypes. Prototypes, what are prototypes? Well, imagine that you have 00:22:38.360 |
a classification model, a computer vision classification model, which means that you 00:22:42.040 |
have some, it takes as input a picture and it will classify it as one of the classes. 00:22:48.240 |
In this case, it's a food classifier that will classify an input picture as one of the 00:22:52.960 |
following class, for example, ice cream or hamburger or pancakes or waffles. In this 00:22:58.160 |
case, it looks like that the model is well trained, because by generating prototype, 00:23:03.880 |
we can get the kind of input that the model wants to see to classify it as a target class. 00:23:10.960 |
So this is the kind of picture that the model wants to see to classify the input picture 00:23:16.680 |
as a hamburger. This is the kind of picture that the model wants to see to classify the 00:23:21.480 |
input picture as a pancake. And it actually resembles a pancake. And this one actually 00:23:25.840 |
resembles a hamburger, which means that the model has learned the correct features from 00:23:31.200 |
the food to classify it as a given class. But we will see later a case in which this 00:23:36.700 |
is not true. Another feature of the Leap Labs interpretability 00:23:41.480 |
engine is entanglement. Entanglement allows us to understand how different classes share 00:23:47.240 |
features. For example, for a food classifier like the one we saw before, we expect high 00:23:53.320 |
entanglement between the ice cream class and frozen yogurt class. Because at least for 00:23:58.000 |
me as a human, they look very similar. They both look like ice cream. So it is expected 00:24:05.160 |
that these two classes share features. But imagine that you have a more broad classifier 00:24:10.920 |
like the one we saw before that can classify fish and volcano, etc. I would not expect, 00:24:16.960 |
for example, cheesecake and dog to have high entanglement. Because at least for me, they 00:24:23.360 |
shouldn't share features. I mean, they are totally different objects. So if they do have 00:24:29.520 |
high entanglement in the model, it means that the model is looking at the wrong features. 00:24:39.440 |
And it also may indicate a higher chance of misclassification between these two classes. 00:24:46.540 |
Another feature that is very important is feature isolation. So in this case, feature 00:24:52.040 |
isolation allows us, for example, to understand which parts of the input is being used by 00:24:57.320 |
the model to make a certain prediction. For example, for a food classifier, imagine we 00:25:02.640 |
have the following picture. The food classifier will classify it as a frozen yogurt with 98% 00:25:11.160 |
probability. But by generating feature isolation, we can understand which part of the input 00:25:17.880 |
is being used to classify the input as a frozen yogurt. And it's actually the part that looks 00:25:23.560 |
like frozen yogurt. But also because there is entanglement between frozen yogurt and 00:25:29.040 |
ice cream, the model, as you can see, is using the similar features to also classify it as 00:25:34.320 |
ice cream with low probability because the model is well trained. But still, they have 00:25:38.240 |
some shared features, as you can see. And there is something that you may not have noticed, 00:25:43.840 |
but is the waffles. With very low probability, the model may also classify it as a waffle. 00:25:48.640 |
Why? Because the model is seeing some features, which are the berries that are on this frozen 00:25:55.560 |
yogurt, to classify it as a waffle. This can happen because in the original picture, in 00:26:01.840 |
the training data, the waffles probably had the berries on top. So the model learned to 00:26:06.960 |
look at the berries to recognize a waffle. So the LeapLabs Interstability Engine can 00:26:11.880 |
understand this and will show you this. This helps you understand what your model has learned. 00:26:18.680 |
Now let's look at a case on when things can go wrong in our model and how LeapLabs Interstability 00:26:25.920 |
Engine can help us improve it. If you look at the tutorial link that I have shared in 00:26:33.440 |
the description, if you go to this link here, to the tutorials at the LeapLabs website, 00:26:41.720 |
you will see tank detection case study. And if you open it, it will open a call up notebook. 00:26:48.000 |
Now let's run it, actually. So let me change run type. We choose T for GPU, and we can 00:26:55.480 |
run it. It will do some imports. Now, what is the tank detection case study? Well, we 00:27:02.120 |
are talking about a classification model that can detect tanks or no tanks. So it has only 00:27:08.120 |
two classes that indicate if the picture contains a tank or it does not contain a tank. Suppose 00:27:14.080 |
that this is a model that is very important for us, and we want to deploy it in the battlefield 00:27:18.760 |
because it can help protect our soldiers. But before deploying it, of course, we want 00:27:25.000 |
to understand what our model has learned. So by understanding what our model has learned, 00:27:29.920 |
we can predict failure modes. So if we run, for example, a picture of a tank into our 00:27:37.760 |
classification model, we will see that it classified as having a tank with a very high 00:27:43.640 |
probability. So in this case, the model is predicting that there is a tank in this picture 00:27:47.640 |
with 98% probability. So it looks like the model is performing very well. But let's try 00:27:53.360 |
to use the LeapLabs interoperability engine to understand what our model has learned. 00:27:58.200 |
So we install the library. Then we can use the library to generate prototypes. As we 00:28:03.520 |
saw before, the prototype tells us what kind of input the model wants to see to classify 00:28:09.520 |
a certain output, to give a certain output. In this case, we need the LeapLabs API key, 00:28:15.680 |
which we can generate from the LeapLabs website. So we go to the dashboard, we go settings, 00:28:22.400 |
and it will generate a key here. We put our key in the API key and we can generate a prototype. 00:28:41.680 |
In my computer, it takes around 25 seconds, I think, or one minute to generate it. Okay, 00:28:47.640 |
the model has generated, the library has generated two prototypes, one for the tank class. So 00:28:55.240 |
what kind of input the model wants to see to tell us that there is a tank and what kind 00:29:02.620 |
of input the model wants to see to tell us that there is no tank. And let's look at this 00:29:08.640 |
picture which indicates when the output indicates that there is a tank. If we look at this picture, 00:29:14.020 |
you see that actually there is no tank. So it means that the model is looking at some 00:29:19.840 |
stuff that is gray, which probably looks like a cloud. But there is no tank here. I mean, 00:29:26.080 |
I expected to see a cannon, I expected to see some wheels or maybe the gun or a soldier 00:29:30.920 |
with a gun on top of the tank or something like this, but actually there is none of these 00:29:34.120 |
features. So is our model looking at the correct features to actually predict a tank, the presence 00:29:41.440 |
of a tank? And let's look at the other class, no tank. As we can see, we have these green 00:29:48.620 |
lines here, which probably indicates grass. So probably the model is looking at the grass 00:29:54.560 |
to indicate that there is no tank. So if it sees an open field with only grass, it will 00:30:01.080 |
say that there is no tank, which could make sense. But the problem is, why is our model 00:30:05.920 |
not looking at the tank to indicate that there is a tank? So let's try to make a prediction 00:30:12.200 |
before looking further at what the LeapLabs Interpolating Engine can tell us. What could 00:30:19.720 |
happen in this case is that imagine that in our training data, we have a lot of pictures 00:30:24.800 |
of tanks and all of them that have tanks happen to have cloudy sky. So what our model may 00:30:32.480 |
have learned is that if there is a cloudy sky, then there is a tank, not that if there 00:30:37.280 |
is a tank, there is a tank. So let's validate our hypothesis. We can use a feature isolation 00:30:43.560 |
to understand what kind of features from an input picture the model is looking at to make 00:30:48.700 |
a certain prediction. So in this case, for example, we can feed the picture that we saw 00:30:55.160 |
before. So this picture as input to see what kind of features the model is looking at to 00:31:01.040 |
predict a tank. Let's see. As you can see, the model is using the entire picture to actually 00:31:19.760 |
predict a tank. But as you can see, the white areas indicate that that feature is not being 00:31:25.560 |
used. And the other areas indicate that the feature is being used. So as you can see, 00:31:29.840 |
the tank is here is white, which means that the model is not using the tank to predict 00:31:35.240 |
the tank, but it's using the sky and maybe the ground to predict that there is a tank. 00:31:41.200 |
So as suspected, the model doesn't seem to use the actual tank for classification much 00:31:46.880 |
at all, right? It's using the sky, the background, and maybe the saturation of the picture. So 00:31:51.800 |
how can we fix this model? Well, one way to fix it is to further train the model by using 00:31:58.000 |
more diverse images of tanks that have maybe some sunny sky, maybe some cloudy sky, maybe 00:32:04.500 |
some snowing environment with snow and some environment maybe in the forest, etc, etc. 00:32:11.100 |
So that the model cannot find any other correlation between pictures of tanks except for the tank 00:32:18.200 |
itself. So that the model will be forced to learn the presence of the tank itself as a 00:32:25.360 |
predictor for tanks. We can run this training and it will for sure improve our model. And 00:32:31.280 |
there is a code here to how to train it again. And after training, we can run feature isolation 00:32:36.560 |
again. And we can see here at the end that after retraining the model on more diverse 00:32:42.440 |
pictures, the model is actually putting all its attention on the tank itself to predict 00:32:48.360 |
the tank and not on the surrounding area. So all of this, thanks to the Leap Labs interpretability 00:32:54.520 |
engine. Now let's talk about feature visualization. So what we saw before with Leap Labs interpretability 00:33:02.240 |
engine is that we can get insights into how our model is making its prediction or what 00:33:07.400 |
kind of feature our model has learned. And in particular, especially for convolutional 00:33:12.480 |
neural networks for computer vision, we have, of course, a subsequent application of layers 00:33:18.400 |
of convolutions. And our goal with feature visualization is to understand what each of 00:33:23.520 |
these layers or what each of the neurons that are making up these layers, what kind of features 00:33:29.680 |
from the input did they learn that contribute to the final prediction. So we want to understand, 00:33:34.920 |
for example, imagine that you have a food classifier and you have many layers in your 00:33:38.840 |
convolutional neural network. Each layer will be looking at a particular kind of feature 00:33:44.080 |
in the input that will contribute to the final output for the final classification. Some 00:33:49.160 |
layers may look at, for example, lines. Some layers may be looking at edges. Some layers 00:33:56.640 |
may be looking at certain patterns, etc. So we want to understand what features each of 00:34:03.200 |
these layers or each of these neurons have learned. And we can do feature visualization 00:34:08.080 |
at many levels. We can do it at the neuron level. So what features is this neuron looking 00:34:13.480 |
at? Or we can do it at the layer level. So what kind of features is the particular layer 00:34:19.600 |
looking at? And also at the logic level, in this case, we have a classification network. 00:34:26.800 |
So we want to understand what kind of features the model wants to see in order to predict 00:34:33.440 |
it at that particular class. So we will model the feature visualization problem as an optimization 00:34:41.880 |
problem. And it's actually how it's done in practice. And it's actually also how more 00:34:46.320 |
or less the Leap Labs Interpretable Engine works. Of course, it's much more sophisticated. 00:34:51.800 |
So this is a simplified explanation. But I wanted you to understand how such an engine 00:34:56.120 |
works so that when you use it, you also know what's happening inside. So what we do, imagine 00:35:03.640 |
that you have a classification network, a convolutional network that is used for classification. 00:35:07.880 |
So as you can see here at the end, we have a Softmax and we have subsequent layers of 00:35:11.920 |
convolutions. We want to understand what this layer of convolution has learned. So in order 00:35:20.400 |
to understand what kind of features this layer has learned, we will treat it as an optimization 00:35:28.040 |
problem, which means that we will create an input that is a complete noise. We run it 00:35:33.960 |
through our network. We take the activations of this layer. So all the outputs of this 00:35:41.000 |
layer and we use it as an objective function. Or you can also call it a loss function. So 00:35:46.760 |
it's the same thing. So you take the output of this as loss and then you optimize the 00:35:52.720 |
input to maximize this loss in this case. So that's why I'm calling it objective. Whenever 00:35:57.260 |
you are maximizing something, we call it objective function. Whenever you are minimizing something, 00:36:02.120 |
we call it loss function. But the same thing. The only thing that changes is that in one 00:36:06.000 |
case you are doing gradient ascent and in the other case you are doing gradient descent. 00:36:11.020 |
In this case, we want to maximize the output of this, the activations of this layer. So 00:36:16.680 |
we treat the output of this network as objective function and we run backpropagation to maximize 00:36:24.720 |
these activations. And this will modify the input in such a way that it maximizes these 00:36:30.080 |
activations. This will get us insights into what kind of features this layer wants to 00:36:39.360 |
see to contribute to the final prediction. We can also optimize for logits. So for example 00:36:46.560 |
if instead of using a particular layer, we use the logits of a particular class, for 00:36:52.360 |
example the class associated with dogs, because we want to see what kind of dogs our model 00:36:58.520 |
wants to see to predict it as a dog, we can use the logits corresponding to the dog class. 00:37:05.060 |
We feed as input complete noise. We use the logit corresponding to the dog class as an 00:37:13.320 |
objective and we run backpropagation to optimize this input to maximize this logit. This is 00:37:23.520 |
actually how you can generate kind of a prototype for the class dog. Of course, you may be wondering, 00:37:31.600 |
is it that simple? Well, not really, because if you do this procedure, so if you start 00:37:37.000 |
from complete noise and you try to maximize a certain logit, it will for sure give you 00:37:42.800 |
insights into what the model has learned, so what kind of input the model wants to see 00:37:48.240 |
to have that logit as output, so that class as output, but it will not look very natural. 00:37:57.840 |
So for example this image here, I believe it's taken from ResNet, in which we can see 00:38:05.760 |
that for example if we optimize for the class Flamingo, we see that the input needs to have 00:38:11.960 |
something like this long necks here, which are typical of Flamingo, which means that 00:38:17.280 |
the model will actually look at these long necks of Flamingos to actually predict the 00:38:21.920 |
Flamingo class. If we look at for example Goldfish, we can see that we have these eyes 00:38:28.160 |
of this Goldfish here, and for example this one looks like the shape of a fish, so the 00:38:33.480 |
model will actually look at the fish to predict the Goldfish. And if we look at for example 00:38:40.480 |
Tarantula, we will see these long black legs here, like this one, like this one, which 00:38:45.880 |
means that the model actually will look at the legs of the Tarantula to predict it as 00:38:50.640 |
Tarantula. But of course you can see that this picture, they don't look really natural, 00:38:56.520 |
because if you look at the Leap Labs interoperability engine, they look quite natural. So for example 00:39:00.600 |
if we go back, and we look at the prototypes generated for Pancakes, it actually looks 00:39:05.480 |
like a Pancake. And if we look at Hamburger, it actually looks like Hamburger. So how can 00:39:11.120 |
we make our inputs look more natural? Well for once you could use the Leap Labs interoperability 00:39:17.760 |
engine which can do it out of the box, but to understand how Leap Labs do it, they use 00:39:23.240 |
what is known as regularization. Let's talk about regularization. So first of all, what 00:39:29.440 |
is regularization? When we train a model, our goal is to run some input through this 00:39:37.880 |
model, calculate an output, compare it with the target so that we can calculate the loss 00:39:42.160 |
and then update the parameters of the model such that we reduce this loss. When we introduce 00:39:50.040 |
regularization, we want this optimization to happen in a particular way, so we want 00:39:55.160 |
to put some constraints in our optimization process. For example, when we train a model, 00:40:01.320 |
we can do what is known as L1 regularization. With L1 regularization, what we do basically 00:40:07.780 |
is we have our loss function, which is our, let's say, cross-entropy loss, because we 00:40:15.960 |
are doing, for example, classification tasks. Then we can add some regularizer, which is 00:40:21.200 |
a constraint that we add to our loss function to make this optimization process happen in 00:40:27.360 |
particular ways. For example, with L1 regularization, we want our models to use the least possible 00:40:35.800 |
input features from the input. So what we do as a regularizer, we use the L1 regularization, 00:40:42.080 |
which is basically just the absolute value of all the weights. What happens in this case? 00:40:50.940 |
What will happen is that because we calculate always the gradient of the loss function with 00:40:56.500 |
respect to the weights of the model, the presence of this absolute value on the weights or the 00:41:03.620 |
parameters will force these weights to become zero. And because the weights will become 00:41:09.180 |
zero, they will use less features from the input. And this helps to make the model more 00:41:15.620 |
sparse, which also helps us to then reduce the size of the model. So regularizers are 00:41:22.580 |
particular constraints that we add to the loss function to make this optimization process 00:41:27.720 |
happen in particular ways, to add some constraints to this optimization problem. And this is 00:41:35.060 |
what we can also do in our optimization problem. So what are we optimizing? We are starting 00:41:40.360 |
from pure noise. For example, this is our pure noise, and we want to transform into 00:41:47.640 |
some kind of input that maximizes a particular output logic in our classification network. 00:41:56.180 |
Of course, when we train a neural network, the data set that the network was trained 00:42:02.540 |
upon, let's say that in the space is here, but this does not mean that the model will 00:42:08.400 |
not activate the output logic, for example, corresponding to the class dog for something 00:42:14.520 |
that is out of distribution. So what we want to do is we want our neural network to optimize 00:42:21.320 |
our input noise. Sorry, we want our optimization problem to optimize our input noise in such 00:42:28.160 |
a way that we remain close to the distribution of the data that the network has seen. So 00:42:34.400 |
the natural input that the network has seen. How to do that? Well, first of all, look at 00:42:40.320 |
my picture. Do you think it's a noisy picture? No, because if you look at my t-shirt, you 00:42:46.800 |
can see that adjacent pixels, they are similar, and there is not much variance in the pixel 00:42:54.000 |
for neighboring pixels. So we could ask our optimization problem to optimize the input 00:43:00.720 |
in such a way that it penalizes high variance for neighboring pixels. And this is known 00:43:08.840 |
as a frequency penalization. So we take our loss function, which is basically just the 00:43:14.720 |
logic that we want to maximize, and we add a penalty to this loss. Every time we see 00:43:20.200 |
a very high variance for neighboring pixels. Another regularizer that we can use is the 00:43:26.560 |
transformation robustness. This is not applied to the loss, actually. This basically means 00:43:32.280 |
that we take our input, the one that we are optimizing, we transform it some way so we 00:43:38.040 |
can rotate it, we can scale it, we can translate it. In this case, this code that I took from 00:43:44.520 |
the Lucid library, which is a very famous library for feature visualization, they applied 00:43:52.320 |
random scaling and random rotation, which means that they will rotate and randomly scale 00:43:58.440 |
the input and then pass it through the network. And the network, because it's an optimization 00:44:03.320 |
problem, will have to, the optimization problem will have to modify the input in such a way 00:44:10.600 |
that even when it's translated, even when it's rotated, even when it's scaled, it will 00:44:16.360 |
still activate that output. So it will only affect the pixels, the input features that 00:44:22.480 |
are needed for us to actually activate that logic. Which also means, in other words, that 00:44:29.440 |
it will try to, in case we are trying to, for example, maximize the logic corresponding 00:44:35.360 |
to the class dog, it will actually try to create a dog because it does not matter if 00:44:39.960 |
the dog is rotated, it does not matter if the dog is scaled, it does not matter if the 00:44:44.640 |
dog is translated, it's here or it's here. It will try to create, so it will try to create 00:44:51.600 |
a natural dog as much as possible. Of course, there are many more regularizers that we need 00:44:57.400 |
to add to make this transformation, to make this optimization problem more robust so that 00:45:05.160 |
we don't get some out of distribution data, but we want to try to generate data that is 00:45:10.880 |
as in distribution as possible. And this is also how LeapLabs works. So the LeapLabs interpretability 00:45:18.400 |
engine can generate prototypes that look natural. And the way they do it is described in this 00:45:25.320 |
paper called "Prototype Generation - Robust Feature Visualization for Data-Independent 00:45:30.080 |
Interpretability" in which they describe the process of generating these prototypes. And 00:45:36.400 |
the way they do it is basically they apply all these regularization techniques. So for 00:45:40.840 |
example, you can see here, random transformation, so that the optimization process produces 00:45:47.840 |
an input that is as natural as possible without ever actually seeing an input. So as you remember, 00:45:57.000 |
when we do a prototype generation with the LeapLabs interpretability engine, we never 00:46:01.160 |
feed an input picture. We just give the model and the algorithm will generate a prototype 00:46:07.240 |
without ever seeing what a natural picture looks like. But it's actually generated. It 00:46:13.280 |
can generate very natural inputs. Why? Because they make this optimization process very robust. 00:46:21.100 |
So they penalize, for example, the high frequency or the high variance in the neighboring pixels. 00:46:27.880 |
They also apply transformation, etc. So that the resulting input is as close as possible 00:46:33.440 |
to the natural inputs that the model is trained upon. Now let's try to use the knowledge that 00:46:39.800 |
we have acquired and apply it to language models. So as we saw before, with computer 00:46:45.120 |
vision models, we can do prototype generation, which is based on feature visualization, which 00:46:49.960 |
means that given a particular, for example, output logit, we want to understand what kind 00:46:54.440 |
of input the models want to see to have that particular logit as output. Can we apply the 00:47:01.200 |
same techniques also to language models? So given a desired output, what kind of prompt 00:47:06.600 |
the model wants to see to generate that output? Well, let's try to answer that question. First, 00:47:12.840 |
let's review how language models work. So a language model, as you know, is a probabilistic 00:47:17.120 |
model that assigns probabilities to sequence of tokens. For example, imagine that the input 00:47:23.440 |
to the language model is that Shanghai is a city in China. The model will tell us what 00:47:29.000 |
is the probability of the next token being China or being Beijing or being cat or being 00:47:34.400 |
pizza or being whatever token is present in our vocabulary. One simplification I always 00:47:39.960 |
do in my video is to associate a token with a word and the word with a token. But this 00:47:44.960 |
is not usually the case. So usually a word may not be a token and the token may not be 00:47:50.960 |
a word. And actually, most of the cases, a word is actually made up of multiple tokens. 00:47:56.560 |
But for our case, we will simplify it and see that every token is a word and every word 00:48:00.520 |
is a token. So the language model just tell us what is the probability of the next token 00:48:05.480 |
given an input prompt. Imagine we want to understand what our model thinks of the word 00:48:13.280 |
girl. So what kind of input, what kind of prompt the model wants to see as input to 00:48:22.400 |
predict the word girl as next token? Well, let's try to use the techniques that we have 00:48:29.720 |
seen before. So first, let's see the results of such an analysis. And in particular, Jessica, 00:48:36.640 |
who is the founder of LeapLabs, she did this study. So she took some tokens, for example, 00:48:42.800 |
the word girl, and then she optimized the input prompt in such a way that the output 00:48:48.320 |
girl is maximized. So the next predicted token is a girl given this input. And she did it 00:48:56.080 |
also for the word woman. She did it for the word good and for the word doctor. This gives 00:49:02.120 |
us insight into what our model has learned because our model, our language model is just 00:49:07.440 |
a model that models the statistical distributions of tokens based on the training data it has 00:49:13.760 |
seen. So in this case, for example, the prompt that maximizes the word girl as being the 00:49:21.920 |
next token is this input here. And as you can see, it tells us that our model has seen 00:49:28.240 |
a lot of bad data that is making our model have bias against girls, for example, because 00:49:37.400 |
we see sexual words, we see other girls that are not quite polite. And the same happens 00:49:43.560 |
for, well, in the case of the woman, the word woman, it's a little better, but still it 00:49:49.240 |
tells you what are the bias of your model against this particular concept. And we can 00:49:54.960 |
see it also for the word good, for example, we see that shooting is good or somehow, and 00:50:00.080 |
Jesus and beautiful or basketball, et cetera. So optimizing the prompt to generate a particular 00:50:09.400 |
output tells you what our model wants to see as input to generate that particular output, 00:50:15.240 |
which gives us insights into the distribution that our model has learned. And Jessica, she 00:50:21.840 |
ran another experiment, which is to, because when we optimize the prompt, we will see later 00:50:26.800 |
how it's actually done in practice. We start from complete noise and we optimize this prompt 00:50:32.560 |
to, to, to become tokens that are more likely to predict a particular output. And of course, 00:50:42.140 |
you can restart this optimization problem from multiple starting points, because you 00:50:46.440 |
can start from complete noise and you have many starting points. So many input noises 00:50:52.240 |
you can have as starting point. And she did it many times and she got some input tokens 00:50:57.280 |
that were more likely to predict the word girl as next token. This gives us a map on 00:51:03.900 |
what kind of inputs the model wants to see with each of tokens with its frequency, what 00:51:11.120 |
kind of inputs the model wants to see to predict the girl as next token or boy as next token 00:51:16.480 |
or science or art as next token. And this also gives us insight into the statistical 00:51:21.440 |
distribution that our model has learned. For example, the word girl, to get the word girl 00:51:26.800 |
as output, the model wants to see some sexual words and some other like not so, some curse 00:51:34.760 |
words also, but also for example, the word dresses or the word boys. And in the case 00:51:39.900 |
of the word boy, we can see that it wants to see rebellious, monkey, girl, et cetera, 00:51:45.600 |
et cetera, but this gives us insight into what our model has seen during its training. 00:51:51.760 |
So now let's try to analyze how to actually generate this kind of map and how this optimization 00:51:57.840 |
problem works. What we did before with the computer vision models, that is we have some 00:52:04.080 |
output logits for which we want to find an input that maximizes that logit is exactly 00:52:09.560 |
what we want to do here, except that here we have a language model and we have another 00:52:15.800 |
complexity. So let's do it step-by-step, how we can generate this kind of map. Imagine 00:52:22.500 |
that we want to find input embeddings that maximize the probability of the next token 00:52:29.200 |
being girl. Now, the first complexity is that girl may not be one token, but it could be 00:52:35.240 |
multiple tokens. So let's suppose that it's actually multiple tokens because this is a 00:52:40.080 |
real scenario. So we have the output that we want to optimize an input for, and suppose 00:52:46.480 |
that we want to optimize three input embeddings. So let's draw three input embeddings to maximize 00:52:54.800 |
the probability of the next token being girl, but we know that girl may not be a single 00:53:00.320 |
token. So let's suppose that it's actually two tokens. So one token, it's GI and the 00:53:06.780 |
other token being RL. Now the job that we did before, that is calculating the loss of 00:53:16.760 |
the output with respect to the input. It's something that we cannot do anymore. Why? 00:53:22.440 |
Because the input in the language model is tokens and tokens are numbers that represent 00:53:28.640 |
the position of this token in the vocabulary. So for example, imagine the input could be 00:53:34.760 |
for example, zero, five, and nine, and these are positions that represent each token in 00:53:40.600 |
the vocabulary. And we cannot optimize for something that is discrete because there is 00:53:45.680 |
no token 0.5, there is no token 3.2. We cannot change these tokens a little bit, hoping that 00:53:52.620 |
they move towards something that represents, that will generate that kind of output. The 00:53:57.920 |
only thing that we can optimize are embeddings. So we will not be optimizing input tokens, 00:54:03.360 |
we will be optimizing input embeddings. So in this case, we suppose that we have three 00:54:08.680 |
embeddings. So let me delete this part. So we suppose that we have three input embeddings. 00:54:29.520 |
Now which three input embeddings should we choose? Well, in the case of the computer 00:54:33.420 |
vision model, we started from pure noise. In this case, we can also start from pure 00:54:37.180 |
noise. So we can start from three random embeddings. One, two, and three. What we can do, we can 00:54:47.000 |
run these three embeddings in our language model. And as you know, the language model 00:54:51.400 |
is a transformable model in most of the cases, and it's a sequence to sequence model that 00:54:55.600 |
will generate if the input is three embeddings, it will generate three embeddings as output. 00:55:00.960 |
So here, we will have three embeddings. Our goal is to make sure to select three embeddings 00:55:12.600 |
that make the likelihood of the next token being GI and the next next token being RL 00:55:20.480 |
maximized. So how to proceed? We take these three embeddings, we run it through our model, 00:55:26.500 |
it will produce three embeddings as output. Usually when sampling from a language model, 00:55:31.660 |
we take the last embedding, so the last hidden state, so the output of a language model here 00:55:38.220 |
at this point are called hidden states. We take the last hidden state, we send it to 00:55:44.060 |
the linear layer, and it will generate what are known as logits. Logits indicate what 00:55:50.100 |
is the probability score, it's not a probability actually, but what is the score that the model 00:55:54.900 |
assigns to each token in the vocabulary, and then by applying the softmax, they become 00:56:00.360 |
probability scores, and then usually we choose the token with the maximum probability score 00:56:05.940 |
as the next token. So in this case, we can take the last embedding, we can run it through 00:56:15.780 |
the linear layer, and it will generate logits. So we will have the logits associated with 00:56:21.100 |
the position, let's say zero. In this logits, we are concerned, because it's a list of numbers, 00:56:28.860 |
each one for each position in the vocabulary, we are interested in two logits in particular. 00:56:34.220 |
One is the one with the highest probability score, and that will be used to sample the 00:56:40.140 |
next token, and one is the logit corresponding to the token GI. So we save two logits, one 00:56:47.180 |
is corresponding to the GI, and one is for the next token. We use the logit corresponding 00:56:55.620 |
to the next token to understand what is the next token, we put it back in the input of 00:56:59.980 |
the model, so we put back the embedding corresponding to this next token in the input of the model, 00:57:07.020 |
along with the three input tokens that we saw before. This will result in four output 00:57:11.860 |
embeddings being generated. We take the last embedding, and this will generate logits corresponding 00:57:17.700 |
to the next position. And also in this logits, we are interested only, actually in this case, 00:57:23.660 |
we are interested only in the logits corresponding to the token RL. Then we take these two logits, 00:57:31.820 |
we know their probability score, because we can run the softmax, and we use them as objective 00:57:38.020 |
for our optimization process, because we want to maximize these probabilities. So the probability 00:57:44.380 |
of selecting this token, and this token. We can take the negative log probability, so 00:57:52.220 |
once we have run the softmax, they will become probabilities, and we can sum them up, and 00:57:59.980 |
this will become our objective. Now, if we do like this, there is no guarantee that the 00:58:07.060 |
inputs that we are optimizing here will actually be embeddings that correspond to some token. 00:58:13.940 |
They may not correspond to any token, because as we saw before for computer vision models, 00:58:18.700 |
our model has some natural inputs that may be here, and maybe we are optimizing something 00:58:24.540 |
that is here, that is out of distribution. So we need to find a way to push these embeddings 00:58:30.500 |
to go in distribution, and one way is to find a regularizer. So something that puts a constraint 00:58:38.540 |
in our optimization problem to push the embeddings in certain directions, and one way to do that 00:58:44.860 |
is whenever we feed these three embeddings that we are trying to optimize to the language 00:58:50.100 |
model, we can calculate their distance from the closest embedding in the vocabulary, and 00:58:57.460 |
use this as regularizer. So we add this distance in our loss function, and we ask the optimization 00:59:05.060 |
problem to minimize also this distance. This will force our optimization problem to optimize 00:59:11.700 |
the embeddings in such a way that they maximize the likelihood of the next token being GI, 00:59:18.620 |
and the next next token being RL, but at the same time, to produce embeddings that are 00:59:25.500 |
closer to our vocabulary, so that we can, that will actually map to some token in our 00:59:31.700 |
vocabulary. So they will not just generate any embedding that result in that activation, 00:59:37.380 |
but generate actually some embedding that correspond to embeddings that are present 00:59:41.820 |
in our vocabulary. And this is how we can generate that kind of map. So thank you guys 00:59:48.820 |
for watching my video. I know it has been very demanding, especially for the last part, 00:59:54.180 |
but I will share a notebook in the description of the video that you can use to generate 01:00:01.140 |
the map that we saw before, so you can play with it by yourself. If you like this video, 01:00:07.220 |
please share it with your friends, with your colleagues, and I hope you come back to my