ML Interpretability: feature visualization, adversarial example, interp. for language models

Hello guys, welcome back to my channel. Today we are going to talk about machine learning interpretability. Let's review the topics of today. I will be starting by introducing what is machine learning interpretability. Then we will review deep learning and back propagation because they are needed for us to understand the rest of the videos and the topics.

Then we will see a nice little trick, so how to trick a classifier. So imagine that you have a classification neural network, like for example a convolutional neural network that can classify pictures into classes. For example it will tell you that the picture of a dog is a dog, the picture of a person is a person, etc.

Our goal is, without touching anything of this model, so without touching its weights, without touching its parameters, its structure or anything related to the model, we want any classifier to be tricked into believing that for example the picture of a dog is actually a person or the picture of a person is actually a dog.

And we will see how we can trick any classifier of our choice. Later I will be introducing an interpretability engine, so a library that allows us to make vision models more interpretable. We will explore the topic of feature visualization which is very important for interpretability. And finally we will apply the techniques that we have learned to language models, so how to make language models more interpretable.

What are the prerequisites for watching this video? Well for sure that we have a little background in calculus, I think knowing what are derivatives and how to calculate them is enough. And also of course that you have a background in deep learning, so you know what is a loss function or what is the softmax function for example.

So let's start our journey. What is machine learning interpretability? In 2016 there was a fatal accident between a Tesla car driver and a truck. And as reported by the Guardian, we can see that the car sensor system against a bright spring sky failed to distinguish a large white 18 wheel truck and trailer crossing the highway.

So basically the car was going and it failed to recognize this obstacle which was the truck. And the car just attempted to drive full speed under this truck. Of course resulting in the crash and it was unfortunately fatal. Now I don't want to say that it was Tesla's fault or it was the software's fault, I don't have enough information for that.

So let's make a hypothetical case like you are creating a self-driving company and you want to deploy your car, self-driving car. How do you make sure that the car can recognize any obstacle? How do you know what your model has learned? Because for example if the car, the first question that you want to answer is what your model has learned.

For example imagine that you have a model that allows you to segment the obstacles on the road. The first question that you want to answer is what did my model learn? So how does my model recognize a person? Does it recognize a person by its shape or does it recognize a person by its shoes or by the color of the clothes etc?

Knowing this is important. Why? Because this allows us to understand what could be a failure mode of our model. Because if our model is only looking at the color of the clothes for example to recognize a person, so only looking at the clothes and not at the face for example, then if one day the model will see a person that is wearing strange clothes, something that the model has never seen, the model may fail to recognize that person as an obstacle.

So this is very important. So the second question that we want to answer is what features or patterns from the input make the model generate certain outputs? And this is very important for example for language models. So imagine our language model is a cursing and we want to understand which tokens in the input are being used by the model to generate that kind of output.

Knowing how a model thinks, and pardon me for this word, it's very wrong, so let's think about knowing how a model predicts or makes its prediction, allow us to debug and fine tune the model, which means that during training we can understand why our model is not learning something that we want it to learn, or how should we change our hyperparameters that will affect the learning of our model.

We can identify failure modes before deployment, which means that we can understand what are the things that are more likely to make my model fail when deployed in production. It can increase trust because we can demonstrate that our model is well trained and so people will trust it and this is especially for some scenarios like self-driving cars.

And also we can discover novel insights from the data because sometimes models learn something that we as humans did not see, and this is very important when the models learn patterns, for example in image models, that humans did not see. This is for example in the healthcare sector, imagine that you are training a model that can recognize cancer cells from non-cancer cells, and we realize that the model is performing well and we realize that the model is looking at some parts of the cell that we as humans didn't think of checking before, that is actually a good predictor for the cell being cancerous or not cancerous, so even the model can teach us something that we did not before.

When we define a linear layer in PyTorch it gets converted into a computation graph, for example in this case we have an input that is made up of two features, which means that the input is a vector made up of X1 and X2. For example in this case we have two linear layers, one that converts two features into two features, so it's taking two features as input and it's converting it into two features as output, and we can see it as a layer made up of two neurons, each neuron that is doing a weighted sum of the input features multiplied, each input feature multiplied by its own weight, so X1 is multiplied in this case by W11, and X2 is multiplied by W12, and then it performs the sum, plus a bias, and then we have the application of a non-linear activation, usually the ReLU function.

Then we have another linear layer that is going from two features to one feature that will produce our output. In this case we are trying to model a very simple neural network that takes an input two features that represent features of a house, for example the number of bedrooms and the number of bathrooms, and wants to predict a price for this house, so only one output.

This is a very simple regression task and we can train it by having a training data with two input features and one target. PyTorch will convert this neural network into a computation graph. What does it mean? It means that each node will become an operation that is performed on the input subsequently to arrive to the final output, which is the price of the house.

In this case this is actually a simplified version of the computation graph. The computation graph usually is made up of more nodes than the one you see here, because each kind of single operation is a node. In this case we can see that X1 and X2 are multiplied by some weights in the first neuron and then we sum up a bias, we apply the ReLU and this becomes the input for the neuron at the next layer.

So we take the two outputs at the previous layer, we multiply by them some weights, as you can see here W31 and W32, we add a bias and this becomes our output. How do we train such a network? Well, usually we have a dataset made up of inputs and output pairs, or input and label pairs.

The input represents the features of a house and the output the corresponding price, the price of this house. And our goal is to train the neural network to minimize a certain loss function that we can choose. For this regression task an ideal loss function could be the mean squared error, because we want to minimize the error that the model makes on the final price.

Our hope is that the neural network not only learns the data that it has seen during training, so not only it can predict correctly the price of the houses that it has seen during training, but it also learns some kind of pattern that can generalize to unseen inputs. So how do we proceed practically?

We choose a loss function and we choose, for example, the mean squared error in this case. We run an input through the neural network. So we take the input, we run it through this neural network, which is a feedforward neural network, which means that each output becomes the input of the next layer.

So we run our input, it will produce an output here. We have a loss function that will compare the output of the network with the label that we have assigned to this input label pair. It will compute a loss. Then we calculate the gradient of the loss function with respect to the weight of the network and the weights of this network are also the parameters of this network.

And in this case are W_11, W_12, the bias here by B_1, W_21, W_22, B_2, which are the weights and the bias for the first linear layer and also the weights and the bias for the second linear layer. We calculated this gradient because the gradient indicates kind of a direction.

So if you remember from high school, the gradient is basically a derivative. So imagine we do it for a single variable. So imagine that we have, we want to calculate the derivative of the loss function with respect to W_11. So we write it here. This is the loss function and this is W_11.

Imagine that the loss function is doing something like this. We have a kind of a local minima here and then we have a global minima here. Imagine that we are currently here. Our W_11 initially is here. When we calculate the derivative of the loss function with respect to W_11, we will get the inclination of the tangent line at this point, which is this one here.

And this indicates the direction in which the function is growing. So the function is growing in this direction. We usually update our weights to move to the opposite direction of the gradient. So we update our weights to move right so that the loss will diminish. So for example, we take a little step in this direction so that the loss, as you can see, will decrease because this will be the new loss 2.

We started from loss 1 here. And this is why we do backpropagation. So we calculate this gradient, so the gradient of the loss function with respect to the parameters of the model. And then we update the models to move against the direction of the gradient. The first thing that we do during our training is the forward pass, which means that we have an input.

We run it through our computation graph to calculate an output. So let's do it here. So we have an input that is X_1=2 and X_2=4. We multiply, for example here in this node, each input by the weights of this network and the weights are initialized as follows. So W_11=0.24, W_12=0.29, the bias of W_11=-0.70.

This will result in some co-activations being calculated, so the values of each node are called activations. We use the previous activation to calculate the next one, etc. etc. until we arrive to the output of this neural network. We have a target because we are training and we can calculate a loss.

What do we do with this loss? We run backpropagation, which means that we calculate the gradient of the loss function with respect to each of these weights. For example, to calculate the gradient of the loss function with respect to W_11, which is this parameter here, we can use the chain rule, which means that the derivative of the loss function with respect to A_6, because we need to watch what are the nodes that connect this parameter to the loss.

So the nodes that connect this parameter W_11 to the loss function are this node here, this node here, this one, and this one. What we do in the chain rule is we just go from the loss to the parameter, backwards. So we do the loss function with respect to the previous node, then this node, the derivative of this node with respect to the previous node, then the derivative of this node with respect to the previous node, the derivative of this node with respect to this node, and the derivative of this node with respect to W_11, because this node contains the expression W_11.

This will result in a series of numbers that will give us, that when multiplied together will give us the derivative of the loss function with respect to W_11. What do we do with this derivative, which is a number, because we evaluated this derivative on the input points that we have chosen.

It will give us a number that we will call gradient, and we use it to update the value of our parameter. So the new value of the parameter W_11 is equal to the old value of this parameter minus alpha, which is our learning rate, multiplied by the value of this gradient.

Now, why do we have a minus sign here? Because as I saw before, the gradient indicates the direction in which the function is growing, the loss function is growing with respect to the parameter, and we don't want to make the loss function grow, we want the loss function to decrease, so we move in the opposite direction of the gradient.

So that's why we have a minus sign here. Now, let's trick a classifier. So I introduced the gradient descent because it was needed for us to understand how to do this trick. So imagine, first of all, what is a classifier? A classifier is a neural network that can classify the input into one of the defined classes that we have.

For example, in this case, we may have a classifier that can take an input picture, and then can classify it as a fish, or as a dog, as a volcano, as a car, or a pencil. For example, the ResNet network can classify the input picture into one of the thousand classes it has in its output logits.

The output of a neural network of this kind is called logits, because it indicates what is the score that the network assigns to each of these classes. We don't usually work directly with the logits, we apply a softmax. A softmax is a function that makes the logits kind of turn them into probability scores, because they will sum up to one.

And then we take the class with the highest value of the softmax as the prediction of the model. So if after applying softmax, we see that our network indicated that this node here is 95%, then it means that the network is saying us that this is a fish. And that also applies to other cases, of course.

So what do I mean by tricking a classifier? I mean that I give you a classifier, so a neural network like this one, and you are not allowed to change anything of the network. So you're not allowed to change the weights of this network, you're not allowed to change the architecture of this network, you're not allowed to change anything.

So the weights are frozen, and the architecture and the hyperparameter, everything is frozen. When we run a picture of a fish in this network, it will probably classify it as a fish. But our goal, we want to give a picture of a fish as input, and we want the network to classify it as something else that we can choose, for example, as a volcano with very high probability.

So of course, if you think about it, the only place where we can work to trick this network is actually in the input. And this is what we will do. We will change the input in such a way that the network will not see a fish anymore, but it will see, for example, a volcano.

So this fish and the previous fish are the same for a human because you can see a fish and I can see a fish. But what we did was to add a little bit of noise in this picture so that when the network sees this picture, so this one here, it will not see a fish anymore, it will see a volcano.

How is that even possible? Let's see. So our goal is that we want to have a picture as input and we want to change this picture in such a way that the network sees something else. How can we proceed? Well, we take what we do usually when we train a network like this is this.

We have a series of pictures of images of fish, of trees, of people, etc. and the corresponding label. So for example, we have a thousand pictures of dogs and the label saying that they are dogs. And then we have a thousand pictures of cats and saying that the corresponding label is cat, etc.

So what we do is we feed the input picture to the neural network. The neural network will calculate some output, which is here, to which we apply the Softmax. Then we have the corresponding label because we know what is this picture. This is coming from our training data. So we know that is a fish, for example, in this case.

We can calculate the loss. Then we can run backpropagation, which means that we calculated the gradient of the loss function with respect to the weights of this network, so the parameters of this network. And then we update the parameters to reduce this loss. And this is how we train this network.

So let's try to see how we can trick the network into believing that, for example, this fish here is actually a volcano. Now, when we do the training, as you saw before, we calculated the gradient of the loss function with respect to the parameters, but we can actually calculate also the gradient of the loss function with respect to the input picture.

So what we can do is as follows. We can create a new loss function. So imagine we have a picture of a fish. We know it's a fish, but we want to trick the model into believing it's a volcano. We can create a loss function with respect to the target that we want the network to have.

So we want the network to believe it's a volcano, so we can create a new loss function with respect to the target volcano. And then we run this picture in the network and we calculate the gradient of the loss function with respect to the input. And later we will see in the code how to do that.

But let's try to analyze what does it mean to calculate the gradient of the loss function with respect to the input. It means that it will indicate a direction in which we should change the input to make, because it's a gradient, so it indicates the direction in which we should change the input to make the loss grow.

So we can run backpropagation and optimization on the input to decrease this loss. So because the gradient tells us how we should change the input image to make the loss grow, we can also change the input image in the opposite direction to make the loss decrease. So that's what we will do.

We calculate the gradient of the loss function with respect to the input image, which will indicate a direction. We update the image with some noise in the opposite direction, so we add a little bit of noise in the opposite direction indicated by this gradient, and we keep updating it until the network predicts correctly the output as Volcano.

So in the code it is done as follows. Imagine we have a model, we have an input image, and we have a target class, for example, Volcano. What we can do is we take our input image and we create a tensor of it by asking PyTorch to also calculate the gradient with respect to this tensor, because by default PyTorch will only calculate the gradient with respect to the weights.

So the gradient of the loss function with respect to the weights. But we also want the gradient with respect to this input image. Then we run for a few steps the following. We calculate the output of the model, so we are calculating this output here. We are creating a special loss function with respect to this target that we want the network to have.

So we want the network to output Volcano, so we create a loss function with respect to this target class. We run backward, which means that we calculate the gradient of the loss function with respect to the input. And then we update the image, so the image is updated just like the update formula for the parameters, so it's equal to the old image minus some learning rate, here I call it alpha, multiplied by the direction of the gradient of the loss function with respect to the input.

And we are moving against the direction of the gradient because we want the loss to decrease. If we update the image continuously as follows, we will see that the network will predict it as Volcano. And this is how we can trick a classifier. I made this example because I wanted to show you that models may look at patterns that are completely different from us humans.

For example, in this case, the model is predicting this picture as a Volcano. So the model somehow is seeing a Volcano here, even if to us humans, we will never be able to see a Volcano in this picture, it's a fish. So understanding how our model makes its prediction can help us improve our models.

And thanks to the sponsor of today's video, LeapLabs, we can get insights into how our model makes their predictions. LeapLabs is a research lab that is focusing on machine learning interpretability. And they have developed this library, the LeapLabs interpretability engine, that allows us to understand what our model has learned and how we can get insights from our model to improve it.

For example, this library allows us to generate prototypes. Prototypes, what are prototypes? Well, imagine that you have a classification model, a computer vision classification model, which means that you have some, it takes as input a picture and it will classify it as one of the classes. In this case, it's a food classifier that will classify an input picture as one of the following class, for example, ice cream or hamburger or pancakes or waffles.

In this case, it looks like that the model is well trained, because by generating prototype, we can get the kind of input that the model wants to see to classify it as a target class. So this is the kind of picture that the model wants to see to classify the input picture as a hamburger.

This is the kind of picture that the model wants to see to classify the input picture as a pancake. And it actually resembles a pancake. And this one actually resembles a hamburger, which means that the model has learned the correct features from the food to classify it as a given class.

But we will see later a case in which this is not true. Another feature of the Leap Labs interpretability engine is entanglement. Entanglement allows us to understand how different classes share features. For example, for a food classifier like the one we saw before, we expect high entanglement between the ice cream class and frozen yogurt class.

Because at least for me as a human, they look very similar. They both look like ice cream. So it is expected that these two classes share features. But imagine that you have a more broad classifier like the one we saw before that can classify fish and volcano, etc. I would not expect, for example, cheesecake and dog to have high entanglement.

Because at least for me, they shouldn't share features. I mean, they are totally different objects. So if they do have high entanglement in the model, it means that the model is looking at the wrong features. And it also may indicate a higher chance of misclassification between these two classes.

Another feature that is very important is feature isolation. So in this case, feature isolation allows us, for example, to understand which parts of the input is being used by the model to make a certain prediction. For example, for a food classifier, imagine we have the following picture. The food classifier will classify it as a frozen yogurt with 98% probability.

But by generating feature isolation, we can understand which part of the input is being used to classify the input as a frozen yogurt. And it's actually the part that looks like frozen yogurt. But also because there is entanglement between frozen yogurt and ice cream, the model, as you can see, is using the similar features to also classify it as ice cream with low probability because the model is well trained.

But still, they have some shared features, as you can see. And there is something that you may not have noticed, but is the waffles. With very low probability, the model may also classify it as a waffle. Why? Because the model is seeing some features, which are the berries that are on this frozen yogurt, to classify it as a waffle.

This can happen because in the original picture, in the training data, the waffles probably had the berries on top. So the model learned to look at the berries to recognize a waffle. So the LeapLabs Interstability Engine can understand this and will show you this. This helps you understand what your model has learned.

Now let's look at a case on when things can go wrong in our model and how LeapLabs Interstability Engine can help us improve it. If you look at the tutorial link that I have shared in the description, if you go to this link here, to the tutorials at the LeapLabs website, you will see tank detection case study.

And if you open it, it will open a call up notebook. Now let's run it, actually. So let me change run type. We choose T for GPU, and we can run it. It will do some imports. Now, what is the tank detection case study? Well, we are talking about a classification model that can detect tanks or no tanks.

So it has only two classes that indicate if the picture contains a tank or it does not contain a tank. Suppose that this is a model that is very important for us, and we want to deploy it in the battlefield because it can help protect our soldiers. But before deploying it, of course, we want to understand what our model has learned.

So by understanding what our model has learned, we can predict failure modes. So if we run, for example, a picture of a tank into our classification model, we will see that it classified as having a tank with a very high probability. So in this case, the model is predicting that there is a tank in this picture with 98% probability.

So it looks like the model is performing very well. But let's try to use the LeapLabs interoperability engine to understand what our model has learned. So we install the library. Then we can use the library to generate prototypes. As we saw before, the prototype tells us what kind of input the model wants to see to classify a certain output, to give a certain output.

In this case, we need the LeapLabs API key, which we can generate from the LeapLabs website. So we go to the dashboard, we go settings, and it will generate a key here. We put our key in the API key and we can generate a prototype. In my computer, it takes around 25 seconds, I think, or one minute to generate it.

Okay, the model has generated, the library has generated two prototypes, one for the tank class. So what kind of input the model wants to see to tell us that there is a tank and what kind of input the model wants to see to tell us that there is no tank.

And let's look at this picture which indicates when the output indicates that there is a tank. If we look at this picture, you see that actually there is no tank. So it means that the model is looking at some stuff that is gray, which probably looks like a cloud.

But there is no tank here. I mean, I expected to see a cannon, I expected to see some wheels or maybe the gun or a soldier with a gun on top of the tank or something like this, but actually there is none of these features. So is our model looking at the correct features to actually predict a tank, the presence of a tank?

And let's look at the other class, no tank. As we can see, we have these green lines here, which probably indicates grass. So probably the model is looking at the grass to indicate that there is no tank. So if it sees an open field with only grass, it will say that there is no tank, which could make sense.

But the problem is, why is our model not looking at the tank to indicate that there is a tank? So let's try to make a prediction before looking further at what the LeapLabs Interpolating Engine can tell us. What could happen in this case is that imagine that in our training data, we have a lot of pictures of tanks and all of them that have tanks happen to have cloudy sky.

So what our model may have learned is that if there is a cloudy sky, then there is a tank, not that if there is a tank, there is a tank. So let's validate our hypothesis. We can use a feature isolation to understand what kind of features from an input picture the model is looking at to make a certain prediction.

So in this case, for example, we can feed the picture that we saw before. So this picture as input to see what kind of features the model is looking at to predict a tank. Let's see. As you can see, the model is using the entire picture to actually predict a tank.

But as you can see, the white areas indicate that that feature is not being used. And the other areas indicate that the feature is being used. So as you can see, the tank is here is white, which means that the model is not using the tank to predict the tank, but it's using the sky and maybe the ground to predict that there is a tank.

So as suspected, the model doesn't seem to use the actual tank for classification much at all, right? It's using the sky, the background, and maybe the saturation of the picture. So how can we fix this model? Well, one way to fix it is to further train the model by using more diverse images of tanks that have maybe some sunny sky, maybe some cloudy sky, maybe some snowing environment with snow and some environment maybe in the forest, etc, etc.

So that the model cannot find any other correlation between pictures of tanks except for the tank itself. So that the model will be forced to learn the presence of the tank itself as a predictor for tanks. We can run this training and it will for sure improve our model.

And there is a code here to how to train it again. And after training, we can run feature isolation again. And we can see here at the end that after retraining the model on more diverse pictures, the model is actually putting all its attention on the tank itself to predict the tank and not on the surrounding area.

So all of this, thanks to the Leap Labs interpretability engine. Now let's talk about feature visualization. So what we saw before with Leap Labs interpretability engine is that we can get insights into how our model is making its prediction or what kind of feature our model has learned. And in particular, especially for convolutional neural networks for computer vision, we have, of course, a subsequent application of layers of convolutions.

And our goal with feature visualization is to understand what each of these layers or what each of the neurons that are making up these layers, what kind of features from the input did they learn that contribute to the final prediction. So we want to understand, for example, imagine that you have a food classifier and you have many layers in your convolutional neural network.

Each layer will be looking at a particular kind of feature in the input that will contribute to the final output for the final classification. Some layers may look at, for example, lines. Some layers may be looking at edges. Some layers may be looking at certain patterns, etc. So we want to understand what features each of these layers or each of these neurons have learned.

And we can do feature visualization at many levels. We can do it at the neuron level. So what features is this neuron looking at? Or we can do it at the layer level. So what kind of features is the particular layer looking at? And also at the logic level, in this case, we have a classification network.

So we want to understand what kind of features the model wants to see in order to predict it at that particular class. So we will model the feature visualization problem as an optimization problem. And it's actually how it's done in practice. And it's actually also how more or less the Leap Labs Interpretable Engine works.

Of course, it's much more sophisticated. So this is a simplified explanation. But I wanted you to understand how such an engine works so that when you use it, you also know what's happening inside. So what we do, imagine that you have a classification network, a convolutional network that is used for classification.

So as you can see here at the end, we have a Softmax and we have subsequent layers of convolutions. We want to understand what this layer of convolution has learned. So in order to understand what kind of features this layer has learned, we will treat it as an optimization problem, which means that we will create an input that is a complete noise.

We run it through our network. We take the activations of this layer. So all the outputs of this layer and we use it as an objective function. Or you can also call it a loss function. So it's the same thing. So you take the output of this as loss and then you optimize the input to maximize this loss in this case.

So that's why I'm calling it objective. Whenever you are maximizing something, we call it objective function. Whenever you are minimizing something, we call it loss function. But the same thing. The only thing that changes is that in one case you are doing gradient ascent and in the other case you are doing gradient descent.

In this case, we want to maximize the output of this, the activations of this layer. So we treat the output of this network as objective function and we run backpropagation to maximize these activations. And this will modify the input in such a way that it maximizes these activations. This will get us insights into what kind of features this layer wants to see to contribute to the final prediction.

We can also optimize for logits. So for example if instead of using a particular layer, we use the logits of a particular class, for example the class associated with dogs, because we want to see what kind of dogs our model wants to see to predict it as a dog, we can use the logits corresponding to the dog class.

We feed as input complete noise. We use the logit corresponding to the dog class as an objective and we run backpropagation to optimize this input to maximize this logit. This is actually how you can generate kind of a prototype for the class dog. Of course, you may be wondering, is it that simple?

Well, not really, because if you do this procedure, so if you start from complete noise and you try to maximize a certain logit, it will for sure give you insights into what the model has learned, so what kind of input the model wants to see to have that logit as output, so that class as output, but it will not look very natural.

So for example this image here, I believe it's taken from ResNet, in which we can see that for example if we optimize for the class Flamingo, we see that the input needs to have something like this long necks here, which are typical of Flamingo, which means that the model will actually look at these long necks of Flamingos to actually predict the Flamingo class.

If we look at for example Goldfish, we can see that we have these eyes of this Goldfish here, and for example this one looks like the shape of a fish, so the model will actually look at the fish to predict the Goldfish. And if we look at for example Tarantula, we will see these long black legs here, like this one, like this one, which means that the model actually will look at the legs of the Tarantula to predict it as Tarantula.

But of course you can see that this picture, they don't look really natural, because if you look at the Leap Labs interoperability engine, they look quite natural. So for example if we go back, and we look at the prototypes generated for Pancakes, it actually looks like a Pancake. And if we look at Hamburger, it actually looks like Hamburger.

So how can we make our inputs look more natural? Well for once you could use the Leap Labs interoperability engine which can do it out of the box, but to understand how Leap Labs do it, they use what is known as regularization. Let's talk about regularization. So first of all, what is regularization?

When we train a model, our goal is to run some input through this model, calculate an output, compare it with the target so that we can calculate the loss and then update the parameters of the model such that we reduce this loss. When we introduce regularization, we want this optimization to happen in a particular way, so we want to put some constraints in our optimization process.

For example, when we train a model, we can do what is known as L1 regularization. With L1 regularization, what we do basically is we have our loss function, which is our, let's say, cross-entropy loss, because we are doing, for example, classification tasks. Then we can add some regularizer, which is a constraint that we add to our loss function to make this optimization process happen in particular ways.

For example, with L1 regularization, we want our models to use the least possible input features from the input. So what we do as a regularizer, we use the L1 regularization, which is basically just the absolute value of all the weights. What happens in this case? What will happen is that because we calculate always the gradient of the loss function with respect to the weights of the model, the presence of this absolute value on the weights or the parameters will force these weights to become zero.

And because the weights will become zero, they will use less features from the input. And this helps to make the model more sparse, which also helps us to then reduce the size of the model. So regularizers are particular constraints that we add to the loss function to make this optimization process happen in particular ways, to add some constraints to this optimization problem.

And this is what we can also do in our optimization problem. So what are we optimizing? We are starting from pure noise. For example, this is our pure noise, and we want to transform into some kind of input that maximizes a particular output logic in our classification network. Of course, when we train a neural network, the data set that the network was trained upon, let's say that in the space is here, but this does not mean that the model will not activate the output logic, for example, corresponding to the class dog for something that is out of distribution.

So what we want to do is we want our neural network to optimize our input noise. Sorry, we want our optimization problem to optimize our input noise in such a way that we remain close to the distribution of the data that the network has seen. So the natural input that the network has seen.

How to do that? Well, first of all, look at my picture. Do you think it's a noisy picture? No, because if you look at my t-shirt, you can see that adjacent pixels, they are similar, and there is not much variance in the pixel for neighboring pixels. So we could ask our optimization problem to optimize the input in such a way that it penalizes high variance for neighboring pixels.

And this is known as a frequency penalization. So we take our loss function, which is basically just the logic that we want to maximize, and we add a penalty to this loss. Every time we see a very high variance for neighboring pixels. Another regularizer that we can use is the transformation robustness.

This is not applied to the loss, actually. This basically means that we take our input, the one that we are optimizing, we transform it some way so we can rotate it, we can scale it, we can translate it. In this case, this code that I took from the Lucid library, which is a very famous library for feature visualization, they applied random scaling and random rotation, which means that they will rotate and randomly scale the input and then pass it through the network.

And the network, because it's an optimization problem, will have to, the optimization problem will have to modify the input in such a way that even when it's translated, even when it's rotated, even when it's scaled, it will still activate that output. So it will only affect the pixels, the input features that are needed for us to actually activate that logic.

Which also means, in other words, that it will try to, in case we are trying to, for example, maximize the logic corresponding to the class dog, it will actually try to create a dog because it does not matter if the dog is rotated, it does not matter if the dog is scaled, it does not matter if the dog is translated, it's here or it's here.

It will try to create, so it will try to create a natural dog as much as possible. Of course, there are many more regularizers that we need to add to make this transformation, to make this optimization problem more robust so that we don't get some out of distribution data, but we want to try to generate data that is as in distribution as possible.

And this is also how LeapLabs works. So the LeapLabs interpretability engine can generate prototypes that look natural. And the way they do it is described in this paper called "Prototype Generation - Robust Feature Visualization for Data-Independent Interpretability" in which they describe the process of generating these prototypes. And the way they do it is basically they apply all these regularization techniques.

So for example, you can see here, random transformation, so that the optimization process produces an input that is as natural as possible without ever actually seeing an input. So as you remember, when we do a prototype generation with the LeapLabs interpretability engine, we never feed an input picture. We just give the model and the algorithm will generate a prototype without ever seeing what a natural picture looks like.

But it's actually generated. It can generate very natural inputs. Why? Because they make this optimization process very robust. So they penalize, for example, the high frequency or the high variance in the neighboring pixels. They also apply transformation, etc. So that the resulting input is as close as possible to the natural inputs that the model is trained upon.

Now let's try to use the knowledge that we have acquired and apply it to language models. So as we saw before, with computer vision models, we can do prototype generation, which is based on feature visualization, which means that given a particular, for example, output logit, we want to understand what kind of input the models want to see to have that particular logit as output.

Can we apply the same techniques also to language models? So given a desired output, what kind of prompt the model wants to see to generate that output? Well, let's try to answer that question. First, let's review how language models work. So a language model, as you know, is a probabilistic model that assigns probabilities to sequence of tokens.

For example, imagine that the input to the language model is that Shanghai is a city in China. The model will tell us what is the probability of the next token being China or being Beijing or being cat or being pizza or being whatever token is present in our vocabulary.

One simplification I always do in my video is to associate a token with a word and the word with a token. But this is not usually the case. So usually a word may not be a token and the token may not be a word. And actually, most of the cases, a word is actually made up of multiple tokens.

But for our case, we will simplify it and see that every token is a word and every word is a token. So the language model just tell us what is the probability of the next token given an input prompt. Imagine we want to understand what our model thinks of the word girl.

So what kind of input, what kind of prompt the model wants to see as input to predict the word girl as next token? Well, let's try to use the techniques that we have seen before. So first, let's see the results of such an analysis. And in particular, Jessica, who is the founder of LeapLabs, she did this study.

So she took some tokens, for example, the word girl, and then she optimized the input prompt in such a way that the output girl is maximized. So the next predicted token is a girl given this input. And she did it also for the word woman. She did it for the word good and for the word doctor.

This gives us insight into what our model has learned because our model, our language model is just a model that models the statistical distributions of tokens based on the training data it has seen. So in this case, for example, the prompt that maximizes the word girl as being the next token is this input here.

And as you can see, it tells us that our model has seen a lot of bad data that is making our model have bias against girls, for example, because we see sexual words, we see other girls that are not quite polite. And the same happens for, well, in the case of the woman, the word woman, it's a little better, but still it tells you what are the bias of your model against this particular concept.

And we can see it also for the word good, for example, we see that shooting is good or somehow, and Jesus and beautiful or basketball, et cetera. So optimizing the prompt to generate a particular output tells you what our model wants to see as input to generate that particular output, which gives us insights into the distribution that our model has learned.

And Jessica, she ran another experiment, which is to, because when we optimize the prompt, we will see later how it's actually done in practice. We start from complete noise and we optimize this prompt to, to, to become tokens that are more likely to predict a particular output. And of course, you can restart this optimization problem from multiple starting points, because you can start from complete noise and you have many starting points.

So many input noises you can have as starting point. And she did it many times and she got some input tokens that were more likely to predict the word girl as next token. This gives us a map on what kind of inputs the model wants to see with each of tokens with its frequency, what kind of inputs the model wants to see to predict the girl as next token or boy as next token or science or art as next token.

And this also gives us insight into the statistical distribution that our model has learned. For example, the word girl, to get the word girl as output, the model wants to see some sexual words and some other like not so, some curse words also, but also for example, the word dresses or the word boys.

And in the case of the word boy, we can see that it wants to see rebellious, monkey, girl, et cetera, et cetera, but this gives us insight into what our model has seen during its training. So now let's try to analyze how to actually generate this kind of map and how this optimization problem works.

What we did before with the computer vision models, that is we have some output logits for which we want to find an input that maximizes that logit is exactly what we want to do here, except that here we have a language model and we have another complexity. So let's do it step-by-step, how we can generate this kind of map.

Imagine that we want to find input embeddings that maximize the probability of the next token being girl. Now, the first complexity is that girl may not be one token, but it could be multiple tokens. So let's suppose that it's actually multiple tokens because this is a real scenario. So we have the output that we want to optimize an input for, and suppose that we want to optimize three input embeddings.

So let's draw three input embeddings to maximize the probability of the next token being girl, but we know that girl may not be a single token. So let's suppose that it's actually two tokens. So one token, it's GI and the other token being RL. Now the job that we did before, that is calculating the loss of the output with respect to the input.

It's something that we cannot do anymore. Why? Because the input in the language model is tokens and tokens are numbers that represent the position of this token in the vocabulary. So for example, imagine the input could be for example, zero, five, and nine, and these are positions that represent each token in the vocabulary.

And we cannot optimize for something that is discrete because there is no token 0.5, there is no token 3.2. We cannot change these tokens a little bit, hoping that they move towards something that represents, that will generate that kind of output. The only thing that we can optimize are embeddings.

So we will not be optimizing input tokens, we will be optimizing input embeddings. So in this case, we suppose that we have three embeddings. So let me delete this part. So we suppose that we have three input embeddings. Now which three input embeddings should we choose? Well, in the case of the computer vision model, we started from pure noise.

In this case, we can also start from pure noise. So we can start from three random embeddings. One, two, and three. What we can do, we can run these three embeddings in our language model. And as you know, the language model is a transformable model in most of the cases, and it's a sequence to sequence model that will generate if the input is three embeddings, it will generate three embeddings as output.

So here, we will have three embeddings. Our goal is to make sure to select three embeddings that make the likelihood of the next token being GI and the next next token being RL maximized. So how to proceed? We take these three embeddings, we run it through our model, it will produce three embeddings as output.

Usually when sampling from a language model, we take the last embedding, so the last hidden state, so the output of a language model here at this point are called hidden states. We take the last hidden state, we send it to the linear layer, and it will generate what are known as logits.

Logits indicate what is the probability score, it's not a probability actually, but what is the score that the model assigns to each token in the vocabulary, and then by applying the softmax, they become probability scores, and then usually we choose the token with the maximum probability score as the next token.

So in this case, we can take the last embedding, we can run it through the linear layer, and it will generate logits. So we will have the logits associated with the position, let's say zero. In this logits, we are concerned, because it's a list of numbers, each one for each position in the vocabulary, we are interested in two logits in particular.

One is the one with the highest probability score, and that will be used to sample the next token, and one is the logit corresponding to the token GI. So we save two logits, one is corresponding to the GI, and one is for the next token. We use the logit corresponding to the next token to understand what is the next token, we put it back in the input of the model, so we put back the embedding corresponding to this next token in the input of the model, along with the three input tokens that we saw before.

This will result in four output embeddings being generated. We take the last embedding, and this will generate logits corresponding to the next position. And also in this logits, we are interested only, actually in this case, we are interested only in the logits corresponding to the token RL. Then we take these two logits, we know their probability score, because we can run the softmax, and we use them as objective for our optimization process, because we want to maximize these probabilities.

So the probability of selecting this token, and this token. We can take the negative log probability, so once we have run the softmax, they will become probabilities, and we can sum them up, and this will become our objective. Now, if we do like this, there is no guarantee that the inputs that we are optimizing here will actually be embeddings that correspond to some token.

They may not correspond to any token, because as we saw before for computer vision models, our model has some natural inputs that may be here, and maybe we are optimizing something that is here, that is out of distribution. So we need to find a way to push these embeddings to go in distribution, and one way is to find a regularizer.

So something that puts a constraint in our optimization problem to push the embeddings in certain directions, and one way to do that is whenever we feed these three embeddings that we are trying to optimize to the language model, we can calculate their distance from the closest embedding in the vocabulary, and use this as regularizer.

So we add this distance in our loss function, and we ask the optimization problem to minimize also this distance. This will force our optimization problem to optimize the embeddings in such a way that they maximize the likelihood of the next token being GI, and the next next token being RL, but at the same time, to produce embeddings that are closer to our vocabulary, so that we can, that will actually map to some token in our vocabulary.

So they will not just generate any embedding that result in that activation, but generate actually some embedding that correspond to embeddings that are present in our vocabulary. And this is how we can generate that kind of map. So thank you guys for watching my video. I know it has been very demanding, especially for the last part, but I will share a notebook in the description of the video that you can use to generate the map that we saw before, so you can play with it by yourself.

If you like this video, please share it with your friends, with your colleagues, and I hope you come back to my channel for more videos. Have a nice day!

ML Interpretability: feature visualization, adversarial example, interp. for language models

Transcript