ML Interpretability: feature visualization, adversarial example, interp. for language models

00:00:00.000 | Hello guys, welcome back to my channel. Today we are going to talk about machine learning

00:00:03.920 | interpretability. Let's review the topics of today. I will be starting by introducing

00:00:09.840 | what is machine learning interpretability. Then we will review deep learning and back

00:00:14.320 | propagation because they are needed for us to understand the rest of the videos and the

00:00:18.080 | topics. Then we will see a nice little trick, so how to trick a classifier. So imagine that

00:00:23.560 | you have a classification neural network, like for example a convolutional neural network

00:00:28.400 | that can classify pictures into classes. For example it will tell you that the picture

00:00:32.600 | of a dog is a dog, the picture of a person is a person, etc. Our goal is, without touching

00:00:38.560 | anything of this model, so without touching its weights, without touching its parameters,

00:00:42.760 | its structure or anything related to the model, we want any classifier to be tricked into

00:00:48.040 | believing that for example the picture of a dog is actually a person or the picture

00:00:51.620 | of a person is actually a dog. And we will see how we can trick any classifier of our

00:00:56.800 | choice. Later I will be introducing an interpretability engine, so a library that allows us to make

00:01:03.240 | vision models more interpretable. We will explore the topic of feature visualization

00:01:08.920 | which is very important for interpretability. And finally we will apply the techniques that

00:01:13.480 | we have learned to language models, so how to make language models more interpretable.

00:01:18.680 | What are the prerequisites for watching this video? Well for sure that we have a little

00:01:22.160 | background in calculus, I think knowing what are derivatives and how to calculate them

00:01:26.600 | is enough. And also of course that you have a background in deep learning, so you know

00:01:32.420 | what is a loss function or what is the softmax function for example. So let's start our journey.

00:01:39.760 | What is machine learning interpretability? In 2016 there was a fatal accident between

00:01:46.240 | a Tesla car driver and a truck. And as reported by the Guardian, we can see that the car sensor

00:01:54.920 | system against a bright spring sky failed to distinguish a large white 18 wheel truck

00:02:01.280 | and trailer crossing the highway. So basically the car was going and it failed to recognize

00:02:06.800 | this obstacle which was the truck. And the car just attempted to drive full speed under

00:02:12.360 | this truck. Of course resulting in the crash and it was unfortunately fatal. Now I don't

00:02:18.480 | want to say that it was Tesla's fault or it was the software's fault, I don't have enough

00:02:22.320 | information for that. So let's make a hypothetical case like you are creating a self-driving

00:02:27.080 | company and you want to deploy your car, self-driving car. How do you make sure that the car can

00:02:35.620 | recognize any obstacle? How do you know what your model has learned? Because for example

00:02:43.600 | if the car, the first question that you want to answer is what your model has learned.

00:02:50.160 | For example imagine that you have a model that allows you to segment the obstacles on

00:02:54.520 | the road. The first question that you want to answer is what did my model learn? So how

00:02:59.920 | does my model recognize a person? Does it recognize a person by its shape or does it

00:03:08.360 | recognize a person by its shoes or by the color of the clothes etc? Knowing this is

00:03:16.320 | important. Why? Because this allows us to understand what could be a failure mode of

00:03:21.600 | our model. Because if our model is only looking at the color of the clothes for example to

00:03:26.760 | recognize a person, so only looking at the clothes and not at the face for example, then

00:03:32.500 | if one day the model will see a person that is wearing strange clothes, something that

00:03:37.800 | the model has never seen, the model may fail to recognize that person as an obstacle. So

00:03:43.960 | this is very important. So the second question that we want to answer is what features or

00:03:49.240 | patterns from the input make the model generate certain outputs? And this is very important

00:03:55.200 | for example for language models. So imagine our language model is a cursing and we want

00:03:59.640 | to understand which tokens in the input are being used by the model to generate that kind

00:04:06.120 | of output. Knowing how a model thinks, and pardon me for this word, it's very wrong,

00:04:12.320 | so let's think about knowing how a model predicts or makes its prediction, allow us to debug

00:04:18.840 | and fine tune the model, which means that during training we can understand why our

00:04:23.360 | model is not learning something that we want it to learn, or how should we change our hyperparameters

00:04:29.880 | that will affect the learning of our model. We can identify failure modes before deployment,

00:04:36.240 | which means that we can understand what are the things that are more likely to make my

00:04:42.280 | model fail when deployed in production. It can increase trust because we can demonstrate

00:04:48.520 | that our model is well trained and so people will trust it and this is especially for some

00:04:54.640 | scenarios like self-driving cars. And also we can discover novel insights from the data

00:04:59.320 | because sometimes models learn something that we as humans did not see, and this is very

00:05:03.960 | important when the models learn patterns, for example in image models, that humans did

00:05:09.720 | not see. This is for example in the healthcare sector, imagine that you are training a model

00:05:14.320 | that can recognize cancer cells from non-cancer cells, and we realize that the model is performing

00:05:20.560 | well and we realize that the model is looking at some parts of the cell that we as humans

00:05:24.840 | didn't think of checking before, that is actually a good predictor for the cell being cancerous

00:05:29.880 | or not cancerous, so even the model can teach us something that we did not before. When

00:05:36.360 | we define a linear layer in PyTorch it gets converted into a computation graph, for example

00:05:42.280 | in this case we have an input that is made up of two features, which means that the input

00:05:46.400 | is a vector made up of X1 and X2. For example in this case we have two linear layers, one

00:05:52.260 | that converts two features into two features, so it's taking two features as input and it's

00:05:56.960 | converting it into two features as output, and we can see it as a layer made up of two

00:06:03.960 | neurons, each neuron that is doing a weighted sum of the input features multiplied, each

00:06:11.900 | input feature multiplied by its own weight, so X1 is multiplied in this case by W11, and

00:06:17.520 | X2 is multiplied by W12, and then it performs the sum, plus a bias, and then we have the

00:06:23.920 | application of a non-linear activation, usually the ReLU function. Then we have another linear

00:06:31.280 | layer that is going from two features to one feature that will produce our output. In this

00:06:36.520 | case we are trying to model a very simple neural network that takes an input two features

00:06:41.700 | that represent features of a house, for example the number of bedrooms and the number of bathrooms,

00:06:47.640 | and wants to predict a price for this house, so only one output. This is a very simple

00:06:53.520 | regression task and we can train it by having a training data with two input features and

00:06:58.880 | one target. PyTorch will convert this neural network into a computation graph. What does

00:07:05.920 | it mean? It means that each node will become an operation that is performed on the input

00:07:12.600 | subsequently to arrive to the final output, which is the price of the house. In this case

00:07:17.880 | this is actually a simplified version of the computation graph. The computation graph usually

00:07:21.600 | is made up of more nodes than the one you see here, because each kind of single operation

00:07:26.680 | is a node. In this case we can see that X1 and X2 are multiplied by some weights in the

00:07:33.520 | first neuron and then we sum up a bias, we apply the ReLU and this becomes the input

00:07:38.400 | for the neuron at the next layer. So we take the two outputs at the previous layer, we

00:07:44.400 | multiply by them some weights, as you can see here W31 and W32, we add a bias and this

00:07:50.360 | becomes our output. How do we train such a network? Well, usually we have a dataset made

00:07:58.620 | up of inputs and output pairs, or input and label pairs. The input represents the features

00:08:05.880 | of a house and the output the corresponding price, the price of this house. And our goal

00:08:10.960 | is to train the neural network to minimize a certain loss function that we can choose.

00:08:16.800 | For this regression task an ideal loss function could be the mean squared error, because we

00:08:22.160 | want to minimize the error that the model makes on the final price. Our hope is that

00:08:27.880 | the neural network not only learns the data that it has seen during training, so not only

00:08:32.840 | it can predict correctly the price of the houses that it has seen during training, but

00:08:36.920 | it also learns some kind of pattern that can generalize to unseen inputs. So how do we

00:08:43.200 | proceed practically? We choose a loss function and we choose, for example, the mean squared

00:08:48.320 | error in this case. We run an input through the neural network. So we take the input,

00:08:53.640 | we run it through this neural network, which is a feedforward neural network, which means

00:08:57.440 | that each output becomes the input of the next layer. So we run our input, it will produce

00:09:04.920 | an output here. We have a loss function that will compare the output of the network with

00:09:10.040 | the label that we have assigned to this input label pair. It will compute a loss. Then we

00:09:17.600 | calculate the gradient of the loss function with respect to the weight of the network

00:09:22.560 | and the weights of this network are also the parameters of this network. And in this case

00:09:27.560 | are W_11, W_12, the bias here by B_1, W_21, W_22, B_2, which are the weights and the bias

00:09:36.620 | for the first linear layer and also the weights and the bias for the second linear layer.

00:09:42.680 | We calculated this gradient because the gradient indicates kind of a direction. So if you remember

00:09:48.360 | from high school, the gradient is basically a derivative. So imagine we do it for a single

00:09:54.280 | variable. So imagine that we have, we want to calculate the derivative of the loss function

00:09:59.000 | with respect to W_11. So we write it here. This is the loss function and this is W_11.

00:10:12.000 | Imagine that the loss function is doing something like this. We have a kind of a local minima

00:10:19.440 | here and then we have a global minima here. Imagine that we are currently here. Our W_11

00:10:25.640 | initially is here. When we calculate the derivative of the loss function with respect to W_11,

00:10:31.360 | we will get the inclination of the tangent line at this point, which is this one here.

00:10:37.400 | And this indicates the direction in which the function is growing. So the function is

00:10:42.040 | growing in this direction. We usually update our weights to move to the opposite direction

00:10:47.620 | of the gradient. So we update our weights to move right so that the loss will diminish.

00:10:53.000 | So for example, we take a little step in this direction so that the loss, as you can see,

00:10:56.940 | will decrease because this will be the new loss 2. We started from loss 1 here. And this

00:11:03.960 | is why we do backpropagation. So we calculate this gradient, so the gradient of the loss

00:11:08.280 | function with respect to the parameters of the model. And then we update the models to

00:11:12.440 | move against the direction of the gradient. The first thing that we do during our training

00:11:18.160 | is the forward pass, which means that we have an input. We run it through our computation

00:11:22.800 | graph to calculate an output. So let's do it here. So we have an input that is X_1=2

00:11:28.960 | and X_2=4. We multiply, for example here in this node, each input by the weights of this

00:11:36.120 | network and the weights are initialized as follows. So W_11=0.24, W_12=0.29, the bias

00:11:43.360 | of W_11=-0.70. This will result in some co-activations being calculated, so the values of each node

00:11:51.600 | are called activations. We use the previous activation to calculate the next one, etc.

00:11:57.560 | etc. until we arrive to the output of this neural network. We have a target because we

00:12:02.720 | are training and we can calculate a loss. What do we do with this loss? We run backpropagation,

00:12:08.080 | which means that we calculate the gradient of the loss function with respect to each

00:12:12.600 | of these weights. For example, to calculate the gradient of the loss function with respect

00:12:16.960 | to W_11, which is this parameter here, we can use the chain rule, which means that the

00:12:23.000 | derivative of the loss function with respect to A_6, because we need to watch what are

00:12:30.040 | the nodes that connect this parameter to the loss. So the nodes that connect this parameter

00:12:35.240 | W_11 to the loss function are this node here, this node here, this one, and this one. What

00:12:41.480 | we do in the chain rule is we just go from the loss to the parameter, backwards. So we

00:12:46.600 | do the loss function with respect to the previous node, then this node, the derivative of this

00:12:52.520 | node with respect to the previous node, then the derivative of this node with respect to

00:12:56.240 | the previous node, the derivative of this node with respect to this node, and the derivative

00:13:00.320 | of this node with respect to W_11, because this node contains the expression W_11. This

00:13:07.000 | will result in a series of numbers that will give us, that when multiplied together will

00:13:12.200 | give us the derivative of the loss function with respect to W_11. What do we do with this

00:13:17.440 | derivative, which is a number, because we evaluated this derivative on the input points

00:13:22.480 | that we have chosen. It will give us a number that we will call gradient, and we use it

00:13:27.800 | to update the value of our parameter. So the new value of the parameter W_11 is equal to

00:13:33.040 | the old value of this parameter minus alpha, which is our learning rate, multiplied by

00:13:38.600 | the value of this gradient. Now, why do we have a minus sign here? Because as I saw before,

00:13:46.240 | the gradient indicates the direction in which the function is growing, the loss function

00:13:52.440 | is growing with respect to the parameter, and we don't want to make the loss function

00:13:57.320 | grow, we want the loss function to decrease, so we move in the opposite direction of the

00:14:01.640 | gradient. So that's why we have a minus sign here. Now, let's trick a classifier. So I

00:14:09.040 | introduced the gradient descent because it was needed for us to understand how to do

00:14:14.240 | this trick. So imagine, first of all, what is a classifier? A classifier is a neural

00:14:18.960 | network that can classify the input into one of the defined classes that we have. For example,

00:14:24.680 | in this case, we may have a classifier that can take an input picture, and then can classify

00:14:30.560 | it as a fish, or as a dog, as a volcano, as a car, or a pencil. For example, the ResNet

00:14:35.840 | network can classify the input picture into one of the thousand classes it has in its

00:14:41.680 | output logits. The output of a neural network of this kind is called logits, because it

00:14:48.800 | indicates what is the score that the network assigns to each of these classes. We don't

00:14:54.960 | usually work directly with the logits, we apply a softmax. A softmax is a function that

00:14:59.640 | makes the logits kind of turn them into probability scores, because they will sum up to one. And

00:15:05.920 | then we take the class with the highest value of the softmax as the prediction of the model.

00:15:13.840 | So if after applying softmax, we see that our network indicated that this node here

00:15:21.040 | is 95%, then it means that the network is saying us that this is a fish. And that also

00:15:27.840 | applies to other cases, of course. So what do I mean by tricking a classifier? I mean

00:15:32.480 | that I give you a classifier, so a neural network like this one, and you are not allowed

00:15:38.080 | to change anything of the network. So you're not allowed to change the weights of this

00:15:43.120 | network, you're not allowed to change the architecture of this network, you're not allowed

00:15:46.840 | to change anything. So the weights are frozen, and the architecture and the hyperparameter,

00:15:52.040 | everything is frozen. When we run a picture of a fish in this network, it will probably

00:15:57.560 | classify it as a fish. But our goal, we want to give a picture of a fish as input, and

00:16:03.800 | we want the network to classify it as something else that we can choose, for example, as a

00:16:07.880 | volcano with very high probability. So of course, if you think about it, the only place

00:16:13.920 | where we can work to trick this network is actually in the input. And this is what we

00:16:19.160 | will do. We will change the input in such a way that the network will not see a fish

00:16:25.080 | anymore, but it will see, for example, a volcano. So this fish and the previous fish are the

00:16:32.280 | same for a human because you can see a fish and I can see a fish. But what we did was

00:16:37.760 | to add a little bit of noise in this picture so that when the network sees this picture,

00:16:43.640 | so this one here, it will not see a fish anymore, it will see a volcano. How is that even possible?

00:16:50.280 | Let's see. So our goal is that we want to have a picture as input and we want to change

00:16:58.320 | this picture in such a way that the network sees something else. How can we proceed? Well,

00:17:05.480 | we take what we do usually when we train a network like this is this. We have a series

00:17:12.200 | of pictures of images of fish, of trees, of people, etc. and the corresponding label.

00:17:19.080 | So for example, we have a thousand pictures of dogs and the label saying that they are

00:17:24.400 | dogs. And then we have a thousand pictures of cats and saying that the corresponding

00:17:28.160 | label is cat, etc. So what we do is we feed the input picture to the neural network. The

00:17:35.120 | neural network will calculate some output, which is here, to which we apply the Softmax.

00:17:41.480 | Then we have the corresponding label because we know what is this picture. This is coming

00:17:44.800 | from our training data. So we know that is a fish, for example, in this case. We can

00:17:49.480 | calculate the loss. Then we can run backpropagation, which means that we calculated the gradient

00:17:55.080 | of the loss function with respect to the weights of this network, so the parameters of this

00:17:59.840 | network. And then we update the parameters to reduce this loss. And this is how we train

00:18:04.320 | this network. So let's try to see how we can trick the network into believing that, for

00:18:09.720 | example, this fish here is actually a volcano. Now, when we do the training, as you saw before,

00:18:17.600 | we calculated the gradient of the loss function with respect to the parameters, but we can

00:18:21.760 | actually calculate also the gradient of the loss function with respect to the input picture.

00:18:27.360 | So what we can do is as follows. We can create a new loss function. So imagine we have a

00:18:33.200 | picture of a fish. We know it's a fish, but we want to trick the model into believing

00:18:36.800 | it's a volcano. We can create a loss function with respect to the target that we want the

00:18:41.600 | network to have. So we want the network to believe it's a volcano, so we can create a

00:18:45.720 | new loss function with respect to the target volcano. And then we run this picture in the

00:18:51.080 | network and we calculate the gradient of the loss function with respect to the input. And

00:18:55.640 | later we will see in the code how to do that. But let's try to analyze what does it mean

00:19:01.340 | to calculate the gradient of the loss function with respect to the input. It means that it

00:19:06.400 | will indicate a direction in which we should change the input to make, because it's a gradient,

00:19:12.460 | so it indicates the direction in which we should change the input to make the loss grow.

00:19:17.760 | So we can run backpropagation and optimization on the input to decrease this loss. So because

00:19:26.480 | the gradient tells us how we should change the input image to make the loss grow, we

00:19:31.360 | can also change the input image in the opposite direction to make the loss decrease. So that's

00:19:38.180 | what we will do. We calculate the gradient of the loss function with respect to the input

00:19:44.520 | image, which will indicate a direction. We update the image with some noise in the opposite

00:19:50.280 | direction, so we add a little bit of noise in the opposite direction indicated by this

00:19:54.120 | gradient, and we keep updating it until the network predicts correctly the output as Volcano.

00:20:02.040 | So in the code it is done as follows. Imagine we have a model, we have an input image, and

00:20:08.160 | we have a target class, for example, Volcano. What we can do is we take our input image

00:20:13.200 | and we create a tensor of it by asking PyTorch to also calculate the gradient with respect

00:20:19.320 | to this tensor, because by default PyTorch will only calculate the gradient with respect

00:20:23.360 | to the weights. So the gradient of the loss function with respect to the weights. But

00:20:26.720 | we also want the gradient with respect to this input image. Then we run for a few steps

00:20:32.360 | the following. We calculate the output of the model, so we are calculating this output

00:20:37.680 | here. We are creating a special loss function with respect to this target that we want the

00:20:43.320 | network to have. So we want the network to output Volcano, so we create a loss function

00:20:48.600 | with respect to this target class. We run backward, which means that we calculate the

00:20:53.800 | gradient of the loss function with respect to the input. And then we update the image,

00:21:01.320 | so the image is updated just like the update formula for the parameters, so it's equal

00:21:06.520 | to the old image minus some learning rate, here I call it alpha, multiplied by the direction

00:21:13.560 | of the gradient of the loss function with respect to the input. And we are moving against

00:21:20.360 | the direction of the gradient because we want the loss to decrease. If we update the image

00:21:24.840 | continuously as follows, we will see that the network will predict it as Volcano. And

00:21:31.040 | this is how we can trick a classifier. I made this example because I wanted to show you

00:21:35.960 | that models may look at patterns that are completely different from us humans. For example,

00:21:43.000 | in this case, the model is predicting this picture as a Volcano. So the model somehow

00:21:49.240 | is seeing a Volcano here, even if to us humans, we will never be able to see a Volcano in

00:21:54.960 | this picture, it's a fish. So understanding how our model makes its prediction can help

00:22:02.480 | us improve our models. And thanks to the sponsor of today's video, LeapLabs, we can get insights

00:22:09.040 | into how our model makes their predictions. LeapLabs is a research lab that is focusing

00:22:15.400 | on machine learning interpretability. And they have developed this library, the LeapLabs

00:22:20.680 | interpretability engine, that allows us to understand what our model has learned and

00:22:26.000 | how we can get insights from our model to improve it. For example, this library allows

00:22:33.160 | us to generate prototypes. Prototypes, what are prototypes? Well, imagine that you have

00:22:38.360 | a classification model, a computer vision classification model, which means that you

00:22:42.040 | have some, it takes as input a picture and it will classify it as one of the classes.

00:22:48.240 | In this case, it's a food classifier that will classify an input picture as one of the

00:22:52.960 | following class, for example, ice cream or hamburger or pancakes or waffles. In this

00:22:58.160 | case, it looks like that the model is well trained, because by generating prototype,

00:23:03.880 | we can get the kind of input that the model wants to see to classify it as a target class.

00:23:10.960 | So this is the kind of picture that the model wants to see to classify the input picture

00:23:16.680 | as a hamburger. This is the kind of picture that the model wants to see to classify the

00:23:21.480 | input picture as a pancake. And it actually resembles a pancake. And this one actually

00:23:25.840 | resembles a hamburger, which means that the model has learned the correct features from

00:23:31.200 | the food to classify it as a given class. But we will see later a case in which this

00:23:36.700 | is not true. Another feature of the Leap Labs interpretability

00:23:41.480 | engine is entanglement. Entanglement allows us to understand how different classes share

00:23:47.240 | features. For example, for a food classifier like the one we saw before, we expect high

00:23:53.320 | entanglement between the ice cream class and frozen yogurt class. Because at least for

00:23:58.000 | me as a human, they look very similar. They both look like ice cream. So it is expected

00:24:05.160 | that these two classes share features. But imagine that you have a more broad classifier

00:24:10.920 | like the one we saw before that can classify fish and volcano, etc. I would not expect,

00:24:16.960 | for example, cheesecake and dog to have high entanglement. Because at least for me, they

00:24:23.360 | shouldn't share features. I mean, they are totally different objects. So if they do have

00:24:29.520 | high entanglement in the model, it means that the model is looking at the wrong features.

00:24:39.440 | And it also may indicate a higher chance of misclassification between these two classes.

00:24:46.540 | Another feature that is very important is feature isolation. So in this case, feature

00:24:52.040 | isolation allows us, for example, to understand which parts of the input is being used by

00:24:57.320 | the model to make a certain prediction. For example, for a food classifier, imagine we

00:25:02.640 | have the following picture. The food classifier will classify it as a frozen yogurt with 98%

00:25:11.160 | probability. But by generating feature isolation, we can understand which part of the input

00:25:17.880 | is being used to classify the input as a frozen yogurt. And it's actually the part that looks

00:25:23.560 | like frozen yogurt. But also because there is entanglement between frozen yogurt and

00:25:29.040 | ice cream, the model, as you can see, is using the similar features to also classify it as

00:25:34.320 | ice cream with low probability because the model is well trained. But still, they have

00:25:38.240 | some shared features, as you can see. And there is something that you may not have noticed,

00:25:43.840 | but is the waffles. With very low probability, the model may also classify it as a waffle.

00:25:48.640 | Why? Because the model is seeing some features, which are the berries that are on this frozen

00:25:55.560 | yogurt, to classify it as a waffle. This can happen because in the original picture, in

00:26:01.840 | the training data, the waffles probably had the berries on top. So the model learned to

00:26:06.960 | look at the berries to recognize a waffle. So the LeapLabs Interstability Engine can

00:26:11.880 | understand this and will show you this. This helps you understand what your model has learned.

00:26:18.680 | Now let's look at a case on when things can go wrong in our model and how LeapLabs Interstability

00:26:25.920 | Engine can help us improve it. If you look at the tutorial link that I have shared in

00:26:33.440 | the description, if you go to this link here, to the tutorials at the LeapLabs website,

00:26:41.720 | you will see tank detection case study. And if you open it, it will open a call up notebook.

00:26:48.000 | Now let's run it, actually. So let me change run type. We choose T for GPU, and we can

00:26:55.480 | run it. It will do some imports. Now, what is the tank detection case study? Well, we

00:27:02.120 | are talking about a classification model that can detect tanks or no tanks. So it has only

00:27:08.120 | two classes that indicate if the picture contains a tank or it does not contain a tank. Suppose

00:27:14.080 | that this is a model that is very important for us, and we want to deploy it in the battlefield

00:27:18.760 | because it can help protect our soldiers. But before deploying it, of course, we want

00:27:25.000 | to understand what our model has learned. So by understanding what our model has learned,

00:27:29.920 | we can predict failure modes. So if we run, for example, a picture of a tank into our

00:27:37.760 | classification model, we will see that it classified as having a tank with a very high

00:27:43.640 | probability. So in this case, the model is predicting that there is a tank in this picture

00:27:47.640 | with 98% probability. So it looks like the model is performing very well. But let's try

00:27:53.360 | to use the LeapLabs interoperability engine to understand what our model has learned.

00:27:58.200 | So we install the library. Then we can use the library to generate prototypes. As we

00:28:03.520 | saw before, the prototype tells us what kind of input the model wants to see to classify

00:28:09.520 | a certain output, to give a certain output. In this case, we need the LeapLabs API key,

00:28:15.680 | which we can generate from the LeapLabs website. So we go to the dashboard, we go settings,

00:28:22.400 | and it will generate a key here. We put our key in the API key and we can generate a prototype.

00:28:41.680 | In my computer, it takes around 25 seconds, I think, or one minute to generate it. Okay,

00:28:47.640 | the model has generated, the library has generated two prototypes, one for the tank class. So

00:28:55.240 | what kind of input the model wants to see to tell us that there is a tank and what kind

00:29:02.620 | of input the model wants to see to tell us that there is no tank. And let's look at this

00:29:08.640 | picture which indicates when the output indicates that there is a tank. If we look at this picture,

00:29:14.020 | you see that actually there is no tank. So it means that the model is looking at some

00:29:19.840 | stuff that is gray, which probably looks like a cloud. But there is no tank here. I mean,

00:29:26.080 | I expected to see a cannon, I expected to see some wheels or maybe the gun or a soldier

00:29:30.920 | with a gun on top of the tank or something like this, but actually there is none of these

00:29:34.120 | features. So is our model looking at the correct features to actually predict a tank, the presence

00:29:41.440 | of a tank? And let's look at the other class, no tank. As we can see, we have these green

00:29:48.620 | lines here, which probably indicates grass. So probably the model is looking at the grass

00:29:54.560 | to indicate that there is no tank. So if it sees an open field with only grass, it will

00:30:01.080 | say that there is no tank, which could make sense. But the problem is, why is our model

00:30:05.920 | not looking at the tank to indicate that there is a tank? So let's try to make a prediction

00:30:12.200 | before looking further at what the LeapLabs Interpolating Engine can tell us. What could

00:30:19.720 | happen in this case is that imagine that in our training data, we have a lot of pictures

00:30:24.800 | of tanks and all of them that have tanks happen to have cloudy sky. So what our model may

00:30:32.480 | have learned is that if there is a cloudy sky, then there is a tank, not that if there

00:30:37.280 | is a tank, there is a tank. So let's validate our hypothesis. We can use a feature isolation

00:30:43.560 | to understand what kind of features from an input picture the model is looking at to make

00:30:48.700 | a certain prediction. So in this case, for example, we can feed the picture that we saw

00:30:55.160 | before. So this picture as input to see what kind of features the model is looking at to

00:31:01.040 | predict a tank. Let's see. As you can see, the model is using the entire picture to actually

00:31:19.760 | predict a tank. But as you can see, the white areas indicate that that feature is not being

00:31:25.560 | used. And the other areas indicate that the feature is being used. So as you can see,

00:31:29.840 | the tank is here is white, which means that the model is not using the tank to predict

00:31:35.240 | the tank, but it's using the sky and maybe the ground to predict that there is a tank.

00:31:41.200 | So as suspected, the model doesn't seem to use the actual tank for classification much

00:31:46.880 | at all, right? It's using the sky, the background, and maybe the saturation of the picture. So

00:31:51.800 | how can we fix this model? Well, one way to fix it is to further train the model by using

00:31:58.000 | more diverse images of tanks that have maybe some sunny sky, maybe some cloudy sky, maybe

00:32:04.500 | some snowing environment with snow and some environment maybe in the forest, etc, etc.

00:32:11.100 | So that the model cannot find any other correlation between pictures of tanks except for the tank

00:32:18.200 | itself. So that the model will be forced to learn the presence of the tank itself as a

00:32:25.360 | predictor for tanks. We can run this training and it will for sure improve our model. And

00:32:31.280 | there is a code here to how to train it again. And after training, we can run feature isolation

00:32:36.560 | again. And we can see here at the end that after retraining the model on more diverse

00:32:42.440 | pictures, the model is actually putting all its attention on the tank itself to predict

00:32:48.360 | the tank and not on the surrounding area. So all of this, thanks to the Leap Labs interpretability

00:32:54.520 | engine. Now let's talk about feature visualization. So what we saw before with Leap Labs interpretability

00:33:02.240 | engine is that we can get insights into how our model is making its prediction or what

00:33:07.400 | kind of feature our model has learned. And in particular, especially for convolutional

00:33:12.480 | neural networks for computer vision, we have, of course, a subsequent application of layers

00:33:18.400 | of convolutions. And our goal with feature visualization is to understand what each of

00:33:23.520 | these layers or what each of the neurons that are making up these layers, what kind of features

00:33:29.680 | from the input did they learn that contribute to the final prediction. So we want to understand,

00:33:34.920 | for example, imagine that you have a food classifier and you have many layers in your

00:33:38.840 | convolutional neural network. Each layer will be looking at a particular kind of feature

00:33:44.080 | in the input that will contribute to the final output for the final classification. Some

00:33:49.160 | layers may look at, for example, lines. Some layers may be looking at edges. Some layers

00:33:56.640 | may be looking at certain patterns, etc. So we want to understand what features each of

00:34:03.200 | these layers or each of these neurons have learned. And we can do feature visualization

00:34:08.080 | at many levels. We can do it at the neuron level. So what features is this neuron looking

00:34:13.480 | at? Or we can do it at the layer level. So what kind of features is the particular layer

00:34:19.600 | looking at? And also at the logic level, in this case, we have a classification network.

00:34:26.800 | So we want to understand what kind of features the model wants to see in order to predict

00:34:33.440 | it at that particular class. So we will model the feature visualization problem as an optimization

00:34:41.880 | problem. And it's actually how it's done in practice. And it's actually also how more

00:34:46.320 | or less the Leap Labs Interpretable Engine works. Of course, it's much more sophisticated.

00:34:51.800 | So this is a simplified explanation. But I wanted you to understand how such an engine

00:34:56.120 | works so that when you use it, you also know what's happening inside. So what we do, imagine

00:35:03.640 | that you have a classification network, a convolutional network that is used for classification.

00:35:07.880 | So as you can see here at the end, we have a Softmax and we have subsequent layers of

00:35:11.920 | convolutions. We want to understand what this layer of convolution has learned. So in order

00:35:20.400 | to understand what kind of features this layer has learned, we will treat it as an optimization

00:35:28.040 | problem, which means that we will create an input that is a complete noise. We run it

00:35:33.960 | through our network. We take the activations of this layer. So all the outputs of this

00:35:41.000 | layer and we use it as an objective function. Or you can also call it a loss function. So

00:35:46.760 | it's the same thing. So you take the output of this as loss and then you optimize the

00:35:52.720 | input to maximize this loss in this case. So that's why I'm calling it objective. Whenever

00:35:57.260 | you are maximizing something, we call it objective function. Whenever you are minimizing something,

00:36:02.120 | we call it loss function. But the same thing. The only thing that changes is that in one

00:36:06.000 | case you are doing gradient ascent and in the other case you are doing gradient descent.

00:36:11.020 | In this case, we want to maximize the output of this, the activations of this layer. So

00:36:16.680 | we treat the output of this network as objective function and we run backpropagation to maximize

00:36:24.720 | these activations. And this will modify the input in such a way that it maximizes these

00:36:30.080 | activations. This will get us insights into what kind of features this layer wants to

00:36:39.360 | see to contribute to the final prediction. We can also optimize for logits. So for example

00:36:46.560 | if instead of using a particular layer, we use the logits of a particular class, for

00:36:52.360 | example the class associated with dogs, because we want to see what kind of dogs our model

00:36:58.520 | wants to see to predict it as a dog, we can use the logits corresponding to the dog class.

00:37:05.060 | We feed as input complete noise. We use the logit corresponding to the dog class as an

00:37:13.320 | objective and we run backpropagation to optimize this input to maximize this logit. This is

00:37:23.520 | actually how you can generate kind of a prototype for the class dog. Of course, you may be wondering,

00:37:31.600 | is it that simple? Well, not really, because if you do this procedure, so if you start

00:37:37.000 | from complete noise and you try to maximize a certain logit, it will for sure give you

00:37:42.800 | insights into what the model has learned, so what kind of input the model wants to see

00:37:48.240 | to have that logit as output, so that class as output, but it will not look very natural.

00:37:57.840 | So for example this image here, I believe it's taken from ResNet, in which we can see

00:38:05.760 | that for example if we optimize for the class Flamingo, we see that the input needs to have

00:38:11.960 | something like this long necks here, which are typical of Flamingo, which means that

00:38:17.280 | the model will actually look at these long necks of Flamingos to actually predict the

00:38:21.920 | Flamingo class. If we look at for example Goldfish, we can see that we have these eyes

00:38:28.160 | of this Goldfish here, and for example this one looks like the shape of a fish, so the

00:38:33.480 | model will actually look at the fish to predict the Goldfish. And if we look at for example

00:38:40.480 | Tarantula, we will see these long black legs here, like this one, like this one, which

00:38:45.880 | means that the model actually will look at the legs of the Tarantula to predict it as

00:38:50.640 | Tarantula. But of course you can see that this picture, they don't look really natural,

00:38:56.520 | because if you look at the Leap Labs interoperability engine, they look quite natural. So for example

00:39:00.600 | if we go back, and we look at the prototypes generated for Pancakes, it actually looks

00:39:05.480 | like a Pancake. And if we look at Hamburger, it actually looks like Hamburger. So how can

00:39:11.120 | we make our inputs look more natural? Well for once you could use the Leap Labs interoperability

00:39:17.760 | engine which can do it out of the box, but to understand how Leap Labs do it, they use

00:39:23.240 | what is known as regularization. Let's talk about regularization. So first of all, what

00:39:29.440 | is regularization? When we train a model, our goal is to run some input through this

00:39:37.880 | model, calculate an output, compare it with the target so that we can calculate the loss

00:39:42.160 | and then update the parameters of the model such that we reduce this loss. When we introduce

00:39:50.040 | regularization, we want this optimization to happen in a particular way, so we want

00:39:55.160 | to put some constraints in our optimization process. For example, when we train a model,

00:40:01.320 | we can do what is known as L1 regularization. With L1 regularization, what we do basically

00:40:07.780 | is we have our loss function, which is our, let's say, cross-entropy loss, because we

00:40:15.960 | are doing, for example, classification tasks. Then we can add some regularizer, which is

00:40:21.200 | a constraint that we add to our loss function to make this optimization process happen in

00:40:27.360 | particular ways. For example, with L1 regularization, we want our models to use the least possible

00:40:35.800 | input features from the input. So what we do as a regularizer, we use the L1 regularization,

00:40:42.080 | which is basically just the absolute value of all the weights. What happens in this case?

00:40:50.940 | What will happen is that because we calculate always the gradient of the loss function with

00:40:56.500 | respect to the weights of the model, the presence of this absolute value on the weights or the

00:41:03.620 | parameters will force these weights to become zero. And because the weights will become

00:41:09.180 | zero, they will use less features from the input. And this helps to make the model more

00:41:15.620 | sparse, which also helps us to then reduce the size of the model. So regularizers are

00:41:22.580 | particular constraints that we add to the loss function to make this optimization process

00:41:27.720 | happen in particular ways, to add some constraints to this optimization problem. And this is

00:41:35.060 | what we can also do in our optimization problem. So what are we optimizing? We are starting

00:41:40.360 | from pure noise. For example, this is our pure noise, and we want to transform into

00:41:47.640 | some kind of input that maximizes a particular output logic in our classification network.

00:41:56.180 | Of course, when we train a neural network, the data set that the network was trained

00:42:02.540 | upon, let's say that in the space is here, but this does not mean that the model will

00:42:08.400 | not activate the output logic, for example, corresponding to the class dog for something

00:42:14.520 | that is out of distribution. So what we want to do is we want our neural network to optimize

00:42:21.320 | our input noise. Sorry, we want our optimization problem to optimize our input noise in such

00:42:28.160 | a way that we remain close to the distribution of the data that the network has seen. So

00:42:34.400 | the natural input that the network has seen. How to do that? Well, first of all, look at

00:42:40.320 | my picture. Do you think it's a noisy picture? No, because if you look at my t-shirt, you

00:42:46.800 | can see that adjacent pixels, they are similar, and there is not much variance in the pixel

00:42:54.000 | for neighboring pixels. So we could ask our optimization problem to optimize the input

00:43:00.720 | in such a way that it penalizes high variance for neighboring pixels. And this is known

00:43:08.840 | as a frequency penalization. So we take our loss function, which is basically just the

00:43:14.720 | logic that we want to maximize, and we add a penalty to this loss. Every time we see

00:43:20.200 | a very high variance for neighboring pixels. Another regularizer that we can use is the

00:43:26.560 | transformation robustness. This is not applied to the loss, actually. This basically means

00:43:32.280 | that we take our input, the one that we are optimizing, we transform it some way so we

00:43:38.040 | can rotate it, we can scale it, we can translate it. In this case, this code that I took from

00:43:44.520 | the Lucid library, which is a very famous library for feature visualization, they applied

00:43:52.320 | random scaling and random rotation, which means that they will rotate and randomly scale

00:43:58.440 | the input and then pass it through the network. And the network, because it's an optimization

00:44:03.320 | problem, will have to, the optimization problem will have to modify the input in such a way

00:44:10.600 | that even when it's translated, even when it's rotated, even when it's scaled, it will

00:44:16.360 | still activate that output. So it will only affect the pixels, the input features that

00:44:22.480 | are needed for us to actually activate that logic. Which also means, in other words, that

00:44:29.440 | it will try to, in case we are trying to, for example, maximize the logic corresponding

00:44:35.360 | to the class dog, it will actually try to create a dog because it does not matter if

00:44:39.960 | the dog is rotated, it does not matter if the dog is scaled, it does not matter if the

00:44:44.640 | dog is translated, it's here or it's here. It will try to create, so it will try to create

00:44:51.600 | a natural dog as much as possible. Of course, there are many more regularizers that we need

00:44:57.400 | to add to make this transformation, to make this optimization problem more robust so that

00:45:05.160 | we don't get some out of distribution data, but we want to try to generate data that is

00:45:10.880 | as in distribution as possible. And this is also how LeapLabs works. So the LeapLabs interpretability

00:45:18.400 | engine can generate prototypes that look natural. And the way they do it is described in this

00:45:25.320 | paper called "Prototype Generation - Robust Feature Visualization for Data-Independent

00:45:30.080 | Interpretability" in which they describe the process of generating these prototypes. And

00:45:36.400 | the way they do it is basically they apply all these regularization techniques. So for

00:45:40.840 | example, you can see here, random transformation, so that the optimization process produces

00:45:47.840 | an input that is as natural as possible without ever actually seeing an input. So as you remember,

00:45:57.000 | when we do a prototype generation with the LeapLabs interpretability engine, we never

00:46:01.160 | feed an input picture. We just give the model and the algorithm will generate a prototype

00:46:07.240 | without ever seeing what a natural picture looks like. But it's actually generated. It

00:46:13.280 | can generate very natural inputs. Why? Because they make this optimization process very robust.

00:46:21.100 | So they penalize, for example, the high frequency or the high variance in the neighboring pixels.

00:46:27.880 | They also apply transformation, etc. So that the resulting input is as close as possible

00:46:33.440 | to the natural inputs that the model is trained upon. Now let's try to use the knowledge that

00:46:39.800 | we have acquired and apply it to language models. So as we saw before, with computer

00:46:45.120 | vision models, we can do prototype generation, which is based on feature visualization, which

00:46:49.960 | means that given a particular, for example, output logit, we want to understand what kind

00:46:54.440 | of input the models want to see to have that particular logit as output. Can we apply the

00:47:01.200 | same techniques also to language models? So given a desired output, what kind of prompt

00:47:06.600 | the model wants to see to generate that output? Well, let's try to answer that question. First,

00:47:12.840 | let's review how language models work. So a language model, as you know, is a probabilistic

00:47:17.120 | model that assigns probabilities to sequence of tokens. For example, imagine that the input

00:47:23.440 | to the language model is that Shanghai is a city in China. The model will tell us what

00:47:29.000 | is the probability of the next token being China or being Beijing or being cat or being

00:47:34.400 | pizza or being whatever token is present in our vocabulary. One simplification I always

00:47:39.960 | do in my video is to associate a token with a word and the word with a token. But this

00:47:44.960 | is not usually the case. So usually a word may not be a token and the token may not be

00:47:50.960 | a word. And actually, most of the cases, a word is actually made up of multiple tokens.

00:47:56.560 | But for our case, we will simplify it and see that every token is a word and every word

00:48:00.520 | is a token. So the language model just tell us what is the probability of the next token

00:48:05.480 | given an input prompt. Imagine we want to understand what our model thinks of the word

00:48:13.280 | girl. So what kind of input, what kind of prompt the model wants to see as input to

00:48:22.400 | predict the word girl as next token? Well, let's try to use the techniques that we have

00:48:29.720 | seen before. So first, let's see the results of such an analysis. And in particular, Jessica,

00:48:36.640 | who is the founder of LeapLabs, she did this study. So she took some tokens, for example,

00:48:42.800 | the word girl, and then she optimized the input prompt in such a way that the output

00:48:48.320 | girl is maximized. So the next predicted token is a girl given this input. And she did it

00:48:56.080 | also for the word woman. She did it for the word good and for the word doctor. This gives

00:49:02.120 | us insight into what our model has learned because our model, our language model is just

00:49:07.440 | a model that models the statistical distributions of tokens based on the training data it has

00:49:13.760 | seen. So in this case, for example, the prompt that maximizes the word girl as being the

00:49:21.920 | next token is this input here. And as you can see, it tells us that our model has seen

00:49:28.240 | a lot of bad data that is making our model have bias against girls, for example, because

00:49:37.400 | we see sexual words, we see other girls that are not quite polite. And the same happens

00:49:43.560 | for, well, in the case of the woman, the word woman, it's a little better, but still it

00:49:49.240 | tells you what are the bias of your model against this particular concept. And we can

00:49:54.960 | see it also for the word good, for example, we see that shooting is good or somehow, and

00:50:00.080 | Jesus and beautiful or basketball, et cetera. So optimizing the prompt to generate a particular

00:50:09.400 | output tells you what our model wants to see as input to generate that particular output,

00:50:15.240 | which gives us insights into the distribution that our model has learned. And Jessica, she

00:50:21.840 | ran another experiment, which is to, because when we optimize the prompt, we will see later

00:50:26.800 | how it's actually done in practice. We start from complete noise and we optimize this prompt

00:50:32.560 | to, to, to become tokens that are more likely to predict a particular output. And of course,

00:50:42.140 | you can restart this optimization problem from multiple starting points, because you

00:50:46.440 | can start from complete noise and you have many starting points. So many input noises

00:50:52.240 | you can have as starting point. And she did it many times and she got some input tokens

00:50:57.280 | that were more likely to predict the word girl as next token. This gives us a map on

00:51:03.900 | what kind of inputs the model wants to see with each of tokens with its frequency, what

00:51:11.120 | kind of inputs the model wants to see to predict the girl as next token or boy as next token

00:51:16.480 | or science or art as next token. And this also gives us insight into the statistical

00:51:21.440 | distribution that our model has learned. For example, the word girl, to get the word girl

00:51:26.800 | as output, the model wants to see some sexual words and some other like not so, some curse

00:51:34.760 | words also, but also for example, the word dresses or the word boys. And in the case

00:51:39.900 | of the word boy, we can see that it wants to see rebellious, monkey, girl, et cetera,

00:51:45.600 | et cetera, but this gives us insight into what our model has seen during its training.

00:51:51.760 | So now let's try to analyze how to actually generate this kind of map and how this optimization

00:51:57.840 | problem works. What we did before with the computer vision models, that is we have some

00:52:04.080 | output logits for which we want to find an input that maximizes that logit is exactly

00:52:09.560 | what we want to do here, except that here we have a language model and we have another

00:52:15.800 | complexity. So let's do it step-by-step, how we can generate this kind of map. Imagine

00:52:22.500 | that we want to find input embeddings that maximize the probability of the next token

00:52:29.200 | being girl. Now, the first complexity is that girl may not be one token, but it could be

00:52:35.240 | multiple tokens. So let's suppose that it's actually multiple tokens because this is a

00:52:40.080 | real scenario. So we have the output that we want to optimize an input for, and suppose

00:52:46.480 | that we want to optimize three input embeddings. So let's draw three input embeddings to maximize

00:52:54.800 | the probability of the next token being girl, but we know that girl may not be a single

00:53:00.320 | token. So let's suppose that it's actually two tokens. So one token, it's GI and the

00:53:06.780 | other token being RL. Now the job that we did before, that is calculating the loss of

00:53:16.760 | the output with respect to the input. It's something that we cannot do anymore. Why?

00:53:22.440 | Because the input in the language model is tokens and tokens are numbers that represent

00:53:28.640 | the position of this token in the vocabulary. So for example, imagine the input could be

00:53:34.760 | for example, zero, five, and nine, and these are positions that represent each token in

00:53:40.600 | the vocabulary. And we cannot optimize for something that is discrete because there is

00:53:45.680 | no token 0.5, there is no token 3.2. We cannot change these tokens a little bit, hoping that

00:53:52.620 | they move towards something that represents, that will generate that kind of output. The

00:53:57.920 | only thing that we can optimize are embeddings. So we will not be optimizing input tokens,

00:54:03.360 | we will be optimizing input embeddings. So in this case, we suppose that we have three

00:54:08.680 | embeddings. So let me delete this part. So we suppose that we have three input embeddings.

00:54:29.520 | Now which three input embeddings should we choose? Well, in the case of the computer

00:54:33.420 | vision model, we started from pure noise. In this case, we can also start from pure

00:54:37.180 | noise. So we can start from three random embeddings. One, two, and three. What we can do, we can

00:54:47.000 | run these three embeddings in our language model. And as you know, the language model

00:54:51.400 | is a transformable model in most of the cases, and it's a sequence to sequence model that

00:54:55.600 | will generate if the input is three embeddings, it will generate three embeddings as output.

00:55:00.960 | So here, we will have three embeddings. Our goal is to make sure to select three embeddings

00:55:12.600 | that make the likelihood of the next token being GI and the next next token being RL

00:55:20.480 | maximized. So how to proceed? We take these three embeddings, we run it through our model,

00:55:26.500 | it will produce three embeddings as output. Usually when sampling from a language model,

00:55:31.660 | we take the last embedding, so the last hidden state, so the output of a language model here

00:55:38.220 | at this point are called hidden states. We take the last hidden state, we send it to

00:55:44.060 | the linear layer, and it will generate what are known as logits. Logits indicate what

00:55:50.100 | is the probability score, it's not a probability actually, but what is the score that the model

00:55:54.900 | assigns to each token in the vocabulary, and then by applying the softmax, they become

00:56:00.360 | probability scores, and then usually we choose the token with the maximum probability score

00:56:05.940 | as the next token. So in this case, we can take the last embedding, we can run it through

00:56:15.780 | the linear layer, and it will generate logits. So we will have the logits associated with

00:56:21.100 | the position, let's say zero. In this logits, we are concerned, because it's a list of numbers,

00:56:28.860 | each one for each position in the vocabulary, we are interested in two logits in particular.

00:56:34.220 | One is the one with the highest probability score, and that will be used to sample the

00:56:40.140 | next token, and one is the logit corresponding to the token GI. So we save two logits, one

00:56:47.180 | is corresponding to the GI, and one is for the next token. We use the logit corresponding

00:56:55.620 | to the next token to understand what is the next token, we put it back in the input of

00:56:59.980 | the model, so we put back the embedding corresponding to this next token in the input of the model,

00:57:07.020 | along with the three input tokens that we saw before. This will result in four output

00:57:11.860 | embeddings being generated. We take the last embedding, and this will generate logits corresponding

00:57:17.700 | to the next position. And also in this logits, we are interested only, actually in this case,

00:57:23.660 | we are interested only in the logits corresponding to the token RL. Then we take these two logits,

00:57:31.820 | we know their probability score, because we can run the softmax, and we use them as objective

00:57:38.020 | for our optimization process, because we want to maximize these probabilities. So the probability

00:57:44.380 | of selecting this token, and this token. We can take the negative log probability, so

00:57:52.220 | once we have run the softmax, they will become probabilities, and we can sum them up, and

00:57:59.980 | this will become our objective. Now, if we do like this, there is no guarantee that the

00:58:07.060 | inputs that we are optimizing here will actually be embeddings that correspond to some token.

00:58:13.940 | They may not correspond to any token, because as we saw before for computer vision models,

00:58:18.700 | our model has some natural inputs that may be here, and maybe we are optimizing something

00:58:24.540 | that is here, that is out of distribution. So we need to find a way to push these embeddings

00:58:30.500 | to go in distribution, and one way is to find a regularizer. So something that puts a constraint

00:58:38.540 | in our optimization problem to push the embeddings in certain directions, and one way to do that

00:58:44.860 | is whenever we feed these three embeddings that we are trying to optimize to the language

00:58:50.100 | model, we can calculate their distance from the closest embedding in the vocabulary, and

00:58:57.460 | use this as regularizer. So we add this distance in our loss function, and we ask the optimization

00:59:05.060 | problem to minimize also this distance. This will force our optimization problem to optimize

00:59:11.700 | the embeddings in such a way that they maximize the likelihood of the next token being GI,

00:59:18.620 | and the next next token being RL, but at the same time, to produce embeddings that are

00:59:25.500 | closer to our vocabulary, so that we can, that will actually map to some token in our

00:59:31.700 | vocabulary. So they will not just generate any embedding that result in that activation,

00:59:37.380 | but generate actually some embedding that correspond to embeddings that are present

00:59:41.820 | in our vocabulary. And this is how we can generate that kind of map. So thank you guys

00:59:48.820 | for watching my video. I know it has been very demanding, especially for the last part,

00:59:54.180 | but I will share a notebook in the description of the video that you can use to generate

01:00:01.140 | the map that we saw before, so you can play with it by yourself. If you like this video,

01:00:07.220 | please share it with your friends, with your colleagues, and I hope you come back to my

01:00:11.540 | channel for more videos. Have a nice day!