back to indexCLIP - Paper explanation (training and inference)
00:00:00.000 |
Hello guys, welcome to this new video about Clip. 00:00:08.560 |
And it's got really popular even recently because of stable diffusion. 00:00:13.280 |
And it was quite revolutionary for the time, especially because it's used a novel way of 00:00:19.560 |
First of all, we will explore what is Clip and why what does it mean to connect text 00:00:25.920 |
And secondly, actually before that, we will also explore why we needed Clip in the first 00:00:33.680 |
First of all, the task we are concerned about is image classification. 00:00:39.660 |
And before we had Clip, we had these convolutional neural networks that were trained to classify 00:00:48.880 |
For example, let's see this picture from Google's website. 00:00:53.880 |
And we see that, for example, before, if we had pictures of cats or dogs, and we wanted 00:00:59.000 |
to classify them into two classes, we had to create this convolutional neural network 00:01:05.800 |
with a lot of convolutions, max pooling, and finally, a fully connected layer that would 00:01:11.680 |
give a maximum score to the class that is most representative for the input image. 00:01:15.760 |
For example, the activation for the cat output would be the highest in this case, because 00:01:24.740 |
And if it was a dog, then the output for the dog would be the have the highest value. 00:01:31.140 |
This was working well, actually, the problem with this method with this way of proceeding 00:01:36.760 |
is that we need a lot of pictures, we need a big data set, we need a lot of labels data 00:01:43.440 |
So we need a lot of pictures of cats, a lot of pictures of dogs. 00:01:47.000 |
And someone has to spend time to build this data set and to label them and to verify that 00:01:54.480 |
This is okay, if we have a small number of classes, and they are quite different from 00:02:00.880 |
However, in some domains, it's not easy to build this data set. 00:02:04.480 |
And it's quite costly actually to build them. 00:02:07.280 |
I think of medical research in which the pictures have to be labeled by, for example, a doctor 00:02:13.520 |
or anyway, someone who has knowledge in the domain. 00:02:16.840 |
So you cannot just ask a random person to classify cancer and non cancer images from 00:02:25.520 |
So the problem was to build this data set was really expensive. 00:02:29.760 |
And plus, what they saw is that this data set could not generalize to other tasks. 00:02:35.520 |
So for example, classifier that was trained upon dogs and cats could not generalize easily 00:02:46.400 |
And it would perform really badly on other kind of classic classifications. 00:02:52.680 |
So let's explore how clip solves this problem. 00:02:56.220 |
So clip just like the name says, connecting images and pictures, sorry, images and text 00:03:07.040 |
And basically, it means contrastive learning images pre training. 00:03:11.280 |
And the way it works is written is shown here. 00:03:14.880 |
So basically, clip is made of two encoders, one text encoder and one image encoder. 00:03:23.240 |
What do we feed to clip first of all, we give him a batch of text and the corresponding 00:03:27.920 |
images, which means that the first item in this text batch is corresponding to the first 00:03:37.620 |
So pepper the Aussie pup is actually corresponding to this picture. 00:03:43.400 |
The authors of clip got these pictures and this text from the internet, they created 00:03:48.000 |
a data set of 400 million images collected from the internet that were supposedly well 00:04:00.280 |
Usually when you find a picture on the internet, actually, you don't find just the picture, 00:04:04.200 |
you also find some description of the picture behind it. 00:04:07.240 |
Especially on social networks, for example, people going on a trip in somewhere they will 00:04:11.400 |
write something about the content of the picture. 00:04:17.960 |
So for example, here, we don't just write dog, we actually describe the picture. 00:04:21.960 |
So this is why they call it natural language supervision. 00:04:26.320 |
So the way it works is, they take the text in the batch of text, and they go pass it 00:04:31.360 |
through the text encoder, which gives us some features for this tech. 00:04:36.200 |
And these features are actually then multiplied by another matrix so that the dimension of 00:04:46.560 |
So they pass the images through the image encoder. 00:04:49.640 |
And then they multiply this feature by another matrix to make the images have the same dimension 00:04:59.240 |
When then they build this dot product, this cosine similarity metric, we can see here, 00:05:06.280 |
in which they calculate the cosine similarity between each possible combination of text 00:05:13.520 |
I mean, what do we expect, since we know that the ground truth is the fact that this picture 00:05:18.320 |
matches with the first text, and the second text matches with the second picture, and 00:05:23.240 |
the third text matches with the third picture, we want all the items in the diagonal. 00:05:29.480 |
So the one we know match to each other, we have the most to be the most similar to have 00:05:36.480 |
the highest similarity, while we want the other pairs to have a lower similarity, even 00:05:45.000 |
But we want these ones for on the diagonal to have the highest one. 00:05:50.160 |
And actually, this code is written also in the paper, which we can see here, let me check 00:06:01.380 |
So here we have a batch of images, we pass it through the image encoder to get some at 00:06:08.760 |
And then the embeddings from the image encoder of dimension di. 00:06:14.520 |
Then we do the same with the text, we have n text, and we pass it to the text encoder, 00:06:19.600 |
we will get some features from this text encoder of the dimension t dt. 00:06:25.080 |
Then we multiply the features from the image and the features from the text with the two 00:06:30.080 |
matrices so that each of them will have a resulting feature size of d e. 00:06:39.840 |
And we calculate the logics, then what we do, we calculate the loss, how, how should 00:06:49.040 |
Well, basically, what we expect is, is that by in the rows in this row, for example, we 00:06:56.840 |
expect the this item, so the position one to have the highest cosine similarity in this 00:07:04.120 |
row, we expect the second one to have the highest similarity. 00:07:08.680 |
And the third row, we expect the third one, and the same for the columns in the first 00:07:12.800 |
column, we expect this one to have the highest second one, we have this one to have the highest 00:07:17.200 |
and the third one, we have this item to have the highest cosine similarity. 00:07:22.360 |
And this explains the choice of the loss function here. 00:07:25.880 |
So basically, we just generate a range between zero and n. 00:07:30.720 |
And then we this is our expected actually labeled. 00:07:34.660 |
So we want that particular row of that particular position in the row or in the column to have 00:07:40.280 |
And we compare this one with the logic generated on the first axis and on the second axis, 00:07:46.880 |
basically means on there by rows or by columns, then we sum the two losses and we divide by 00:07:56.880 |
And this is how the training works for this contrastive training. 00:08:07.800 |
Inference is quite easy, and quite efficient, also, I have to say, first of all, because 00:08:15.840 |
What we do, we don't need to calculate anything from the text encoder, which we can calculate 00:08:23.200 |
So first of all, actually, what we do is, we create a prompt, so a photo of a something. 00:08:29.240 |
And what we do is, we create a list of classes that we expect to work with. 00:08:35.880 |
So in this case, we can work with plane cars, dogs, birds, etc. 00:08:39.760 |
So we pass all of these possible classes into this prompt generate the corresponding feature 00:08:47.240 |
So for example, we will have a picture of a plane and generate its features into t one, 00:08:52.160 |
then a picture of a car and we will have another feature and put it into t two, a picture of 00:08:56.960 |
a dog, and then put it into t three, we compute all these features, and we keep them aside, 00:09:03.720 |
we save them, we can reuse them even for the next classification. 00:09:08.000 |
We don't have to compute them every time we want to classify an image. 00:09:14.160 |
And then what we do is we take the picture of the dog, we pass it through the image encoder, 00:09:17.960 |
we calculate its features, and then we multiply basically what we have computed before with 00:09:26.120 |
And the one with the highest value will be the chosen label will be the chosen text corresponding 00:09:36.140 |
As we can see, it's quite efficient also, because we only have to compute the features 00:09:40.400 |
of the image once and then of course, we have to multiply. 00:09:46.960 |
And okay, this in the website, we also can see that the clip authors were telling about 00:09:55.560 |
the problems they had with the previous models, for example, image net was, you know, was 00:10:01.360 |
built using millions of images, and they see required over 25,000 workers to annotate 14 00:10:12.520 |
So actually, clip is doing it nearly for free, if we could say, because actually, we are 00:10:17.440 |
learning from the internet, and there is a lot of resource available on the internet. 00:10:21.360 |
And this model actually will be used also by stable diffusion and all these generative 00:10:25.680 |
systems that actually just download the stuff from the internet and train models. 00:10:30.920 |
And the same is done for GPT and all the other language models. 00:10:36.840 |
And here we can see some examples of of classification. 00:10:45.080 |
And clip is also very highly efficient compared to the other models. 00:10:48.880 |
And the best aspect of clip is that it can work very well on zero shot. 00:10:55.400 |
So for example, for example, clip is able to classify, I don't know, action recognition, 00:11:07.640 |
But not all tasks is not efficient in every task, of course, for example, some tasks that 00:11:13.200 |
are even difficult for human as a zero shot task, of course, clip is not performing very 00:11:21.760 |
And some tasks that are totally unrelated to his training data are also he's not also 00:11:29.920 |
For example, counting the objects in an image, etc. 00:11:37.480 |
Another note I wanted to add is, is that how does they how do they extract features. 00:11:45.400 |
So we have one text encoder here and one image encoder as the image encoder, the authors 00:11:56.720 |
And for them, we just extract the features from the last year and that's it about the 00:12:03.440 |
What the authors do actually is they choose a transformer, but they only use the encoder 00:12:12.080 |
And what they do is they take the features corresponding to the end of text token from 00:12:24.640 |
Actually, it was not very clear to me the way the authors wrote it. 00:12:30.080 |
So the text sequence is bracketed with start of sentence and end of sentence tokens and 00:12:33.840 |
the activation of the highest layer of the transformer at the end of sentence token are 00:12:38.720 |
treated as the feature representation of the text. 00:12:42.320 |
This basically means that if we watch the attention paper, they take the features from 00:12:49.080 |
here and corresponding to the end of sentence character, which in the code is done on this. 00:13:03.720 |
So what they do is for they pass the text to the transformer, they do the normalization. 00:13:12.760 |
Then for each of the text, they check in the original text, where was the position of the 00:13:23.240 |
And they get the features corresponding to that one. 00:13:26.120 |
And that's what they use to multiply with the W matrix to obtain the features and then 00:13:37.160 |
I was not very concerned about actually the results, which you can read on the paper and 00:13:42.280 |
you can read also on the website, even the applications. 00:13:46.280 |
Actually what I wanted to show in this paper in this small video is that how do how does 00:13:54.120 |
clip work and how do we train a model similar to this? 00:13:58.120 |
Thank you for listening and enjoy the rest of your day.