CLIP - Paper explanation (training and inference)

00:00:00.000 | Hello guys, welcome to this new video about Clip.

00:00:04.320 | Clip is a model from OpenAI released 2021.

00:00:08.560 | And it's got really popular even recently because of stable diffusion.

00:00:13.280 | And it was quite revolutionary for the time, especially because it's used a novel way of

00:00:18.000 | connecting text and images.

00:00:19.560 | First of all, we will explore what is Clip and why what does it mean to connect text

00:00:24.800 | and images.

00:00:25.920 | And secondly, actually before that, we will also explore why we needed Clip in the first

00:00:31.680 | place.

00:00:32.680 | Okay.

00:00:33.680 | First of all, the task we are concerned about is image classification.

00:00:37.120 | So we are in the domain of classification.

00:00:39.660 | And before we had Clip, we had these convolutional neural networks that were trained to classify

00:00:45.720 | pictures, images into classes.

00:00:48.880 | For example, let's see this picture from Google's website.

00:00:53.880 | And we see that, for example, before, if we had pictures of cats or dogs, and we wanted

00:00:59.000 | to classify them into two classes, we had to create this convolutional neural network

00:01:05.800 | with a lot of convolutions, max pooling, and finally, a fully connected layer that would

00:01:11.680 | give a maximum score to the class that is most representative for the input image.

00:01:15.760 | For example, the activation for the cat output would be the highest in this case, because

00:01:22.920 | we are giving a picture of the cat.

00:01:24.740 | And if it was a dog, then the output for the dog would be the have the highest value.

00:01:31.140 | This was working well, actually, the problem with this method with this way of proceeding

00:01:36.760 | is that we need a lot of pictures, we need a big data set, we need a lot of labels data

00:01:42.440 | set.

00:01:43.440 | So we need a lot of pictures of cats, a lot of pictures of dogs.

00:01:47.000 | And someone has to spend time to build this data set and to label them and to verify that

00:01:51.800 | these labels are actually correct.

00:01:54.480 | This is okay, if we have a small number of classes, and they are quite different from

00:01:59.880 | each other.

00:02:00.880 | However, in some domains, it's not easy to build this data set.

00:02:04.480 | And it's quite costly actually to build them.

00:02:07.280 | I think of medical research in which the pictures have to be labeled by, for example, a doctor

00:02:13.520 | or anyway, someone who has knowledge in the domain.

00:02:16.840 | So you cannot just ask a random person to classify cancer and non cancer images from

00:02:23.720 | medical devices.

00:02:25.520 | So the problem was to build this data set was really expensive.

00:02:29.760 | And plus, what they saw is that this data set could not generalize to other tasks.

00:02:35.520 | So for example, classifier that was trained upon dogs and cats could not generalize easily

00:02:42.500 | to other type of classes.

00:02:46.400 | And it would perform really badly on other kind of classic classifications.

00:02:52.680 | So let's explore how clip solves this problem.

00:02:56.220 | So clip just like the name says, connecting images and pictures, sorry, images and text

00:03:03.080 | is a model from open AI.

00:03:07.040 | And basically, it means contrastive learning images pre training.

00:03:11.280 | And the way it works is written is shown here.

00:03:14.880 | So basically, clip is made of two encoders, one text encoder and one image encoder.

00:03:23.240 | What do we feed to clip first of all, we give him a batch of text and the corresponding

00:03:27.920 | images, which means that the first item in this text batch is corresponding to the first

00:03:34.360 | image in this image back.

00:03:37.620 | So pepper the Aussie pup is actually corresponding to this picture.

00:03:41.640 | And where do we get all this picture?

00:03:43.400 | The authors of clip got these pictures and this text from the internet, they created

00:03:48.000 | a data set of 400 million images collected from the internet that were supposedly well

00:03:56.840 | described by the users by the authors.

00:04:00.280 | Usually when you find a picture on the internet, actually, you don't find just the picture,

00:04:04.200 | you also find some description of the picture behind it.

00:04:07.240 | Especially on social networks, for example, people going on a trip in somewhere they will

00:04:11.400 | write something about the content of the picture.

00:04:14.720 | And this, this is not just a single word.

00:04:17.960 | So for example, here, we don't just write dog, we actually describe the picture.

00:04:21.960 | So this is why they call it natural language supervision.

00:04:26.320 | So the way it works is, they take the text in the batch of text, and they go pass it

00:04:31.360 | through the text encoder, which gives us some features for this tech.

00:04:36.200 | And these features are actually then multiplied by another matrix so that the dimension of

00:04:40.160 | the features is a particular dimension.

00:04:43.840 | And then they do the same with the images.

00:04:46.560 | So they pass the images through the image encoder.

00:04:49.640 | And then they multiply this feature by another matrix to make the images have the same dimension

00:04:56.960 | as the text features.

00:04:59.240 | When then they build this dot product, this cosine similarity metric, we can see here,

00:05:06.280 | in which they calculate the cosine similarity between each possible combination of text

00:05:11.520 | and image.

00:05:12.520 | And what do we expect?

00:05:13.520 | I mean, what do we expect, since we know that the ground truth is the fact that this picture

00:05:18.320 | matches with the first text, and the second text matches with the second picture, and

00:05:23.240 | the third text matches with the third picture, we want all the items in the diagonal.

00:05:29.480 | So the one we know match to each other, we have the most to be the most similar to have

00:05:36.480 | the highest similarity, while we want the other pairs to have a lower similarity, even

00:05:43.600 | zero.

00:05:45.000 | But we want these ones for on the diagonal to have the highest one.

00:05:50.160 | And actually, this code is written also in the paper, which we can see here, let me check

00:05:57.200 | which page Yeah, here.

00:06:01.380 | So here we have a batch of images, we pass it through the image encoder to get some at

00:06:07.720 | the bed.

00:06:08.760 | And then the embeddings from the image encoder of dimension di.

00:06:14.520 | Then we do the same with the text, we have n text, and we pass it to the text encoder,

00:06:19.600 | we will get some features from this text encoder of the dimension t dt.

00:06:25.080 | Then we multiply the features from the image and the features from the text with the two

00:06:30.080 | matrices so that each of them will have a resulting feature size of d e.

00:06:36.560 | We do the cosine similarities for each pair.

00:06:39.840 | And we calculate the logics, then what we do, we calculate the loss, how, how should

00:06:48.040 | we calculate the loss?

00:06:49.040 | Well, basically, what we expect is, is that by in the rows in this row, for example, we

00:06:56.840 | expect the this item, so the position one to have the highest cosine similarity in this

00:07:04.120 | row, we expect the second one to have the highest similarity.

00:07:08.680 | And the third row, we expect the third one, and the same for the columns in the first

00:07:12.800 | column, we expect this one to have the highest second one, we have this one to have the highest

00:07:17.200 | and the third one, we have this item to have the highest cosine similarity.

00:07:22.360 | And this explains the choice of the loss function here.

00:07:25.880 | So basically, we just generate a range between zero and n.

00:07:30.720 | And then we this is our expected actually labeled.

00:07:34.660 | So we want that particular row of that particular position in the row or in the column to have

00:07:39.280 | the highest one.

00:07:40.280 | And we compare this one with the logic generated on the first axis and on the second axis,

00:07:46.880 | basically means on there by rows or by columns, then we sum the two losses and we divide by

00:07:52.040 | two.

00:07:53.040 | So we do the average of the two loss.

00:07:54.040 | And this is our loss function.

00:07:56.880 | And this is how the training works for this contrastive training.

00:08:03.600 | Then how do we do inference?

00:08:07.800 | Inference is quite easy, and quite efficient, also, I have to say, first of all, because

00:08:13.320 | imagine we have a picture of a dog.

00:08:15.840 | What we do, we don't need to calculate anything from the text encoder, which we can calculate

00:08:22.200 | only one.

00:08:23.200 | So first of all, actually, what we do is, we create a prompt, so a photo of a something.

00:08:29.240 | And what we do is, we create a list of classes that we expect to work with.

00:08:35.880 | So in this case, we can work with plane cars, dogs, birds, etc.

00:08:39.760 | So we pass all of these possible classes into this prompt generate the corresponding feature

00:08:46.240 | for the prompt.

00:08:47.240 | So for example, we will have a picture of a plane and generate its features into t one,

00:08:52.160 | then a picture of a car and we will have another feature and put it into t two, a picture of

00:08:56.960 | a dog, and then put it into t three, we compute all these features, and we keep them aside,

00:09:03.720 | we save them, we can reuse them even for the next classification.

00:09:08.000 | We don't have to compute them every time we want to classify an image.

00:09:11.800 | So we do this job only once.

00:09:14.160 | And then what we do is we take the picture of the dog, we pass it through the image encoder,

00:09:17.960 | we calculate its features, and then we multiply basically what we have computed before with

00:09:22.680 | the features of from of the image.

00:09:26.120 | And the one with the highest value will be the chosen label will be the chosen text corresponding

00:09:32.520 | to this picture.

00:09:34.400 | And this is how the inference works.

00:09:36.140 | As we can see, it's quite efficient also, because we only have to compute the features

00:09:40.400 | of the image once and then of course, we have to multiply.

00:09:46.960 | And okay, this in the website, we also can see that the clip authors were telling about

00:09:55.560 | the problems they had with the previous models, for example, image net was, you know, was

00:10:01.360 | built using millions of images, and they see required over 25,000 workers to annotate 14

00:10:09.040 | million images for 22,000 objects series.

00:10:12.520 | So actually, clip is doing it nearly for free, if we could say, because actually, we are

00:10:17.440 | learning from the internet, and there is a lot of resource available on the internet.

00:10:21.360 | And this model actually will be used also by stable diffusion and all these generative

00:10:25.680 | systems that actually just download the stuff from the internet and train models.

00:10:30.920 | And the same is done for GPT and all the other language models.

00:10:36.840 | And here we can see some examples of of classification.

00:10:42.600 | I didn't load all of them.

00:10:45.080 | And clip is also very highly efficient compared to the other models.

00:10:48.880 | And the best aspect of clip is that it can work very well on zero shot.

00:10:55.400 | So for example, for example, clip is able to classify, I don't know, action recognition,

00:11:06.000 | even OCR.

00:11:07.640 | But not all tasks is not efficient in every task, of course, for example, some tasks that

00:11:13.200 | are even difficult for human as a zero shot task, of course, clip is not performing very

00:11:20.200 | well on them.

00:11:21.760 | And some tasks that are totally unrelated to his training data are also he's not also

00:11:26.360 | performing very well on those tasks.

00:11:29.920 | For example, counting the objects in an image, etc.

00:11:33.700 | And yeah, this is it.

00:11:37.480 | Another note I wanted to add is, is that how does they how do they extract features.

00:11:45.400 | So we have one text encoder here and one image encoder as the image encoder, the authors

00:11:52.820 | use ResNet and the vision transformer.

00:11:56.720 | And for them, we just extract the features from the last year and that's it about the

00:12:01.400 | text encoder.

00:12:03.440 | What the authors do actually is they choose a transformer, but they only use the encoder

00:12:09.880 | part of the transformer, of course.

00:12:12.080 | And what they do is they take the features corresponding to the end of text token from

00:12:20.260 | the last layer.

00:12:21.920 | So basically, it's written here.

00:12:24.640 | Actually, it was not very clear to me the way the authors wrote it.

00:12:30.080 | So the text sequence is bracketed with start of sentence and end of sentence tokens and

00:12:33.840 | the activation of the highest layer of the transformer at the end of sentence token are

00:12:38.720 | treated as the feature representation of the text.

00:12:42.320 | This basically means that if we watch the attention paper, they take the features from

00:12:49.080 | here and corresponding to the end of sentence character, which in the code is done on this.

00:12:58.720 | You can see here, this is file model.py.

00:13:02.320 | It's done here.

00:13:03.720 | So what they do is for they pass the text to the transformer, they do the normalization.

00:13:12.760 | Then for each of the text, they check in the original text, where was the position of the

00:13:18.920 | end of sentence token.

00:13:21.560 | This is how they do it.

00:13:23.240 | And they get the features corresponding to that one.

00:13:26.120 | And that's what they use to multiply with the W matrix to obtain the features and then

00:13:31.960 | do the cosine similarity.

00:13:35.040 | I hope my explanation was clear.

00:13:37.160 | I was not very concerned about actually the results, which you can read on the paper and

00:13:42.280 | you can read also on the website, even the applications.

00:13:46.280 | Actually what I wanted to show in this paper in this small video is that how do how does

00:13:54.120 | clip work and how do we train a model similar to this?

00:13:58.120 | Thank you for listening and enjoy the rest of your day.