back to index

CLIP - Paper explanation (training and inference)


Whisper Transcript | Transcript Only Page

00:00:00.000 | Hello guys, welcome to this new video about Clip.
00:00:04.320 | Clip is a model from OpenAI released 2021.
00:00:08.560 | And it's got really popular even recently because of stable diffusion.
00:00:13.280 | And it was quite revolutionary for the time, especially because it's used a novel way of
00:00:18.000 | connecting text and images.
00:00:19.560 | First of all, we will explore what is Clip and why what does it mean to connect text
00:00:24.800 | and images.
00:00:25.920 | And secondly, actually before that, we will also explore why we needed Clip in the first
00:00:31.680 | place.
00:00:32.680 | Okay.
00:00:33.680 | First of all, the task we are concerned about is image classification.
00:00:37.120 | So we are in the domain of classification.
00:00:39.660 | And before we had Clip, we had these convolutional neural networks that were trained to classify
00:00:45.720 | pictures, images into classes.
00:00:48.880 | For example, let's see this picture from Google's website.
00:00:53.880 | And we see that, for example, before, if we had pictures of cats or dogs, and we wanted
00:00:59.000 | to classify them into two classes, we had to create this convolutional neural network
00:01:05.800 | with a lot of convolutions, max pooling, and finally, a fully connected layer that would
00:01:11.680 | give a maximum score to the class that is most representative for the input image.
00:01:15.760 | For example, the activation for the cat output would be the highest in this case, because
00:01:22.920 | we are giving a picture of the cat.
00:01:24.740 | And if it was a dog, then the output for the dog would be the have the highest value.
00:01:31.140 | This was working well, actually, the problem with this method with this way of proceeding
00:01:36.760 | is that we need a lot of pictures, we need a big data set, we need a lot of labels data
00:01:43.440 | So we need a lot of pictures of cats, a lot of pictures of dogs.
00:01:47.000 | And someone has to spend time to build this data set and to label them and to verify that
00:01:51.800 | these labels are actually correct.
00:01:54.480 | This is okay, if we have a small number of classes, and they are quite different from
00:01:59.880 | each other.
00:02:00.880 | However, in some domains, it's not easy to build this data set.
00:02:04.480 | And it's quite costly actually to build them.
00:02:07.280 | I think of medical research in which the pictures have to be labeled by, for example, a doctor
00:02:13.520 | or anyway, someone who has knowledge in the domain.
00:02:16.840 | So you cannot just ask a random person to classify cancer and non cancer images from
00:02:23.720 | medical devices.
00:02:25.520 | So the problem was to build this data set was really expensive.
00:02:29.760 | And plus, what they saw is that this data set could not generalize to other tasks.
00:02:35.520 | So for example, classifier that was trained upon dogs and cats could not generalize easily
00:02:42.500 | to other type of classes.
00:02:46.400 | And it would perform really badly on other kind of classic classifications.
00:02:52.680 | So let's explore how clip solves this problem.
00:02:56.220 | So clip just like the name says, connecting images and pictures, sorry, images and text
00:03:03.080 | is a model from open AI.
00:03:07.040 | And basically, it means contrastive learning images pre training.
00:03:11.280 | And the way it works is written is shown here.
00:03:14.880 | So basically, clip is made of two encoders, one text encoder and one image encoder.
00:03:23.240 | What do we feed to clip first of all, we give him a batch of text and the corresponding
00:03:27.920 | images, which means that the first item in this text batch is corresponding to the first
00:03:34.360 | image in this image back.
00:03:37.620 | So pepper the Aussie pup is actually corresponding to this picture.
00:03:41.640 | And where do we get all this picture?
00:03:43.400 | The authors of clip got these pictures and this text from the internet, they created
00:03:48.000 | a data set of 400 million images collected from the internet that were supposedly well
00:03:56.840 | described by the users by the authors.
00:04:00.280 | Usually when you find a picture on the internet, actually, you don't find just the picture,
00:04:04.200 | you also find some description of the picture behind it.
00:04:07.240 | Especially on social networks, for example, people going on a trip in somewhere they will
00:04:11.400 | write something about the content of the picture.
00:04:14.720 | And this, this is not just a single word.
00:04:17.960 | So for example, here, we don't just write dog, we actually describe the picture.
00:04:21.960 | So this is why they call it natural language supervision.
00:04:26.320 | So the way it works is, they take the text in the batch of text, and they go pass it
00:04:31.360 | through the text encoder, which gives us some features for this tech.
00:04:36.200 | And these features are actually then multiplied by another matrix so that the dimension of
00:04:40.160 | the features is a particular dimension.
00:04:43.840 | And then they do the same with the images.
00:04:46.560 | So they pass the images through the image encoder.
00:04:49.640 | And then they multiply this feature by another matrix to make the images have the same dimension
00:04:56.960 | as the text features.
00:04:59.240 | When then they build this dot product, this cosine similarity metric, we can see here,
00:05:06.280 | in which they calculate the cosine similarity between each possible combination of text
00:05:11.520 | and image.
00:05:12.520 | And what do we expect?
00:05:13.520 | I mean, what do we expect, since we know that the ground truth is the fact that this picture
00:05:18.320 | matches with the first text, and the second text matches with the second picture, and
00:05:23.240 | the third text matches with the third picture, we want all the items in the diagonal.
00:05:29.480 | So the one we know match to each other, we have the most to be the most similar to have
00:05:36.480 | the highest similarity, while we want the other pairs to have a lower similarity, even
00:05:43.600 | zero.
00:05:45.000 | But we want these ones for on the diagonal to have the highest one.
00:05:50.160 | And actually, this code is written also in the paper, which we can see here, let me check
00:05:57.200 | which page Yeah, here.
00:06:01.380 | So here we have a batch of images, we pass it through the image encoder to get some at
00:06:07.720 | the bed.
00:06:08.760 | And then the embeddings from the image encoder of dimension di.
00:06:14.520 | Then we do the same with the text, we have n text, and we pass it to the text encoder,
00:06:19.600 | we will get some features from this text encoder of the dimension t dt.
00:06:25.080 | Then we multiply the features from the image and the features from the text with the two
00:06:30.080 | matrices so that each of them will have a resulting feature size of d e.
00:06:36.560 | We do the cosine similarities for each pair.
00:06:39.840 | And we calculate the logics, then what we do, we calculate the loss, how, how should
00:06:48.040 | we calculate the loss?
00:06:49.040 | Well, basically, what we expect is, is that by in the rows in this row, for example, we
00:06:56.840 | expect the this item, so the position one to have the highest cosine similarity in this
00:07:04.120 | row, we expect the second one to have the highest similarity.
00:07:08.680 | And the third row, we expect the third one, and the same for the columns in the first
00:07:12.800 | column, we expect this one to have the highest second one, we have this one to have the highest
00:07:17.200 | and the third one, we have this item to have the highest cosine similarity.
00:07:22.360 | And this explains the choice of the loss function here.
00:07:25.880 | So basically, we just generate a range between zero and n.
00:07:30.720 | And then we this is our expected actually labeled.
00:07:34.660 | So we want that particular row of that particular position in the row or in the column to have
00:07:39.280 | the highest one.
00:07:40.280 | And we compare this one with the logic generated on the first axis and on the second axis,
00:07:46.880 | basically means on there by rows or by columns, then we sum the two losses and we divide by
00:07:53.040 | So we do the average of the two loss.
00:07:54.040 | And this is our loss function.
00:07:56.880 | And this is how the training works for this contrastive training.
00:08:03.600 | Then how do we do inference?
00:08:07.800 | Inference is quite easy, and quite efficient, also, I have to say, first of all, because
00:08:13.320 | imagine we have a picture of a dog.
00:08:15.840 | What we do, we don't need to calculate anything from the text encoder, which we can calculate
00:08:22.200 | only one.
00:08:23.200 | So first of all, actually, what we do is, we create a prompt, so a photo of a something.
00:08:29.240 | And what we do is, we create a list of classes that we expect to work with.
00:08:35.880 | So in this case, we can work with plane cars, dogs, birds, etc.
00:08:39.760 | So we pass all of these possible classes into this prompt generate the corresponding feature
00:08:46.240 | for the prompt.
00:08:47.240 | So for example, we will have a picture of a plane and generate its features into t one,
00:08:52.160 | then a picture of a car and we will have another feature and put it into t two, a picture of
00:08:56.960 | a dog, and then put it into t three, we compute all these features, and we keep them aside,
00:09:03.720 | we save them, we can reuse them even for the next classification.
00:09:08.000 | We don't have to compute them every time we want to classify an image.
00:09:11.800 | So we do this job only once.
00:09:14.160 | And then what we do is we take the picture of the dog, we pass it through the image encoder,
00:09:17.960 | we calculate its features, and then we multiply basically what we have computed before with
00:09:22.680 | the features of from of the image.
00:09:26.120 | And the one with the highest value will be the chosen label will be the chosen text corresponding
00:09:32.520 | to this picture.
00:09:34.400 | And this is how the inference works.
00:09:36.140 | As we can see, it's quite efficient also, because we only have to compute the features
00:09:40.400 | of the image once and then of course, we have to multiply.
00:09:46.960 | And okay, this in the website, we also can see that the clip authors were telling about
00:09:55.560 | the problems they had with the previous models, for example, image net was, you know, was
00:10:01.360 | built using millions of images, and they see required over 25,000 workers to annotate 14
00:10:09.040 | million images for 22,000 objects series.
00:10:12.520 | So actually, clip is doing it nearly for free, if we could say, because actually, we are
00:10:17.440 | learning from the internet, and there is a lot of resource available on the internet.
00:10:21.360 | And this model actually will be used also by stable diffusion and all these generative
00:10:25.680 | systems that actually just download the stuff from the internet and train models.
00:10:30.920 | And the same is done for GPT and all the other language models.
00:10:36.840 | And here we can see some examples of of classification.
00:10:42.600 | I didn't load all of them.
00:10:45.080 | And clip is also very highly efficient compared to the other models.
00:10:48.880 | And the best aspect of clip is that it can work very well on zero shot.
00:10:55.400 | So for example, for example, clip is able to classify, I don't know, action recognition,
00:11:06.000 | even OCR.
00:11:07.640 | But not all tasks is not efficient in every task, of course, for example, some tasks that
00:11:13.200 | are even difficult for human as a zero shot task, of course, clip is not performing very
00:11:20.200 | well on them.
00:11:21.760 | And some tasks that are totally unrelated to his training data are also he's not also
00:11:26.360 | performing very well on those tasks.
00:11:29.920 | For example, counting the objects in an image, etc.
00:11:33.700 | And yeah, this is it.
00:11:37.480 | Another note I wanted to add is, is that how does they how do they extract features.
00:11:45.400 | So we have one text encoder here and one image encoder as the image encoder, the authors
00:11:52.820 | use ResNet and the vision transformer.
00:11:56.720 | And for them, we just extract the features from the last year and that's it about the
00:12:01.400 | text encoder.
00:12:03.440 | What the authors do actually is they choose a transformer, but they only use the encoder
00:12:09.880 | part of the transformer, of course.
00:12:12.080 | And what they do is they take the features corresponding to the end of text token from
00:12:20.260 | the last layer.
00:12:21.920 | So basically, it's written here.
00:12:24.640 | Actually, it was not very clear to me the way the authors wrote it.
00:12:30.080 | So the text sequence is bracketed with start of sentence and end of sentence tokens and
00:12:33.840 | the activation of the highest layer of the transformer at the end of sentence token are
00:12:38.720 | treated as the feature representation of the text.
00:12:42.320 | This basically means that if we watch the attention paper, they take the features from
00:12:49.080 | here and corresponding to the end of sentence character, which in the code is done on this.
00:12:58.720 | You can see here, this is file model.py.
00:13:02.320 | It's done here.
00:13:03.720 | So what they do is for they pass the text to the transformer, they do the normalization.
00:13:12.760 | Then for each of the text, they check in the original text, where was the position of the
00:13:18.920 | end of sentence token.
00:13:21.560 | This is how they do it.
00:13:23.240 | And they get the features corresponding to that one.
00:13:26.120 | And that's what they use to multiply with the W matrix to obtain the features and then
00:13:31.960 | do the cosine similarity.
00:13:35.040 | I hope my explanation was clear.
00:13:37.160 | I was not very concerned about actually the results, which you can read on the paper and
00:13:42.280 | you can read also on the website, even the applications.
00:13:46.280 | Actually what I wanted to show in this paper in this small video is that how do how does
00:13:54.120 | clip work and how do we train a model similar to this?
00:13:58.120 | Thank you for listening and enjoy the rest of your day.