CLIP - Paper explanation (training and inference)

Hello guys, welcome to this new video about Clip. Clip is a model from OpenAI released 2021. And it's got really popular even recently because of stable diffusion. And it was quite revolutionary for the time, especially because it's used a novel way of connecting text and images. First of all, we will explore what is Clip and why what does it mean to connect text and images.

And secondly, actually before that, we will also explore why we needed Clip in the first place. Okay. First of all, the task we are concerned about is image classification. So we are in the domain of classification. And before we had Clip, we had these convolutional neural networks that were trained to classify pictures, images into classes.

For example, let's see this picture from Google's website. And we see that, for example, before, if we had pictures of cats or dogs, and we wanted to classify them into two classes, we had to create this convolutional neural network with a lot of convolutions, max pooling, and finally, a fully connected layer that would give a maximum score to the class that is most representative for the input image.

For example, the activation for the cat output would be the highest in this case, because we are giving a picture of the cat. And if it was a dog, then the output for the dog would be the have the highest value. This was working well, actually, the problem with this method with this way of proceeding is that we need a lot of pictures, we need a big data set, we need a lot of labels data set.

So we need a lot of pictures of cats, a lot of pictures of dogs. And someone has to spend time to build this data set and to label them and to verify that these labels are actually correct. This is okay, if we have a small number of classes, and they are quite different from each other.

However, in some domains, it's not easy to build this data set. And it's quite costly actually to build them. I think of medical research in which the pictures have to be labeled by, for example, a doctor or anyway, someone who has knowledge in the domain. So you cannot just ask a random person to classify cancer and non cancer images from medical devices.

So the problem was to build this data set was really expensive. And plus, what they saw is that this data set could not generalize to other tasks. So for example, classifier that was trained upon dogs and cats could not generalize easily to other type of classes. And it would perform really badly on other kind of classic classifications.

So let's explore how clip solves this problem. So clip just like the name says, connecting images and pictures, sorry, images and text is a model from open AI. And basically, it means contrastive learning images pre training. And the way it works is written is shown here. So basically, clip is made of two encoders, one text encoder and one image encoder.

What do we feed to clip first of all, we give him a batch of text and the corresponding images, which means that the first item in this text batch is corresponding to the first image in this image back. So pepper the Aussie pup is actually corresponding to this picture.

And where do we get all this picture? The authors of clip got these pictures and this text from the internet, they created a data set of 400 million images collected from the internet that were supposedly well described by the users by the authors. Usually when you find a picture on the internet, actually, you don't find just the picture, you also find some description of the picture behind it.

Especially on social networks, for example, people going on a trip in somewhere they will write something about the content of the picture. And this, this is not just a single word. So for example, here, we don't just write dog, we actually describe the picture. So this is why they call it natural language supervision.

So the way it works is, they take the text in the batch of text, and they go pass it through the text encoder, which gives us some features for this tech. And these features are actually then multiplied by another matrix so that the dimension of the features is a particular dimension.

And then they do the same with the images. So they pass the images through the image encoder. And then they multiply this feature by another matrix to make the images have the same dimension as the text features. When then they build this dot product, this cosine similarity metric, we can see here, in which they calculate the cosine similarity between each possible combination of text and image.

And what do we expect? I mean, what do we expect, since we know that the ground truth is the fact that this picture matches with the first text, and the second text matches with the second picture, and the third text matches with the third picture, we want all the items in the diagonal.

So the one we know match to each other, we have the most to be the most similar to have the highest similarity, while we want the other pairs to have a lower similarity, even zero. But we want these ones for on the diagonal to have the highest one. And actually, this code is written also in the paper, which we can see here, let me check which page Yeah, here.

So here we have a batch of images, we pass it through the image encoder to get some at the bed. And then the embeddings from the image encoder of dimension di. Then we do the same with the text, we have n text, and we pass it to the text encoder, we will get some features from this text encoder of the dimension t dt.

Then we multiply the features from the image and the features from the text with the two matrices so that each of them will have a resulting feature size of d e. We do the cosine similarities for each pair. And we calculate the logics, then what we do, we calculate the loss, how, how should we calculate the loss?

Well, basically, what we expect is, is that by in the rows in this row, for example, we expect the this item, so the position one to have the highest cosine similarity in this row, we expect the second one to have the highest similarity. And the third row, we expect the third one, and the same for the columns in the first column, we expect this one to have the highest second one, we have this one to have the highest and the third one, we have this item to have the highest cosine similarity.

And this explains the choice of the loss function here. So basically, we just generate a range between zero and n. And then we this is our expected actually labeled. So we want that particular row of that particular position in the row or in the column to have the highest one.

And we compare this one with the logic generated on the first axis and on the second axis, basically means on there by rows or by columns, then we sum the two losses and we divide by two. So we do the average of the two loss. And this is our loss function.

And this is how the training works for this contrastive training. Then how do we do inference? Inference is quite easy, and quite efficient, also, I have to say, first of all, because imagine we have a picture of a dog. What we do, we don't need to calculate anything from the text encoder, which we can calculate only one.

So first of all, actually, what we do is, we create a prompt, so a photo of a something. And what we do is, we create a list of classes that we expect to work with. So in this case, we can work with plane cars, dogs, birds, etc. So we pass all of these possible classes into this prompt generate the corresponding feature for the prompt.

So for example, we will have a picture of a plane and generate its features into t one, then a picture of a car and we will have another feature and put it into t two, a picture of a dog, and then put it into t three, we compute all these features, and we keep them aside, we save them, we can reuse them even for the next classification.

We don't have to compute them every time we want to classify an image. So we do this job only once. And then what we do is we take the picture of the dog, we pass it through the image encoder, we calculate its features, and then we multiply basically what we have computed before with the features of from of the image.

And the one with the highest value will be the chosen label will be the chosen text corresponding to this picture. And this is how the inference works. As we can see, it's quite efficient also, because we only have to compute the features of the image once and then of course, we have to multiply.

And okay, this in the website, we also can see that the clip authors were telling about the problems they had with the previous models, for example, image net was, you know, was built using millions of images, and they see required over 25,000 workers to annotate 14 million images for 22,000 objects series.

So actually, clip is doing it nearly for free, if we could say, because actually, we are learning from the internet, and there is a lot of resource available on the internet. And this model actually will be used also by stable diffusion and all these generative systems that actually just download the stuff from the internet and train models.

And the same is done for GPT and all the other language models. And here we can see some examples of of classification. I didn't load all of them. And clip is also very highly efficient compared to the other models. And the best aspect of clip is that it can work very well on zero shot.

So for example, for example, clip is able to classify, I don't know, action recognition, even OCR. But not all tasks is not efficient in every task, of course, for example, some tasks that are even difficult for human as a zero shot task, of course, clip is not performing very well on them.

And some tasks that are totally unrelated to his training data are also he's not also performing very well on those tasks. For example, counting the objects in an image, etc. And yeah, this is it. Another note I wanted to add is, is that how does they how do they extract features.

So we have one text encoder here and one image encoder as the image encoder, the authors use ResNet and the vision transformer. And for them, we just extract the features from the last year and that's it about the text encoder. What the authors do actually is they choose a transformer, but they only use the encoder part of the transformer, of course.

And what they do is they take the features corresponding to the end of text token from the last layer. So basically, it's written here. Actually, it was not very clear to me the way the authors wrote it. So the text sequence is bracketed with start of sentence and end of sentence tokens and the activation of the highest layer of the transformer at the end of sentence token are treated as the feature representation of the text.

This basically means that if we watch the attention paper, they take the features from here and corresponding to the end of sentence character, which in the code is done on this. You can see here, this is file model.py. It's done here. So what they do is for they pass the text to the transformer, they do the normalization.

Then for each of the text, they check in the original text, where was the position of the end of sentence token. This is how they do it. And they get the features corresponding to that one. And that's what they use to multiply with the W matrix to obtain the features and then do the cosine similarity.

I hope my explanation was clear. I was not very concerned about actually the results, which you can read on the paper and you can read also on the website, even the applications. Actually what I wanted to show in this paper in this small video is that how do how does clip work and how do we train a model similar to this?

Thank you for listening and enjoy the rest of your day.

CLIP - Paper explanation (training and inference)

Transcript