Back to Index

OpenAI CLIP Explained | Multi-modal ML


Chapters

0:0
9:39 Architecture of Clip
12:12 Contrastive Training
19:24 Padding Labels
20:10 Input Ids and an Attention Mask
24:26 Encoding Images
27:7 Cosine Similarity
31:22 Object Detection

Transcript

Today we're talking about what is quite possibly the future of machine learning in both computer vision and NLP. We're talking about a combination of both into a single multi-modal model. That model is CLIP which was built and trained by OpenAI. Now CLIP is open source and it has been around for a little while.

It's been around from since the start of 2021 but in the past few months we've seen the adoption of CLIP grow at a pretty insane rate. It has found uses in a load of things so what we will be focusing on is text or multi-modal search across text and images.

We're also going to talk a little bit about zero shot classifications, zero shot object detection and at some point in the future we'll also talk about the use of models like CLIP or even CLIP itself in the diffuser models that have become incredibly popular recently like DALI-2, IMAGEN and MIDJOURNEY.

Now to understand why CLIP is so important we can use a paper that was released in 2020 called Experience Grounds Language. In Experience Grounds Language the authors define these five different world scopes; corpus, internet perception, embodiment and social. Now most of the models that we're aware of in NLP, this is a sort of NLP language focused paper, these were sort of the state of the art very recently so you have BERT, T5 and all these different models, GPT and so on.

In perception this is where we start to see not just NLP but also computer vision. So here is where we are now with CLIP. Okay and quite interestingly as well in the next world scope this is where you start to get reinforcement learning and then you continue going so you almost see this culmination of all the different disciplines in machine learning coming together into one larger discipline.

Now the focus between these different world scopes is mostly on the data that is being used to train the models. So over here world scope one you have the first models that we saw which would have been things like Word2Vec which is probably one of the earliest samples of deep learning in NLP and that consisted of training a neural network on a small or not small amount of data but small compared to future data sets.

So a single corpus you know for example 100,000 sentences from Wikipedia may be one example of that. Then going forwards we have the internet sized corpuses. So these are based on a huge web scraping from across the internet from many different sources and the models trained on this were able to obviously kind of pull in more general understanding of language from purely text data.

Okay so it's a lot of data but it's still purely text. The next one which we're focusing on world scope three is not just that it's in our example text and image data. Okay so we're training a model to understand both of these different modalities of information and this is almost like AI moving from a purely digital very abstract space to a more realistic real world space because in the real world we don't just rely on text data we rely on a huge number of sensory inputs.

We have visual, audio, touch and everything else. Okay so we're sort of moving more towards that more broad scope of inputs from different modalities. Okay where modality would be something like text or image or visual and so on. For us in the real world that chaotic ensemble of different sensory inputs is what kind of creates or trains our internal model of the outside world.

So it makes sense that that is the sort of direction that machine learning and AI may also go in. So to achieve this multi modality in CLIP we actually use two models that are trained to almost speak the same language. So we have these two models one of them is a text encoder one of them is an image encoder.

Both of these models create a vector representation of whatever they are being input so the text encoder may get a sentence. That sentence could be two dogs running across a frosty field and then we have a image of two dogs running across a frosty field and CLIP will be trained so that the text encoder consumes our sentence and outputs a vector representation that is very very closely aligned to what the image encoder has output based on the image of the same concept.

Now by training both of these models to encode these vectors into a similar vector space we are teaching them to speak the same vector language right. So this is it's very abstract this this vector language is 512 dimensional space so we can't directly understand what or it's very difficult for us to directly understand what is actually happening there.

But these two models do actually output patterns that are logical and make sense and we can see some of this by comparing the similarity between the vectors that it outputs. Okay so we can see that the two vectors for dogs running across a frosty field both the the text vector and the image vector are both within a very similar vector space.

Whereas something else like elephants in the Serengeti is you know whether it's text or image it's not here with our our two dogs running across frosty field it's somewhere over over here right in a completely different space. So what we can do with that is is calculate the similarity between these vectors and identify which ones are similar or not similar according to CLIP.

From this from these these meaningful vectors that CLIP is actually outputting we are able to create a content-based image retrieval system. Okay so content-based image retrieval is basically where we using some text or using maybe even another image we can search for images based on their content right and not just like some meta textual metadata or something that's been attached to them.

And with CLIP unlike other content-based image retrieval systems CLIP is incredibly good at actually capturing the meaning across an entire image. So you know for example with our two dogs running across a frosty field we might also be able to describe the background of that image without mentioning that there's two dogs in it and if we describe in such a way that we align pretty well with what that image actually is what is in that image we might actually also return the image based on that.

So we're not just focusing on one thing in the image CLIP allows us to focus on many things in the image. So an example of that is within this data set I've been using here there are no images there's one single image of the food a hot dog. Okay so I tried to search for that and the first image that is returned is a dog eating a hot dog.

Okay so it's pretty relevant but of course there are no other images of hot dogs in this in this data set. So the other images that are returned are quite interesting because in some way or another they are kind of showing a hot dog. So the first one we have a dog looking pretty cozy in a warm room with a fire in the background.

Then we have a dog in a big wooly jumper and another dog kind of like posing for the camera. So weirdly enough we got a load of hot dog images even though it's not really maybe it's not exactly what we meant when we said hot dog but a person could understand that.

Okay we can we can see how those that term and those images are related. Now we're not actually only restricted to text to image search. When we encode our data when we code text and when we encode images we are actually just creating vectors. So we can search across that space in any direction with any combination of modalities.

So we could do a text to text search image to image search. We can also do image to text search or we can search everything. We could use some text to search with both text and images. We can kind of go in any direction use any modality we want there.

Now let's go into a little more detail on what the architecture of CLIP actually looks like. So CLIP as I mentioned it's these two models. Now these two models are trained in parallel. One of them is the text encoder. Now it's a just a generic text encoder of 12 layers and then on the image encoder side there are although there are two different options I've spoken about.

There is a vision transformer model and also a ResNet model and they use a few different sizes for ResNet as well. Both of these both of these encoder models output a single 512 dimensional vector and the way these models is trained is kind of in the name of CLIP.

So CLIP stands for contrastive learning in pre-training and so the training that is used during pre-training is is contrastive. It's contrastive pre-training. Now across both NLP and computer vision large models sort of dominate the state of the art and the reason for this or the idea behind this is that just by giving a large model a huge amount of data they can learn sort of general patterns from what they see and almost kind of internalize a general rule set for the the data that it sees.

Okay so they manage to recognize general patterns in their modality. In language they may be able to internalize the grammar rules and patterns in English language. For vision models that may be sort of the general patterns that you identify or notice in with different scenes and different objects. Now the problem with these different models the reason they don't fit together very well already is that they're trained separately.

So by default these state-of-the-art models have no understanding of each other and that that's where CLIP is is different that's what CLIP has has brought to the table here. With CLIP the text and image encoders are trained while considering the context of the other modality. Okay so the text encoder is trained and it considers the modality or it considers the the concept learned by the image encoder and the image encoder does the same for the text encoder and we can almost think of this as the the image and text encoders are sharing a almost indirect understanding of the other modality.

Now contrastive training works by taking a image and text pair so for example the two dogs running across a frosty field and putting those together into the text encoder and image encoder and learning to encode them both as as closely as possible. For this to work well we also need negative pairs so we need something to compare against as this is a general rule in contrastive learning you can't just have positive pairs because then everything can just be kind of encoded into the same like tiny little space and you don't know how to separate the the pairs are dissimilar.

Okay so we need both positive and negative pairs so we have a positive pair okay in order to get negative pairs we can essentially just take all the positive pairs in our data set and we can say okay the pair t1 and i1 we can mix t1 with different i's okay so we can do t1 with i2 and i3 i4 and so on so we're basically just swapping the pairs and we can we can understand that those other pairs are probably not going to be similar as long as our data set is relatively large.

Occasionally maybe we will get a pair that are similar but as long as our data set is large enough that that doesn't happen too frequently it's not going to affect how our training is it will be sort of a negligible problem. So with this idea we can use a loss function that will minimize the difference between positive pairs and maximize the difference between negative pairs and that will look something like this where we have our positive pairs in the diagonal of the similarity matrix and everything else is something that we the dot product there we need to maximize and this image that you see here is actually the pre-training for a single batch okay so one interesting thing to note here is if we have a small batch say we only have a batch size of two it's going to be very easy for our model to identify which two items are similar which two are not similar whereas if we have 64 in our 64 items in our batch it will be much harder for our model because it has to it has to find more nuanced differences between them and and what basically the odds of guessing randomly between those and guessing correctly are much smaller.

So a larger batch size is a good thing to aim for in this contrastive pre-training approach. So with that I think we we have a good idea now of how CLIP can be used and also you know how it has been trained for for this. So what I really want to do now is kind of show you how you might be able to use it as well.

Now we're going to be using the Vision Transformer version of CLIP okay so remember I said there's a ResNet and Vision Transformer options for that image encoder we're going to use a Vision Transformer version and OpenAI have released this model through the Hugging Face library so we can we can go to the Hugging Face library and use it directly from there which makes it really easy for us to actually sort of get started with it.

So let's go ahead and do that now. Okay so for this we will need to install a few libraries here so we have Transformers, Torch and Datasets. So Datasets we need to actually get a Dataset so I've prepared one especially for this so we have this image text Dataset and in here we don't have there's not much it's just 21 images or text to image pairs and we can see what they look like.

So we have this text Aerial shot of a futuristic city with a large motorway. Okay so I'll try to just describe this image as best I could and yeah that's what I got and there are like you saw just now 21 of these image text pairs in there. So let's go ahead and actually prepare or download and sort of initialize CLIP for our use.

So the model ID on Hugging Face is this so if we were to go to HuggingFace.co we could type that in here and we have the model there. Okay so this is the model that we're using over from OpenAI here and with this model we use these two we use a processor and a model so this is the model itself this is CLIP right this is a almost like a pre-processor for both our text and also the images.

Okay so one thing we would do here if we have a CUDA device and available we can move our model to the CUDA device. At the moment if you try and do this with NPS so if you're on Mac and you have a you have Apple Silicon there are some processes or some transformations in the CLIP that don't function on NPS at the moment so I would stick with CPU.

We're only doing inference so it's still pretty fast. Now as I was mentioning the processor is what handles both the text and image preparation that needs to happen before we feed them into the actual encoder models themselves that make up CLIP. So for text we do this so this is just going to be this going to work like a normal text tokenizer.

A normal text tokenizer for text transform models is used in order to translate our human readable text into transformer readable IDs. Okay so we pass the text here we make sure we are saying there are no images included here because the processor if we have both images and text it can process them at the same time.

We can do that here as well but I want to show you it separately just to show you what they're actually doing. So the padding we need to set that to true and that is because different sentences can have different lengths okay so you have like hello world and whatever I wrote before up here so this aerial shot of futuristic city aerial shot of a city.

These two sentences have different lengths and a transform model needs to see the same length being input for all of the text that is within this sort of single batch. So basically what it's going to do there is add what are called padding labels so it's just going to add a few of these up to the length of the longest sequence within that batch of text items just because in here we have those 22 um 20 no sorry 21 sentences.

So that's all we're doing there I'm sure that is and then we are returning those as pytorch tensors and then finally just moving them to whichever device we're using. I'm using CPU here so it's not actually necessary to do this but I'm doing it in case you do do the same on a CUDA enabled device.

So from there we have these input IDs and an attention mask okay so let's have a quick look at what what those are. So if we go into tokens and we have a look at input IDs okay you see we get all these literally just integer values and you'll see that a lot of them have this 49407 at the end right that is they're the padding tokens there okay so they they are not represented as strings but they're represented as these integer numbers okay and we know that they're the padding tokens because they they're appearing several times at the end of each sequence and none of the sequences I fed in there were they didn't have any similar words at the end of those okay so you can see them all here.

So we know that those are the padding sequences we also see there's like an initialization of sequence token there as well and then everything in between those they are tokens that represent a word or a part of a word from our original text. So let's see input IDs the attention mask is you'll see so here you can see that it's just these ones and zeros now the ones represent real tokens okay they represent real words that were in our from our text inputs the zeros represent where the where our processor has added padding tokens so this is used for the internal mechanisms of the text transformer to know which tokens to pay attention to which ones to ignore because we don't want to really focus on those padding tokens because they're meaningless they're just there to make sure we have the same size inputs going into our transform model that's all that is.

So we can go down and after we have our tokens now what we what we do is we use clip to encode all of them with this get text features okay and then we pass our tokens and we've got two device here I think I already I already moved them to device so I don't need to do that again we can actually remove that.

Okay and okay what do we get here so we get 21 so 21 text inputs that makes sense 512 dimensional vectors okay so they are our text embeddings representing each of those those text sentences I just gave and then one other thing I wanted to point out here is that we have the min and max values and they're they're pretty big okay they're clearly not normalized so this depends on what you're doing if you are if you want to compare these vectors you need to make sure you're not using a similarity metric that looks or that considers the magnitude of your vectors you need to only consider the the angle so you can do that with cosine similarity or the alternative is that you can normalize these vectors and then you can also do this with dot product similarity okay so to normalize if you wanted to use our product similarity now you would do this okay so here we're just detaching our text embeddings from the the PyTorch graph moving them to CPU if needed we actually don't need to do that but do it here anyway and converting them into a NumPy rate and then we calculate the value that we will normalize it each vector by okay so for each each vector we're calculating a number and then that number is what we're going to divide them all by here okay to normalize that and then after that you can see the minimum maximum is this minus 0.15 and plus 0.53 okay so neither of them going over minus one or plus one now now when it comes to encoding images we we do the same thing or very similar thing so images are also pre-processed using the using the processor as we did with our text but we just use slightly different parameters to start there so the reason we're processing these images is that uh clip expects a certain size of image when when we're feeding images into it and expects those those image pixels to be normalized as well now rgb images by default the the value the pixel values and they will range from 0 to 255 we need to normalize those and we also need to resize the images so you can see you can see that here so the first image it has this size it's it's a pretty big image okay this is the the width and the height of that image now here we're taking all the images and we're processing them make sure we say text is is none and that will actually only output one tensor the pixel values tensor so we're just going to extract that straight out there and we're also going to move it to um the device set hardware device in this case just cpu and now let's have a look at this image or images now so now we can see that we have this um this array or tensor with three color channels so this is the rgb and it has a height and width of 224 so it's been you know sort of squeezed into a smaller size now and we have 21 of those because we fed in all of our images okay so this is how we use the processor and this is just resizing and normalizing our images ready for the vision transformer encoder of clip and very similar to before before you use get text features now we're going to use get image features and we pass in those images like that and again as you you might expect those images are not um normalized you see that here and as we would also expect they are the same dimensionality as our text embeddings so that means we can compare them but before comparing them of course as before we we normalize them so we should normalize them again here um and yep same process again and we can see that those have those have changed okay cool so what we now want to do is calculate the similarity between all of our image embeddings and all of our text embeddings so we can do that in a few different ways uh we have cosine similarity or dot product similarity the reason we can use our product similarity is because we normalize but i'm going to show you how to do both so that if you don't normalize you can actually just use cosine similarity like we do here so cosine similarity is actually just a dot product as a numerator between the text embeddings and image embeddings and in the denominator we have just normalized the norm values of both of those okay that is that's all it is actually so it's it's pretty pretty simple and if we plot those similarity scores between those we get this so we would expect along this diagonal here we would expect these to be the highest similarity values because they represent the the true pairs okay between the images and the text now we have some that are not quite there like here and there is this image text pair which is more similar even with something else you know i very quickly put these together so they're not always going to be perfect so we have one here that is maybe not perfect but again there is also a lot of overlap between these images so there are several images of city skylines and a lot of time i describe those as futuristic cities in you know in whatever with a big motorway or something on those lines so that is probably where we're getting that from um now if we were to calculate the dot product similarity between these we would expect it to be the same okay now um okay from this this calculation dot product similarity we do seem to get very similar um set of similarities at the end there but are they the same well if we go down here we can see that they pretty much are so we can't do a straight comparison we can't do we can't do cosine similarity equals um dot similarity because the numbers are actually slightly different but they're only slightly different because there's sort of a floating point error because these are all floating point numbers so we get very very small differences between the numbers and you can see that here so taking we've calculated the difference between them between the numbers and the two arrays and then we're looking okay what's the minimum difference between them it's zero okay that's what we'd expect that's where the numbers are exactly the same what's the maximum difference between the numbers and it's 2.98 to the minus eight so like 0.000000 and so on two so very small number and okay with that we we know it's just floating point errors between those two similarity arrays so that's pretty cool to to see that and we can use this exact concept of comparing with with a cosine similarity or dot product similarity to actually search through all of our images with like a text prompt for example but that's not all that clip is good for clip is also has this amazing performance um as a zero shot model for different tasks okay so not even just one that's what actually different tasks so um it is it performs incredibly well out of the box for classification and we'll go through this in more detail in a future video but the idea is that given a set of classes from a classification image classification data set you can maybe you can modify them a little bit the class names to make them more like a sentence and then you use this same idea of comparing all of your your your text representations of the classes with a set of images from the data set and with this you just calculate the similarity between those and this the the text embedding or the you can think of it as a class embedding that gets a high similarity to your image is the predicted class okay so you have zero shot classification like that super easy another use case is object detection so you let's say you have maybe you're looking for a cat or a butterfly in an image okay and you okay when you're looking for the cat you're gonna you're gonna use a chunk of text that says um a fluffy cat okay and you encode that with clip and you get your text embedding and then what you do is you break up your image into all these little patches and you just slide through all of those patches okay you can you can include like an overlap so you're going over those over like you're not missing anything between patches so you're just sliding through your image and with each part of the image that you slide through you extract the image from that you process it through clip and then you compare the encoding for that image against the tips that you create so a fluffy cat and what we will see is that patches of the image that contain what it what it is you've just described will have a higher similarity rating okay and then you can overlay those scores back onto your image and you will find that the that clip is able to essentially identify where in your image a specific object is and you are just describing that image using a natural language prompt now these are only a few use cases of clip and only really scratch the surface of what is actually possible with this model we also see it being used in like i said the diffusion models like dali which is a great example of how powerful clip can actually be so that's it for this introduction to clip i hope it's been useful as i said we're going to go into more detail on the different use cases of clip and how to apply clip for these use cases in future videos but until then that's it for now so thank you very much for watching and i will see you again in the next one bye