OpenAI CLIP Explained | Multi-modal ML

00:00:00.000 | Today we're talking about what is quite possibly the future of machine learning in both computer

00:00:06.160 | vision and NLP. We're talking about a combination of both into a single multi-modal model. That

00:00:16.960 | model is CLIP which was built and trained by OpenAI. Now CLIP is open source and it has been

00:00:24.160 | around for a little while. It's been around from since the start of 2021 but in the past few months

00:00:33.600 | we've seen the adoption of CLIP grow at a pretty insane rate. It has found uses in a load of things

00:00:43.760 | so what we will be focusing on is text or multi-modal search across text and images.

00:00:49.760 | We're also going to talk a little bit about zero shot classifications, zero shot object detection

00:00:55.680 | and at some point in the future we'll also talk about the use of models like CLIP or even CLIP

00:01:02.000 | itself in the diffuser models that have become incredibly popular recently like DALI-2, IMAGEN

00:01:09.440 | and MIDJOURNEY. Now to understand why CLIP is so important we can use a paper that was released

00:01:17.120 | in 2020 called Experience Grounds Language. In Experience Grounds Language the authors define

00:01:25.280 | these five different world scopes; corpus, internet perception, embodiment and social.

00:01:32.000 | Now most of the models that we're aware of in NLP, this is a sort of NLP language focused paper,

00:01:42.320 | these were sort of the state of the art very recently so you have BERT, T5 and all these

00:01:49.120 | different models, GPT and so on. In perception this is where we start to see not just NLP but

00:01:56.560 | also computer vision. So here is where we are now with CLIP. Okay and quite interestingly as well

00:02:05.520 | in the next world scope this is where you start to get reinforcement learning and then you continue

00:02:10.720 | going so you almost see this culmination of all the different disciplines in machine learning

00:02:15.040 | coming together into one larger discipline. Now the focus between these different world scopes is

00:02:21.600 | mostly on the data that is being used to train the models. So over here world scope one you have

00:02:28.800 | the first models that we saw which would have been things like Word2Vec which is probably one

00:02:37.280 | of the earliest samples of deep learning in NLP and that consisted of training a neural network

00:02:44.080 | on a small or not small amount of data but small compared to future data sets. So a single corpus

00:02:52.240 | you know for example 100,000 sentences from Wikipedia may be one example of that.

00:02:58.880 | Then going forwards we have the internet sized corpuses. So these are based on a huge web

00:03:06.160 | scraping from across the internet from many different sources and the models trained on this

00:03:12.240 | were able to obviously kind of pull in more general understanding of language from purely text data.

00:03:22.080 | Okay so it's a lot of data but it's still purely text. The next one which we're focusing on world

00:03:26.960 | scope three is not just that it's in our example text and image data. Okay so we're training a

00:03:34.720 | model to understand both of these different modalities of information and this is almost like

00:03:42.800 | AI moving from a purely digital very abstract space to a more realistic real world space because

00:03:53.760 | in the real world we don't just rely on text data we rely on a huge number of sensory inputs. We have

00:04:02.800 | visual, audio, touch and everything else. Okay so we're sort of moving more towards

00:04:10.960 | that more broad scope of inputs from different modalities. Okay where modality would be something

00:04:17.520 | like text or image or visual and so on. For us in the real world that chaotic ensemble of different

00:04:28.080 | sensory inputs is what kind of creates or trains our internal model of the outside world. So it

00:04:35.200 | makes sense that that is the sort of direction that machine learning and AI may also go in.

00:04:42.160 | So to achieve this multi modality in CLIP we actually use two models that are trained to

00:04:50.880 | almost speak the same language. So we have these two models one of them is a text encoder one of

00:04:56.480 | them is an image encoder. Both of these models create a vector representation of whatever they

00:05:03.760 | are being input so the text encoder may get a sentence. That sentence could be two dogs running

00:05:08.640 | across a frosty field and then we have a image of two dogs running across a frosty field and CLIP

00:05:17.840 | will be trained so that the text encoder consumes our sentence and outputs a vector representation

00:05:27.200 | that is very very closely aligned to what the image encoder has output based on the image of

00:05:35.440 | the same concept. Now by training both of these models to encode these vectors into a similar

00:05:42.800 | vector space we are teaching them to speak the same vector language right. So this is it's very

00:05:50.240 | abstract this this vector language is 512 dimensional space so we can't directly understand

00:05:58.480 | what or it's very difficult for us to directly understand what is actually happening there.

00:06:03.600 | But these two models do actually output patterns that are logical and make sense and we can see

00:06:13.440 | some of this by comparing the similarity between the vectors that it outputs. Okay so we can see

00:06:19.520 | that the two vectors for dogs running across a frosty field both the the text vector and the

00:06:24.960 | image vector are both within a very similar vector space. Whereas something else like

00:06:31.680 | elephants in the Serengeti is you know whether it's text or image it's not here with our our

00:06:38.400 | two dogs running across frosty field it's somewhere over over here right in a completely different

00:06:43.120 | space. So what we can do with that is is calculate the similarity between these vectors and identify

00:06:49.440 | which ones are similar or not similar according to CLIP. From this from these these meaningful

00:06:56.240 | vectors that CLIP is actually outputting we are able to create a content-based image retrieval

00:07:01.840 | system. Okay so content-based image retrieval is basically where we using some text or using

00:07:08.080 | maybe even another image we can search for images based on their content right and not just like

00:07:17.920 | some meta textual metadata or something that's been attached to them. And with CLIP unlike other

00:07:23.120 | content-based image retrieval systems CLIP is incredibly good at actually capturing the meaning

00:07:30.800 | across an entire image. So you know for example with our two dogs running across a frosty field

00:07:36.960 | we might also be able to describe the background of that image without mentioning that there's two

00:07:41.360 | dogs in it and if we describe in such a way that we align pretty well with what that image actually

00:07:47.600 | is what is in that image we might actually also return the image based on that. So we're not just

00:07:53.360 | focusing on one thing in the image CLIP allows us to focus on many things in the image. So an

00:08:00.480 | example of that is within this data set I've been using here there are no images there's one single

00:08:08.400 | image of the food a hot dog. Okay so I tried to search for that and the first image that is

00:08:15.440 | returned is a dog eating a hot dog. Okay so it's pretty relevant but of course there are no other

00:08:21.280 | images of hot dogs in this in this data set. So the other images that are returned are quite

00:08:26.640 | interesting because in some way or another they are kind of showing a hot dog. So the first one

00:08:33.280 | we have a dog looking pretty cozy in a warm room with a fire in the background. Then we have a dog

00:08:41.120 | in a big wooly jumper and another dog kind of like posing for the camera. So weirdly enough we got a

00:08:49.840 | load of hot dog images even though it's not really maybe it's not exactly what we meant when we said

00:08:55.360 | hot dog but a person could understand that. Okay we can we can see how those that term and those

00:09:01.840 | images are related. Now we're not actually only restricted to text to image search. When we encode

00:09:08.880 | our data when we code text and when we encode images we are actually just creating vectors.

00:09:14.080 | So we can search across that space in any direction with any combination of modalities.

00:09:21.120 | So we could do a text to text search image to image search. We can also do image to text search

00:09:27.280 | or we can search everything. We could use some text to search with both text and images. We can

00:09:32.480 | kind of go in any direction use any modality we want there. Now let's go into a little more detail

00:09:38.160 | on what the architecture of CLIP actually looks like. So CLIP as I mentioned it's these two models.

00:09:46.880 | Now these two models are trained in parallel. One of them is the text encoder. Now it's a just a

00:09:53.040 | generic text encoder of 12 layers and then on the image encoder side there are although there are

00:09:59.280 | two different options I've spoken about. There is a vision transformer model and also a ResNet model

00:10:06.720 | and they use a few different sizes for ResNet as well. Both of these both of these encoder models

00:10:12.720 | output a single 512 dimensional vector and the way these models is trained is kind of in the

00:10:21.120 | name of CLIP. So CLIP stands for contrastive learning in pre-training and so the training

00:10:29.200 | that is used during pre-training is is contrastive. It's contrastive pre-training. Now across both

00:10:34.400 | NLP and computer vision large models sort of dominate the state of the art and the reason for

00:10:41.440 | this or the idea behind this is that just by giving a large model a huge amount of data they can learn

00:10:48.880 | sort of general patterns from what they see and almost kind of internalize a general rule set

00:10:56.960 | for the the data that it sees. Okay so they manage to recognize general patterns in their modality.

00:11:04.480 | In language they may be able to internalize the grammar rules and patterns in English language.

00:11:12.240 | For vision models that may be sort of the general patterns that you identify or notice

00:11:20.480 | in with different scenes and different objects. Now the problem with these different models the

00:11:26.560 | reason they don't fit together very well already is that they're trained separately. So by default

00:11:33.120 | these state-of-the-art models have no understanding of each other and that that's where CLIP is is

00:11:39.280 | different that's what CLIP has has brought to the table here. With CLIP the text and image encoders

00:11:45.760 | are trained while considering the context of the other modality. Okay so the text encoder is trained

00:11:53.920 | and it considers the modality or it considers the the concept learned by the image encoder and the

00:11:59.920 | image encoder does the same for the text encoder and we can almost think of this as the the image

00:12:05.840 | and text encoders are sharing a almost indirect understanding of the other modality. Now

00:12:12.880 | contrastive training works by taking a image and text pair so for example the two dogs running

00:12:19.520 | across a frosty field and putting those together into the text encoder and image encoder and

00:12:26.160 | learning to encode them both as as closely as possible. For this to work well we also need

00:12:34.240 | negative pairs so we need something to compare against as this is a general rule in contrastive

00:12:43.120 | learning you can't just have positive pairs because then everything can just be kind of

00:12:46.320 | encoded into the same like tiny little space and you don't know how to separate the the pairs are

00:12:53.680 | dissimilar. Okay so we need both positive and negative pairs so we have a positive pair okay

00:12:58.640 | in order to get negative pairs we can essentially just take all the positive pairs in our data set

00:13:05.200 | and we can say okay the pair t1 and i1 we can mix t1 with different i's okay so we can do t1

00:13:15.520 | with i2 and i3 i4 and so on so we're basically just swapping the pairs and we can we can understand

00:13:24.320 | that those other pairs are probably not going to be similar as long as our data set is relatively

00:13:30.080 | large. Occasionally maybe we will get a pair that are similar but as long as our data set is large

00:13:36.240 | enough that that doesn't happen too frequently it's not going to affect how our training is it

00:13:40.640 | will be sort of a negligible problem. So with this idea we can use a loss function that will

00:13:49.680 | minimize the difference between positive pairs and maximize the difference between

00:13:56.480 | negative pairs and that will look something like this where we have our positive pairs in the

00:14:01.440 | diagonal of the similarity matrix and everything else is something that we the dot product there

00:14:08.080 | we need to maximize and this image that you see here is actually the pre-training for a single

00:14:15.520 | batch okay so one interesting thing to note here is if we have a small batch say we only have a

00:14:21.920 | batch size of two it's going to be very easy for our model to identify which two items are similar

00:14:27.280 | which two are not similar whereas if we have 64 in our 64 items in our batch it will be much harder

00:14:35.280 | for our model because it has to it has to find more nuanced differences between them and

00:14:40.480 | and what basically the odds of guessing randomly between those and guessing correctly are much

00:14:48.080 | smaller. So a larger batch size is a good thing to aim for in this contrastive pre-training approach.

00:14:56.880 | So with that I think we we have a good idea now of how CLIP can be used and also you know how it

00:15:05.120 | has been trained for for this. So what I really want to do now is kind of show you how you might

00:15:11.440 | be able to use it as well. Now we're going to be using the Vision Transformer version of CLIP

00:15:17.360 | okay so remember I said there's a ResNet and Vision Transformer options for that image encoder

00:15:23.760 | we're going to use a Vision Transformer version and OpenAI have released this model through the

00:15:31.600 | Hugging Face library so we can we can go to the Hugging Face library and use it directly from

00:15:35.920 | there which makes it really easy for us to actually sort of get started with it. So let's go ahead and

00:15:40.960 | do that now. Okay so for this we will need to install a few libraries here so we have Transformers,

00:15:48.560 | Torch and Datasets. So Datasets we need to actually get a Dataset so I've prepared one

00:15:54.560 | especially for this so we have this image text Dataset and in here we don't have there's not

00:16:02.000 | much it's just 21 images or text to image pairs and we can see what they look like. So

00:16:08.800 | we have this text Aerial shot of a futuristic city with a large motorway. Okay so I'll try to just

00:16:15.760 | describe this image as best I could and yeah that's what I got and there are like you saw just

00:16:25.040 | now 21 of these image text pairs in there. So let's go ahead and actually prepare or download

00:16:33.600 | and sort of initialize CLIP for our use. So the model ID on Hugging Face is this

00:16:41.920 | so if we were to go to HuggingFace.co we could type that in here and we have the model there.

00:16:54.960 | Okay so this is the model that we're using over from OpenAI here and with this model we

00:17:02.400 | use these two we use a processor and a model so this is the model itself this is CLIP right this

00:17:10.800 | is a almost like a pre-processor for both our text and also the images. Okay so one thing we would do

00:17:20.560 | here if we have a CUDA device and available we can move our model to the CUDA device.

00:17:28.000 | At the moment if you try and do this with NPS so if you're on Mac and you have a you have Apple

00:17:34.720 | Silicon there are some processes or some transformations in the CLIP that don't

00:17:41.520 | function on NPS at the moment so I would stick with CPU. We're only doing inference so it's

00:17:48.320 | still pretty fast. Now as I was mentioning the processor is what handles both the text

00:17:55.200 | and image preparation that needs to happen before we feed them into the actual encoder models

00:18:01.280 | themselves that make up CLIP. So for text we do this so this is just going to be this going to

00:18:09.280 | work like a normal text tokenizer. A normal text tokenizer for text transform models is used

00:18:16.080 | in order to translate our human readable text into transformer readable IDs. Okay so we

00:18:24.880 | pass the text here we make sure we are saying there are no images included here because the

00:18:30.880 | processor if we have both images and text it can process them at the same time. We can do that here

00:18:38.720 | as well but I want to show you it separately just to show you what they're actually doing.

00:18:43.040 | So the padding we need to set that to true and that is because different sentences

00:18:52.160 | can have different lengths okay so you have like hello world and whatever I wrote before up here

00:18:59.440 | so this aerial shot of futuristic city aerial shot of a city. These two sentences have different

00:19:09.760 | lengths and a transform model needs to see the same length being input for all of the

00:19:17.280 | text that is within this sort of single batch. So basically what it's going to do there is add what

00:19:24.560 | are called padding labels so it's just going to add a few of these up to the length of the longest

00:19:32.400 | sequence within that batch of text items just because in here we have those 22 um

00:19:39.520 | 20 no sorry 21 sentences. So that's all we're doing there I'm sure that is

00:19:52.320 | and then we are returning those as pytorch tensors and then finally just moving them to

00:19:57.760 | whichever device we're using. I'm using CPU here so it's not actually necessary to do this but

00:20:03.360 | I'm doing it in case you do do the same on a CUDA enabled device. So from there we have these input

00:20:11.280 | IDs and an attention mask okay so let's have a quick look at what what those are. So if we go

00:20:19.120 | into tokens and we have a look at input IDs okay you see we get all these literally just integer

00:20:31.520 | values and you'll see that a lot of them have this 49407 at the end right that is they're the padding

00:20:40.720 | tokens there okay so they they are not represented as strings but they're represented as these

00:20:46.560 | integer numbers okay and we know that they're the padding tokens because they they're appearing

00:20:51.520 | several times at the end of each sequence and none of the sequences I fed in there

00:20:57.120 | were they didn't have any similar words at the end of those okay so you can see them all here.

00:21:02.560 | So we know that those are the padding sequences we also see there's like an

00:21:09.760 | initialization of sequence token there as well and then everything in between those

00:21:17.760 | they are tokens that represent a word or a part of a word from our original text.

00:21:24.240 | So let's see input IDs the attention mask is you'll see so here you can see that it's just

00:21:32.720 | these ones and zeros now the ones represent real tokens okay they represent real words that were in

00:21:40.000 | our from our text inputs the zeros represent where the where our processor has added padding tokens

00:21:50.400 | so this is used for the internal mechanisms of the text transformer to know which tokens to pay

00:21:57.120 | attention to which ones to ignore because we don't want to really focus on those padding tokens

00:22:02.080 | because they're meaningless they're just there to make sure we have the same size inputs going

00:22:08.720 | into our transform model that's all that is. So we can go down and after we have our tokens

00:22:18.000 | now what we what we do is we use clip to encode all of them with this get text features okay and

00:22:26.400 | then we pass our tokens and we've got two device here I think I already I already moved them to

00:22:31.600 | device so I don't need to do that again we can actually remove that. Okay and okay what do we

00:22:39.760 | get here so we get 21 so 21 text inputs that makes sense 512 dimensional vectors okay so they are our

00:22:52.000 | text embeddings representing each of those those text sentences I just gave and then one other

00:22:58.960 | thing I wanted to point out here is that we have the min and max values and they're they're pretty

00:23:04.800 | big okay they're clearly not normalized so this depends on what you're doing if you are if you

00:23:11.520 | want to compare these vectors you need to make sure you're not using a similarity metric that

00:23:19.120 | looks or that considers the magnitude of your vectors you need to only consider the the angle

00:23:26.320 | so you can do that with cosine similarity or the alternative is that you can normalize these

00:23:31.440 | vectors and then you can also do this with dot product similarity okay so to normalize if you

00:23:39.520 | wanted to use our product similarity now you would do this okay so here we're just detaching our text

00:23:46.080 | embeddings from the the PyTorch graph moving them to CPU if needed we actually don't need to do that

00:23:51.440 | but do it here anyway and converting them into a NumPy rate and then we calculate the value that

00:23:58.160 | we will normalize it each vector by okay so for each each vector we're calculating a number and

00:24:05.840 | then that number is what we're going to divide them all by here okay to normalize that and then

00:24:13.280 | after that you can see the minimum maximum is this minus 0.15 and plus 0.53 okay so neither

00:24:22.720 | of them going over minus one or plus one now now when it comes to encoding images we we do the same

00:24:28.240 | thing or very similar thing so images are also pre-processed using the using the processor as

00:24:35.280 | we did with our text but we just use slightly different parameters to start there so the reason

00:24:40.720 | we're processing these images is that uh clip expects a certain size of image when when we're

00:24:47.040 | feeding images into it and expects those those image pixels to be normalized as well now rgb

00:24:55.040 | images by default the the value the pixel values and they will range from 0 to 255

00:25:02.000 | we need to normalize those and we also need to resize the images so you can see you can see that

00:25:08.240 | here so the first image it has this size it's it's a pretty big image okay this is the the width and

00:25:14.720 | the height of that image now here we're taking all the images and we're processing them make sure we

00:25:20.960 | say text is is none and that will actually only output one tensor the pixel values tensor so

00:25:26.080 | we're just going to extract that straight out there and we're also going to move it to um the

00:25:30.800 | device set hardware device in this case just cpu and now let's have a look at this image or images

00:25:37.040 | now so now we can see that we have this um this array or tensor with three color channels so this

00:25:43.840 | is the rgb and it has a height and width of 224 so it's been you know sort of squeezed into a

00:25:52.000 | smaller size now and we have 21 of those because we fed in all of our images okay so this is how

00:26:00.800 | we use the processor and this is just resizing and normalizing our images ready for the vision

00:26:07.440 | transformer encoder of clip and very similar to before before you use get text features

00:26:15.040 | now we're going to use get image features and we pass in those images like that and again as you

00:26:21.840 | you might expect those images are not um normalized you see that here

00:26:26.960 | and as we would also expect they are the same dimensionality as our text embeddings so that

00:26:35.520 | means we can compare them but before comparing them of course as before we we normalize them

00:26:42.000 | so we should normalize them again here um and yep same process again and we can see that those have

00:26:49.440 | those have changed okay cool so what we now want to do is calculate the similarity between all of

00:27:00.240 | our image embeddings and all of our text embeddings so we can do that in a few different ways uh we

00:27:07.440 | have cosine similarity or dot product similarity the reason we can use our product similarity is

00:27:13.280 | because we normalize but i'm going to show you how to do both so that if you don't normalize

00:27:17.360 | you can actually just use cosine similarity like we do here so cosine similarity is actually just

00:27:22.640 | a dot product as a numerator between the text embeddings and image embeddings and in the

00:27:30.080 | denominator we have just normalized the norm values of both of those okay that is that's all

00:27:38.880 | it is actually so it's it's pretty pretty simple and if we plot those similarity scores between

00:27:45.840 | those we get this so we would expect along this diagonal here we would expect these to be the

00:27:52.240 | highest similarity values because they represent the the true pairs okay between the images and

00:27:57.920 | the text now we have some that are not quite there like here and there is this image text pair which

00:28:03.760 | is more similar even with something else you know i very quickly put these together so they're not

00:28:10.800 | always going to be perfect so we have one here that is maybe not perfect but again there is also

00:28:16.560 | a lot of overlap between these images so there are several images of city skylines and a lot of time

00:28:22.640 | i describe those as futuristic cities in you know in whatever with a big motorway or something on

00:28:28.480 | those lines so that is probably where we're getting that from um now if we were to calculate the dot

00:28:35.840 | product similarity between these we would expect it to be the same okay now um okay from this this

00:28:43.680 | calculation dot product similarity we do seem to get very similar um set of similarities at the end

00:28:50.560 | there but are they the same well if we go down here we can see that they pretty much are so

00:28:57.520 | we can't do a straight comparison we can't do we can't do cosine similarity equals

00:29:05.760 | um dot similarity because the numbers are actually slightly different but they're only

00:29:11.280 | slightly different because there's sort of a floating point error because these are all

00:29:14.720 | floating point numbers so we get very very small differences between the numbers and you can see

00:29:21.920 | that here so taking we've calculated the difference between them between the numbers and the two

00:29:26.720 | arrays and then we're looking okay what's the minimum difference between them it's zero okay

00:29:31.920 | that's what we'd expect that's where the numbers are exactly the same what's the maximum difference

00:29:36.240 | between the numbers and it's 2.98 to the minus eight so like 0.000000 and so on two so very small

00:29:49.520 | number and okay with that we we know it's just floating point errors between those two similarity

00:29:56.960 | arrays so that's pretty cool to to see that and we can use this exact concept of comparing with

00:30:03.600 | with a cosine similarity or dot product similarity to actually search through all of our images

00:30:11.600 | with like a text prompt for example but that's not all that clip is good for clip is also has this

00:30:18.320 | amazing performance um as a zero shot model for different tasks okay so not even just one that's

00:30:26.240 | what actually different tasks so um it is it performs incredibly well out of the box for

00:30:33.440 | classification and we'll go through this in more detail in a future video but the idea is that

00:30:41.600 | given a set of classes from a classification image classification data set you can maybe you can

00:30:50.960 | modify them a little bit the class names to make them more like a sentence and then you use this

00:30:56.000 | same idea of comparing all of your your your text representations of the classes with a set of

00:31:03.200 | images from the data set and with this you just calculate the similarity between those and this

00:31:08.400 | the the text embedding or the you can think of it as a class embedding that gets a high similarity

00:31:14.240 | to your image is the predicted class okay so you have zero shot classification like that super easy

00:31:21.200 | another use case is object detection so you let's say you have maybe you're looking for a cat or a

00:31:30.000 | butterfly in an image okay and you okay when you're looking for the cat you're gonna you're

00:31:36.480 | gonna use a chunk of text that says um a fluffy cat okay and you encode that with clip and you

00:31:42.000 | get your text embedding and then what you do is you break up your image into all these little patches

00:31:48.240 | and you just slide through all of those patches okay you can you can include like an overlap

00:31:53.680 | so you're going over those over like you're not missing anything between patches so you're just

00:31:59.680 | sliding through your image and with each part of the image that you slide through you extract the

00:32:05.360 | image from that you process it through clip and then you compare the encoding for that image

00:32:11.600 | against the tips that you create so a fluffy cat and what we will see is that patches of the image

00:32:18.720 | that contain what it what it is you've just described will have a higher similarity rating

00:32:23.920 | okay and then you can overlay those scores back onto your image and you will find that the that

00:32:30.000 | clip is able to essentially identify where in your image a specific object is and you are just

00:32:38.400 | describing that image using a natural language prompt now these are only a few use cases of clip

00:32:45.200 | and only really scratch the surface of what is actually possible with this model we also see

00:32:52.000 | it being used in like i said the diffusion models like dali which is a great example of how powerful

00:32:58.080 | clip can actually be so that's it for this introduction to clip i hope it's been useful

00:33:07.520 | as i said we're going to go into more detail on the different use cases of clip and how to

00:33:11.920 | apply clip for these use cases in future videos but until then that's it for now

00:33:18.000 | so thank you very much for watching and i will see you again in the next one bye

OpenAI CLIP Explained | Multi-modal ML

Chapters