back to indexFast intro to multi-modal ML with OpenAI's CLIP
Chapters
0:0 Intro
0:15 What is CLIP?
2:13 Getting started
5:38 Creating text embeddings
7:23 Creating image embeddings
10:26 Embedding a lot of images
15:8 Text-image similarity search
21:38 Alternative image and text search
00:00:00.000 |
In this video, we're going to have a quick introduction to OpenAI's CLIP 00:00:04.860 |
and how we can use it to almost move between the modalities of both language and images. 00:00:14.860 |
Now, before we dive in, let's just quickly understand what CLIP is. 00:00:22.860 |
In this implementation, we're going to be using a vision transformer that will embed images. 00:00:29.360 |
And we're going to use a normal text transformer that will embed text. 00:00:34.860 |
During pre-training, OpenAI trained the model on pairs of images and text. 00:00:41.360 |
And it trained them to both output embedding vectors that are as close as possible to each other. 00:00:50.360 |
So, the text transformer was trained to output a single embedding, 512-dimensional embedding, 00:00:58.860 |
that was as close as possible to the vision transformer's image embedding for the image-text pair. 00:01:07.860 |
So, what that means is that CLIP is able to take both images and text and embed them both into a similar vector space. 00:01:21.360 |
You can do image and text classification, you can do image and text search, and a huge number of things. 00:01:28.360 |
Anything to do with images and text, there's a good chance we can do it with CLIP. 00:01:33.860 |
So, let's have a look at how we actually use CLIP. 00:01:37.860 |
OpenAI released a GitHub repository, OpenAI CLIP here. 00:01:42.860 |
This contains CLIP, but we're not going to use this implementation. 00:01:47.360 |
We're actually going to use this implementation of CLIP. 00:01:50.860 |
So, this is on HuggingFace, so we're going to be using HuggingFace Transformers. 00:01:55.360 |
And this is still from OpenAI, it's still CLIP. 00:01:59.360 |
It's just an easy-to-use implementation of it through the HuggingFace Transformers library, 00:02:04.360 |
which is a more standard library for actually doing anything with NLP and also now computer vision and some other things as well. 00:02:14.360 |
So, to get started, I'd recommend you install these libraries, pip install torch. 00:02:18.860 |
You should probably go through the PyTorch.org instructions rather than following this here. 00:02:25.360 |
So, go to PyTorch.org and just install PyTorch using the specific install command they use for your platform or your iOS from here. 00:02:37.360 |
And then, pip install transformers and datasets. 00:02:40.360 |
You can still just use this command, so I'd recommend installing PyTorch from here instead. 00:02:46.360 |
Now, after that, we're going to need our dataset. 00:02:56.360 |
It contains, I think, just under 10,000 images, and we only care about the images here. 00:03:07.360 |
So, we'll go to the first item, and we'll just have a look at image. 00:03:26.360 |
Okay, just to point out that we have a lot of images in here in this dataset that cover a range of things. 00:03:34.360 |
There's not a huge number of different categories here, but they have dogs, they have radios, and a few other things. 00:03:40.360 |
Now, I'm just going to go ahead and initialize everything. 00:03:47.360 |
From transformers, we're importing the clip tokenizer. 00:03:51.360 |
So, the tokenizer is what's going to handle the preprocessing of our text into token ID tensors and other tensors. 00:04:06.360 |
So, this is actually just going to resize our images into the size that clip expects and also modify the pixel values as well. 00:04:24.360 |
So, if you have CUDA or NPS, if you're on M1 Mac, you just set that with this. 00:04:35.360 |
And then we're ready to actually initialize all of this. 00:04:39.360 |
So, the model ID is going to be what we saw before. 00:04:48.360 |
We have this open eye clip VIT base patch 32. 00:04:57.360 |
And now we just need to -- look, I'm being told what to do already. 00:05:02.360 |
So, model, clip model from pre-trained, model ID. 00:05:05.360 |
I'm going to -- I don't normally set device like that. 00:05:38.360 |
So, now what we're going to do is take a look at how we actually create the text embeddings 00:05:50.360 |
There's not many pictures of dogs in the snow in the state set, but there are some. 00:05:56.360 |
And what we need to do is tokenize the prompt. 00:06:03.360 |
I'm going to go with tokenizer, prompt, and we need to return tensors using PyTorch. 00:06:11.360 |
So, we're going to be using PyTorch behind the scenes here. 00:06:18.360 |
And let's just have a look at what is actually in inputs. 00:06:26.360 |
So, you'll recognize this if you've used Hugging Face Transformers before. 00:06:32.360 |
And these are just the token IDs that represent the words from this. 00:06:45.360 |
But if we had padding in here, anything beyond the length of our prompt would become a zero, 00:06:50.360 |
telling the model to not pay attention to that part of the prompt. 00:06:56.360 |
And from there, we can process this through clips. 00:07:23.360 |
So, that's the text embedding side of things. 00:07:27.360 |
Now, we need to go ahead and do the image embedding side of things. 00:07:32.360 |
So, we're going to resize the image first with the processor. 00:07:38.360 |
So, you can also process text through this processor. 00:07:42.360 |
I'm just keeping it separate because it makes more sense to me. 00:07:52.360 |
Again, we want to return tensors using PyTorch. 00:07:57.360 |
And then we can have a look at the -- I'm going to show you the image. 00:08:18.360 |
And we're going to move those to the device as well. 00:08:21.360 |
I think the device I have set up right now is actually CPU. 00:08:24.360 |
It doesn't make a difference for me, but that's fine. 00:08:32.360 |
So, you see that we have this 224 by 224 image with three color channels. 00:08:39.360 |
So, this is just the expected shape that will be consumed by the vision transformer of clip. 00:09:04.360 |
And I need to -- so, I just need to resize it. 00:09:07.360 |
Let me show you what I'm actually doing here. 00:09:13.360 |
So, I'm going to remove that first dimension. 00:09:17.360 |
So, we put the three color channels at the back. 00:09:22.360 |
And this is for matplotlib to be able to actually show us this. 00:09:35.360 |
And you can see -- so, the minimum/maximum color values are -- or all of the color values, 00:09:44.360 |
pixel values are modified when we do this, process it through the processor. 00:09:51.360 |
But you can see that this is like a resized -- you know, what we saw before. 00:10:04.360 |
So, with that, we can go ahead and get the image features. 00:10:21.360 |
So, similar to before, we have that 512-dimensional embedding vector. 00:10:31.360 |
What I'm going to show you how to do is how to kind of search through this or at least compare a small number of images against our prompt so that we can actually see which one of those images is the most similar to a dog in the snow. 00:10:49.360 |
So, to do that, we're going to want to embed more of these images. 00:11:06.360 |
So, this is just so you can replicate what I'm doing. 00:11:09.360 |
So, this will randomly generate a set set of random numbers. 00:11:18.360 |
So, the reason I'm doing this is because we want to take a sample out of the dataset. 00:11:27.360 |
So, to do that, we want to go -- so, sample indices are going to be equal to NumPy random.randint from zero up to the length of ImageNet. 00:11:49.360 |
And then we're going to convert that into a list. 00:11:53.360 |
I can just have a quick look at what is in there. 00:12:05.360 |
And if we run it again, because we have that random seed set, the random set of numbers doesn't change. 00:12:13.360 |
And what I'm going to do is just create a list of images using those values. 00:12:34.360 |
Now we want to just go ahead and literally take everything we've just done and put it into a for loop to create the embeddings for all of these images. 00:12:50.360 |
This is just a progress bar so we can see where we are. 00:12:53.360 |
Batch size, I'm saying how many images to perform this for in any one go. 00:12:59.360 |
You can increase this if you're using a bigger GPU or whatever else. 00:13:07.360 |
Image array, I'm setting that to none for now. 00:13:16.360 |
And then we're just doing the same thing as before. 00:13:18.360 |
So, I'm selecting a batch of images based on the batch size. 00:13:24.360 |
And then where we are processing and resizing the images from that batch, we're getting the image features look exactly the same thing. 00:13:33.360 |
I think before I actually didn't include pixel values. 00:13:52.360 |
It's the same thing as what I showed you up here. 00:13:54.360 |
So, we squeeze the first dimension out of that like we did here. 00:14:01.360 |
And then we are moving that batch of embeddings to the CPU. 00:14:10.360 |
We're detaching it from the gradient, like the training graph of PyTorch, the PyTorch model, e.g. clip. 00:14:19.360 |
And then we're converting into a NumPy array. 00:14:22.360 |
And then I'm going to add that batch of embeddings to a large array of all image embeddings. 00:14:41.360 |
So, here I'm actually pulling in the full row or record at any one time. 00:15:08.360 |
And now we have 100, 512 dimensional image embeddings from our dataset. 00:15:16.360 |
And we can now use them to compare to our initial text embedding and see which one of these matches most closely to that text embedding. 00:15:27.360 |
So, I'm going to be using dot product similarity. 00:15:30.360 |
So, there's just one thing to be aware of with that. 00:15:34.360 |
And that is that it considers both the magnitude of the vector and also the angle. 00:15:40.360 |
So, in this case, that will -- that can throw off our results. 00:15:46.360 |
So, we should normalize all of the image embeddings so that we are not looking at the magnitude of vectors. 00:15:54.360 |
And we're only focusing on the angular similarity between our text embedding and these image embeddings. 00:16:03.360 |
So, to do that, we need to -- I'll just show you quickly. 00:16:12.360 |
You know, they're kind of all over the place. 00:16:18.360 |
So, do image array divided by numpy linage.norm. 00:16:41.360 |
And these are basically telling us for each one of these vectors of what should we divide it by in order to bring each of them to within a set magnitude, pretty much. 00:17:21.360 |
And then -- so, the image array, the shape is going to be transposed now. 00:17:38.360 |
And now if we have a look at the minimum and maximum. 00:17:44.360 |
We get these values, which are more reasonable. 00:17:49.360 |
So, now what we can do is use dot product similarity to actually compare these. 00:17:57.360 |
So, text embedding, I'm going to take the text embedding. 00:18:01.360 |
And similar to before, what we did is we need to move it to the CPU, detach it from the PyTorch graph, and then convert it to NumPy array. 00:18:16.360 |
And then for the scores, all we need to do is NumPy dot. 00:18:20.360 |
And we are going to put the text embedding followed by the image array. 00:18:26.360 |
And actually, I think I need to transpose this again. 00:18:28.360 |
So, maybe we could have avoided transposing up here. 00:18:36.360 |
So, the scores that we get here, we get a single score for every single vector, as we can see. 00:18:45.360 |
And they are the dot product similarity scores. 00:18:47.360 |
So, what we can now do is sort based on this scores array. 00:18:52.360 |
And just return, like, the top, say, the top five images and see what the top five most similar images are for our query. 00:19:07.360 |
So, top K is going to be the five most similar, or the five items with the highest score. 00:19:13.360 |
And then we want to take the index values using np.arg.sort. 00:19:19.360 |
We're going to add the negative of the scores there. 00:19:24.360 |
And just make sure we take -- because scores has this here. 00:19:28.360 |
So, we're actually just taking the -- let me show you. 00:19:45.360 |
So, what we're left with is these five index values, which are essentially indexes of the image embeddings. 00:19:55.360 |
And, therefore, the images that are the most similar to our query. 00:20:01.360 |
So, we'll use matplotlib again to visualize those. 00:20:34.360 |
The first item, as we would expect, is a dog in the snow. 00:20:39.360 |
So, after that, we get dogs and we get, like, these snowy areas. 00:20:45.360 |
The reason for that is that we just don't have any more images of dogs in the snow. 00:20:53.360 |
It's like a toy that maybe it's a dog, maybe it's a bear. 00:20:57.360 |
But I suppose technically that's like a dog in the snow. 00:21:02.360 |
So, yeah, obviously the model is performing pretty well. 00:21:06.360 |
And I think that's really cool that we can do that so easily. 00:21:09.360 |
And, yeah, I mean, Clip is, I think, an amazing model that we can use to do a load of cool things across both the text and image domain. 00:21:22.360 |
And it's definitely, like, if you think just a couple of years ago this sort of thing wasn't possible. 00:21:28.360 |
And didn't seem, like, at least not to this sort of degree of accuracy, like it was going to be happening anytime soon. 00:21:40.360 |
Here we've obviously shown, I showed you how to do, like, a text to image search. 00:21:46.360 |
You can do this, like, in reality what we're doing is kind of searching through the vectors. 00:21:52.360 |
So, it doesn't matter, you know, which direction you're doing that search. 00:21:59.360 |
So, if you want to do a text to text search with Clip, you could. 00:22:03.360 |
If you want to do image to image search, you could. 00:22:05.360 |
If you want to do image to text or all of those things all at once, you could. 00:22:11.360 |
It's not, it's, you're searching through vectors. 00:22:14.360 |
So, what is behind those vectors doesn't really matter so much. 00:22:21.360 |
I think Clip is super interesting and I hope that you do as well. 00:22:26.360 |
In the future, or very soon actually, I'm going to be going into a lot more detail on Clip. 00:22:33.360 |
So, if you are interested in that, subscribe and click on the little notification button 00:22:40.360 |
and you will get a notification about that pretty soon. 00:22:48.360 |
Thank you very much for watching and I will see you again in the next one.