back to index

Fast intro to multi-modal ML with OpenAI's CLIP


Chapters

0:0 Intro
0:15 What is CLIP?
2:13 Getting started
5:38 Creating text embeddings
7:23 Creating image embeddings
10:26 Embedding a lot of images
15:8 Text-image similarity search
21:38 Alternative image and text search

Whisper Transcript | Transcript Only Page

00:00:00.000 | In this video, we're going to have a quick introduction to OpenAI's CLIP
00:00:04.860 | and how we can use it to almost move between the modalities of both language and images.
00:00:14.860 | Now, before we dive in, let's just quickly understand what CLIP is.
00:00:19.360 | So, it consists of two big models.
00:00:22.860 | In this implementation, we're going to be using a vision transformer that will embed images.
00:00:29.360 | And we're going to use a normal text transformer that will embed text.
00:00:34.860 | During pre-training, OpenAI trained the model on pairs of images and text.
00:00:41.360 | And it trained them to both output embedding vectors that are as close as possible to each other.
00:00:50.360 | So, the text transformer was trained to output a single embedding, 512-dimensional embedding,
00:00:58.860 | that was as close as possible to the vision transformer's image embedding for the image-text pair.
00:01:07.860 | So, what that means is that CLIP is able to take both images and text and embed them both into a similar vector space.
00:01:19.360 | And with that, we can do a lot of things.
00:01:21.360 | You can do image and text classification, you can do image and text search, and a huge number of things.
00:01:28.360 | Anything to do with images and text, there's a good chance we can do it with CLIP.
00:01:33.860 | So, let's have a look at how we actually use CLIP.
00:01:37.860 | OpenAI released a GitHub repository, OpenAI CLIP here.
00:01:42.860 | This contains CLIP, but we're not going to use this implementation.
00:01:47.360 | We're actually going to use this implementation of CLIP.
00:01:50.860 | So, this is on HuggingFace, so we're going to be using HuggingFace Transformers.
00:01:55.360 | And this is still from OpenAI, it's still CLIP.
00:01:59.360 | It's just an easy-to-use implementation of it through the HuggingFace Transformers library,
00:02:04.360 | which is a more standard library for actually doing anything with NLP and also now computer vision and some other things as well.
00:02:14.360 | So, to get started, I'd recommend you install these libraries, pip install torch.
00:02:18.860 | You should probably go through the PyTorch.org instructions rather than following this here.
00:02:25.360 | So, go to PyTorch.org and just install PyTorch using the specific install command they use for your platform or your iOS from here.
00:02:37.360 | And then, pip install transformers and datasets.
00:02:40.360 | You can still just use this command, so I'd recommend installing PyTorch from here instead.
00:02:46.360 | Now, after that, we're going to need our dataset.
00:02:51.360 | So, this is just a very simple dataset.
00:02:56.360 | It contains, I think, just under 10,000 images, and we only care about the images here.
00:03:03.360 | So, if we have a look, we have ImageNet.
00:03:07.360 | So, we'll go to the first item, and we'll just have a look at image.
00:03:11.360 | Okay, and we have this, like, Sony radio.
00:03:14.360 | And we have other things as well.
00:03:16.360 | So, if we go ImageNet, okay, it's 6494.
00:03:24.360 | There's another image here of a dog.
00:03:26.360 | Okay, just to point out that we have a lot of images in here in this dataset that cover a range of things.
00:03:34.360 | There's not a huge number of different categories here, but they have dogs, they have radios, and a few other things.
00:03:40.360 | Now, I'm just going to go ahead and initialize everything.
00:03:44.360 | So, there's a few things here.
00:03:47.360 | From transformers, we're importing the clip tokenizer.
00:03:51.360 | So, the tokenizer is what's going to handle the preprocessing of our text into token ID tensors and other tensors.
00:04:00.360 | We have the clip processor.
00:04:02.360 | That's like the tokenizer, but for images.
00:04:06.360 | So, this is actually just going to resize our images into the size that clip expects and also modify the pixel values as well.
00:04:19.360 | And then we have clip model.
00:04:20.360 | Clip model is clip itself.
00:04:23.360 | Okay.
00:04:24.360 | So, if you have CUDA or NPS, if you're on M1 Mac, you just set that with this.
00:04:34.360 | Okay.
00:04:35.360 | And then we're ready to actually initialize all of this.
00:04:39.360 | So, the model ID is going to be what we saw before.
00:04:45.360 | So, you come over here.
00:04:48.360 | We have this open eye clip VIT base patch 32.
00:04:51.360 | Copy that.
00:04:55.360 | And here we go.
00:04:56.360 | Okay.
00:04:57.360 | And now we just need to -- look, I'm being told what to do already.
00:05:01.360 | Okay.
00:05:02.360 | So, model, clip model from pre-trained, model ID.
00:05:05.360 | I'm going to -- I don't normally set device like that.
00:05:08.360 | I don't know if you can.
00:05:10.360 | I am going to do it like this.
00:05:13.360 | Okay.
00:05:14.360 | And tokenizer.
00:05:16.360 | Okay.
00:05:17.360 | Good job.
00:05:18.360 | And processor.
00:05:19.360 | Cool.
00:05:20.360 | Almost there.
00:05:21.360 | It's from pre-trained.
00:05:24.360 | Okay.
00:05:27.360 | And you got a little bit confused.
00:05:30.360 | So, model ID.
00:05:33.360 | Okay.
00:05:34.360 | That looks good.
00:05:35.360 | Let's run that.
00:05:36.360 | Okay.
00:05:37.360 | Cool.
00:05:38.360 | So, now what we're going to do is take a look at how we actually create the text embeddings
00:05:44.360 | through clip.
00:05:45.360 | So, we start with prompt.
00:05:47.360 | I'm going to go with a dog in the snow.
00:05:50.360 | There's not many pictures of dogs in the snow in the state set, but there are some.
00:05:56.360 | And what we need to do is tokenize the prompt.
00:05:59.360 | Yeah, that's true.
00:06:00.360 | Okay.
00:06:01.360 | I'm not going to do it like that.
00:06:03.360 | I'm going to go with tokenizer, prompt, and we need to return tensors using PyTorch.
00:06:11.360 | So, we're going to be using PyTorch behind the scenes here.
00:06:15.360 | So, make sure we do that.
00:06:18.360 | And let's just have a look at what is actually in inputs.
00:06:22.360 | Okay.
00:06:23.360 | So, we get this input IDs tensor.
00:06:26.360 | So, you'll recognize this if you've used Hugging Face Transformers before.
00:06:32.360 | And these are just the token IDs that represent the words from this.
00:06:38.360 | Okay.
00:06:39.360 | And this is the attention mask.
00:06:42.360 | Now, for us, it's going to all be ones.
00:06:45.360 | But if we had padding in here, anything beyond the length of our prompt would become a zero,
00:06:50.360 | telling the model to not pay attention to that part of the prompt.
00:06:56.360 | And from there, we can process this through clips.
00:07:00.360 | So, we do model, get text features, I think.
00:07:09.360 | And we pass in those inputs.
00:07:12.360 | Okay.
00:07:13.360 | And let's have a look at the shape of that.
00:07:17.360 | Okay.
00:07:18.360 | So, we have a 512 dimensional vector.
00:07:22.360 | Okay.
00:07:23.360 | So, that's the text embedding side of things.
00:07:27.360 | Now, we need to go ahead and do the image embedding side of things.
00:07:31.360 | Okay.
00:07:32.360 | So, we're going to resize the image first with the processor.
00:07:36.360 | We're not adding any text in here.
00:07:38.360 | So, you can also process text through this processor.
00:07:42.360 | I'm just keeping it separate because it makes more sense to me.
00:07:46.360 | The image should be images, actually.
00:07:52.360 | Again, we want to return tensors using PyTorch.
00:07:56.360 | Okay.
00:07:57.360 | And then we can have a look at the -- I'm going to show you the image.
00:08:03.360 | First, we just have a look at the shape.
00:08:06.360 | And as well, one thing.
00:08:08.360 | So, okay.
00:08:09.360 | I can show you.
00:08:10.360 | Okay.
00:08:12.360 | Okay.
00:08:13.360 | In here, we actually have this pixel values.
00:08:15.360 | So, we actually need to extract that.
00:08:17.360 | So, we're going to put it here.
00:08:18.360 | And we're going to move those to the device as well.
00:08:21.360 | I think the device I have set up right now is actually CPU.
00:08:24.360 | It doesn't make a difference for me, but that's fine.
00:08:28.360 | So, let's have a look at the shape.
00:08:31.360 | Okay.
00:08:32.360 | So, you see that we have this 224 by 224 image with three color channels.
00:08:39.360 | So, this is just the expected shape that will be consumed by the vision transformer of clip.
00:08:47.360 | Okay.
00:08:48.360 | And so, import matplotlib, pyplot.plt.
00:08:53.360 | And I just want to show you this image.
00:08:56.360 | So, this resize image.
00:08:58.360 | So, plt.imshow.image.
00:09:04.360 | And I need to -- so, I just need to resize it.
00:09:07.360 | Let me show you what I'm actually doing here.
00:09:09.360 | So, image.squeeze.
00:09:12.360 | Zero.
00:09:13.360 | So, I'm going to remove that first dimension.
00:09:16.360 | Now, I'm going to transpose it.
00:09:17.360 | So, we put the three color channels at the back.
00:09:22.360 | And this is for matplotlib to be able to actually show us this.
00:09:28.360 | So, I'm going to take that.
00:09:30.360 | I'm going to put it here.
00:09:34.360 | Okay.
00:09:35.360 | And you can see -- so, the minimum/maximum color values are -- or all of the color values,
00:09:44.360 | pixel values are modified when we do this, process it through the processor.
00:09:49.360 | So, the colors are kind of messed up.
00:09:51.360 | But you can see that this is like a resized -- you know, what we saw before.
00:09:56.360 | Okay.
00:09:57.360 | You can still see it's a Sony.
00:09:58.360 | It's just kind of backwards now and flipped.
00:10:01.360 | We can still see that it is that Sony radio.
00:10:04.360 | So, with that, we can go ahead and get the image features.
00:10:08.360 | I think it just showed me.
00:10:11.360 | Model, get image features.
00:10:13.360 | I just want an image.
00:10:15.360 | Okay.
00:10:17.360 | And then let's have a look at the shape.
00:10:19.360 | Cool.
00:10:20.360 | Okay.
00:10:21.360 | So, similar to before, we have that 512-dimensional embedding vector.
00:10:26.360 | Okay.
00:10:27.360 | So, that's cool.
00:10:29.360 | And from here, we can do a lot of things.
00:10:31.360 | What I'm going to show you how to do is how to kind of search through this or at least compare a small number of images against our prompt so that we can actually see which one of those images is the most similar to a dog in the snow.
00:10:48.360 | Okay.
00:10:49.360 | So, to do that, we're going to want to embed more of these images.
00:10:53.360 | I'm not going to embed loads of them.
00:10:55.360 | I'm just going to embed 100 images.
00:10:57.360 | Nothing crazy.
00:10:59.360 | So, we're going to import NumPy as NP.
00:11:05.360 | NP random seed.
00:11:06.360 | So, this is just so you can replicate what I'm doing.
00:11:09.360 | So, this will randomly generate a set set of random numbers.
00:11:17.360 | Okay.
00:11:18.360 | So, the reason I'm doing this is because we want to take a sample out of the dataset.
00:11:22.360 | We don't want to have the whole dataset.
00:11:24.360 | I want it to be at least somewhat random.
00:11:27.360 | So, to do that, we want to go -- so, sample indices are going to be equal to NumPy random.randint from zero up to the length of ImageNet.
00:11:43.360 | It's actually plus one.
00:11:45.360 | And we need 100 of those.
00:11:48.360 | Okay.
00:11:49.360 | And then we're going to convert that into a list.
00:11:52.360 | Okay.
00:11:53.360 | I can just have a quick look at what is in there.
00:11:57.360 | Okay.
00:11:58.360 | So, just all of these numbers here.
00:12:01.360 | Okay.
00:12:02.360 | So, yeah, cool.
00:12:05.360 | And if we run it again, because we have that random seed set, the random set of numbers doesn't change.
00:12:13.360 | And what I'm going to do is just create a list of images using those values.
00:12:20.360 | So, I for I in sample IDX.
00:12:25.360 | Okay.
00:12:27.360 | And then we check length of images.
00:12:30.360 | Okay.
00:12:31.360 | So, now we have 100 images from our dataset.
00:12:34.360 | Now we want to just go ahead and literally take everything we've just done and put it into a for loop to create the embeddings for all of these images.
00:12:45.360 | Okay.
00:12:46.360 | So, that will look something like this.
00:12:49.360 | I'm using TQDM here.
00:12:50.360 | This is just a progress bar so we can see where we are.
00:12:53.360 | Batch size, I'm saying how many images to perform this for in any one go.
00:12:59.360 | You can increase this if you're using a bigger GPU or whatever else.
00:13:07.360 | Image array, I'm setting that to none for now.
00:13:10.360 | We initialize that in the first loop.
00:13:15.360 | Okay.
00:13:16.360 | And then we're just doing the same thing as before.
00:13:18.360 | So, I'm selecting a batch of images based on the batch size.
00:13:24.360 | And then where we are processing and resizing the images from that batch, we're getting the image features look exactly the same thing.
00:13:33.360 | I think before I actually didn't include pixel values.
00:13:36.360 | But it's the same thing.
00:13:38.360 | It's just a default argument.
00:13:40.360 | Converting into a NumPy array.
00:13:43.360 | Did I show you this before?
00:13:45.360 | I don't actually think so.
00:13:48.360 | Maybe not.
00:13:49.360 | But here, the squeeze is very similar.
00:13:52.360 | It's the same thing as what I showed you up here.
00:13:54.360 | So, we squeeze the first dimension out of that like we did here.
00:14:01.360 | And then we are moving that batch of embeddings to the CPU.
00:14:08.360 | If it's not already on the CPU.
00:14:10.360 | We're detaching it from the gradient, like the training graph of PyTorch, the PyTorch model, e.g. clip.
00:14:19.360 | And then we're converting into a NumPy array.
00:14:22.360 | And then I'm going to add that batch of embeddings to a large array of all image embeddings.
00:14:28.360 | And that's where the image array comes in.
00:14:31.360 | So, let's run that.
00:14:35.360 | So, we come up here.
00:14:38.360 | I made a mistake in the code.
00:14:41.360 | So, here I'm actually pulling in the full row or record at any one time.
00:14:46.360 | We don't want to do that.
00:14:48.360 | We want the image itself.
00:14:50.360 | So, run that again.
00:14:53.360 | Okay.
00:14:54.360 | And now if we check the type of images.
00:14:59.360 | Zero.
00:15:00.360 | We should see it's a pill image.
00:15:02.360 | Cool.
00:15:03.360 | We have a pill here.
00:15:04.360 | Now we can run this.
00:15:06.360 | Okay.
00:15:07.360 | It won't take long.
00:15:08.360 | And now we have 100, 512 dimensional image embeddings from our dataset.
00:15:16.360 | And we can now use them to compare to our initial text embedding and see which one of these matches most closely to that text embedding.
00:15:26.360 | Okay.
00:15:27.360 | So, I'm going to be using dot product similarity.
00:15:30.360 | So, there's just one thing to be aware of with that.
00:15:34.360 | And that is that it considers both the magnitude of the vector and also the angle.
00:15:40.360 | So, in this case, that will -- that can throw off our results.
00:15:46.360 | So, we should normalize all of the image embeddings so that we are not looking at the magnitude of vectors.
00:15:54.360 | And we're only focusing on the angular similarity between our text embedding and these image embeddings.
00:16:03.360 | So, to do that, we need to -- I'll just show you quickly.
00:16:09.360 | So, I look at the minimum and maximum.
00:16:12.360 | You know, they're kind of all over the place.
00:16:14.360 | So, to normalize, we need to do this.
00:16:18.360 | So, do image array divided by numpy linage.norm.
00:16:26.360 | And here we have the image array.
00:16:31.360 | Okay.
00:16:33.360 | Axes equals one.
00:16:36.360 | And let me -- I can show you what that is.
00:16:40.360 | So, we have all these numbers.
00:16:41.360 | And these are basically telling us for each one of these vectors of what should we divide it by in order to bring each of them to within a set magnitude, pretty much.
00:16:58.360 | So, take a look at the shape.
00:17:03.360 | It will be 100.
00:17:05.360 | Yeah, we do that.
00:17:09.360 | So, I think I need to transpose this.
00:17:17.360 | Okay.
00:17:21.360 | And then -- so, the image array, the shape is going to be transposed now.
00:17:25.360 | So, I'm going to transpose it again.
00:17:27.360 | Yeah.
00:17:31.360 | Image array equals image array transpose.
00:17:35.360 | Okay.
00:17:36.360 | Cool.
00:17:38.360 | And now if we have a look at the minimum and maximum.
00:17:42.360 | So, minimum and maximum.
00:17:44.360 | We get these values, which are more reasonable.
00:17:48.360 | Okay.
00:17:49.360 | So, now what we can do is use dot product similarity to actually compare these.
00:17:57.360 | So, text embedding, I'm going to take the text embedding.
00:18:01.360 | And similar to before, what we did is we need to move it to the CPU, detach it from the PyTorch graph, and then convert it to NumPy array.
00:18:12.360 | Okay.
00:18:15.360 | Yeah.
00:18:16.360 | And then for the scores, all we need to do is NumPy dot.
00:18:20.360 | And we are going to put the text embedding followed by the image array.
00:18:26.360 | And actually, I think I need to transpose this again.
00:18:28.360 | So, maybe we could have avoided transposing up here.
00:18:34.360 | Okay.
00:18:35.360 | Yeah.
00:18:36.360 | So, the scores that we get here, we get a single score for every single vector, as we can see.
00:18:43.360 | Shape 100.
00:18:45.360 | And they are the dot product similarity scores.
00:18:47.360 | So, what we can now do is sort based on this scores array.
00:18:52.360 | And just return, like, the top, say, the top five images and see what the top five most similar images are for our query.
00:19:02.360 | Okay.
00:19:03.360 | So, we're going to return the top K.
00:19:07.360 | So, top K is going to be the five most similar, or the five items with the highest score.
00:19:13.360 | And then we want to take the index values using np.arg.sort.
00:19:19.360 | We're going to add the negative of the scores there.
00:19:24.360 | And just make sure we take -- because scores has this here.
00:19:28.360 | So, we're actually just taking the -- let me show you.
00:19:32.360 | Scores zero dot shape.
00:19:37.360 | Okay.
00:19:38.360 | So, it's taking the 100 values there.
00:19:41.360 | And then I want to take the top K from that.
00:19:44.360 | Okay.
00:19:45.360 | So, what we're left with is these five index values, which are essentially indexes of the image embeddings.
00:19:55.360 | And, therefore, the images that are the most similar to our query.
00:20:01.360 | So, we'll use matplotlib again to visualize those.
00:20:06.360 | So, we do for i in idx.
00:20:09.360 | Let's print the score first.
00:20:13.360 | So, scores i.
00:20:19.360 | And actually that would be zero i.
00:20:23.360 | And then I am going to show.
00:20:26.360 | And I'm just going to PLT show.
00:20:29.360 | Okay.
00:20:30.360 | Cool.
00:20:31.360 | So, yeah.
00:20:33.360 | I mean, that's it.
00:20:34.360 | The first item, as we would expect, is a dog in the snow.
00:20:39.360 | So, after that, we get dogs and we get, like, these snowy areas.
00:20:45.360 | The reason for that is that we just don't have any more images of dogs in the snow.
00:20:51.360 | This one, I don't know what this is.
00:20:53.360 | It's like a toy that maybe it's a dog, maybe it's a bear.
00:20:56.360 | I'm not sure.
00:20:57.360 | But I suppose technically that's like a dog in the snow.
00:21:00.360 | So, we have that.
00:21:02.360 | So, yeah, obviously the model is performing pretty well.
00:21:06.360 | And I think that's really cool that we can do that so easily.
00:21:09.360 | And, yeah, I mean, Clip is, I think, an amazing model that we can use to do a load of cool things across both the text and image domain.
00:21:20.360 | Which is super interesting.
00:21:22.360 | And it's definitely, like, if you think just a couple of years ago this sort of thing wasn't possible.
00:21:28.360 | And didn't seem, like, at least not to this sort of degree of accuracy, like it was going to be happening anytime soon.
00:21:37.360 | So, this is really cool.
00:21:40.360 | Here we've obviously shown, I showed you how to do, like, a text to image search.
00:21:46.360 | You can do this, like, in reality what we're doing is kind of searching through the vectors.
00:21:52.360 | So, it doesn't matter, you know, which direction you're doing that search.
00:21:57.360 | Like, the vectors are all the same.
00:21:59.360 | So, if you want to do a text to text search with Clip, you could.
00:22:03.360 | If you want to do image to image search, you could.
00:22:05.360 | If you want to do image to text or all of those things all at once, you could.
00:22:11.360 | It's not, it's, you're searching through vectors.
00:22:14.360 | So, what is behind those vectors doesn't really matter so much.
00:22:17.360 | Okay, so I think that's it for this video.
00:22:21.360 | I think Clip is super interesting and I hope that you do as well.
00:22:26.360 | In the future, or very soon actually, I'm going to be going into a lot more detail on Clip.
00:22:33.360 | So, if you are interested in that, subscribe and click on the little notification button
00:22:40.360 | and you will get a notification about that pretty soon.
00:22:44.360 | But, until then, that's it for this video.
00:22:48.360 | Thank you very much for watching and I will see you again in the next one.