Fast intro to multi-modal ML with OpenAI's CLIP

00:00:00.000 | In this video, we're going to have a quick introduction to OpenAI's CLIP

00:00:04.860 | and how we can use it to almost move between the modalities of both language and images.

00:00:14.860 | Now, before we dive in, let's just quickly understand what CLIP is.

00:00:19.360 | So, it consists of two big models.

00:00:22.860 | In this implementation, we're going to be using a vision transformer that will embed images.

00:00:29.360 | And we're going to use a normal text transformer that will embed text.

00:00:34.860 | During pre-training, OpenAI trained the model on pairs of images and text.

00:00:41.360 | And it trained them to both output embedding vectors that are as close as possible to each other.

00:00:50.360 | So, the text transformer was trained to output a single embedding, 512-dimensional embedding,

00:00:58.860 | that was as close as possible to the vision transformer's image embedding for the image-text pair.

00:01:07.860 | So, what that means is that CLIP is able to take both images and text and embed them both into a similar vector space.

00:01:19.360 | And with that, we can do a lot of things.

00:01:21.360 | You can do image and text classification, you can do image and text search, and a huge number of things.

00:01:28.360 | Anything to do with images and text, there's a good chance we can do it with CLIP.

00:01:33.860 | So, let's have a look at how we actually use CLIP.

00:01:37.860 | OpenAI released a GitHub repository, OpenAI CLIP here.

00:01:42.860 | This contains CLIP, but we're not going to use this implementation.

00:01:47.360 | We're actually going to use this implementation of CLIP.

00:01:50.860 | So, this is on HuggingFace, so we're going to be using HuggingFace Transformers.

00:01:55.360 | And this is still from OpenAI, it's still CLIP.

00:01:59.360 | It's just an easy-to-use implementation of it through the HuggingFace Transformers library,

00:02:04.360 | which is a more standard library for actually doing anything with NLP and also now computer vision and some other things as well.

00:02:14.360 | So, to get started, I'd recommend you install these libraries, pip install torch.

00:02:18.860 | You should probably go through the PyTorch.org instructions rather than following this here.

00:02:25.360 | So, go to PyTorch.org and just install PyTorch using the specific install command they use for your platform or your iOS from here.

00:02:37.360 | And then, pip install transformers and datasets.

00:02:40.360 | You can still just use this command, so I'd recommend installing PyTorch from here instead.

00:02:46.360 | Now, after that, we're going to need our dataset.

00:02:51.360 | So, this is just a very simple dataset.

00:02:56.360 | It contains, I think, just under 10,000 images, and we only care about the images here.

00:03:03.360 | So, if we have a look, we have ImageNet.

00:03:07.360 | So, we'll go to the first item, and we'll just have a look at image.

00:03:11.360 | Okay, and we have this, like, Sony radio.

00:03:14.360 | And we have other things as well.

00:03:16.360 | So, if we go ImageNet, okay, it's 6494.

00:03:24.360 | There's another image here of a dog.

00:03:26.360 | Okay, just to point out that we have a lot of images in here in this dataset that cover a range of things.

00:03:34.360 | There's not a huge number of different categories here, but they have dogs, they have radios, and a few other things.

00:03:40.360 | Now, I'm just going to go ahead and initialize everything.

00:03:44.360 | So, there's a few things here.

00:03:47.360 | From transformers, we're importing the clip tokenizer.

00:03:51.360 | So, the tokenizer is what's going to handle the preprocessing of our text into token ID tensors and other tensors.

00:04:00.360 | We have the clip processor.

00:04:02.360 | That's like the tokenizer, but for images.

00:04:06.360 | So, this is actually just going to resize our images into the size that clip expects and also modify the pixel values as well.

00:04:19.360 | And then we have clip model.

00:04:20.360 | Clip model is clip itself.

00:04:23.360 | Okay.

00:04:24.360 | So, if you have CUDA or NPS, if you're on M1 Mac, you just set that with this.

00:04:34.360 | Okay.

00:04:35.360 | And then we're ready to actually initialize all of this.

00:04:39.360 | So, the model ID is going to be what we saw before.

00:04:45.360 | So, you come over here.

00:04:48.360 | We have this open eye clip VIT base patch 32.

00:04:51.360 | Copy that.

00:04:55.360 | And here we go.

00:04:56.360 | Okay.

00:04:57.360 | And now we just need to -- look, I'm being told what to do already.

00:05:01.360 | Okay.

00:05:02.360 | So, model, clip model from pre-trained, model ID.

00:05:05.360 | I'm going to -- I don't normally set device like that.

00:05:08.360 | I don't know if you can.

00:05:10.360 | I am going to do it like this.

00:05:13.360 | Okay.

00:05:14.360 | And tokenizer.

00:05:16.360 | Okay.

00:05:17.360 | Good job.

00:05:18.360 | And processor.

00:05:19.360 | Cool.

00:05:20.360 | Almost there.

00:05:21.360 | It's from pre-trained.

00:05:24.360 | Okay.

00:05:27.360 | And you got a little bit confused.

00:05:30.360 | So, model ID.

00:05:33.360 | Okay.

00:05:34.360 | That looks good.

00:05:35.360 | Let's run that.

00:05:36.360 | Okay.

00:05:37.360 | Cool.

00:05:38.360 | So, now what we're going to do is take a look at how we actually create the text embeddings

00:05:44.360 | through clip.

00:05:45.360 | So, we start with prompt.

00:05:47.360 | I'm going to go with a dog in the snow.

00:05:50.360 | There's not many pictures of dogs in the snow in the state set, but there are some.

00:05:56.360 | And what we need to do is tokenize the prompt.

00:05:59.360 | Yeah, that's true.

00:06:00.360 | Okay.

00:06:01.360 | I'm not going to do it like that.

00:06:03.360 | I'm going to go with tokenizer, prompt, and we need to return tensors using PyTorch.

00:06:11.360 | So, we're going to be using PyTorch behind the scenes here.

00:06:15.360 | So, make sure we do that.

00:06:18.360 | And let's just have a look at what is actually in inputs.

00:06:22.360 | Okay.

00:06:23.360 | So, we get this input IDs tensor.

00:06:26.360 | So, you'll recognize this if you've used Hugging Face Transformers before.

00:06:32.360 | And these are just the token IDs that represent the words from this.

00:06:38.360 | Okay.

00:06:39.360 | And this is the attention mask.

00:06:42.360 | Now, for us, it's going to all be ones.

00:06:45.360 | But if we had padding in here, anything beyond the length of our prompt would become a zero,

00:06:50.360 | telling the model to not pay attention to that part of the prompt.

00:06:56.360 | And from there, we can process this through clips.

00:07:00.360 | So, we do model, get text features, I think.

00:07:09.360 | And we pass in those inputs.

00:07:12.360 | Okay.

00:07:13.360 | And let's have a look at the shape of that.

00:07:17.360 | Okay.

00:07:18.360 | So, we have a 512 dimensional vector.

00:07:22.360 | Okay.

00:07:23.360 | So, that's the text embedding side of things.

00:07:27.360 | Now, we need to go ahead and do the image embedding side of things.

00:07:31.360 | Okay.

00:07:32.360 | So, we're going to resize the image first with the processor.

00:07:36.360 | We're not adding any text in here.

00:07:38.360 | So, you can also process text through this processor.

00:07:42.360 | I'm just keeping it separate because it makes more sense to me.

00:07:46.360 | The image should be images, actually.

00:07:52.360 | Again, we want to return tensors using PyTorch.

00:07:56.360 | Okay.

00:07:57.360 | And then we can have a look at the -- I'm going to show you the image.

00:08:03.360 | First, we just have a look at the shape.

00:08:06.360 | And as well, one thing.

00:08:08.360 | So, okay.

00:08:09.360 | I can show you.

00:08:10.360 | Okay.

00:08:12.360 | Okay.

00:08:13.360 | In here, we actually have this pixel values.

00:08:15.360 | So, we actually need to extract that.

00:08:17.360 | So, we're going to put it here.

00:08:18.360 | And we're going to move those to the device as well.

00:08:21.360 | I think the device I have set up right now is actually CPU.

00:08:24.360 | It doesn't make a difference for me, but that's fine.

00:08:28.360 | So, let's have a look at the shape.

00:08:31.360 | Okay.

00:08:32.360 | So, you see that we have this 224 by 224 image with three color channels.

00:08:39.360 | So, this is just the expected shape that will be consumed by the vision transformer of clip.

00:08:47.360 | Okay.

00:08:48.360 | And so, import matplotlib, pyplot.plt.

00:08:53.360 | And I just want to show you this image.

00:08:56.360 | So, this resize image.

00:08:58.360 | So, plt.imshow.image.

00:09:04.360 | And I need to -- so, I just need to resize it.

00:09:07.360 | Let me show you what I'm actually doing here.

00:09:09.360 | So, image.squeeze.

00:09:12.360 | Zero.

00:09:13.360 | So, I'm going to remove that first dimension.

00:09:16.360 | Now, I'm going to transpose it.

00:09:17.360 | So, we put the three color channels at the back.

00:09:22.360 | And this is for matplotlib to be able to actually show us this.

00:09:28.360 | So, I'm going to take that.

00:09:30.360 | I'm going to put it here.

00:09:34.360 | Okay.

00:09:35.360 | And you can see -- so, the minimum/maximum color values are -- or all of the color values,

00:09:44.360 | pixel values are modified when we do this, process it through the processor.

00:09:49.360 | So, the colors are kind of messed up.

00:09:51.360 | But you can see that this is like a resized -- you know, what we saw before.

00:09:56.360 | Okay.

00:09:57.360 | You can still see it's a Sony.

00:09:58.360 | It's just kind of backwards now and flipped.

00:10:01.360 | We can still see that it is that Sony radio.

00:10:04.360 | So, with that, we can go ahead and get the image features.

00:10:08.360 | I think it just showed me.

00:10:11.360 | Model, get image features.

00:10:13.360 | I just want an image.

00:10:15.360 | Okay.

00:10:17.360 | And then let's have a look at the shape.

00:10:19.360 | Cool.

00:10:20.360 | Okay.

00:10:21.360 | So, similar to before, we have that 512-dimensional embedding vector.

00:10:26.360 | Okay.

00:10:27.360 | So, that's cool.

00:10:29.360 | And from here, we can do a lot of things.

00:10:31.360 | What I'm going to show you how to do is how to kind of search through this or at least compare a small number of images against our prompt so that we can actually see which one of those images is the most similar to a dog in the snow.

00:10:48.360 | Okay.

00:10:49.360 | So, to do that, we're going to want to embed more of these images.

00:10:53.360 | I'm not going to embed loads of them.

00:10:55.360 | I'm just going to embed 100 images.

00:10:57.360 | Nothing crazy.

00:10:59.360 | So, we're going to import NumPy as NP.

00:11:05.360 | NP random seed.

00:11:06.360 | So, this is just so you can replicate what I'm doing.

00:11:09.360 | So, this will randomly generate a set set of random numbers.

00:11:17.360 | Okay.

00:11:18.360 | So, the reason I'm doing this is because we want to take a sample out of the dataset.

00:11:22.360 | We don't want to have the whole dataset.

00:11:24.360 | I want it to be at least somewhat random.

00:11:27.360 | So, to do that, we want to go -- so, sample indices are going to be equal to NumPy random.randint from zero up to the length of ImageNet.

00:11:43.360 | It's actually plus one.

00:11:45.360 | And we need 100 of those.

00:11:48.360 | Okay.

00:11:49.360 | And then we're going to convert that into a list.

00:11:52.360 | Okay.

00:11:53.360 | I can just have a quick look at what is in there.

00:11:57.360 | Okay.

00:11:58.360 | So, just all of these numbers here.

00:12:01.360 | Okay.

00:12:02.360 | So, yeah, cool.

00:12:05.360 | And if we run it again, because we have that random seed set, the random set of numbers doesn't change.

00:12:13.360 | And what I'm going to do is just create a list of images using those values.

00:12:20.360 | So, I for I in sample IDX.

00:12:25.360 | Okay.

00:12:27.360 | And then we check length of images.

00:12:30.360 | Okay.

00:12:31.360 | So, now we have 100 images from our dataset.

00:12:34.360 | Now we want to just go ahead and literally take everything we've just done and put it into a for loop to create the embeddings for all of these images.

00:12:45.360 | Okay.

00:12:46.360 | So, that will look something like this.

00:12:49.360 | I'm using TQDM here.

00:12:50.360 | This is just a progress bar so we can see where we are.

00:12:53.360 | Batch size, I'm saying how many images to perform this for in any one go.

00:12:59.360 | You can increase this if you're using a bigger GPU or whatever else.

00:13:07.360 | Image array, I'm setting that to none for now.

00:13:10.360 | We initialize that in the first loop.

00:13:15.360 | Okay.

00:13:16.360 | And then we're just doing the same thing as before.

00:13:18.360 | So, I'm selecting a batch of images based on the batch size.

00:13:24.360 | And then where we are processing and resizing the images from that batch, we're getting the image features look exactly the same thing.

00:13:33.360 | I think before I actually didn't include pixel values.

00:13:36.360 | But it's the same thing.

00:13:38.360 | It's just a default argument.

00:13:40.360 | Converting into a NumPy array.

00:13:43.360 | Did I show you this before?

00:13:45.360 | I don't actually think so.

00:13:47.360 | No.

00:13:48.360 | Maybe not.

00:13:49.360 | But here, the squeeze is very similar.

00:13:52.360 | It's the same thing as what I showed you up here.

00:13:54.360 | So, we squeeze the first dimension out of that like we did here.

00:14:01.360 | And then we are moving that batch of embeddings to the CPU.

00:14:08.360 | If it's not already on the CPU.

00:14:10.360 | We're detaching it from the gradient, like the training graph of PyTorch, the PyTorch model, e.g. clip.

00:14:19.360 | And then we're converting into a NumPy array.

00:14:22.360 | And then I'm going to add that batch of embeddings to a large array of all image embeddings.

00:14:28.360 | And that's where the image array comes in.

00:14:31.360 | So, let's run that.

00:14:35.360 | So, we come up here.

00:14:38.360 | I made a mistake in the code.

00:14:41.360 | So, here I'm actually pulling in the full row or record at any one time.

00:14:46.360 | We don't want to do that.

00:14:48.360 | We want the image itself.

00:14:50.360 | So, run that again.

00:14:53.360 | Okay.

00:14:54.360 | And now if we check the type of images.

00:14:59.360 | Zero.

00:15:00.360 | We should see it's a pill image.

00:15:02.360 | Cool.

00:15:03.360 | We have a pill here.

00:15:04.360 | Now we can run this.

00:15:06.360 | Okay.

00:15:07.360 | It won't take long.

00:15:08.360 | And now we have 100, 512 dimensional image embeddings from our dataset.

00:15:16.360 | And we can now use them to compare to our initial text embedding and see which one of these matches most closely to that text embedding.

00:15:26.360 | Okay.

00:15:27.360 | So, I'm going to be using dot product similarity.

00:15:30.360 | So, there's just one thing to be aware of with that.

00:15:34.360 | And that is that it considers both the magnitude of the vector and also the angle.

00:15:40.360 | So, in this case, that will -- that can throw off our results.

00:15:46.360 | So, we should normalize all of the image embeddings so that we are not looking at the magnitude of vectors.

00:15:54.360 | And we're only focusing on the angular similarity between our text embedding and these image embeddings.

00:16:03.360 | So, to do that, we need to -- I'll just show you quickly.

00:16:09.360 | So, I look at the minimum and maximum.

00:16:12.360 | You know, they're kind of all over the place.

00:16:14.360 | So, to normalize, we need to do this.

00:16:18.360 | So, do image array divided by numpy linage.norm.

00:16:26.360 | And here we have the image array.

00:16:31.360 | Okay.

00:16:33.360 | Axes equals one.

00:16:36.360 | And let me -- I can show you what that is.

00:16:40.360 | So, we have all these numbers.

00:16:41.360 | And these are basically telling us for each one of these vectors of what should we divide it by in order to bring each of them to within a set magnitude, pretty much.

00:16:58.360 | So, take a look at the shape.

00:17:03.360 | It will be 100.

00:17:05.360 | Yeah, we do that.

00:17:09.360 | So, I think I need to transpose this.

00:17:17.360 | Okay.

00:17:21.360 | And then -- so, the image array, the shape is going to be transposed now.

00:17:25.360 | So, I'm going to transpose it again.

00:17:27.360 | Yeah.

00:17:31.360 | Image array equals image array transpose.

00:17:35.360 | Okay.

00:17:36.360 | Cool.

00:17:38.360 | And now if we have a look at the minimum and maximum.

00:17:42.360 | So, minimum and maximum.

00:17:44.360 | We get these values, which are more reasonable.

00:17:48.360 | Okay.

00:17:49.360 | So, now what we can do is use dot product similarity to actually compare these.

00:17:57.360 | So, text embedding, I'm going to take the text embedding.

00:18:01.360 | And similar to before, what we did is we need to move it to the CPU, detach it from the PyTorch graph, and then convert it to NumPy array.

00:18:12.360 | Okay.

00:18:15.360 | Yeah.

00:18:16.360 | And then for the scores, all we need to do is NumPy dot.

00:18:20.360 | And we are going to put the text embedding followed by the image array.

00:18:26.360 | And actually, I think I need to transpose this again.

00:18:28.360 | So, maybe we could have avoided transposing up here.

00:18:34.360 | Okay.

00:18:35.360 | Yeah.

00:18:36.360 | So, the scores that we get here, we get a single score for every single vector, as we can see.

00:18:43.360 | Shape 100.

00:18:45.360 | And they are the dot product similarity scores.

00:18:47.360 | So, what we can now do is sort based on this scores array.

00:18:52.360 | And just return, like, the top, say, the top five images and see what the top five most similar images are for our query.

00:19:02.360 | Okay.

00:19:03.360 | So, we're going to return the top K.

00:19:07.360 | So, top K is going to be the five most similar, or the five items with the highest score.

00:19:13.360 | And then we want to take the index values using np.arg.sort.

00:19:19.360 | We're going to add the negative of the scores there.

00:19:24.360 | And just make sure we take -- because scores has this here.

00:19:28.360 | So, we're actually just taking the -- let me show you.

00:19:32.360 | Scores zero dot shape.

00:19:37.360 | Okay.

00:19:38.360 | So, it's taking the 100 values there.

00:19:41.360 | And then I want to take the top K from that.

00:19:44.360 | Okay.

00:19:45.360 | So, what we're left with is these five index values, which are essentially indexes of the image embeddings.

00:19:55.360 | And, therefore, the images that are the most similar to our query.

00:20:01.360 | So, we'll use matplotlib again to visualize those.

00:20:06.360 | So, we do for i in idx.

00:20:09.360 | Let's print the score first.

00:20:13.360 | So, scores i.

00:20:19.360 | And actually that would be zero i.

00:20:23.360 | And then I am going to show.

00:20:26.360 | And I'm just going to PLT show.

00:20:29.360 | Okay.

00:20:30.360 | Cool.

00:20:31.360 | So, yeah.

00:20:33.360 | I mean, that's it.

00:20:34.360 | The first item, as we would expect, is a dog in the snow.

00:20:39.360 | So, after that, we get dogs and we get, like, these snowy areas.

00:20:45.360 | The reason for that is that we just don't have any more images of dogs in the snow.

00:20:51.360 | This one, I don't know what this is.

00:20:53.360 | It's like a toy that maybe it's a dog, maybe it's a bear.

00:20:56.360 | I'm not sure.

00:20:57.360 | But I suppose technically that's like a dog in the snow.

00:21:00.360 | So, we have that.

00:21:02.360 | So, yeah, obviously the model is performing pretty well.

00:21:06.360 | And I think that's really cool that we can do that so easily.

00:21:09.360 | And, yeah, I mean, Clip is, I think, an amazing model that we can use to do a load of cool things across both the text and image domain.

00:21:20.360 | Which is super interesting.

00:21:22.360 | And it's definitely, like, if you think just a couple of years ago this sort of thing wasn't possible.

00:21:28.360 | And didn't seem, like, at least not to this sort of degree of accuracy, like it was going to be happening anytime soon.

00:21:37.360 | So, this is really cool.

00:21:40.360 | Here we've obviously shown, I showed you how to do, like, a text to image search.

00:21:46.360 | You can do this, like, in reality what we're doing is kind of searching through the vectors.

00:21:52.360 | So, it doesn't matter, you know, which direction you're doing that search.

00:21:57.360 | Like, the vectors are all the same.

00:21:59.360 | So, if you want to do a text to text search with Clip, you could.

00:22:03.360 | If you want to do image to image search, you could.

00:22:05.360 | If you want to do image to text or all of those things all at once, you could.

00:22:11.360 | It's not, it's, you're searching through vectors.

00:22:14.360 | So, what is behind those vectors doesn't really matter so much.

00:22:17.360 | Okay, so I think that's it for this video.

00:22:21.360 | I think Clip is super interesting and I hope that you do as well.

00:22:26.360 | In the future, or very soon actually, I'm going to be going into a lot more detail on Clip.

00:22:33.360 | So, if you are interested in that, subscribe and click on the little notification button

00:22:40.360 | and you will get a notification about that pretty soon.

00:22:44.360 | But, until then, that's it for this video.

00:22:48.360 | Thank you very much for watching and I will see you again in the next one.

00:22:52.360 | Bye.

Fast intro to multi-modal ML with OpenAI's CLIP

Chapters