Fast intro to multi-modal ML with OpenAI's CLIP

In this video, we're going to have a quick introduction to OpenAI's CLIP and how we can use it to almost move between the modalities of both language and images. Now, before we dive in, let's just quickly understand what CLIP is. So, it consists of two big models. In this implementation, we're going to be using a vision transformer that will embed images.

And we're going to use a normal text transformer that will embed text. During pre-training, OpenAI trained the model on pairs of images and text. And it trained them to both output embedding vectors that are as close as possible to each other. So, the text transformer was trained to output a single embedding, 512-dimensional embedding, that was as close as possible to the vision transformer's image embedding for the image-text pair.

So, what that means is that CLIP is able to take both images and text and embed them both into a similar vector space. And with that, we can do a lot of things. You can do image and text classification, you can do image and text search, and a huge number of things.

Anything to do with images and text, there's a good chance we can do it with CLIP. So, let's have a look at how we actually use CLIP. OpenAI released a GitHub repository, OpenAI CLIP here. This contains CLIP, but we're not going to use this implementation. We're actually going to use this implementation of CLIP.

So, this is on HuggingFace, so we're going to be using HuggingFace Transformers. And this is still from OpenAI, it's still CLIP. It's just an easy-to-use implementation of it through the HuggingFace Transformers library, which is a more standard library for actually doing anything with NLP and also now computer vision and some other things as well.

So, to get started, I'd recommend you install these libraries, pip install torch. You should probably go through the PyTorch.org instructions rather than following this here. So, go to PyTorch.org and just install PyTorch using the specific install command they use for your platform or your iOS from here. And then, pip install transformers and datasets.

You can still just use this command, so I'd recommend installing PyTorch from here instead. Now, after that, we're going to need our dataset. So, this is just a very simple dataset. It contains, I think, just under 10,000 images, and we only care about the images here. So, if we have a look, we have ImageNet.

So, we'll go to the first item, and we'll just have a look at image. Okay, and we have this, like, Sony radio. And we have other things as well. So, if we go ImageNet, okay, it's 6494. There's another image here of a dog. Okay, just to point out that we have a lot of images in here in this dataset that cover a range of things.

There's not a huge number of different categories here, but they have dogs, they have radios, and a few other things. Now, I'm just going to go ahead and initialize everything. So, there's a few things here. From transformers, we're importing the clip tokenizer. So, the tokenizer is what's going to handle the preprocessing of our text into token ID tensors and other tensors.

We have the clip processor. That's like the tokenizer, but for images. So, this is actually just going to resize our images into the size that clip expects and also modify the pixel values as well. And then we have clip model. Clip model is clip itself. Okay. So, if you have CUDA or NPS, if you're on M1 Mac, you just set that with this.

Okay. And then we're ready to actually initialize all of this. So, the model ID is going to be what we saw before. So, you come over here. We have this open eye clip VIT base patch 32. Copy that. And here we go. Okay. And now we just need to -- look, I'm being told what to do already.

Okay. So, model, clip model from pre-trained, model ID. I'm going to -- I don't normally set device like that. I don't know if you can. I am going to do it like this. Okay. And tokenizer. Okay. Good job. And processor. Cool. Almost there. It's from pre-trained. Okay. And you got a little bit confused.

So, model ID. Okay. That looks good. Let's run that. Okay. Cool. So, now what we're going to do is take a look at how we actually create the text embeddings through clip. So, we start with prompt. I'm going to go with a dog in the snow. There's not many pictures of dogs in the snow in the state set, but there are some.

And what we need to do is tokenize the prompt. Yeah, that's true. Okay. I'm not going to do it like that. I'm going to go with tokenizer, prompt, and we need to return tensors using PyTorch. So, we're going to be using PyTorch behind the scenes here. So, make sure we do that.

And let's just have a look at what is actually in inputs. Okay. So, we get this input IDs tensor. So, you'll recognize this if you've used Hugging Face Transformers before. And these are just the token IDs that represent the words from this. Okay. And this is the attention mask.

Now, for us, it's going to all be ones. But if we had padding in here, anything beyond the length of our prompt would become a zero, telling the model to not pay attention to that part of the prompt. And from there, we can process this through clips. So, we do model, get text features, I think.

And we pass in those inputs. Okay. And let's have a look at the shape of that. Okay. So, we have a 512 dimensional vector. Okay. So, that's the text embedding side of things. Now, we need to go ahead and do the image embedding side of things. Okay. So, we're going to resize the image first with the processor.

We're not adding any text in here. So, you can also process text through this processor. I'm just keeping it separate because it makes more sense to me. The image should be images, actually. Again, we want to return tensors using PyTorch. Okay. And then we can have a look at the -- I'm going to show you the image.

First, we just have a look at the shape. And as well, one thing. So, okay. I can show you. Okay. Okay. In here, we actually have this pixel values. So, we actually need to extract that. So, we're going to put it here. And we're going to move those to the device as well.

I think the device I have set up right now is actually CPU. It doesn't make a difference for me, but that's fine. So, let's have a look at the shape. Okay. So, you see that we have this 224 by 224 image with three color channels. So, this is just the expected shape that will be consumed by the vision transformer of clip.

Okay. And so, import matplotlib, pyplot.plt. And I just want to show you this image. So, this resize image. So, plt.imshow.image. And I need to -- so, I just need to resize it. Let me show you what I'm actually doing here. So, image.squeeze. Zero. So, I'm going to remove that first dimension.

Now, I'm going to transpose it. So, we put the three color channels at the back. And this is for matplotlib to be able to actually show us this. So, I'm going to take that. I'm going to put it here. Okay. And you can see -- so, the minimum/maximum color values are -- or all of the color values, pixel values are modified when we do this, process it through the processor.

So, the colors are kind of messed up. But you can see that this is like a resized -- you know, what we saw before. Okay. You can still see it's a Sony. It's just kind of backwards now and flipped. We can still see that it is that Sony radio.

So, with that, we can go ahead and get the image features. I think it just showed me. Model, get image features. I just want an image. Okay. And then let's have a look at the shape. Cool. Okay. So, similar to before, we have that 512-dimensional embedding vector. Okay. So, that's cool.

And from here, we can do a lot of things. What I'm going to show you how to do is how to kind of search through this or at least compare a small number of images against our prompt so that we can actually see which one of those images is the most similar to a dog in the snow.

Okay. So, to do that, we're going to want to embed more of these images. I'm not going to embed loads of them. I'm just going to embed 100 images. Nothing crazy. So, we're going to import NumPy as NP. NP random seed. So, this is just so you can replicate what I'm doing.

So, this will randomly generate a set set of random numbers. Okay. So, the reason I'm doing this is because we want to take a sample out of the dataset. We don't want to have the whole dataset. I want it to be at least somewhat random. So, to do that, we want to go -- so, sample indices are going to be equal to NumPy random.randint from zero up to the length of ImageNet.

It's actually plus one. And we need 100 of those. Okay. And then we're going to convert that into a list. Okay. I can just have a quick look at what is in there. Okay. So, just all of these numbers here. Okay. So, yeah, cool. And if we run it again, because we have that random seed set, the random set of numbers doesn't change.

And what I'm going to do is just create a list of images using those values. So, I for I in sample IDX. Okay. And then we check length of images. Okay. So, now we have 100 images from our dataset. Now we want to just go ahead and literally take everything we've just done and put it into a for loop to create the embeddings for all of these images.

Okay. So, that will look something like this. I'm using TQDM here. This is just a progress bar so we can see where we are. Batch size, I'm saying how many images to perform this for in any one go. You can increase this if you're using a bigger GPU or whatever else.

Image array, I'm setting that to none for now. We initialize that in the first loop. Okay. And then we're just doing the same thing as before. So, I'm selecting a batch of images based on the batch size. And then where we are processing and resizing the images from that batch, we're getting the image features look exactly the same thing.

I think before I actually didn't include pixel values. But it's the same thing. It's just a default argument. Converting into a NumPy array. Did I show you this before? I don't actually think so. No. Maybe not. But here, the squeeze is very similar. It's the same thing as what I showed you up here.

So, we squeeze the first dimension out of that like we did here. And then we are moving that batch of embeddings to the CPU. If it's not already on the CPU. We're detaching it from the gradient, like the training graph of PyTorch, the PyTorch model, e.g. clip. And then we're converting into a NumPy array.

And then I'm going to add that batch of embeddings to a large array of all image embeddings. And that's where the image array comes in. So, let's run that. So, we come up here. I made a mistake in the code. So, here I'm actually pulling in the full row or record at any one time.

We don't want to do that. We want the image itself. So, run that again. Okay. And now if we check the type of images. Zero. We should see it's a pill image. Cool. We have a pill here. Now we can run this. Okay. It won't take long. And now we have 100, 512 dimensional image embeddings from our dataset.

And we can now use them to compare to our initial text embedding and see which one of these matches most closely to that text embedding. Okay. So, I'm going to be using dot product similarity. So, there's just one thing to be aware of with that. And that is that it considers both the magnitude of the vector and also the angle.

So, in this case, that will -- that can throw off our results. So, we should normalize all of the image embeddings so that we are not looking at the magnitude of vectors. And we're only focusing on the angular similarity between our text embedding and these image embeddings. So, to do that, we need to -- I'll just show you quickly.

So, I look at the minimum and maximum. You know, they're kind of all over the place. So, to normalize, we need to do this. So, do image array divided by numpy linage.norm. And here we have the image array. Okay. Axes equals one. And let me -- I can show you what that is.

So, we have all these numbers. And these are basically telling us for each one of these vectors of what should we divide it by in order to bring each of them to within a set magnitude, pretty much. So, take a look at the shape. It will be 100. Yeah, we do that.

So, I think I need to transpose this. Okay. And then -- so, the image array, the shape is going to be transposed now. So, I'm going to transpose it again. Yeah. Image array equals image array transpose. Okay. Cool. And now if we have a look at the minimum and maximum.

So, minimum and maximum. We get these values, which are more reasonable. Okay. So, now what we can do is use dot product similarity to actually compare these. So, text embedding, I'm going to take the text embedding. And similar to before, what we did is we need to move it to the CPU, detach it from the PyTorch graph, and then convert it to NumPy array.

Okay. Yeah. And then for the scores, all we need to do is NumPy dot. And we are going to put the text embedding followed by the image array. And actually, I think I need to transpose this again. So, maybe we could have avoided transposing up here. Okay. Yeah. So, the scores that we get here, we get a single score for every single vector, as we can see.

Shape 100. And they are the dot product similarity scores. So, what we can now do is sort based on this scores array. And just return, like, the top, say, the top five images and see what the top five most similar images are for our query. Okay. So, we're going to return the top K.

So, top K is going to be the five most similar, or the five items with the highest score. And then we want to take the index values using np.arg.sort. We're going to add the negative of the scores there. And just make sure we take -- because scores has this here.

So, we're actually just taking the -- let me show you. Scores zero dot shape. Okay. So, it's taking the 100 values there. And then I want to take the top K from that. Okay. So, what we're left with is these five index values, which are essentially indexes of the image embeddings.

And, therefore, the images that are the most similar to our query. So, we'll use matplotlib again to visualize those. So, we do for i in idx. Let's print the score first. So, scores i. And actually that would be zero i. And then I am going to show. And I'm just going to PLT show.

Okay. Cool. So, yeah. I mean, that's it. The first item, as we would expect, is a dog in the snow. So, after that, we get dogs and we get, like, these snowy areas. The reason for that is that we just don't have any more images of dogs in the snow.

This one, I don't know what this is. It's like a toy that maybe it's a dog, maybe it's a bear. I'm not sure. But I suppose technically that's like a dog in the snow. So, we have that. So, yeah, obviously the model is performing pretty well. And I think that's really cool that we can do that so easily.

And, yeah, I mean, Clip is, I think, an amazing model that we can use to do a load of cool things across both the text and image domain. Which is super interesting. And it's definitely, like, if you think just a couple of years ago this sort of thing wasn't possible.

And didn't seem, like, at least not to this sort of degree of accuracy, like it was going to be happening anytime soon. So, this is really cool. Here we've obviously shown, I showed you how to do, like, a text to image search. You can do this, like, in reality what we're doing is kind of searching through the vectors.

So, it doesn't matter, you know, which direction you're doing that search. Like, the vectors are all the same. So, if you want to do a text to text search with Clip, you could. If you want to do image to image search, you could. If you want to do image to text or all of those things all at once, you could.

It's not, it's, you're searching through vectors. So, what is behind those vectors doesn't really matter so much. Okay, so I think that's it for this video. I think Clip is super interesting and I hope that you do as well. In the future, or very soon actually, I'm going to be going into a lot more detail on Clip.

So, if you are interested in that, subscribe and click on the little notification button and you will get a notification about that pretty soon. But, until then, that's it for this video. Thank you very much for watching and I will see you again in the next one. Bye.

Fast intro to multi-modal ML with OpenAI's CLIP

Chapters

Transcript