Back to Index

Fast Zero Shot Object Detection with OpenAI CLIP


Chapters

0:0 Early Progress in Computer Vision
2:3 Classification vs. Localization and Detection
3:55 Zero Shot with OpenAI CLIP
5:23 Zero Shot Object Localization with OpenAI CLIP
6:40 Localization with Occlusion Algorithm
7:44 Zero Shot Object Detection with OpenAI CLIP
8:34 Data Preprocessing for CLIP
13:55 Initializing OpenAI CLIP in Python
17:5 Clipping the Localization Visual
18:32 Applying Scores for Visual
20:25 Object Localization with New Prompt
20:52 Zero Shot Object Detection in Python
21:20 Creating Bounding Boxes with Matplotlib
25:15 Object Detection Code
27:11 Object Detection Results
28:29 Trends in Multi-Modal ML

Transcript

The Imaginated Large Scale Visual Recognition Challenge was a world-changing competition that ran from around 2010 to 2017. During this time, the competition acted as the place to go if you needed to find what the current state of the art was in image classification, object localization, object detection. As well as that, from 2012 onwards, it really acted as the catalyst of the explosion in deep learning.

Researchers fine-tuned better performing computer vision models year on year, but there was a unquestioned assumption causing problems. It was assumed that every new task required model fine-tuning. This required a lot of data, and a lot of data required a lot of capital and time. It wasn't until recently that this assumption was challenged and proven wrong.

The astonishing rise of what are called multimodal models has made the, what was thought impossible, very possible across various domains and tasks. One of those is called zero-shot object detection and localization. Now, zero-shot refers to taking a model and applying it to a new domain without ever fine-tuning it on data from that new domain.

So that means we can take a model, we can, maybe it works in one domain, a classification in one particular area on one dataset, and we can take that same model without any fine-tuning, and we can use it for object detection in a completely different domain. Without that model seeing any training data from that new domain.

So in this video, we're going to explore how to use OpenAI's CLIP for zero-shot object detection and localization. Let's begin with taking a quick look at image classification. Now, image classification can kind of be seen as one of the simplest tasks in visual recognition. And it's also the first step on the way to object detection.

At its core, it's just assigning a categorical label to an image. Now, moving on from image classification, we have object localization. Object localization is image classification followed by the identification of where in the image the specific object actually is. So we are localizing the object. Now, doing that, we're essentially just going to identify the coordinates on the image, and we're going to return, the typical approach to this is return an image where you have like a bounding box surrounding the object that you are looking for.

And then we take this one step further to perform object detection. With detection, we are localizing multiple objects within the image, or we have the capability to identify multiple objects within the image. So in this example, we have a cat and a dog. We would expect with object detection to identify both the cat and the dog.

In the case of us having multiple dogs in this image or multiple cats in this image, we would also expect the object detection algorithm to actually identify each one of those independently. Now, in the past, if we wanted to switch a model between any one of these tasks, we'd have to fine tune it on more data.

Wanted to switch it to another domain, we would have to also fine tune it on new data from that domain. But that's not always the case with models like OpenAI's CLIP for performing each one of these tasks in a zero-shot setting. Now, OpenAI's CLIP is a multi-modal model that has been pre-trained on a huge number of text and image pairs.

And it essentially works by identifying text and image pairs that have a similar meaning and placing them within a similar vector space. Every text and every image gets converted into a vector and they are placed in a shared vector space. And the vectors that appear close together, they have a similar meaning.

Now, CLIP's very broad pre-training means that it can perform very effectively across a lot of different domains. It's seen a lot of data, and so it has a good understanding of all these different things. And we can even adjust the task being performed with just a few code changes.

We don't actually have to adjust the model itself. We just adjust the code around the model. And that's very much thanks to CLIP's focus on sort of comparing these vectors. So for example, for classification, we give CLIP a list of our class labels, and then we pass in images, and we just identify within that vector space where those images are with respect to those class label vectors, and which class label is most similar to our particular image.

And then that is our prediction. So that most similar class label, that's our predicted class. Now, for object localization, we apply a very similar type of logic. As before, we create a class label, but unlike before, we don't feed the entire image into CLIP. To localize an object, we have to break the image into patches.

We then pass a window over all of those patches, moving across the entire image, left to right, top to bottom. And we generate a image embedding for each of those windows. And then we calculate the similarity between each one of those windows embedded by CLIP and the class label embedding, returning a similarity score for every single patch.

Now, after calculating the similarity score for every single patch, we use that to create almost like a map of relevance across the entire image. And then we can use that map to identify the location of the object of interest. And from that, we will get something that looks kind of like this.

So we have most of the image will be very dark and black. That means as the object of interest is not in that space. And then using that localization map, we can create a more traditional bounding box visualization as well. Both of these visuals are capturing the same information, we're just displaying it in a different way.

Now, there's also other approaches to this. So I recently hosted a talk with two sets of people, actually. So Federico Bianchi from Sanford's NLP group, and also Rafael Pisone. And both of those have worked on a Italian CLIP project. And part of that was performing object localization. Now, to do that, they use a slightly different approach to what I'm going to demonstrate here.

And we can think of it as almost like the opposite. So whereas we slide a window over the whole image, they slide a black patch over the whole image, which hides what is behind that patch. And then they feed the image into CLIP. And essentially, as you slide the patch over the image, you are hiding a part of the image.

And therefore, if the similarity score drops when the patch is over a certain area, you know that the object you're looking for is probably within that space. And that's called the occlusion algorithm. And then moving on to object detection, which is like the last level in these three tasks, we will be identifying multiple objects.

Now, there's a very fine line between object localization and object detection, but you can simply think of it as localization for multiple clusters and multiple objects. With our cat and butterfly image, we will be searching for two objects, a cat and a butterfly. And with that, we could draw a bounding box around both of those objects.

And essentially, what we're doing there is using localization for a single object, but then we're putting both of those together in a loop in our code, and we're producing this object detection process. Now, we've covered the idea behind image classification onto object localization and object detection. Now, let's have a look at how we actually implement all of this.

Now, before we move on to any classification, localization, or detection task, we need to have some data. We're gonna use a small demo dataset called James Callum Image Text Demo, and we can download it like this. So using Hugging Face datasets here, which we can pip install with pip install datasets, and this is a dataset, it's very small, it's 21 text to image pairs, okay?

One of those is the image you've already seen, the cat with a butterfly landing on its nose, very curious how they got that photo. Now, after you've downloaded that dataset, we are striped that we're gonna be using this image here, and what we want to do is not use the image file itself, 'cause at the moment it's a Pill Python image object, but instead we need to convert it into a tensor.

Now, we're gonna be using PyTorch later on, so what I'm going to do here is we're going to just transform the image into a tensor, and we use TorchVision transforms, which is a typical pipeline tool in computer vision, and we just use toTensor, okay? And then we process our image through that pipeline, and then we can see that we get this, okay?

So, what are these values here? We have the height of the image in pixels, the width of the image in pixels, and then also the three color channels, red, green, and blue, that make up the image. Now, we need a slightly different format when we are processing everything. One, we need to add those patches, and two, we need to process it through a PyTorch model, and we also need the batch dimension for that.

So, the first thing we're gonna do is add the batch dimension. It's just a single image, so we just have one in there, but we need that anyway. And then we come down to here. So, this is where we're gonna break the image into the patches, okay? Each patch is going to be 256 dimensions in both height and width.

So, the first thing we do here is unfold, and we get this here. We get this 256 and this 20. Now, the 20 is the height of the image in these 256-pixel patches, and we can visualize that here, all right? So, now we have all these kind of like slithers of the image.

That's just a vertical component of each patch, and we use unfold again, but this time in a second dimension, so targeting what was this dimension here, and we also get another 256. Now, if we visualize that, we get our full patches, okay, like this. Now, if you just consider this here, it's like, if we look at this patch here, it doesn't tell us anything about the image, right?

And even when we're over the cat, these patches are way too small to actually tell us anything. If Clip is processing a single patch at a time, it's probably not going to tell us anything. Maybe it could tell us that there's some hair in this patch or that there's an eye in this patch, but beyond that, it's not going to be very useful.

So, rather than feeding single patches into Clip, what we do is actually feed a window of six by six patches, or we can modify that value if we prefer, and that just gives us a big patch to pass over to Clip. Now, the reason that we don't just do that from the start, we don't just create these bigger patches to begin with, is because when we're sliding through the image, we want to have some degree of overlap between each patch.

Okay, so we create these smaller patches, and then what we can do is actually slide across just one little patch at a time, and we define that using the stride variable. So, if we come down to here, we have window, we have stride, remove this, and here we go.

This is our code for going through the whole image, creating a patch at every time step, okay? So, we go for Y, and then we go through the whole Y-axis, and then within that, we're going across left to right with each step, and we initialize an empty big patch array, so this is our, like, the full window.

We get the current batch, so, okay, let's say we start at zero, zero, X zero, Y zero. We go from zero to six, and zero to six here, right? So, that gives us the very top left corner or window of the image, and then we're literally going through and just go processing all of that, and you can see that happening here.

As Y and X are increasing, we're moving through that image, and we're seeing each big patch from our image, okay? Sliding across with a single small little patch at a time so that we don't miss any important information. Now, this is how we're gonna run through the whole image, but before we do that, we actually need clip, so let's go ahead and actually initialize clip.

So, to do that, all we do is this, so we're using Hugging Face Transformers, which is using PyTorch in the back there, so we need the clip processor, which is like a pre-processing pipeline for both text and images, and then the actual model itself, okay, so we set model ID, and we initialize both of those.

Then, what we want to do is move the model to a device, if possible, all right? So, we can use CPU, but if you have a CUDA-enabled GPU, that will be much faster, so I'd recommend doing that. If you can, if not, then you can use CPU. It will be a bit slower, but we'll still run within a bearable timeframe, so if I'm running this on my Mac, I am using CPU, you can actually run this on NPS as well, so you could change your device to NPS if you have an NPS-enabled Apple Silicon device.

So, now, returning to that process where we're going through each window within the image, we're just going to add a little bit more logic, so we are processing like we were before. There's nothing different here. We're creating that big patch, and then what we do is process that big patch and process a text label, okay?

So, at the moment, we're looking for a fluffy cat within this image, so that is how we do this. We're returning PyTorch. It turns out we also add padding here as well for the text, although, in this case, I don't think we need it because we only have a single text item, but we include that when we're using multiple text items later, and then we calculate and retrieve the similarity score between them, okay?

So, if we pass both text and images through this processor, we'll pass both into our inputs here, and then we just calculate the -- or we extract the logics for each image, and the item just converts that into an array of values for a single value. And then here, we have those scores, so what we're doing here is creating the -- what I earlier called, like, the relevance map or localization map throughout the whole image.

So, for every window that we go through, we're adding this score to every single patch or little patch within that window, and what we're going to do, or what we're going to find when we do that is that some patches will naturally have a higher score than others because they are viewed more times, right?

So, if you think about the top-left patch in the image, that's only going to be viewed once, whereas patches in the middle are going to be viewed many times because we'll have a sliding window going over there multiple times. So, what we also need to do is identify the number of runs that we perform or number of calculations that we perform within each one of those patches.

The reason we do that is so that we can take the average for each score based on the number of times that score has been calculated because here, we're taking the total of all those scores, and then we just take the average like so. Now, the scores tensor is going to have a very smooth gradient of values from zero, completely irrelevant, to one.

Now, if you consider that we've been going over these scores multiple times, it means that the object of interest has kind of like faded out of the window, like over multiple steps. So, that means that the similarity score quite gradually fades out as you go away from the object, which means that you don't really get very good localization if you use these scores directly.

So, what we need to do is actually clip the lowest scores down to zero. So, to do that, what we do is calculate the average of scores across the whole image. We subtract that average from the current scores. What that will do is push 50% of the scores below zero, and then we clip those scores.

So, anything below zero becomes zero, and we can do this multiple times. Okay, one time is usually enough, but you can do it multiple times to increase that effect of making the edge of this detected or localized area better defined. And then after you've done that, what we need to do is normalize those scores.

Okay, so we might have to do this a few times, or everything's probably going to be within the range of like zero to 0.5, or zero to 0.2. So, then we normalize those scores to bring them back within the range of zero to one. Now, to apply these scores to the patches, we need to align their tensors, because right now, they are not aligned.

Okay, for the scores, we have like 20 by 13 tensor, but for the patches, we have the batch dimension there, we have the 20 by 13, which we do want, but then we have the three color channels and the two, five, six for each set of pixels within each patch.

So, we need to adjust that a little bit. So, we need to first remove the batch dimension. We do that by squeezing out the zero dimension, which is our batch dimension. And then we permute the different dimensions, essentially just moving them around in our patches in order to align them better with the score tensor dimensions.

And then all we do is multiply the patches by those scores. That's pretty straightforward. Then we have to permute them again, because if we want to visualize everything, it needs to be within a certain shape in order for us to visualize it in Matplotlib. So, we come down and first thing we do is just get Y and X here.

So, Y and X are the patches. See here, this is Y, so the height of the image in patches, and then 13, which is the width of the image in patches. And we come down here and we can plot this. Okay, and we get this pretty nice visual which localizes the fluffy cat within that image.

Now, what's really interesting is if we just search for a cat, we actually get a slightly different localization, because here you can see it's kind of focusing a lot on the fluffy part of the cat. So, if we just search for a cat, it would actually focus more on the head.

So, we can really add nuance information to these prompts and get a pretty nuanced response back. Now, we can do the same for butterfly. So, we'll just throw all that code together. This is just what we've done before. We initialize scores and runs, and we go process all of that.

The only thing we change here is the prompt. We change it to a butterfly. And if we go down, and we're gonna go down and down, and visualize that, we get this, okay? So, again, that's pretty cool. We can see that it is identifying where in the image that butterfly actually is.

So, that is the object localization set. Now, I want to have a look at object detection, which is essentially just taking the object localization and wrapping some more code around it in order to look at these multiple objects rather than just one. But to do that, we can't really visualize in the same way that we've done here.

We're going to need a different type of visualization, and that's where we have the bounding boxes. So, let's take a look at how we would do that. So, using the, I think the butterfly example, so, the butterfly scores that we just calculated, we're going to look at where those scores are higher than 0.5.

Now, you can adjust this threshold based on what you find works best. So, we do this, and what we'll get is a array of true and false values as to where the score was higher than 0.5 and not. And then we detect where the non-zero values are in that array, and what we do is get a load of X and Y values here.

So, position three, two, we know that there is a score that is higher than 0.5, and we get three and two here. So, three is the row of the non-zero value, and two is the column of the non-zero value. So, at row position three and column two, we know that there is a non-zero value, or a value or score that's higher than 0.5, our threshold.

And put all that together, we get something that looks kind of like this. So, we already, we kind of see that localization visual that we just created. And what we want to do is identify the bounding box that's just kind of surrounding those values, okay? So, we know in terms of like a coordinate system, we want one and three and four and 10 to be included within that.

So, what we do is find the corners from the detection array or set of coordinates that we got before from NP non-zero. And what we do is we just take the minimum X and Y values, and maximum X and Y values, and that will give us the corners of the box.

And that's pretty simple to calculate. Now, when we get the maximum value, what we want to do is because we, basically we're getting the position of the patch and the position of each patch, we're essentially identifying the top left corner of each patch. So, when we're looking at the maximum value, we actually want not the start of the patch, but the end of the patch, okay?

So, that's why we add that plus one here in order to get that. And the same for the X max value as well. So, that gives us the corner coordinates. And then what we do is multiply those corner coordinates by the patch size, which is 256 pixels. And then we have the pixel positions of each one of those corners.

Because before we had the patch coordinates, now we have the pixel coordinates, which we can map directly onto the original image. So, we can see the minimum values here. So, we have for X and Y, two, five, six, and a seven, six, eight. And what we want to do, because we're going to be using matplotlib patches, matplotlib patches expects the top left corner coordinates and the width and height of the bounding box that you want to create.

So, we calculate the width and height. And that's pretty simple. It's just Y max minus Y min and X max minus X min. And we get these. And what we can do now is take the image. We have to reshape it a little bit. So, we have to move the three color channels dimension from the zero dimension to the final dimension.

So, we just do that here, move axes. And now we can plot that image. Okay, so we show that image with matplotlib. And then we create the rectangle patch. This is our bounding box. Okay, so we pass X min and Y min. That's the top left corner. And then we also pass the width and height of what the bounding box should be.

And if we come down, we get this visual. Okay, so that's our bounding box visualization. And with that, it's not much further to create our object detection. So, let's have a look at how we do that. Now, the logic for this is pretty much just a loop over what we've already done.

So, I put together a load of functions here, which is essentially just what we've already gone through, getting patches, getting the scores, getting the box. And then the one thing that is new here is this detect function. Okay, so we have detect. That's gonna get the patches. So, it's gonna take an image and it's gonna split it into those patches that we created.

We're gonna convert the image into format for displaying with matplotlib. We did that before. And we also initialize that plot and add our image to that plot. And then what we do is we have a for loop. And this for loop goes through the image localization steps and bounding box steps that we just went through, just multiple times.

Okay, so we have multiple prompts and we want to do multiple times. So, we calculate our similarity scores based on a specific prompt for all of our image patches. From that, we get our scores in that patch tensor format that we saw before. And then what we do is we want to get the box based on a particular threshold.

So, 0.5, like we used before. You can see it up there. We have our patch size, which we just need to pass that for the calculation of the, or for the conversion. And we have our patch size, which we pass to that for the conversion from patch pixel, from patch coordinates to pixel coordinates.

And then we also have our scores. And that will return the minimum X and Y coordinates and also width and height of the box. We create the bounding box. And then we add that to the axis, okay? So, now let's visualize all of this, see what we get. So, here I've used a slightly smaller window size before using six, just to point out that you can change this.

And depending on your image, it may be better to use a smaller or larger window. And you can see, so what we're doing here, we've got a cat and a butterfly. And you can see that we get, we get a butterfly here and we get the cat here, okay?

That's pretty cool. And like I said, with Clip, we can apply this object detection without fine tuning. All we need to do is change these prompts here, okay? So, it's really straightforward to modify this and move it to a new domain. Okay, so that's it for this walkthrough of object localization and object detection with Clip.

As I said, I think zero-shot object localization, detection, and even classification opens the doors to a lot of projects and use cases that were just not accessible before because time and capital constraints. And now we can just use Clip and get pretty impressive results very quickly. All it requires is a bit of code changing here and there.

Now, I think Clip is one part of a trend in multi-modality that is kind of creating a more accessible ML that is less brittle like models were in the past that required a lot of fine tuning just to adapt to a slightly different domain and just more generally applicable, which I think is really exciting.

And it's really cool to see this sort of thing actually being used and to actually use it and just see how easy it is to use Clip for so many different use cases and it work like incredibly easily. So that's it for this video. I hope it has been useful.

So thank you very much for watching and I will see you again in the next one. Bye. (upbeat music) (upbeat music fades) (upbeat music fades) (upbeat music fades) you