back to indexFast Zero Shot Object Detection with OpenAI CLIP
Chapters
0:0 Early Progress in Computer Vision
2:3 Classification vs. Localization and Detection
3:55 Zero Shot with OpenAI CLIP
5:23 Zero Shot Object Localization with OpenAI CLIP
6:40 Localization with Occlusion Algorithm
7:44 Zero Shot Object Detection with OpenAI CLIP
8:34 Data Preprocessing for CLIP
13:55 Initializing OpenAI CLIP in Python
17:5 Clipping the Localization Visual
18:32 Applying Scores for Visual
20:25 Object Localization with New Prompt
20:52 Zero Shot Object Detection in Python
21:20 Creating Bounding Boxes with Matplotlib
25:15 Object Detection Code
27:11 Object Detection Results
28:29 Trends in Multi-Modal ML
00:00:00.000 |
The Imaginated Large Scale Visual Recognition Challenge 00:00:11.680 |
During this time, the competition acted as the place to go 00:00:16.680 |
if you needed to find what the current state of the art 00:00:40.320 |
but there was a unquestioned assumption causing problems. 00:00:52.920 |
and a lot of data required a lot of capital and time. 00:00:56.040 |
It wasn't until recently that this assumption 00:01:03.120 |
The astonishing rise of what are called multimodal models 00:01:13.160 |
very possible across various domains and tasks. 00:01:17.040 |
One of those is called zero-shot object detection 00:01:28.520 |
without ever fine-tuning it on data from that new domain. 00:01:39.600 |
a classification in one particular area on one dataset, 00:01:43.320 |
and we can take that same model without any fine-tuning, 00:01:59.120 |
for zero-shot object detection and localization. 00:02:03.480 |
Let's begin with taking a quick look at image classification. 00:02:07.480 |
Now, image classification can kind of be seen 00:02:10.400 |
as one of the simplest tasks in visual recognition. 00:02:13.240 |
And it's also the first step on the way to object detection. 00:02:17.640 |
At its core, it's just assigning a categorical label 00:02:33.920 |
of where in the image the specific object actually is. 00:02:42.080 |
Now, doing that, we're essentially just going 00:02:47.400 |
the typical approach to this is return an image 00:02:52.440 |
surrounding the object that you are looking for. 00:03:00.880 |
With detection, we are localizing multiple objects 00:03:08.800 |
to identify multiple objects within the image. 00:03:20.200 |
In the case of us having multiple dogs in this image 00:03:25.720 |
we would also expect the object detection algorithm 00:03:29.200 |
to actually identify each one of those independently. 00:03:32.680 |
Now, in the past, if we wanted to switch a model 00:03:49.520 |
like OpenAI's CLIP for performing each one of these tasks 00:04:09.520 |
text and image pairs that have a similar meaning 00:04:12.600 |
and placing them within a similar vector space. 00:04:16.040 |
Every text and every image gets converted into a vector 00:04:19.760 |
and they are placed in a shared vector space. 00:04:38.960 |
And we can even adjust the task being performed 00:04:45.120 |
We don't actually have to adjust the model itself. 00:05:03.880 |
and we just identify within that vector space 00:05:41.040 |
We then pass a window over all of those patches, 00:05:55.360 |
between each one of those windows embedded by CLIP 00:06:00.800 |
returning a similarity score for every single patch. 00:06:07.160 |
we use that to create almost like a map of relevance 00:06:14.120 |
to identify the location of the object of interest. 00:06:21.040 |
So we have most of the image will be very dark and black. 00:06:24.680 |
That means as the object of interest is not in that space. 00:06:35.640 |
Both of these visuals are capturing the same information, 00:06:48.320 |
So Federico Bianchi from Sanford's NLP group, 00:06:54.520 |
And both of those have worked on a Italian CLIP project. 00:06:59.320 |
And part of that was performing object localization. 00:07:03.880 |
Now, to do that, they use a slightly different approach 00:07:09.680 |
And we can think of it as almost like the opposite. 00:07:12.320 |
So whereas we slide a window over the whole image, 00:07:15.920 |
they slide a black patch over the whole image, 00:07:25.200 |
And essentially, as you slide the patch over the image, 00:07:46.080 |
which is like the last level in these three tasks, 00:07:52.440 |
Now, there's a very fine line between object localization 00:07:57.240 |
but you can simply think of it as localization 00:08:16.720 |
but then we're putting both of those together 00:08:20.860 |
and we're producing this object detection process. 00:08:23.820 |
Now, we've covered the idea behind image classification 00:08:28.200 |
onto object localization and object detection. 00:08:34.480 |
Now, before we move on to any classification, 00:08:52.560 |
which we can pip install with pip install datasets, 00:09:07.840 |
One of those is the image you've already seen, 00:09:10.720 |
the cat with a butterfly landing on its nose, 00:09:19.340 |
we are striped that we're gonna be using this image here, 00:09:22.540 |
and what we want to do is not use the image file itself, 00:09:28.020 |
'cause at the moment it's a Pill Python image object, 00:09:32.680 |
but instead we need to convert it into a tensor. 00:09:38.760 |
so what I'm going to do here is we're going to just 00:09:46.320 |
which is a typical pipeline tool in computer vision, 00:09:52.120 |
And then we process our image through that pipeline, 00:10:08.880 |
red, green, and blue, that make up the image. 00:10:20.840 |
and two, we need to process it through a PyTorch model, 00:10:25.640 |
and we also need the batch dimension for that. 00:10:31.480 |
It's just a single image, so we just have one in there, 00:11:08.120 |
So, now we have all these kind of like slithers of the image. 00:11:12.800 |
That's just a vertical component of each patch, 00:11:26.880 |
Now, if we visualize that, we get our full patches, 00:11:39.840 |
it doesn't tell us anything about the image, right? 00:11:48.800 |
If Clip is processing a single patch at a time, 00:11:54.560 |
Maybe it could tell us that there's some hair in this patch 00:12:00.120 |
but beyond that, it's not going to be very useful. 00:12:02.760 |
So, rather than feeding single patches into Clip, 00:12:06.000 |
what we do is actually feed a window of six by six patches, 00:12:13.160 |
and that just gives us a big patch to pass over to Clip. 00:12:17.520 |
Now, the reason that we don't just do that from the start, 00:12:19.920 |
we don't just create these bigger patches to begin with, 00:12:23.000 |
is because when we're sliding through the image, 00:12:25.240 |
we want to have some degree of overlap between each patch. 00:12:31.000 |
and then what we can do is actually slide across 00:12:35.240 |
and we define that using the stride variable. 00:12:46.760 |
This is our code for going through the whole image, 00:12:52.880 |
So, we go for Y, and then we go through the whole Y-axis, 00:12:57.360 |
and then within that, we're going across left to right 00:13:00.040 |
with each step, and we initialize an empty big patch array, 00:13:10.360 |
let's say we start at zero, zero, X zero, Y zero. 00:13:14.400 |
We go from zero to six, and zero to six here, right? 00:13:30.080 |
As Y and X are increasing, we're moving through that image, 00:13:34.840 |
and we're seeing each big patch from our image, okay? 00:13:39.080 |
Sliding across with a single small little patch at a time 00:13:42.840 |
so that we don't miss any important information. 00:13:46.240 |
Now, this is how we're gonna run through the whole image, 00:13:50.080 |
but before we do that, we actually need clip, 00:13:52.680 |
so let's go ahead and actually initialize clip. 00:14:08.360 |
for both text and images, and then the actual model itself, 00:14:12.720 |
okay, so we set model ID, and we initialize both of those. 00:14:22.800 |
So, we can use CPU, but if you have a CUDA-enabled GPU, 00:14:26.960 |
that will be much faster, so I'd recommend doing that. 00:14:35.320 |
but we'll still run within a bearable timeframe, 00:14:42.480 |
I am using CPU, you can actually run this on NPS as well, 00:14:49.720 |
if you have an NPS-enabled Apple Silicon device. 00:14:57.000 |
where we're going through each window within the image, 00:15:00.400 |
we're just going to add a little bit more logic, 00:15:09.080 |
and then what we do is process that big patch 00:15:14.240 |
So, at the moment, we're looking for a fluffy cat 00:15:16.360 |
within this image, so that is how we do this. 00:15:21.240 |
It turns out we also add padding here as well for the text, 00:15:25.480 |
although, in this case, I don't think we need it 00:15:40.080 |
So, if we pass both text and images through this processor, 00:15:51.360 |
and the item just converts that into an array of values 00:16:05.400 |
what I earlier called, like, the relevance map 00:16:08.320 |
or localization map throughout the whole image. 00:16:14.280 |
we're adding this score to every single patch 00:16:31.400 |
So, if you think about the top-left patch in the image, 00:16:43.600 |
is identify the number of runs that we perform 00:16:52.400 |
The reason we do that is so that we can take the average 00:16:59.480 |
because here, we're taking the total of all those scores, 00:17:27.840 |
quite gradually fades out as you go away from the object, 00:17:45.000 |
the average of scores across the whole image. 00:17:48.000 |
We subtract that average from the current scores. 00:17:51.960 |
What that will do is push 50% of the scores below zero, 00:18:10.320 |
of this detected or localized area better defined. 00:18:15.600 |
what we need to do is normalize those scores. 00:18:19.160 |
Okay, so we might have to do this a few times, 00:18:30.160 |
to bring them back within the range of zero to one. 00:18:42.200 |
Okay, for the scores, we have like 20 by 13 tensor, 00:18:47.200 |
but for the patches, we have the batch dimension there, 00:18:54.800 |
and the two, five, six for each set of pixels 00:19:00.800 |
So, we need to first remove the batch dimension. 00:19:03.080 |
We do that by squeezing out the zero dimension, 00:19:08.040 |
And then we permute the different dimensions, 00:19:11.360 |
essentially just moving them around in our patches 00:19:18.520 |
And then all we do is multiply the patches by those scores. 00:19:29.960 |
in order for us to visualize it in Matplotlib. 00:19:41.880 |
See here, this is Y, so the height of the image in patches, 00:19:45.680 |
and then 13, which is the width of the image in patches. 00:19:55.080 |
which localizes the fluffy cat within that image. 00:20:03.600 |
we actually get a slightly different localization, 00:20:06.160 |
because here you can see it's kind of focusing a lot 00:20:14.960 |
So, we can really add nuance information to these prompts 00:20:38.440 |
And if we go down, and we're gonna go down and down, 00:20:47.000 |
We can see that it is identifying where in the image 00:20:56.560 |
Now, I want to have a look at object detection, 00:20:59.000 |
which is essentially just taking the object localization 00:21:14.720 |
We're going to need a different type of visualization, 00:21:20.280 |
So, let's take a look at how we would do that. 00:21:23.280 |
So, using the, I think the butterfly example, 00:21:26.960 |
so, the butterfly scores that we just calculated, 00:21:43.640 |
as to where the score was higher than 0.5 and not. 00:21:47.480 |
And then we detect where the non-zero values are 00:21:59.400 |
we know that there is a score that is higher than 0.5, 00:22:20.000 |
or a value or score that's higher than 0.5, our threshold. 00:22:28.080 |
So, we already, we kind of see that localization visual 00:22:35.400 |
And what we want to do is identify the bounding box 00:22:39.720 |
that's just kind of surrounding those values, okay? 00:22:43.320 |
So, we know in terms of like a coordinate system, 00:22:52.680 |
from the detection array or set of coordinates 00:23:02.240 |
And what we do is we just take the minimum X and Y values, 00:23:08.880 |
and that will give us the corners of the box. 00:23:21.520 |
basically we're getting the position of the patch 00:23:25.920 |
we're essentially identifying the top left corner 00:23:46.720 |
And then what we do is multiply those corner coordinates 00:24:02.040 |
which we can map directly onto the original image. 00:24:14.600 |
because we're going to be using matplotlib patches, 00:24:17.320 |
matplotlib patches expects the top left corner coordinates 00:24:29.520 |
It's just Y max minus Y min and X max minus X min. 00:24:44.920 |
So, we have to move the three color channels dimension 00:24:48.080 |
from the zero dimension to the final dimension. 00:25:15.840 |
Okay, so that's our bounding box visualization. 00:25:27.400 |
Now, the logic for this is pretty much just a loop 00:25:35.400 |
which is essentially just what we've already gone through, 00:25:37.640 |
getting patches, getting the scores, getting the box. 00:25:52.720 |
and it's gonna split it into those patches that we created. 00:26:08.600 |
And this for loop goes through the image localization steps 00:26:12.200 |
and bounding box steps that we just went through, 00:26:21.760 |
based on a specific prompt for all of our image patches. 00:26:29.160 |
in that patch tensor format that we saw before. 00:26:33.080 |
And then what we do is we want to get the box 00:26:43.840 |
for the calculation of the, or for the conversion. 00:26:59.600 |
And that will return the minimum X and Y coordinates 00:27:10.120 |
So, now let's visualize all of this, see what we get. 00:27:14.280 |
So, here I've used a slightly smaller window size 00:27:16.680 |
before using six, just to point out that you can change this. 00:27:22.280 |
it may be better to use a smaller or larger window. 00:27:32.120 |
And you can see that we get, we get a butterfly here 00:27:41.560 |
we can apply this object detection without fine tuning. 00:27:45.720 |
All we need to do is change these prompts here, okay? 00:27:49.640 |
So, it's really straightforward to modify this 00:27:59.960 |
of object localization and object detection with Clip. 00:28:04.800 |
As I said, I think zero-shot object localization, 00:28:08.280 |
detection, and even classification opens the doors 00:28:22.640 |
and get pretty impressive results very quickly. 00:28:26.280 |
All it requires is a bit of code changing here and there. 00:28:50.040 |
And it's really cool to see this sort of thing