Fast Zero Shot Object Detection with OpenAI CLIP

00:00:00.000 | The Imaginated Large Scale Visual Recognition Challenge

00:00:03.720 | was a world-changing competition

00:00:07.560 | that ran from around 2010 to 2017.

00:00:11.680 | During this time, the competition acted as the place to go

00:00:16.680 | if you needed to find what the current state of the art

00:00:20.640 | was in image classification,

00:00:23.240 | object localization, object detection.

00:00:25.840 | As well as that, from 2012 onwards,

00:00:28.560 | it really acted as the catalyst

00:00:31.160 | of the explosion in deep learning.

00:00:34.120 | Researchers fine-tuned better performing

00:00:37.600 | computer vision models year on year,

00:00:40.320 | but there was a unquestioned assumption causing problems.

00:00:44.760 | It was assumed that every new task

00:00:48.200 | required model fine-tuning.

00:00:50.800 | This required a lot of data,

00:00:52.920 | and a lot of data required a lot of capital and time.

00:00:56.040 | It wasn't until recently that this assumption

00:00:59.800 | was challenged and proven wrong.

00:01:03.120 | The astonishing rise of what are called multimodal models

00:01:08.120 | has made the, what was thought impossible,

00:01:13.160 | very possible across various domains and tasks.

00:01:17.040 | One of those is called zero-shot object detection

00:01:21.440 | and localization.

00:01:23.240 | Now, zero-shot refers to taking a model

00:01:26.560 | and applying it to a new domain

00:01:28.520 | without ever fine-tuning it on data from that new domain.

00:01:33.400 | So that means we can take a model,

00:01:35.880 | we can, maybe it works in one domain,

00:01:39.600 | a classification in one particular area on one dataset,

00:01:43.320 | and we can take that same model without any fine-tuning,

00:01:46.080 | and we can use it for object detection

00:01:49.520 | in a completely different domain.

00:01:51.320 | Without that model seeing any training data

00:01:53.560 | from that new domain.

00:01:54.760 | So in this video, we're going to explore

00:01:57.040 | how to use OpenAI's CLIP

00:01:59.120 | for zero-shot object detection and localization.

00:02:03.480 | Let's begin with taking a quick look at image classification.

00:02:07.480 | Now, image classification can kind of be seen

00:02:10.400 | as one of the simplest tasks in visual recognition.

00:02:13.240 | And it's also the first step on the way to object detection.

00:02:17.640 | At its core, it's just assigning a categorical label

00:02:21.160 | to an image.

00:02:22.360 | Now, moving on from image classification,

00:02:24.400 | we have object localization.

00:02:27.160 | Object localization is image classification

00:02:30.880 | followed by the identification

00:02:33.920 | of where in the image the specific object actually is.

00:02:38.480 | So we are localizing the object.

00:02:42.080 | Now, doing that, we're essentially just going

00:02:43.920 | to identify the coordinates on the image,

00:02:45.920 | and we're going to return,

00:02:47.400 | the typical approach to this is return an image

00:02:50.440 | where you have like a bounding box

00:02:52.440 | surrounding the object that you are looking for.

00:02:55.880 | And then we take this one step further

00:02:57.960 | to perform object detection.

00:03:00.880 | With detection, we are localizing multiple objects

00:03:05.320 | within the image, or we have the capability

00:03:08.800 | to identify multiple objects within the image.

00:03:11.560 | So in this example, we have a cat and a dog.

00:03:14.240 | We would expect with object detection

00:03:16.920 | to identify both the cat and the dog.

00:03:20.200 | In the case of us having multiple dogs in this image

00:03:24.240 | or multiple cats in this image,

00:03:25.720 | we would also expect the object detection algorithm

00:03:29.200 | to actually identify each one of those independently.

00:03:32.680 | Now, in the past, if we wanted to switch a model

00:03:36.240 | between any one of these tasks,

00:03:37.480 | we'd have to fine tune it on more data.

00:03:39.920 | Wanted to switch it to another domain,

00:03:41.840 | we would have to also fine tune it

00:03:43.960 | on new data from that domain.

00:03:46.280 | But that's not always the case with models

00:03:49.520 | like OpenAI's CLIP for performing each one of these tasks

00:03:53.480 | in a zero-shot setting.

00:03:55.440 | Now, OpenAI's CLIP is a multi-modal model

00:04:00.360 | that has been pre-trained on a huge number

00:04:02.960 | of text and image pairs.

00:04:05.120 | And it essentially works by identifying

00:04:09.520 | text and image pairs that have a similar meaning

00:04:12.600 | and placing them within a similar vector space.

00:04:16.040 | Every text and every image gets converted into a vector

00:04:19.760 | and they are placed in a shared vector space.

00:04:22.720 | And the vectors that appear close together,

00:04:25.160 | they have a similar meaning.

00:04:26.480 | Now, CLIP's very broad pre-training

00:04:29.360 | means that it can perform very effectively

00:04:32.520 | across a lot of different domains.

00:04:34.280 | It's seen a lot of data,

00:04:35.240 | and so it has a good understanding

00:04:37.160 | of all these different things.

00:04:38.960 | And we can even adjust the task being performed

00:04:43.120 | with just a few code changes.

00:04:45.120 | We don't actually have to adjust the model itself.

00:04:47.680 | We just adjust the code around the model.

00:04:50.040 | And that's very much thanks to CLIP's focus

00:04:53.680 | on sort of comparing these vectors.

00:04:56.680 | So for example, for classification,

00:04:58.920 | we give CLIP a list of our class labels,

00:05:02.320 | and then we pass in images,

00:05:03.880 | and we just identify within that vector space

00:05:06.320 | where those images are

00:05:08.280 | with respect to those class label vectors,

00:05:11.040 | and which class label is most similar

00:05:14.920 | to our particular image.

00:05:16.640 | And then that is our prediction.

00:05:18.720 | So that most similar class label,

00:05:21.920 | that's our predicted class.

00:05:23.440 | Now, for object localization,

00:05:25.760 | we apply a very similar type of logic.

00:05:29.480 | As before, we create a class label,

00:05:31.920 | but unlike before,

00:05:33.200 | we don't feed the entire image into CLIP.

00:05:36.560 | To localize an object,

00:05:37.720 | we have to break the image into patches.

00:05:41.040 | We then pass a window over all of those patches,

00:05:44.560 | moving across the entire image,

00:05:46.760 | left to right, top to bottom.

00:05:48.240 | And we generate a image embedding

00:05:50.880 | for each of those windows.

00:05:52.960 | And then we calculate the similarity

00:05:55.360 | between each one of those windows embedded by CLIP

00:05:58.680 | and the class label embedding,

00:06:00.800 | returning a similarity score for every single patch.

00:06:03.840 | Now, after calculating the similarity score

00:06:05.840 | for every single patch,

00:06:07.160 | we use that to create almost like a map of relevance

00:06:11.200 | across the entire image.

00:06:12.840 | And then we can use that map

00:06:14.120 | to identify the location of the object of interest.

00:06:18.400 | And from that, we will get something

00:06:20.200 | that looks kind of like this.

00:06:21.040 | So we have most of the image will be very dark and black.

00:06:24.680 | That means as the object of interest is not in that space.

00:06:28.000 | And then using that localization map,

00:06:30.680 | we can create a more traditional

00:06:33.080 | bounding box visualization as well.

00:06:35.640 | Both of these visuals are capturing the same information,

00:06:38.240 | we're just displaying it in a different way.

00:06:40.120 | Now, there's also other approaches to this.

00:06:42.600 | So I recently hosted a talk

00:06:45.640 | with two sets of people, actually.

00:06:48.320 | So Federico Bianchi from Sanford's NLP group,

00:06:52.440 | and also Rafael Pisone.

00:06:54.520 | And both of those have worked on a Italian CLIP project.

00:06:59.320 | And part of that was performing object localization.

00:07:03.880 | Now, to do that, they use a slightly different approach

00:07:07.600 | to what I'm going to demonstrate here.

00:07:09.680 | And we can think of it as almost like the opposite.

00:07:12.320 | So whereas we slide a window over the whole image,

00:07:15.920 | they slide a black patch over the whole image,

00:07:19.800 | which hides what is behind that patch.

00:07:22.960 | And then they feed the image into CLIP.

00:07:25.200 | And essentially, as you slide the patch over the image,

00:07:30.160 | you are hiding a part of the image.

00:07:32.500 | And therefore, if the similarity score drops

00:07:35.760 | when the patch is over a certain area,

00:07:38.160 | you know that the object you're looking for

00:07:40.360 | is probably within that space.

00:07:42.560 | And that's called the occlusion algorithm.

00:07:44.600 | And then moving on to object detection,

00:07:46.080 | which is like the last level in these three tasks,

00:07:49.680 | we will be identifying multiple objects.

00:07:52.440 | Now, there's a very fine line between object localization

00:07:56.000 | and object detection,

00:07:57.240 | but you can simply think of it as localization

00:08:00.180 | for multiple clusters and multiple objects.

00:08:02.480 | With our cat and butterfly image,

00:08:04.440 | we will be searching for two objects,

00:08:06.500 | a cat and a butterfly.

00:08:08.260 | And with that, we could draw a bounding box

00:08:11.040 | around both of those objects.

00:08:12.720 | And essentially, what we're doing there

00:08:13.960 | is using localization for a single object,

00:08:16.720 | but then we're putting both of those together

00:08:18.520 | in a loop in our code,

00:08:20.860 | and we're producing this object detection process.

00:08:23.820 | Now, we've covered the idea behind image classification

00:08:28.200 | onto object localization and object detection.

00:08:31.760 | Now, let's have a look

00:08:32.600 | at how we actually implement all of this.

00:08:34.480 | Now, before we move on to any classification,

00:08:36.760 | localization, or detection task,

00:08:39.000 | we need to have some data.

00:08:41.680 | We're gonna use a small demo dataset

00:08:44.000 | called James Callum Image Text Demo,

00:08:46.940 | and we can download it like this.

00:08:49.760 | So using Hugging Face datasets here,

00:08:52.560 | which we can pip install with pip install datasets,

00:09:00.860 | and this is a dataset, it's very small,

00:09:03.320 | it's 21 text to image pairs, okay?

00:09:07.840 | One of those is the image you've already seen,

00:09:10.720 | the cat with a butterfly landing on its nose,

00:09:14.580 | very curious how they got that photo.

00:09:16.900 | Now, after you've downloaded that dataset,

00:09:19.340 | we are striped that we're gonna be using this image here,

00:09:22.540 | and what we want to do is not use the image file itself,

00:09:28.020 | 'cause at the moment it's a Pill Python image object,

00:09:32.680 | but instead we need to convert it into a tensor.

00:09:36.600 | Now, we're gonna be using PyTorch later on,

00:09:38.760 | so what I'm going to do here is we're going to just

00:09:42.120 | transform the image into a tensor,

00:09:44.040 | and we use TorchVision transforms,

00:09:46.320 | which is a typical pipeline tool in computer vision,

00:09:49.760 | and we just use toTensor, okay?

00:09:52.120 | And then we process our image through that pipeline,

00:09:55.880 | and then we can see that we get this, okay?

00:09:57.880 | So, what are these values here?

00:10:00.160 | We have the height of the image in pixels,

00:10:03.400 | the width of the image in pixels,

00:10:05.840 | and then also the three color channels,

00:10:08.880 | red, green, and blue, that make up the image.

00:10:12.380 | Now, we need a slightly different format

00:10:16.280 | when we are processing everything.

00:10:18.400 | One, we need to add those patches,

00:10:20.840 | and two, we need to process it through a PyTorch model,

00:10:25.640 | and we also need the batch dimension for that.

00:10:28.640 | So, the first thing we're gonna do

00:10:30.160 | is add the batch dimension.

00:10:31.480 | It's just a single image, so we just have one in there,

00:10:34.800 | but we need that anyway.

00:10:37.200 | And then we come down to here.

00:10:39.520 | So, this is where we're gonna break

00:10:41.440 | the image into the patches, okay?

00:10:45.200 | Each patch is going to be 256 dimensions

00:10:48.000 | in both height and width.

00:10:49.880 | So, the first thing we do here is unfold,

00:10:52.600 | and we get this here.

00:10:54.760 | We get this 256 and this 20.

00:10:57.320 | Now, the 20 is the height of the image

00:11:00.720 | in these 256-pixel patches,

00:11:04.400 | and we can visualize that here, all right?

00:11:08.120 | So, now we have all these kind of like slithers of the image.

00:11:12.800 | That's just a vertical component of each patch,

00:11:15.360 | and we use unfold again,

00:11:19.120 | but this time in a second dimension,

00:11:21.000 | so targeting what was this dimension here,

00:11:24.680 | and we also get another 256.

00:11:26.880 | Now, if we visualize that, we get our full patches,

00:11:30.200 | okay, like this.

00:11:31.440 | Now, if you just consider this here,

00:11:36.920 | it's like, if we look at this patch here,

00:11:39.840 | it doesn't tell us anything about the image, right?

00:11:43.400 | And even when we're over the cat,

00:11:45.520 | these patches are way too small

00:11:46.920 | to actually tell us anything.

00:11:48.800 | If Clip is processing a single patch at a time,

00:11:52.560 | it's probably not going to tell us anything.

00:11:54.560 | Maybe it could tell us that there's some hair in this patch

00:11:57.880 | or that there's an eye in this patch,

00:12:00.120 | but beyond that, it's not going to be very useful.

00:12:02.760 | So, rather than feeding single patches into Clip,

00:12:06.000 | what we do is actually feed a window of six by six patches,

00:12:10.400 | or we can modify that value if we prefer,

00:12:13.160 | and that just gives us a big patch to pass over to Clip.

00:12:17.520 | Now, the reason that we don't just do that from the start,

00:12:19.920 | we don't just create these bigger patches to begin with,

00:12:23.000 | is because when we're sliding through the image,

00:12:25.240 | we want to have some degree of overlap between each patch.

00:12:29.000 | Okay, so we create these smaller patches,

00:12:31.000 | and then what we can do is actually slide across

00:12:33.120 | just one little patch at a time,

00:12:35.240 | and we define that using the stride variable.

00:12:37.640 | So, if we come down to here,

00:12:40.440 | we have window, we have stride, remove this,

00:12:44.080 | and here we go.

00:12:46.760 | This is our code for going through the whole image,

00:12:49.560 | creating a patch at every time step, okay?

00:12:52.880 | So, we go for Y, and then we go through the whole Y-axis,

00:12:57.360 | and then within that, we're going across left to right

00:13:00.040 | with each step, and we initialize an empty big patch array,

00:13:04.440 | so this is our, like, the full window.

00:13:07.240 | We get the current batch, so, okay,

00:13:10.360 | let's say we start at zero, zero, X zero, Y zero.

00:13:14.400 | We go from zero to six, and zero to six here, right?

00:13:19.280 | So, that gives us the very top left corner

00:13:23.040 | or window of the image,

00:13:24.760 | and then we're literally going through

00:13:26.240 | and just go processing all of that,

00:13:28.720 | and you can see that happening here.

00:13:30.080 | As Y and X are increasing, we're moving through that image,

00:13:34.840 | and we're seeing each big patch from our image, okay?

00:13:39.080 | Sliding across with a single small little patch at a time

00:13:42.840 | so that we don't miss any important information.

00:13:46.240 | Now, this is how we're gonna run through the whole image,

00:13:50.080 | but before we do that, we actually need clip,

00:13:52.680 | so let's go ahead and actually initialize clip.

00:13:55.640 | So, to do that, all we do is this,

00:13:58.480 | so we're using Hugging Face Transformers,

00:14:00.440 | which is using PyTorch in the back there,

00:14:04.000 | so we need the clip processor,

00:14:05.800 | which is like a pre-processing pipeline

00:14:08.360 | for both text and images, and then the actual model itself,

00:14:12.720 | okay, so we set model ID, and we initialize both of those.

00:14:17.200 | Then, what we want to do is move the model

00:14:19.720 | to a device, if possible, all right?

00:14:22.800 | So, we can use CPU, but if you have a CUDA-enabled GPU,

00:14:26.960 | that will be much faster, so I'd recommend doing that.

00:14:30.960 | If you can, if not, then you can use CPU.

00:14:34.080 | It will be a bit slower,

00:14:35.320 | but we'll still run within a bearable timeframe,

00:14:39.320 | so if I'm running this on my Mac,

00:14:42.480 | I am using CPU, you can actually run this on NPS as well,

00:14:46.360 | so you could change your device to NPS

00:14:49.720 | if you have an NPS-enabled Apple Silicon device.

00:14:53.880 | So, now, returning to that process

00:14:57.000 | where we're going through each window within the image,

00:15:00.400 | we're just going to add a little bit more logic,

00:15:02.440 | so we are processing like we were before.

00:15:05.200 | There's nothing different here.

00:15:07.120 | We're creating that big patch,

00:15:09.080 | and then what we do is process that big patch

00:15:11.560 | and process a text label, okay?

00:15:14.240 | So, at the moment, we're looking for a fluffy cat

00:15:16.360 | within this image, so that is how we do this.

00:15:19.600 | We're returning PyTorch.

00:15:21.240 | It turns out we also add padding here as well for the text,

00:15:25.480 | although, in this case, I don't think we need it

00:15:28.880 | because we only have a single text item,

00:15:31.680 | but we include that when we're using

00:15:33.280 | multiple text items later,

00:15:35.280 | and then we calculate and retrieve

00:15:36.920 | the similarity score between them, okay?

00:15:40.080 | So, if we pass both text and images through this processor,

00:15:43.320 | we'll pass both into our inputs here,

00:15:46.000 | and then we just calculate the --

00:15:47.800 | or we extract the logics for each image,

00:15:51.360 | and the item just converts that into an array of values

00:15:56.800 | for a single value.

00:15:59.000 | And then here, we have those scores,

00:16:01.840 | so what we're doing here is creating the --

00:16:05.400 | what I earlier called, like, the relevance map

00:16:08.320 | or localization map throughout the whole image.

00:16:10.840 | So, for every window that we go through,

00:16:14.280 | we're adding this score to every single patch

00:16:17.760 | or little patch within that window,

00:16:20.080 | and what we're going to do,

00:16:22.240 | or what we're going to find when we do that

00:16:23.920 | is that some patches will naturally

00:16:26.360 | have a higher score than others

00:16:28.000 | because they are viewed more times, right?

00:16:31.400 | So, if you think about the top-left patch in the image,

00:16:33.600 | that's only going to be viewed once,

00:16:35.080 | whereas patches in the middle

00:16:36.600 | are going to be viewed many times

00:16:38.200 | because we'll have a sliding window

00:16:39.680 | going over there multiple times.

00:16:41.920 | So, what we also need to do

00:16:43.600 | is identify the number of runs that we perform

00:16:47.800 | or number of calculations that we perform

00:16:50.200 | within each one of those patches.

00:16:52.400 | The reason we do that is so that we can take the average

00:16:55.120 | for each score based on the number of times

00:16:57.520 | that score has been calculated

00:16:59.480 | because here, we're taking the total of all those scores,

00:17:03.240 | and then we just take the average like so.

00:17:05.360 | Now, the scores tensor is going to have

00:17:09.200 | a very smooth gradient of values

00:17:11.920 | from zero, completely irrelevant, to one.

00:17:15.320 | Now, if you consider that we've been going

00:17:17.040 | over these scores multiple times,

00:17:19.400 | it means that the object of interest

00:17:21.160 | has kind of like faded out of the window,

00:17:24.560 | like over multiple steps.

00:17:25.960 | So, that means that the similarity score

00:17:27.840 | quite gradually fades out as you go away from the object,

00:17:31.640 | which means that you don't really

00:17:32.640 | get very good localization

00:17:34.040 | if you use these scores directly.

00:17:35.960 | So, what we need to do is actually clip

00:17:39.200 | the lowest scores down to zero.

00:17:42.440 | So, to do that, what we do is calculate

00:17:45.000 | the average of scores across the whole image.

00:17:48.000 | We subtract that average from the current scores.

00:17:51.960 | What that will do is push 50% of the scores below zero,

00:17:56.040 | and then we clip those scores.

00:17:58.120 | So, anything below zero becomes zero,

00:18:00.840 | and we can do this multiple times.

00:18:02.960 | Okay, one time is usually enough,

00:18:04.560 | but you can do it multiple times

00:18:05.760 | to increase that effect of making the edge

00:18:10.320 | of this detected or localized area better defined.

00:18:14.160 | And then after you've done that,

00:18:15.600 | what we need to do is normalize those scores.

00:18:19.160 | Okay, so we might have to do this a few times,

00:18:21.720 | or everything's probably going to be

00:18:23.040 | within the range of like zero to 0.5,

00:18:26.400 | or zero to 0.2.

00:18:28.800 | So, then we normalize those scores

00:18:30.160 | to bring them back within the range of zero to one.

00:18:32.560 | Now, to apply these scores to the patches,

00:18:37.560 | we need to align their tensors,

00:18:39.600 | because right now, they are not aligned.

00:18:42.200 | Okay, for the scores, we have like 20 by 13 tensor,

00:18:47.200 | but for the patches, we have the batch dimension there,

00:18:50.920 | we have the 20 by 13, which we do want,

00:18:53.280 | but then we have the three color channels

00:18:54.800 | and the two, five, six for each set of pixels

00:18:58.040 | within each patch.

00:18:58.960 | So, we need to adjust that a little bit.

00:19:00.800 | So, we need to first remove the batch dimension.

00:19:03.080 | We do that by squeezing out the zero dimension,

00:19:06.280 | which is our batch dimension.

00:19:08.040 | And then we permute the different dimensions,

00:19:11.360 | essentially just moving them around in our patches

00:19:14.160 | in order to align them better

00:19:15.400 | with the score tensor dimensions.

00:19:18.520 | And then all we do is multiply the patches by those scores.

00:19:22.000 | That's pretty straightforward.

00:19:24.200 | Then we have to permute them again,

00:19:25.600 | because if we want to visualize everything,

00:19:27.400 | it needs to be within a certain shape

00:19:29.960 | in order for us to visualize it in Matplotlib.

00:19:32.320 | So, we come down and first thing we do

00:19:37.200 | is just get Y and X here.

00:19:38.680 | So, Y and X are the patches.

00:19:41.880 | See here, this is Y, so the height of the image in patches,

00:19:45.680 | and then 13, which is the width of the image in patches.

00:19:49.440 | And we come down here and we can plot this.

00:19:52.120 | Okay, and we get this pretty nice visual

00:19:55.080 | which localizes the fluffy cat within that image.

00:19:59.920 | Now, what's really interesting is

00:20:01.480 | if we just search for a cat,

00:20:03.600 | we actually get a slightly different localization,

00:20:06.160 | because here you can see it's kind of focusing a lot

00:20:08.080 | on the fluffy part of the cat.

00:20:10.400 | So, if we just search for a cat,

00:20:12.560 | it would actually focus more on the head.

00:20:14.960 | So, we can really add nuance information to these prompts

00:20:19.960 | and get a pretty nuanced response back.

00:20:24.280 | Now, we can do the same for butterfly.

00:20:27.120 | So, we'll just throw all that code together.

00:20:29.280 | This is just what we've done before.

00:20:30.920 | We initialize scores and runs,

00:20:32.720 | and we go process all of that.

00:20:34.440 | The only thing we change here is the prompt.

00:20:36.720 | We change it to a butterfly.

00:20:38.440 | And if we go down, and we're gonna go down and down,

00:20:41.400 | and visualize that, we get this, okay?

00:20:44.440 | So, again, that's pretty cool.

00:20:47.000 | We can see that it is identifying where in the image

00:20:51.080 | that butterfly actually is.

00:20:52.840 | So, that is the object localization set.

00:20:56.560 | Now, I want to have a look at object detection,

00:20:59.000 | which is essentially just taking the object localization

00:21:01.720 | and wrapping some more code around it

00:21:04.760 | in order to look at these multiple objects

00:21:07.840 | rather than just one.

00:21:09.040 | But to do that, we can't really visualize

00:21:12.280 | in the same way that we've done here.

00:21:14.720 | We're going to need a different type of visualization,

00:21:17.920 | and that's where we have the bounding boxes.

00:21:20.280 | So, let's take a look at how we would do that.

00:21:23.280 | So, using the, I think the butterfly example,

00:21:26.960 | so, the butterfly scores that we just calculated,

00:21:30.600 | we're going to look at where those scores

00:21:32.440 | are higher than 0.5.

00:21:33.760 | Now, you can adjust this threshold

00:21:35.360 | based on what you find works best.

00:21:38.280 | So, we do this, and what we'll get

00:21:41.040 | is a array of true and false values

00:21:43.640 | as to where the score was higher than 0.5 and not.

00:21:47.480 | And then we detect where the non-zero values are

00:21:52.600 | in that array, and what we do

00:21:53.960 | is get a load of X and Y values here.

00:21:57.440 | So, position three, two,

00:21:59.400 | we know that there is a score that is higher than 0.5,

00:22:03.520 | and we get three and two here.

00:22:05.080 | So, three is the row of the non-zero value,

00:22:10.080 | and two is the column of the non-zero value.

00:22:12.920 | So, at row position three and column two,

00:22:17.440 | we know that there is a non-zero value,

00:22:20.000 | or a value or score that's higher than 0.5, our threshold.

00:22:24.720 | And put all that together, we get something

00:22:26.640 | that looks kind of like this.

00:22:28.080 | So, we already, we kind of see that localization visual

00:22:33.080 | that we just created.

00:22:35.400 | And what we want to do is identify the bounding box

00:22:39.720 | that's just kind of surrounding those values, okay?

00:22:43.320 | So, we know in terms of like a coordinate system,

00:22:45.760 | we want one and three and four and 10

00:22:48.320 | to be included within that.

00:22:49.680 | So, what we do is find the corners

00:22:52.680 | from the detection array or set of coordinates

00:22:57.680 | that we got before from NP non-zero.

00:23:02.240 | And what we do is we just take the minimum X and Y values,

00:23:07.120 | and maximum X and Y values,

00:23:08.880 | and that will give us the corners of the box.

00:23:11.400 | And that's pretty simple to calculate.

00:23:15.360 | Now, when we get the maximum value,

00:23:19.040 | what we want to do is because we,

00:23:21.520 | basically we're getting the position of the patch

00:23:23.880 | and the position of each patch,

00:23:25.920 | we're essentially identifying the top left corner

00:23:29.080 | of each patch.

00:23:30.000 | So, when we're looking at the maximum value,

00:23:31.720 | we actually want not the start of the patch,

00:23:34.440 | but the end of the patch, okay?

00:23:36.520 | So, that's why we add that plus one here

00:23:39.200 | in order to get that.

00:23:41.160 | And the same for the X max value as well.

00:23:43.640 | So, that gives us the corner coordinates.

00:23:46.720 | And then what we do is multiply those corner coordinates

00:23:50.160 | by the patch size, which is 256 pixels.

00:23:52.760 | And then we have the pixel positions

00:23:55.120 | of each one of those corners.

00:23:58.000 | Because before we had the patch coordinates,

00:24:00.400 | now we have the pixel coordinates,

00:24:02.040 | which we can map directly onto the original image.

00:24:04.720 | So, we can see the minimum values here.

00:24:07.640 | So, we have for X and Y, two, five, six,

00:24:11.040 | and a seven, six, eight.

00:24:13.000 | And what we want to do,

00:24:14.600 | because we're going to be using matplotlib patches,

00:24:17.320 | matplotlib patches expects the top left corner coordinates

00:24:21.640 | and the width and height of the bounding box

00:24:24.480 | that you want to create.

00:24:25.720 | So, we calculate the width and height.

00:24:28.480 | And that's pretty simple.

00:24:29.520 | It's just Y max minus Y min and X max minus X min.

00:24:34.520 | And we get these.

00:24:36.240 | And what we can do now is take the image.

00:24:41.040 | We have to reshape it a little bit.

00:24:44.920 | So, we have to move the three color channels dimension

00:24:48.080 | from the zero dimension to the final dimension.

00:24:51.880 | So, we just do that here, move axes.

00:24:54.920 | And now we can plot that image.

00:24:56.960 | Okay, so we show that image with matplotlib.

00:25:00.360 | And then we create the rectangle patch.

00:25:02.280 | This is our bounding box.

00:25:03.960 | Okay, so we pass X min and Y min.

00:25:06.000 | That's the top left corner.

00:25:07.720 | And then we also pass the width and height

00:25:10.600 | of what the bounding box should be.

00:25:13.560 | And if we come down, we get this visual.

00:25:15.840 | Okay, so that's our bounding box visualization.

00:25:19.000 | And with that, it's not much further

00:25:22.120 | to create our object detection.

00:25:24.680 | So, let's have a look at how we do that.

00:25:27.400 | Now, the logic for this is pretty much just a loop

00:25:30.520 | over what we've already done.

00:25:31.800 | So, I put together a load of functions here,

00:25:35.400 | which is essentially just what we've already gone through,

00:25:37.640 | getting patches, getting the scores, getting the box.

00:25:43.280 | And then the one thing that is new here

00:25:46.160 | is this detect function.

00:25:48.080 | Okay, so we have detect.

00:25:49.280 | That's gonna get the patches.

00:25:51.320 | So, it's gonna take an image

00:25:52.720 | and it's gonna split it into those patches that we created.

00:25:55.320 | We're gonna convert the image into format

00:25:57.040 | for displaying with matplotlib.

00:25:58.320 | We did that before.

00:25:59.680 | And we also initialize that plot

00:26:02.080 | and add our image to that plot.

00:26:05.320 | And then what we do is we have a for loop.

00:26:08.600 | And this for loop goes through the image localization steps

00:26:12.200 | and bounding box steps that we just went through,

00:26:15.360 | just multiple times.

00:26:17.040 | Okay, so we have multiple prompts

00:26:18.400 | and we want to do multiple times.

00:26:19.960 | So, we calculate our similarity scores

00:26:21.760 | based on a specific prompt for all of our image patches.

00:26:26.600 | From that, we get our scores

00:26:29.160 | in that patch tensor format that we saw before.

00:26:33.080 | And then what we do is we want to get the box

00:26:36.400 | based on a particular threshold.

00:26:37.960 | So, 0.5, like we used before.

00:26:40.360 | You can see it up there.

00:26:41.280 | We have our patch size,

00:26:42.680 | which we just need to pass that

00:26:43.840 | for the calculation of the, or for the conversion.

00:26:48.440 | And we have our patch size,

00:26:49.640 | which we pass to that for the conversion

00:26:51.680 | from patch pixel, from patch coordinates

00:26:56.400 | to pixel coordinates.

00:26:58.120 | And then we also have our scores.

00:26:59.600 | And that will return the minimum X and Y coordinates

00:27:02.320 | and also width and height of the box.

00:27:05.480 | We create the bounding box.

00:27:07.400 | And then we add that to the axis, okay?

00:27:10.120 | So, now let's visualize all of this, see what we get.

00:27:14.280 | So, here I've used a slightly smaller window size

00:27:16.680 | before using six, just to point out that you can change this.

00:27:20.320 | And depending on your image,

00:27:22.280 | it may be better to use a smaller or larger window.

00:27:26.160 | And you can see, so what we're doing here,

00:27:29.960 | we've got a cat and a butterfly.

00:27:32.120 | And you can see that we get, we get a butterfly here

00:27:35.160 | and we get the cat here, okay?

00:27:37.360 | That's pretty cool.

00:27:38.280 | And like I said, with Clip,

00:27:41.560 | we can apply this object detection without fine tuning.

00:27:45.720 | All we need to do is change these prompts here, okay?

00:27:49.640 | So, it's really straightforward to modify this

00:27:54.280 | and move it to a new domain.

00:27:56.440 | Okay, so that's it for this walkthrough

00:27:59.960 | of object localization and object detection with Clip.

00:28:04.800 | As I said, I think zero-shot object localization,

00:28:08.280 | detection, and even classification opens the doors

00:28:12.080 | to a lot of projects and use cases

00:28:14.920 | that were just not accessible before

00:28:17.480 | because time and capital constraints.

00:28:20.880 | And now we can just use Clip

00:28:22.640 | and get pretty impressive results very quickly.

00:28:26.280 | All it requires is a bit of code changing here and there.

00:28:29.640 | Now, I think Clip is one part of a trend

00:28:32.800 | in multi-modality that is kind of creating

00:28:35.800 | a more accessible ML that is less brittle

00:28:39.960 | like models were in the past

00:28:41.280 | that required a lot of fine tuning

00:28:42.680 | just to adapt to a slightly different domain

00:28:45.200 | and just more generally applicable,

00:28:48.400 | which I think is really exciting.

00:28:50.040 | And it's really cool to see this sort of thing

00:28:53.360 | actually being used and to actually use it

00:28:55.200 | and just see how easy it is to use Clip

00:28:59.360 | for so many different use cases

00:29:01.720 | and it work like incredibly easily.

00:29:05.600 | So that's it for this video.

00:29:08.560 | I hope it has been useful.

00:29:12.160 | So thank you very much for watching

00:29:14.000 | and I will see you again in the next one.

00:29:15.840 | Bye.

00:29:16.680 | (upbeat music)

00:29:19.280 | (upbeat music fades)

00:29:22.360 | (upbeat music fades)

00:29:25.440 | (upbeat music fades)

00:29:28.520 | you

Fast Zero Shot Object Detection with OpenAI CLIP

Chapters