back to index

Fast Zero Shot Object Detection with OpenAI CLIP


Chapters

0:0 Early Progress in Computer Vision
2:3 Classification vs. Localization and Detection
3:55 Zero Shot with OpenAI CLIP
5:23 Zero Shot Object Localization with OpenAI CLIP
6:40 Localization with Occlusion Algorithm
7:44 Zero Shot Object Detection with OpenAI CLIP
8:34 Data Preprocessing for CLIP
13:55 Initializing OpenAI CLIP in Python
17:5 Clipping the Localization Visual
18:32 Applying Scores for Visual
20:25 Object Localization with New Prompt
20:52 Zero Shot Object Detection in Python
21:20 Creating Bounding Boxes with Matplotlib
25:15 Object Detection Code
27:11 Object Detection Results
28:29 Trends in Multi-Modal ML

Whisper Transcript | Transcript Only Page

00:00:00.000 | The Imaginated Large Scale Visual Recognition Challenge
00:00:03.720 | was a world-changing competition
00:00:07.560 | that ran from around 2010 to 2017.
00:00:11.680 | During this time, the competition acted as the place to go
00:00:16.680 | if you needed to find what the current state of the art
00:00:20.640 | was in image classification,
00:00:23.240 | object localization, object detection.
00:00:25.840 | As well as that, from 2012 onwards,
00:00:28.560 | it really acted as the catalyst
00:00:31.160 | of the explosion in deep learning.
00:00:34.120 | Researchers fine-tuned better performing
00:00:37.600 | computer vision models year on year,
00:00:40.320 | but there was a unquestioned assumption causing problems.
00:00:44.760 | It was assumed that every new task
00:00:48.200 | required model fine-tuning.
00:00:50.800 | This required a lot of data,
00:00:52.920 | and a lot of data required a lot of capital and time.
00:00:56.040 | It wasn't until recently that this assumption
00:00:59.800 | was challenged and proven wrong.
00:01:03.120 | The astonishing rise of what are called multimodal models
00:01:08.120 | has made the, what was thought impossible,
00:01:13.160 | very possible across various domains and tasks.
00:01:17.040 | One of those is called zero-shot object detection
00:01:21.440 | and localization.
00:01:23.240 | Now, zero-shot refers to taking a model
00:01:26.560 | and applying it to a new domain
00:01:28.520 | without ever fine-tuning it on data from that new domain.
00:01:33.400 | So that means we can take a model,
00:01:35.880 | we can, maybe it works in one domain,
00:01:39.600 | a classification in one particular area on one dataset,
00:01:43.320 | and we can take that same model without any fine-tuning,
00:01:46.080 | and we can use it for object detection
00:01:49.520 | in a completely different domain.
00:01:51.320 | Without that model seeing any training data
00:01:53.560 | from that new domain.
00:01:54.760 | So in this video, we're going to explore
00:01:57.040 | how to use OpenAI's CLIP
00:01:59.120 | for zero-shot object detection and localization.
00:02:03.480 | Let's begin with taking a quick look at image classification.
00:02:07.480 | Now, image classification can kind of be seen
00:02:10.400 | as one of the simplest tasks in visual recognition.
00:02:13.240 | And it's also the first step on the way to object detection.
00:02:17.640 | At its core, it's just assigning a categorical label
00:02:21.160 | to an image.
00:02:22.360 | Now, moving on from image classification,
00:02:24.400 | we have object localization.
00:02:27.160 | Object localization is image classification
00:02:30.880 | followed by the identification
00:02:33.920 | of where in the image the specific object actually is.
00:02:38.480 | So we are localizing the object.
00:02:42.080 | Now, doing that, we're essentially just going
00:02:43.920 | to identify the coordinates on the image,
00:02:45.920 | and we're going to return,
00:02:47.400 | the typical approach to this is return an image
00:02:50.440 | where you have like a bounding box
00:02:52.440 | surrounding the object that you are looking for.
00:02:55.880 | And then we take this one step further
00:02:57.960 | to perform object detection.
00:03:00.880 | With detection, we are localizing multiple objects
00:03:05.320 | within the image, or we have the capability
00:03:08.800 | to identify multiple objects within the image.
00:03:11.560 | So in this example, we have a cat and a dog.
00:03:14.240 | We would expect with object detection
00:03:16.920 | to identify both the cat and the dog.
00:03:20.200 | In the case of us having multiple dogs in this image
00:03:24.240 | or multiple cats in this image,
00:03:25.720 | we would also expect the object detection algorithm
00:03:29.200 | to actually identify each one of those independently.
00:03:32.680 | Now, in the past, if we wanted to switch a model
00:03:36.240 | between any one of these tasks,
00:03:37.480 | we'd have to fine tune it on more data.
00:03:39.920 | Wanted to switch it to another domain,
00:03:41.840 | we would have to also fine tune it
00:03:43.960 | on new data from that domain.
00:03:46.280 | But that's not always the case with models
00:03:49.520 | like OpenAI's CLIP for performing each one of these tasks
00:03:53.480 | in a zero-shot setting.
00:03:55.440 | Now, OpenAI's CLIP is a multi-modal model
00:04:00.360 | that has been pre-trained on a huge number
00:04:02.960 | of text and image pairs.
00:04:05.120 | And it essentially works by identifying
00:04:09.520 | text and image pairs that have a similar meaning
00:04:12.600 | and placing them within a similar vector space.
00:04:16.040 | Every text and every image gets converted into a vector
00:04:19.760 | and they are placed in a shared vector space.
00:04:22.720 | And the vectors that appear close together,
00:04:25.160 | they have a similar meaning.
00:04:26.480 | Now, CLIP's very broad pre-training
00:04:29.360 | means that it can perform very effectively
00:04:32.520 | across a lot of different domains.
00:04:34.280 | It's seen a lot of data,
00:04:35.240 | and so it has a good understanding
00:04:37.160 | of all these different things.
00:04:38.960 | And we can even adjust the task being performed
00:04:43.120 | with just a few code changes.
00:04:45.120 | We don't actually have to adjust the model itself.
00:04:47.680 | We just adjust the code around the model.
00:04:50.040 | And that's very much thanks to CLIP's focus
00:04:53.680 | on sort of comparing these vectors.
00:04:56.680 | So for example, for classification,
00:04:58.920 | we give CLIP a list of our class labels,
00:05:02.320 | and then we pass in images,
00:05:03.880 | and we just identify within that vector space
00:05:06.320 | where those images are
00:05:08.280 | with respect to those class label vectors,
00:05:11.040 | and which class label is most similar
00:05:14.920 | to our particular image.
00:05:16.640 | And then that is our prediction.
00:05:18.720 | So that most similar class label,
00:05:21.920 | that's our predicted class.
00:05:23.440 | Now, for object localization,
00:05:25.760 | we apply a very similar type of logic.
00:05:29.480 | As before, we create a class label,
00:05:31.920 | but unlike before,
00:05:33.200 | we don't feed the entire image into CLIP.
00:05:36.560 | To localize an object,
00:05:37.720 | we have to break the image into patches.
00:05:41.040 | We then pass a window over all of those patches,
00:05:44.560 | moving across the entire image,
00:05:46.760 | left to right, top to bottom.
00:05:48.240 | And we generate a image embedding
00:05:50.880 | for each of those windows.
00:05:52.960 | And then we calculate the similarity
00:05:55.360 | between each one of those windows embedded by CLIP
00:05:58.680 | and the class label embedding,
00:06:00.800 | returning a similarity score for every single patch.
00:06:03.840 | Now, after calculating the similarity score
00:06:05.840 | for every single patch,
00:06:07.160 | we use that to create almost like a map of relevance
00:06:11.200 | across the entire image.
00:06:12.840 | And then we can use that map
00:06:14.120 | to identify the location of the object of interest.
00:06:18.400 | And from that, we will get something
00:06:20.200 | that looks kind of like this.
00:06:21.040 | So we have most of the image will be very dark and black.
00:06:24.680 | That means as the object of interest is not in that space.
00:06:28.000 | And then using that localization map,
00:06:30.680 | we can create a more traditional
00:06:33.080 | bounding box visualization as well.
00:06:35.640 | Both of these visuals are capturing the same information,
00:06:38.240 | we're just displaying it in a different way.
00:06:40.120 | Now, there's also other approaches to this.
00:06:42.600 | So I recently hosted a talk
00:06:45.640 | with two sets of people, actually.
00:06:48.320 | So Federico Bianchi from Sanford's NLP group,
00:06:52.440 | and also Rafael Pisone.
00:06:54.520 | And both of those have worked on a Italian CLIP project.
00:06:59.320 | And part of that was performing object localization.
00:07:03.880 | Now, to do that, they use a slightly different approach
00:07:07.600 | to what I'm going to demonstrate here.
00:07:09.680 | And we can think of it as almost like the opposite.
00:07:12.320 | So whereas we slide a window over the whole image,
00:07:15.920 | they slide a black patch over the whole image,
00:07:19.800 | which hides what is behind that patch.
00:07:22.960 | And then they feed the image into CLIP.
00:07:25.200 | And essentially, as you slide the patch over the image,
00:07:30.160 | you are hiding a part of the image.
00:07:32.500 | And therefore, if the similarity score drops
00:07:35.760 | when the patch is over a certain area,
00:07:38.160 | you know that the object you're looking for
00:07:40.360 | is probably within that space.
00:07:42.560 | And that's called the occlusion algorithm.
00:07:44.600 | And then moving on to object detection,
00:07:46.080 | which is like the last level in these three tasks,
00:07:49.680 | we will be identifying multiple objects.
00:07:52.440 | Now, there's a very fine line between object localization
00:07:56.000 | and object detection,
00:07:57.240 | but you can simply think of it as localization
00:08:00.180 | for multiple clusters and multiple objects.
00:08:02.480 | With our cat and butterfly image,
00:08:04.440 | we will be searching for two objects,
00:08:06.500 | a cat and a butterfly.
00:08:08.260 | And with that, we could draw a bounding box
00:08:11.040 | around both of those objects.
00:08:12.720 | And essentially, what we're doing there
00:08:13.960 | is using localization for a single object,
00:08:16.720 | but then we're putting both of those together
00:08:18.520 | in a loop in our code,
00:08:20.860 | and we're producing this object detection process.
00:08:23.820 | Now, we've covered the idea behind image classification
00:08:28.200 | onto object localization and object detection.
00:08:31.760 | Now, let's have a look
00:08:32.600 | at how we actually implement all of this.
00:08:34.480 | Now, before we move on to any classification,
00:08:36.760 | localization, or detection task,
00:08:39.000 | we need to have some data.
00:08:41.680 | We're gonna use a small demo dataset
00:08:44.000 | called James Callum Image Text Demo,
00:08:46.940 | and we can download it like this.
00:08:49.760 | So using Hugging Face datasets here,
00:08:52.560 | which we can pip install with pip install datasets,
00:09:00.860 | and this is a dataset, it's very small,
00:09:03.320 | it's 21 text to image pairs, okay?
00:09:07.840 | One of those is the image you've already seen,
00:09:10.720 | the cat with a butterfly landing on its nose,
00:09:14.580 | very curious how they got that photo.
00:09:16.900 | Now, after you've downloaded that dataset,
00:09:19.340 | we are striped that we're gonna be using this image here,
00:09:22.540 | and what we want to do is not use the image file itself,
00:09:28.020 | 'cause at the moment it's a Pill Python image object,
00:09:32.680 | but instead we need to convert it into a tensor.
00:09:36.600 | Now, we're gonna be using PyTorch later on,
00:09:38.760 | so what I'm going to do here is we're going to just
00:09:42.120 | transform the image into a tensor,
00:09:44.040 | and we use TorchVision transforms,
00:09:46.320 | which is a typical pipeline tool in computer vision,
00:09:49.760 | and we just use toTensor, okay?
00:09:52.120 | And then we process our image through that pipeline,
00:09:55.880 | and then we can see that we get this, okay?
00:09:57.880 | So, what are these values here?
00:10:00.160 | We have the height of the image in pixels,
00:10:03.400 | the width of the image in pixels,
00:10:05.840 | and then also the three color channels,
00:10:08.880 | red, green, and blue, that make up the image.
00:10:12.380 | Now, we need a slightly different format
00:10:16.280 | when we are processing everything.
00:10:18.400 | One, we need to add those patches,
00:10:20.840 | and two, we need to process it through a PyTorch model,
00:10:25.640 | and we also need the batch dimension for that.
00:10:28.640 | So, the first thing we're gonna do
00:10:30.160 | is add the batch dimension.
00:10:31.480 | It's just a single image, so we just have one in there,
00:10:34.800 | but we need that anyway.
00:10:37.200 | And then we come down to here.
00:10:39.520 | So, this is where we're gonna break
00:10:41.440 | the image into the patches, okay?
00:10:45.200 | Each patch is going to be 256 dimensions
00:10:48.000 | in both height and width.
00:10:49.880 | So, the first thing we do here is unfold,
00:10:52.600 | and we get this here.
00:10:54.760 | We get this 256 and this 20.
00:10:57.320 | Now, the 20 is the height of the image
00:11:00.720 | in these 256-pixel patches,
00:11:04.400 | and we can visualize that here, all right?
00:11:08.120 | So, now we have all these kind of like slithers of the image.
00:11:12.800 | That's just a vertical component of each patch,
00:11:15.360 | and we use unfold again,
00:11:19.120 | but this time in a second dimension,
00:11:21.000 | so targeting what was this dimension here,
00:11:24.680 | and we also get another 256.
00:11:26.880 | Now, if we visualize that, we get our full patches,
00:11:30.200 | okay, like this.
00:11:31.440 | Now, if you just consider this here,
00:11:36.920 | it's like, if we look at this patch here,
00:11:39.840 | it doesn't tell us anything about the image, right?
00:11:43.400 | And even when we're over the cat,
00:11:45.520 | these patches are way too small
00:11:46.920 | to actually tell us anything.
00:11:48.800 | If Clip is processing a single patch at a time,
00:11:52.560 | it's probably not going to tell us anything.
00:11:54.560 | Maybe it could tell us that there's some hair in this patch
00:11:57.880 | or that there's an eye in this patch,
00:12:00.120 | but beyond that, it's not going to be very useful.
00:12:02.760 | So, rather than feeding single patches into Clip,
00:12:06.000 | what we do is actually feed a window of six by six patches,
00:12:10.400 | or we can modify that value if we prefer,
00:12:13.160 | and that just gives us a big patch to pass over to Clip.
00:12:17.520 | Now, the reason that we don't just do that from the start,
00:12:19.920 | we don't just create these bigger patches to begin with,
00:12:23.000 | is because when we're sliding through the image,
00:12:25.240 | we want to have some degree of overlap between each patch.
00:12:29.000 | Okay, so we create these smaller patches,
00:12:31.000 | and then what we can do is actually slide across
00:12:33.120 | just one little patch at a time,
00:12:35.240 | and we define that using the stride variable.
00:12:37.640 | So, if we come down to here,
00:12:40.440 | we have window, we have stride, remove this,
00:12:44.080 | and here we go.
00:12:46.760 | This is our code for going through the whole image,
00:12:49.560 | creating a patch at every time step, okay?
00:12:52.880 | So, we go for Y, and then we go through the whole Y-axis,
00:12:57.360 | and then within that, we're going across left to right
00:13:00.040 | with each step, and we initialize an empty big patch array,
00:13:04.440 | so this is our, like, the full window.
00:13:07.240 | We get the current batch, so, okay,
00:13:10.360 | let's say we start at zero, zero, X zero, Y zero.
00:13:14.400 | We go from zero to six, and zero to six here, right?
00:13:19.280 | So, that gives us the very top left corner
00:13:23.040 | or window of the image,
00:13:24.760 | and then we're literally going through
00:13:26.240 | and just go processing all of that,
00:13:28.720 | and you can see that happening here.
00:13:30.080 | As Y and X are increasing, we're moving through that image,
00:13:34.840 | and we're seeing each big patch from our image, okay?
00:13:39.080 | Sliding across with a single small little patch at a time
00:13:42.840 | so that we don't miss any important information.
00:13:46.240 | Now, this is how we're gonna run through the whole image,
00:13:50.080 | but before we do that, we actually need clip,
00:13:52.680 | so let's go ahead and actually initialize clip.
00:13:55.640 | So, to do that, all we do is this,
00:13:58.480 | so we're using Hugging Face Transformers,
00:14:00.440 | which is using PyTorch in the back there,
00:14:04.000 | so we need the clip processor,
00:14:05.800 | which is like a pre-processing pipeline
00:14:08.360 | for both text and images, and then the actual model itself,
00:14:12.720 | okay, so we set model ID, and we initialize both of those.
00:14:17.200 | Then, what we want to do is move the model
00:14:19.720 | to a device, if possible, all right?
00:14:22.800 | So, we can use CPU, but if you have a CUDA-enabled GPU,
00:14:26.960 | that will be much faster, so I'd recommend doing that.
00:14:30.960 | If you can, if not, then you can use CPU.
00:14:34.080 | It will be a bit slower,
00:14:35.320 | but we'll still run within a bearable timeframe,
00:14:39.320 | so if I'm running this on my Mac,
00:14:42.480 | I am using CPU, you can actually run this on NPS as well,
00:14:46.360 | so you could change your device to NPS
00:14:49.720 | if you have an NPS-enabled Apple Silicon device.
00:14:53.880 | So, now, returning to that process
00:14:57.000 | where we're going through each window within the image,
00:15:00.400 | we're just going to add a little bit more logic,
00:15:02.440 | so we are processing like we were before.
00:15:05.200 | There's nothing different here.
00:15:07.120 | We're creating that big patch,
00:15:09.080 | and then what we do is process that big patch
00:15:11.560 | and process a text label, okay?
00:15:14.240 | So, at the moment, we're looking for a fluffy cat
00:15:16.360 | within this image, so that is how we do this.
00:15:19.600 | We're returning PyTorch.
00:15:21.240 | It turns out we also add padding here as well for the text,
00:15:25.480 | although, in this case, I don't think we need it
00:15:28.880 | because we only have a single text item,
00:15:31.680 | but we include that when we're using
00:15:33.280 | multiple text items later,
00:15:35.280 | and then we calculate and retrieve
00:15:36.920 | the similarity score between them, okay?
00:15:40.080 | So, if we pass both text and images through this processor,
00:15:43.320 | we'll pass both into our inputs here,
00:15:46.000 | and then we just calculate the --
00:15:47.800 | or we extract the logics for each image,
00:15:51.360 | and the item just converts that into an array of values
00:15:56.800 | for a single value.
00:15:59.000 | And then here, we have those scores,
00:16:01.840 | so what we're doing here is creating the --
00:16:05.400 | what I earlier called, like, the relevance map
00:16:08.320 | or localization map throughout the whole image.
00:16:10.840 | So, for every window that we go through,
00:16:14.280 | we're adding this score to every single patch
00:16:17.760 | or little patch within that window,
00:16:20.080 | and what we're going to do,
00:16:22.240 | or what we're going to find when we do that
00:16:23.920 | is that some patches will naturally
00:16:26.360 | have a higher score than others
00:16:28.000 | because they are viewed more times, right?
00:16:31.400 | So, if you think about the top-left patch in the image,
00:16:33.600 | that's only going to be viewed once,
00:16:35.080 | whereas patches in the middle
00:16:36.600 | are going to be viewed many times
00:16:38.200 | because we'll have a sliding window
00:16:39.680 | going over there multiple times.
00:16:41.920 | So, what we also need to do
00:16:43.600 | is identify the number of runs that we perform
00:16:47.800 | or number of calculations that we perform
00:16:50.200 | within each one of those patches.
00:16:52.400 | The reason we do that is so that we can take the average
00:16:55.120 | for each score based on the number of times
00:16:57.520 | that score has been calculated
00:16:59.480 | because here, we're taking the total of all those scores,
00:17:03.240 | and then we just take the average like so.
00:17:05.360 | Now, the scores tensor is going to have
00:17:09.200 | a very smooth gradient of values
00:17:11.920 | from zero, completely irrelevant, to one.
00:17:15.320 | Now, if you consider that we've been going
00:17:17.040 | over these scores multiple times,
00:17:19.400 | it means that the object of interest
00:17:21.160 | has kind of like faded out of the window,
00:17:24.560 | like over multiple steps.
00:17:25.960 | So, that means that the similarity score
00:17:27.840 | quite gradually fades out as you go away from the object,
00:17:31.640 | which means that you don't really
00:17:32.640 | get very good localization
00:17:34.040 | if you use these scores directly.
00:17:35.960 | So, what we need to do is actually clip
00:17:39.200 | the lowest scores down to zero.
00:17:42.440 | So, to do that, what we do is calculate
00:17:45.000 | the average of scores across the whole image.
00:17:48.000 | We subtract that average from the current scores.
00:17:51.960 | What that will do is push 50% of the scores below zero,
00:17:56.040 | and then we clip those scores.
00:17:58.120 | So, anything below zero becomes zero,
00:18:00.840 | and we can do this multiple times.
00:18:02.960 | Okay, one time is usually enough,
00:18:04.560 | but you can do it multiple times
00:18:05.760 | to increase that effect of making the edge
00:18:10.320 | of this detected or localized area better defined.
00:18:14.160 | And then after you've done that,
00:18:15.600 | what we need to do is normalize those scores.
00:18:19.160 | Okay, so we might have to do this a few times,
00:18:21.720 | or everything's probably going to be
00:18:23.040 | within the range of like zero to 0.5,
00:18:26.400 | or zero to 0.2.
00:18:28.800 | So, then we normalize those scores
00:18:30.160 | to bring them back within the range of zero to one.
00:18:32.560 | Now, to apply these scores to the patches,
00:18:37.560 | we need to align their tensors,
00:18:39.600 | because right now, they are not aligned.
00:18:42.200 | Okay, for the scores, we have like 20 by 13 tensor,
00:18:47.200 | but for the patches, we have the batch dimension there,
00:18:50.920 | we have the 20 by 13, which we do want,
00:18:53.280 | but then we have the three color channels
00:18:54.800 | and the two, five, six for each set of pixels
00:18:58.040 | within each patch.
00:18:58.960 | So, we need to adjust that a little bit.
00:19:00.800 | So, we need to first remove the batch dimension.
00:19:03.080 | We do that by squeezing out the zero dimension,
00:19:06.280 | which is our batch dimension.
00:19:08.040 | And then we permute the different dimensions,
00:19:11.360 | essentially just moving them around in our patches
00:19:14.160 | in order to align them better
00:19:15.400 | with the score tensor dimensions.
00:19:18.520 | And then all we do is multiply the patches by those scores.
00:19:22.000 | That's pretty straightforward.
00:19:24.200 | Then we have to permute them again,
00:19:25.600 | because if we want to visualize everything,
00:19:27.400 | it needs to be within a certain shape
00:19:29.960 | in order for us to visualize it in Matplotlib.
00:19:32.320 | So, we come down and first thing we do
00:19:37.200 | is just get Y and X here.
00:19:38.680 | So, Y and X are the patches.
00:19:41.880 | See here, this is Y, so the height of the image in patches,
00:19:45.680 | and then 13, which is the width of the image in patches.
00:19:49.440 | And we come down here and we can plot this.
00:19:52.120 | Okay, and we get this pretty nice visual
00:19:55.080 | which localizes the fluffy cat within that image.
00:19:59.920 | Now, what's really interesting is
00:20:01.480 | if we just search for a cat,
00:20:03.600 | we actually get a slightly different localization,
00:20:06.160 | because here you can see it's kind of focusing a lot
00:20:08.080 | on the fluffy part of the cat.
00:20:10.400 | So, if we just search for a cat,
00:20:12.560 | it would actually focus more on the head.
00:20:14.960 | So, we can really add nuance information to these prompts
00:20:19.960 | and get a pretty nuanced response back.
00:20:24.280 | Now, we can do the same for butterfly.
00:20:27.120 | So, we'll just throw all that code together.
00:20:29.280 | This is just what we've done before.
00:20:30.920 | We initialize scores and runs,
00:20:32.720 | and we go process all of that.
00:20:34.440 | The only thing we change here is the prompt.
00:20:36.720 | We change it to a butterfly.
00:20:38.440 | And if we go down, and we're gonna go down and down,
00:20:41.400 | and visualize that, we get this, okay?
00:20:44.440 | So, again, that's pretty cool.
00:20:47.000 | We can see that it is identifying where in the image
00:20:51.080 | that butterfly actually is.
00:20:52.840 | So, that is the object localization set.
00:20:56.560 | Now, I want to have a look at object detection,
00:20:59.000 | which is essentially just taking the object localization
00:21:01.720 | and wrapping some more code around it
00:21:04.760 | in order to look at these multiple objects
00:21:07.840 | rather than just one.
00:21:09.040 | But to do that, we can't really visualize
00:21:12.280 | in the same way that we've done here.
00:21:14.720 | We're going to need a different type of visualization,
00:21:17.920 | and that's where we have the bounding boxes.
00:21:20.280 | So, let's take a look at how we would do that.
00:21:23.280 | So, using the, I think the butterfly example,
00:21:26.960 | so, the butterfly scores that we just calculated,
00:21:30.600 | we're going to look at where those scores
00:21:32.440 | are higher than 0.5.
00:21:33.760 | Now, you can adjust this threshold
00:21:35.360 | based on what you find works best.
00:21:38.280 | So, we do this, and what we'll get
00:21:41.040 | is a array of true and false values
00:21:43.640 | as to where the score was higher than 0.5 and not.
00:21:47.480 | And then we detect where the non-zero values are
00:21:52.600 | in that array, and what we do
00:21:53.960 | is get a load of X and Y values here.
00:21:57.440 | So, position three, two,
00:21:59.400 | we know that there is a score that is higher than 0.5,
00:22:03.520 | and we get three and two here.
00:22:05.080 | So, three is the row of the non-zero value,
00:22:10.080 | and two is the column of the non-zero value.
00:22:12.920 | So, at row position three and column two,
00:22:17.440 | we know that there is a non-zero value,
00:22:20.000 | or a value or score that's higher than 0.5, our threshold.
00:22:24.720 | And put all that together, we get something
00:22:26.640 | that looks kind of like this.
00:22:28.080 | So, we already, we kind of see that localization visual
00:22:33.080 | that we just created.
00:22:35.400 | And what we want to do is identify the bounding box
00:22:39.720 | that's just kind of surrounding those values, okay?
00:22:43.320 | So, we know in terms of like a coordinate system,
00:22:45.760 | we want one and three and four and 10
00:22:48.320 | to be included within that.
00:22:49.680 | So, what we do is find the corners
00:22:52.680 | from the detection array or set of coordinates
00:22:57.680 | that we got before from NP non-zero.
00:23:02.240 | And what we do is we just take the minimum X and Y values,
00:23:07.120 | and maximum X and Y values,
00:23:08.880 | and that will give us the corners of the box.
00:23:11.400 | And that's pretty simple to calculate.
00:23:15.360 | Now, when we get the maximum value,
00:23:19.040 | what we want to do is because we,
00:23:21.520 | basically we're getting the position of the patch
00:23:23.880 | and the position of each patch,
00:23:25.920 | we're essentially identifying the top left corner
00:23:29.080 | of each patch.
00:23:30.000 | So, when we're looking at the maximum value,
00:23:31.720 | we actually want not the start of the patch,
00:23:34.440 | but the end of the patch, okay?
00:23:36.520 | So, that's why we add that plus one here
00:23:39.200 | in order to get that.
00:23:41.160 | And the same for the X max value as well.
00:23:43.640 | So, that gives us the corner coordinates.
00:23:46.720 | And then what we do is multiply those corner coordinates
00:23:50.160 | by the patch size, which is 256 pixels.
00:23:52.760 | And then we have the pixel positions
00:23:55.120 | of each one of those corners.
00:23:58.000 | Because before we had the patch coordinates,
00:24:00.400 | now we have the pixel coordinates,
00:24:02.040 | which we can map directly onto the original image.
00:24:04.720 | So, we can see the minimum values here.
00:24:07.640 | So, we have for X and Y, two, five, six,
00:24:11.040 | and a seven, six, eight.
00:24:13.000 | And what we want to do,
00:24:14.600 | because we're going to be using matplotlib patches,
00:24:17.320 | matplotlib patches expects the top left corner coordinates
00:24:21.640 | and the width and height of the bounding box
00:24:24.480 | that you want to create.
00:24:25.720 | So, we calculate the width and height.
00:24:28.480 | And that's pretty simple.
00:24:29.520 | It's just Y max minus Y min and X max minus X min.
00:24:34.520 | And we get these.
00:24:36.240 | And what we can do now is take the image.
00:24:41.040 | We have to reshape it a little bit.
00:24:44.920 | So, we have to move the three color channels dimension
00:24:48.080 | from the zero dimension to the final dimension.
00:24:51.880 | So, we just do that here, move axes.
00:24:54.920 | And now we can plot that image.
00:24:56.960 | Okay, so we show that image with matplotlib.
00:25:00.360 | And then we create the rectangle patch.
00:25:02.280 | This is our bounding box.
00:25:03.960 | Okay, so we pass X min and Y min.
00:25:06.000 | That's the top left corner.
00:25:07.720 | And then we also pass the width and height
00:25:10.600 | of what the bounding box should be.
00:25:13.560 | And if we come down, we get this visual.
00:25:15.840 | Okay, so that's our bounding box visualization.
00:25:19.000 | And with that, it's not much further
00:25:22.120 | to create our object detection.
00:25:24.680 | So, let's have a look at how we do that.
00:25:27.400 | Now, the logic for this is pretty much just a loop
00:25:30.520 | over what we've already done.
00:25:31.800 | So, I put together a load of functions here,
00:25:35.400 | which is essentially just what we've already gone through,
00:25:37.640 | getting patches, getting the scores, getting the box.
00:25:43.280 | And then the one thing that is new here
00:25:46.160 | is this detect function.
00:25:48.080 | Okay, so we have detect.
00:25:49.280 | That's gonna get the patches.
00:25:51.320 | So, it's gonna take an image
00:25:52.720 | and it's gonna split it into those patches that we created.
00:25:55.320 | We're gonna convert the image into format
00:25:57.040 | for displaying with matplotlib.
00:25:58.320 | We did that before.
00:25:59.680 | And we also initialize that plot
00:26:02.080 | and add our image to that plot.
00:26:05.320 | And then what we do is we have a for loop.
00:26:08.600 | And this for loop goes through the image localization steps
00:26:12.200 | and bounding box steps that we just went through,
00:26:15.360 | just multiple times.
00:26:17.040 | Okay, so we have multiple prompts
00:26:18.400 | and we want to do multiple times.
00:26:19.960 | So, we calculate our similarity scores
00:26:21.760 | based on a specific prompt for all of our image patches.
00:26:26.600 | From that, we get our scores
00:26:29.160 | in that patch tensor format that we saw before.
00:26:33.080 | And then what we do is we want to get the box
00:26:36.400 | based on a particular threshold.
00:26:37.960 | So, 0.5, like we used before.
00:26:40.360 | You can see it up there.
00:26:41.280 | We have our patch size,
00:26:42.680 | which we just need to pass that
00:26:43.840 | for the calculation of the, or for the conversion.
00:26:48.440 | And we have our patch size,
00:26:49.640 | which we pass to that for the conversion
00:26:51.680 | from patch pixel, from patch coordinates
00:26:56.400 | to pixel coordinates.
00:26:58.120 | And then we also have our scores.
00:26:59.600 | And that will return the minimum X and Y coordinates
00:27:02.320 | and also width and height of the box.
00:27:05.480 | We create the bounding box.
00:27:07.400 | And then we add that to the axis, okay?
00:27:10.120 | So, now let's visualize all of this, see what we get.
00:27:14.280 | So, here I've used a slightly smaller window size
00:27:16.680 | before using six, just to point out that you can change this.
00:27:20.320 | And depending on your image,
00:27:22.280 | it may be better to use a smaller or larger window.
00:27:26.160 | And you can see, so what we're doing here,
00:27:29.960 | we've got a cat and a butterfly.
00:27:32.120 | And you can see that we get, we get a butterfly here
00:27:35.160 | and we get the cat here, okay?
00:27:37.360 | That's pretty cool.
00:27:38.280 | And like I said, with Clip,
00:27:41.560 | we can apply this object detection without fine tuning.
00:27:45.720 | All we need to do is change these prompts here, okay?
00:27:49.640 | So, it's really straightforward to modify this
00:27:54.280 | and move it to a new domain.
00:27:56.440 | Okay, so that's it for this walkthrough
00:27:59.960 | of object localization and object detection with Clip.
00:28:04.800 | As I said, I think zero-shot object localization,
00:28:08.280 | detection, and even classification opens the doors
00:28:12.080 | to a lot of projects and use cases
00:28:14.920 | that were just not accessible before
00:28:17.480 | because time and capital constraints.
00:28:20.880 | And now we can just use Clip
00:28:22.640 | and get pretty impressive results very quickly.
00:28:26.280 | All it requires is a bit of code changing here and there.
00:28:29.640 | Now, I think Clip is one part of a trend
00:28:32.800 | in multi-modality that is kind of creating
00:28:35.800 | a more accessible ML that is less brittle
00:28:39.960 | like models were in the past
00:28:41.280 | that required a lot of fine tuning
00:28:42.680 | just to adapt to a slightly different domain
00:28:45.200 | and just more generally applicable,
00:28:48.400 | which I think is really exciting.
00:28:50.040 | And it's really cool to see this sort of thing
00:28:53.360 | actually being used and to actually use it
00:28:55.200 | and just see how easy it is to use Clip
00:28:59.360 | for so many different use cases
00:29:01.720 | and it work like incredibly easily.
00:29:05.600 | So that's it for this video.
00:29:08.560 | I hope it has been useful.
00:29:12.160 | So thank you very much for watching
00:29:14.000 | and I will see you again in the next one.
00:29:16.680 | (upbeat music)
00:29:19.280 | (upbeat music fades)
00:29:22.360 | (upbeat music fades)
00:29:25.440 | (upbeat music fades)