Bag of *Visual* Words for Image Classification and Retrieval

00:00:00.000 | Today we are going to talk about something called Bag of Visual Words. Now Bag of Visual Words,

00:00:06.480 | we can use it for essentially converting images into vector embeddings or image embeddings

00:00:14.720 | and using those image embeddings in image classification and image retrieval. Now this

00:00:23.280 | video is going to be split into two parts. First part we are going to look at the theory of Bag of

00:00:29.680 | Visual Words and how it works and then in the second half of the video we will actually look

00:00:34.400 | at an implementation of it and see how we can implement something like Bag of Visual Words

00:00:40.160 | in Python. So let's get started with trying to understand what exactly is Bag of Visual Words.

00:00:48.320 | Now for those of you that are coming from NLP, you may have heard of Bag of Words before.

00:00:56.000 | It's a very common technique used in using NLP and information retrieval but it focuses on

00:01:02.880 | sentences, it focuses on language and text. So we can pretty much visualize it as this. So we take

00:01:10.160 | a sentence and we pull out all of the individual words from that sentence and then we put them into

00:01:17.040 | a bag. This bag is unordered so imagine it's just like an unordered list and that's what Bag of

00:01:24.080 | Words is. Bag of Visual Words is pretty much the same thing but applied to images which is

00:01:31.680 | quite interesting. So how does that possibly make sense? How do we extract words or visual words

00:01:41.680 | from an image? Well at a very high level it's pretty much what you can see here. So we take a

00:01:48.080 | load of images and we extract almost like patches from those images and this is not exactly what

00:01:57.600 | happens. We don't just put those patches into a bag, we actually convert these patches into what

00:02:03.200 | are called the visual words. So these patches here we would call visual features and then we

00:02:10.480 | do some pre-processing in the middle here. You can't see it right now but we'll fill all that in

00:02:16.240 | throughout the video and what we get out of that processing is a load of what we call

00:02:23.840 | visual words. Now as with language there are a limited number of visual words. There are so

00:02:32.960 | many words in the English language and there are also so many visual words in our image data sets.

00:02:40.480 | Now a visual feature, let's say this over here is a visual feature, it consists of two things.

00:02:49.520 | Consists of a key point

00:02:51.760 | and a descriptor.

00:02:56.560 | Now key points are the points in an image that the visual feature is located at.

00:03:07.200 | So slider coordinates and they do not change or they should not change as an image is rotated,

00:03:15.280 | scaled or transformed in some way. And descriptors as you might have guessed are almost like a

00:03:23.040 | description of that particular visual feature. There are multiple algorithms for extracting

00:03:30.240 | these features and creating these descriptors. You can use algorithms like SIFT, ORB or SURF.

00:03:38.400 | Now we're not going to go into detail on what those algorithms are doing right now at least

00:03:44.240 | but we'll just understand that they are pretty much looking at these images, they're finding edges,

00:03:51.600 | changes in color, changes in light and just identifying those and running us through a set

00:03:58.640 | algorithm to create what we call the descriptor. Now the most common of these and the one that we

00:04:07.120 | are going to use is called SIFT. Now SIFT is pretty useful because it's invariant to scale,

00:04:12.480 | rotation, translation, illumination and blur to an extent. Of course it's not perfect

00:04:20.320 | but it can deal with all of these things. Now SIFT first looks through the image

00:04:26.240 | and it identifies a key point. So something that it believes is of relevance. So just as one

00:04:32.640 | example it will probably look up here and there's not really anything in that image. There's no

00:04:40.400 | difference, it's just one flat color. So it probably won't extract a feature from here.

00:04:49.760 | But when it comes over here it's going to see some edges, you're going to see some changes in

00:04:55.680 | color and it's going to be like okay well there's something there. So we put a key point there and

00:05:00.480 | then it's going to go through all of its like algorithmic magic and based on the edges, the

00:05:07.600 | colors, whatever else in that particular patch it is going to produce a 128 dimensional vector.

00:05:17.680 | Now it doesn't stop there, it doesn't just find the first patch, it goes

00:05:24.240 | through an image, it's going to find like a load of patches especially for this image. There's a

00:05:29.120 | lot of information everywhere, it's going to be like okay look there's a window, there's this

00:05:33.440 | edge over here, something interesting over here, there's all these colors and edges. It's going to

00:05:38.480 | find these key points, these visual features everywhere throughout this image and we'll

00:05:46.240 | actually see what these look like a little bit later on. So what we get is a load of these visual

00:05:53.840 | features, the order of those visual features doesn't matter, they're all just kind of placed

00:05:58.960 | or they're suspended in the air over here. We can almost imagine there's like another bag here

00:06:06.720 | which is our bag of visual features before it becomes a bag of visual words. So we're kind of

00:06:12.320 | at this point right now, so every image just becomes this massive essentially an array with

00:06:17.760 | all these 128 dimensional sift vectors. Now once we have pulled out all of our visual features from

00:06:25.440 | an image, we move on to the next step which is how do we go from visual feature to visual word.

00:06:32.720 | Well we have to use what's called a codebook. So a codebook is essentially, it's going to be

00:06:41.440 | like our vocabulary. If you come from NLP again, you'll recognize vocabulary is essentially like a

00:06:47.200 | big list of all the words in your data set. This codebook is going to be a big list of all of the

00:06:57.440 | visual words from our images but of course we don't have our visual words yet, we need to create

00:07:04.240 | them. So that's what we're going to do next, we're going to build our codebook which is basically

00:07:10.080 | going to create all of our visual words at the same time. So how do we construct this codebook?

00:07:16.240 | Well the idea is that we take a load of similar visual features and we actually cluster them

00:07:26.160 | together and we get a centroid out of that cluster. Now that centroid is what we would

00:07:34.080 | call a visual word. So we look at this image here right, on the left over here, these are visual

00:07:44.560 | features right. We have multiple visual features, there's all these things that look kind of like a

00:07:50.320 | dog's nose from a slightly different position but you know they pretty much look like super similar.

00:07:59.040 | Okay and what we do is we process all of these, we use k-means clustering and we get a single

00:08:09.840 | centroid out of this cluster of points that all look very similar. Okay so we go from what is

00:08:17.680 | this, this is 5, 128 dimensional visual features to 1, still 128 dimensional centroid or visual word.

00:08:34.080 | Okay and we do that using k-means clustering.

00:08:38.720 | Now this does not just happen for a single image. We actually feed in all of our images from our,

00:08:49.760 | let's call it a training data set. So we'll actually have loads and loads of these images,

00:08:56.240 | not just one. It's very very bad drawing and okay we have like a dog's nose over here but maybe we

00:09:04.080 | have another dog over here, another image of a dog and he has a nose obviously, maybe. So these

00:09:11.840 | centroids here are also going to be very similar to the centroids of this other dog's nose that

00:09:18.640 | are over here. So in reality we don't just have these five visual features that create our single

00:09:24.960 | our single cluster over here, our dog nose visual word. We actually probably have quite a lot. So

00:09:34.320 | that's that's the process of creating code but we basically go through all of the visual features

00:09:41.200 | that we create. We're going to create a lot of them and we compress them into a set of particular

00:09:48.320 | points which become our visual words. Now this process is called vector quantization. It's where

00:09:54.640 | you take a whole range of vectors and you do something like clustering in order to convert

00:10:07.920 | this massive range of vectors into just a select few vectors. So again we can use the example or

00:10:17.760 | the parallel of language. There are pretty much an infinite number of sentences out there in the

00:10:24.560 | world but if we perform bag of words, we extract every word from let's leave stick with the English

00:10:32.160 | language sentences at the moment. If we extract every single word and we put them all in a bag

00:10:37.840 | and we remove all duplicates we're going to end up with far less than an infinite number of items

00:10:46.000 | there. That's quantization. I suppose quantization of language. We're going from an infinite number

00:10:53.040 | of possible sentences to a limited number of possible words that make up those sentences.

00:10:59.760 | That's exactly what we're doing here. So just taking this one example we have like a church

00:11:08.000 | building. Over on the left we have these two visual features. These two dots are visual features.

00:11:16.000 | Maybe we can also have another visual feature over here, over here. All of those should kind

00:11:22.800 | of be very similar. They're like this green window. They have all these similar edges and

00:11:27.600 | similar colors. So they probably all end up being clustered together into this single visual word

00:11:34.000 | which is a window. This one over here, we have this kind of like bell tower thing and we just

00:11:42.720 | take different parts of this bell tower and they would probably all feed into a single visual word.

00:11:50.080 | Now one thing I should point out here is this is just one image. Now imagine we took another image

00:11:55.040 | with other windows inside it. That other image with windows. Let's say the window is here.

00:12:04.320 | We would take the visual features from that and they would also be translated into this over here.

00:12:12.560 | So what we get out of that is rather than our image being represented by all of these 128

00:12:23.280 | dimensional visual features, they are actually represented by these technically 128 dimensional

00:12:31.120 | visual words. But because there's a limited number of these visual words we can number them.

00:12:36.080 | So for example this one would be zero. This one we selected here would be one, two, three,

00:12:44.320 | and four. Let's for now pretend this is our entire codebook. That means that this image here

00:12:52.080 | would now be represented by the number of visual words that we found. So the like the frequency

00:12:59.920 | of these visual words. So at position zero we have a dog's nose. We didn't see any of those

00:13:05.200 | in this image so that would be zero. This number one here, so we found like these two visual

00:13:11.600 | features over here. Another one, another one here. So we found four of these. So then this would

00:13:19.120 | become four. Again we don't have like a I think it's a dog's leg there. In this case we found

00:13:25.600 | two of these features. So this is a this one we selected here and then here we have zero again.

00:13:31.360 | So what we've done just there is we've gone from this representing this image with a very large

00:13:40.720 | number of visual features. So 128 dimensional and also there was six of them in this image.

00:13:49.120 | In reality there'd be a lot more. Okay so quite a few values there and then we've kind of compressed

00:13:56.640 | it down into this five dimensional sparse vector. Now in reality we're not going to just have five

00:14:06.640 | possible visual words. In the later examples we're going to use around 200 visual words to represent

00:14:13.680 | just over 9000 images. Okay but that's still a big improvement over you know what we're doing here. So

00:14:22.160 | this massive matrix that we have has now just been translated into a 200 dimensional vector.

00:14:28.640 | And as well I should say the images that we that we take later when we actually process for example

00:14:35.680 | this image we won't just get around six visual features we'll get more like say 800 visual

00:14:42.720 | features. So in reality it'd be something like this 128 dimensions and then we would compress

00:14:49.520 | that down into a single 200 dimensional vector. Okay that's what we're going to do. Now if we

00:14:56.320 | consider okay how how is this useful to us. So we're going to focus on you know the information

00:15:00.960 | retrieval side of things but it works as well with image classification where we basically look

00:15:07.440 | at so for now we're going to add another step in in a moment but for now we're creating what are

00:15:15.520 | these frequency vectors. So this thing that you saw here it's a frequency vector and that frequency

00:15:24.320 | vector is basically telling us okay we have you know in it we have this many I'm just going to

00:15:31.440 | use this example this is I think okay two or something like that we have two windows in this

00:15:36.960 | image. In this image we also have these two crosses on top of a bell tower whereas this image of a dog

00:15:45.520 | we have we have a nose of a leg we have a ear right. So if you were to plot these in a high

00:15:53.440 | dimensional space we'll just you know for for the sake of simplicity we'll pretend okay this is

00:16:00.960 | actually a two-dimensional vector. So our our dog image could be like here okay and our church image

00:16:10.240 | could be over here. Now they're not similar okay we'll use cosine similarity so that basically

00:16:17.120 | means we're going to measure the angle between vectors. So if we maybe we had another dog image

00:16:23.040 | and because it has many of the similar visual words it'll be plotted in a similar space in a

00:16:29.040 | similar vector space. So that means that the angle between these two here will be a lot smaller

00:16:37.440 | right when you compare it to the angle between either of those and this one here okay and then

00:16:43.280 | we're going to calculate cosine similarity between them and we'll see that the cosine similarity

00:16:47.760 | between our more similar images with similar visual words they are more similar. Okay so we

00:16:55.280 | have our our sparse vectors but this is you know there's another step that we need to we need to

00:17:01.200 | take and I'll return to the language example. In language we in NLP we have something called

00:17:10.640 | stop words. Now stop words are basically super common words that don't really tell you much about

00:17:19.040 | what it is you're looking at. So the word the or the word a the word and all of these words appear

00:17:25.440 | in language all over the place but they don't necessarily tell you what the topic of whatever

00:17:32.480 | it is you're talking about is. Okay and in natural language processing these stop words can be

00:17:42.960 | overpowering that you know you can get 200 instances of the in a in an article and if

00:17:51.040 | for example you are reading an article about like Roman history you want to be looking for things

00:17:57.360 | like the word Rome or the word Julius Caesar or the words Julius and Caesar you don't want to be

00:18:05.360 | looking for the word the it doesn't tell you anything. So these stop words are usually they're

00:18:11.840 | either removed from your sentences or we can use something called TFIDF which allows us to look at

00:18:19.600 | the relevance of words based on how common they are across different documents. So in the example

00:18:27.520 | I just mentioned we let's say we have articles about a lot of different things and one of those

00:18:34.800 | happens to be this Roman history article and it has the word Rome in it right. So as a user when

00:18:43.280 | I'm when I'm searching imagine I search something like the history of Rome right. So we have the

00:18:50.240 | history of Rome four words there when we're searching we don't want to give equal relevance

00:18:57.200 | to the word the as we do to Rome and we don't want to give equal relevance to the word of as

00:19:02.880 | we do to history right. We basically want to just ignore of and ignore the and just focus on Rome

00:19:09.840 | and history right. So TFIDF allows us to do that by looking at a frequency of words across all

00:19:17.920 | these different documents and basically scaling the score of our the frequency scores that we

00:19:26.160 | we've just retrieved based on the frequency of these words throughout all the all your other

00:19:31.600 | documents and the same is possible with visual words. So we have all of our images they are the

00:19:38.960 | documents and we have all the visual words in all those all of those images. We use TFIDF to look at

00:19:47.920 | the visual words found in each frequency vector and we use that to essentially lower the score

00:19:58.960 | of visual words that appear all the time and increase the score of visual words that are very

00:20:04.960 | rare and very unique to a particular image. So in the church image that you see here in a lot of

00:20:12.400 | cases SIFT isn't going to pick up patch number one but let's say it did in in this example so

00:20:17.680 | it's picked up a patch of sky a patch of sky is probably pretty common in a lot of pictures. So

00:20:25.280 | we would want to bring the score of that patch of sky down we'd want to lower it whereas this

00:20:33.920 | patch number two we want the score of that to be increased or at least just maintained okay.

00:20:40.720 | So this is how we do that this is a typical TFIDF formula now TFIDF typically you know we

00:20:48.720 | think of it with documents and terms which is just words so the notation maybe it's a bit a

00:20:55.040 | bit weird to translate over into image images here but we'll do it anyway. T refers to term

00:21:03.440 | in in typical TFIDF lingo but we are using it to represent visual word.

00:21:10.080 | D usually means document

00:21:20.000 | in this case our documents are images so it's an image.

00:21:22.720 | Okay and then we have tf td and that refers to the term frequency of a so the frequency

00:21:39.040 | of a particular visual word in a particular image. Okay so the frequency or term frequency

00:21:48.160 | of visual word in image okay let's say we took over here we have visual word one

00:21:56.480 | so tf of visual word one in image we'll call this image like image zero in image zero

00:22:07.760 | would be equal to two okay because there are two of these patches or if we want to include the

00:22:15.040 | other ones it would actually be four. N that you can see over here that is the total number of

00:22:22.080 | images so number of images df t here so that stands for document frequency of term so essentially

00:22:38.480 | how many documents does our particular visual word appear in or how many images does our visual

00:22:44.560 | word appear in and all of this is so we we take both of those and we take the log of them. Okay

00:22:52.160 | so this here becomes the this is the inverse document frequency this this whole sort of part

00:23:00.240 | here this is a term frequency we multiply those together and we get this and that gives us the

00:23:10.000 | tf idf. Now the idf part if you if you look at this so just consider this is useful to understand

00:23:19.840 | we have a number of images number of images does not change based on which image you're looking at

00:23:25.760 | and the document frequency for a particular term that also does not change depending on whether

00:23:32.800 | you're looking at a particular image or not so this idf vector is actually the same across our

00:23:39.760 | entire data set okay it is not going to change the only thing that changes is the tf idf which

00:23:46.720 | is the term frequency so using this we can go from what we have here so all of these all of

00:23:53.920 | these visual words are kind of on the same level there's not really any variation there

00:23:58.000 | to something more like this so maybe we've decided that a nose is a pretty good indicator

00:24:08.240 | of something relevant okay whereas like this leg thing here it's not a very good

00:24:14.560 | indicator or anything okay so they've been adjusted in accordance with their their relevance

00:24:22.320 | okay so we've just been through how to create our visual words how to create a it's like a sparse

00:24:29.200 | vector representation of our images based on those visual words and then how to scale the values in

00:24:37.680 | those sparse vectors based on the relevance of each of each visual word that the values represent

00:24:44.880 | so after we've done all of that we're sort of ready to move on to the next step which

00:24:49.440 | maybe you want to feed all these vector representations into a classifier model

00:24:56.560 | in order to perform some classification or what we're going to go through is how we can

00:25:03.360 | actually retrieve similar images based on these vectors now there are different

00:25:11.840 | distance or similarity metrics that we can use when we're comparing vectors you have dot product

00:25:18.880 | similarity you have euclidean distance but we're going to focus on using cosine similarity now

00:25:26.160 | this is really useful because the tf idf scores that we've output here the magnitude of those of

00:25:33.280 | those vectors can vary significantly so by using cosine similarity it means that we don't need to

00:25:38.960 | normalize anything so we can just take these vectors and we can compare them immediately

00:25:45.440 | which is really useful so cosine similarity is calculated like this so let's say we

00:25:53.600 | i'm gonna draw another one these uh 2d 2d graphs right let's say we have vector a here and we have

00:26:02.320 | vector b over here okay the dot product on the top here this is going to this is going to increase

00:26:13.440 | as these two vectors are are more similar but that by itself this uh this dot product value

00:26:22.800 | it works poorly when we have other vectors of different magnitude in here so if we had all

00:26:28.000 | these different vectors let's say we had one over here although we might say that we're going to

00:26:33.520 | call this one um c c i'm going to call this one d to me it seems like b and c are probably more

00:26:41.920 | similar than c and d but because d and c have a greater magnitude the dot product which considers

00:26:50.000 | both angle and magnitude is going to produce a higher score okay and that's even true if maybe

00:26:58.160 | let's say we had like b and another item here um even though these are basically the same vector

00:27:05.760 | they would score lower than d and c purely because d and c have a larger magnitude okay so what we do

00:27:14.400 | is we add this item here which looks at the magnitude of a and b together and we we divide

00:27:22.080 | by that so we're essentially normalizing everything in this in this calculation and that's how we

00:27:28.800 | calculate the the cosine here so we're actually just calculating the cosine between um between

00:27:33.680 | angles okay and that gives us our our cosine similarity so um let's have a look at another

00:27:42.080 | example so here if we had two exactly matching vectors the cosine similarity between those would

00:27:51.360 | be equal to one if we had two vectors that were a 90 degree angle so vector a is perpendicular to

00:27:58.960 | vector b we would return a value of zero now because we're using tf idf all of our vector

00:28:08.800 | values are positive so in reality all of our vectors would have to be placed within this space

00:28:15.680 | here so within like the i'm assuming this is the positive of the y-axis and this is the positive

00:28:21.600 | of the x-axis so all of our vectors would actually appear in this space here so this vector a down

00:28:28.640 | here would not happen but just using it for the sake of showing you that we can have a cosine

00:28:35.600 | similarity at zero now in this image here i'm just trying to point out the fact okay again it's the

00:28:41.920 | same thing um where we have two images that are exactly the same okay so these uh appear in the

00:28:48.640 | middle here the cosine similarity for those is going to be is going to be one okay and here and

00:28:55.280 | here uh this image there's maybe there's like a building in the background there's some trees so

00:29:00.880 | it's kind of maybe halfway in between the the church image and the dog image now because the

00:29:07.440 | dog image has some trees the church image has some it's a building so it shares some similarities

00:29:14.560 | both of them with this this image so it would probably be somewhere in the middle of that

00:29:19.520 | similarity scale okay somewhere around here and then images like the church and the dog they

00:29:28.560 | there's not already any similarity there so they have a lower closer to zero cosine similarity

00:29:33.600 | okay so that is the the theory behind bag of visual words and how we actually retrieve

00:29:43.120 | similar images using bag of visual words now like i said the two parts of this video now onto the

00:29:50.720 | second part which is where we're going to have a look at the implementation this all this in python

00:29:56.000 | now if you want to follow along with this there will be a link to a colab notebook with all of

00:30:02.000 | this in in the description or if you're reading this on pine cone in the resources at the bottom

00:30:09.600 | of the article so what we first need to consider okay we need a data set of images so we're going

00:30:17.920 | to keep this super quick light and simple first we we get a we're not uploading so we download

00:30:29.600 | the hug and face data set we're using this one so it's it's like what they describe it as a french

00:30:36.800 | image net i'm not sure what makes it french but it's just how they have described this

00:30:44.320 | but it's a almost 10 000 pretty high resolution images that we can we can just download

00:30:54.720 | immediately like this is all we do and we can see straight where we have that church image

00:30:59.440 | okay and yeah we can see all the images just by printing them like this so first thing we

00:31:07.920 | want to do is translate these images so these are what are called pill image objects at the moment

00:31:15.280 | we want to translate those into numpy arrays so we can do that we just go through the entire

00:31:23.600 | data set and we create this array or this list sorry of items which is going to be a list of

00:31:34.720 | numpy array versions of our pill objects now these images some of them are gray scale some

00:31:40.960 | of them are color we're going to drop the color from these images and we're just going to have

00:31:46.000 | gray scale so we can do that here so we use the opencv library okay and we just go through and we

00:32:04.000 | convert all of these images if they are rgb images to gray scale if they're already gray

00:32:12.960 | scale we don't need to transform them now how do we tell if they're gray scale or not well the

00:32:17.840 | shape of a of an image that is gray scale is different so let me show you

00:32:29.840 | images okay so we see here that we have this two-dimensional array okay whereas if we had a

00:32:40.480 | look at so in images training these are all the color images that we still have if we have a look

00:32:47.280 | at beforehand it's a three-dimensional image because you also have the red green and blue

00:32:53.200 | color channels included in that array so that's how we can tell that's why the the length of the

00:32:58.000 | image shape here when it is two we know that if it's greater than two sorry we know that it is a

00:33:04.880 | red rgb image whereas if not it's gray scale

00:33:10.960 | so we can see okay these are the new gray scale images now the next step now that we have our

00:33:22.880 | however images and they're ready and they're pre-processed we need to go through that process

00:33:26.720 | of creating our eventually creating our sparse vectors so we start by creating the visual

00:33:31.920 | features then we move on to visual words then we create the frequency vectors and then we create

00:33:37.200 | the tf-idf vectors okay so number one visual visual features we do that like so so it's actually

00:33:44.960 | pretty simple we are using the using sift from opencv and what we're doing here is just initializing

00:33:54.640 | lists where we're going to store all of our key points and descriptors so remember we have both

00:33:58.640 | key point is kind of pointing to where the where the visual feature is and the descriptor is a

00:34:06.160 | description of visual feature 128 dimensional vector and yeah we would so we initialize this

00:34:15.280 | extractor the sift extractor and here we're just for every image in our black and white images

00:34:21.360 | we're going through we're extracting the key points and the descriptors and we're adding

00:34:27.760 | those key points and descriptors to our big list of key points and descriptors here

00:34:35.680 | now you notice that we're doing this image by image so in here we're actually going to have

00:34:42.160 | so for example the descriptors we're going to have maybe I'll show you in a moment actually

00:34:48.400 | I'm sure I will show you later why don't I show you now so let's have a look at the descriptors

00:34:56.320 | for I know say image 100 I'm going to look at the shape so for image 100 you see that we have

00:35:05.600 | 436 descriptors and each of those descriptors is of dimensionality 128 because it's a sift

00:35:15.600 | descriptor okay now this won't happen with this sample but just be aware that some images

00:35:25.200 | um for example imagine you have like a picture of the sky and there's not really anything there

00:35:30.560 | it's just like the color of the sky um your the sift algorithm might just return nothing because

00:35:37.120 | there's no nothing for it to there's no key points for it to look at there's like no change just flat

00:35:42.400 | image um in that case the the image the sift descriptors will just be none um so we have to

00:35:52.880 | be careful in the next step and just avoid those and so that's what we're doing here if there are

00:35:59.280 | no descriptors in a particular image descriptor list um we just drop it we just drop that image

00:36:06.880 | okay but for this sample it's not really a problem but um just letting you know because in other

00:36:12.960 | data sets you might come into this problem okay so um yeah all we're doing now we're just dropping

00:36:22.320 | any of those descriptors that don't have any descript any of those images that don't have

00:36:29.440 | any descriptors um as i said not an issue for this for this sample now let's have a look at

00:36:36.480 | what these what these key points and descriptors actually look for look like um so these here so

00:36:45.120 | these are the key points so in each in the middle of each of these circles um we have

00:36:51.520 | we have a descriptor okay so we have the sift algorithm as identified a visual feature in each

00:37:00.160 | one of these circles okay and we see here and here and and so all right so the the center of

00:37:12.080 | these circles is the position of the key point uh the line that you can see is like the direction

00:37:18.720 | of the the feature that was detected and the size of the circles are essentially like the scale of

00:37:25.360 | the feature that was uh was detected so with all of our we have all of our visual features now

00:37:33.120 | remember in that list we had visual features visual words um frequency vectors and tf-idf vectors

00:37:40.560 | now we're on to the visual words part so first um all we're doing is creating 1000 random

00:37:50.400 | items a random we're going to use this to select 1000 random images we've added this random seed

00:37:56.320 | in here so you can reproduce what we what we do here now the reason that we are taking a sample

00:38:01.760 | in the first place is just to speed up the k-means part of this the training part you can leave this

00:38:09.040 | if you want and just train everything it just it will take longer and so we extract the descriptors

00:38:15.440 | and the key points for our training sample and we put everything together all the all the

00:38:23.120 | descriptors when we're training k-means we don't care which image a descriptor is in all we care

00:38:29.120 | is that the descriptors are there so we take all of our descriptors and we put them all into a

00:38:35.120 | single numpy array um we can check the shape so we have is it one point 1.1 million of these

00:38:44.160 | descriptors it's quite a lot um for less than 10 000 10 000 um images i think i think it's fairly

00:38:52.800 | large and yeah so um here i'm just showing you okay number of descriptors contained in each image

00:39:06.160 | sorry each image um so we have like image zero here it has 544 descriptors second one

00:39:16.560 | quite a lot more quite a lot less and so it's kind of you know each image has a different set of

00:39:22.000 | items in it so that's normal now you have to set the number of visual words that you would like

00:39:30.800 | from your from your images so it's like the vocab size in nlp were in nlp terms nlp lingo

00:39:38.000 | for this i just set k equal to 200 so we're going to have 200 visual words from this we don't have

00:39:46.880 | that many images if you have more images you should probably increase this particularly if

00:39:52.080 | they're very varied in what they are if they are not varied they're all kind of like the same thing

00:39:56.960 | you can probably use less um visual words and yeah so we are performing k means here

00:40:05.600 | we just do for one iteration and that outputs our codebook which is essentially just a big

00:40:12.080 | list of all of our all of our clusters now uh if you are i don't know maybe you're doing

00:40:18.720 | you're creating your codebook on on a machine and then you're going to be using it later on

00:40:23.040 | like over and over again i figured it's probably pretty important to show you how to actually save

00:40:27.680 | and load that codebook now it's pretty easy anyway it's not it's not like it's difficult

00:40:32.560 | so we use jobleb here and we just dump k which is you know the the number of visual words that we

00:40:41.680 | have you don't even necessarily need to put that in there you could just use the codebook alone

00:40:45.520 | and we're going to save it into this pickle file and then when you're loading it at a later later

00:40:50.480 | point we just use jobleb load there's nothing else to it now at this point we have here we've

00:40:58.160 | we've constructed our codebook um so we have gone we've finished we've created our visual features

00:41:05.600 | we've created our visual words um and they're all just stored in this codebook and now what we're

00:41:10.800 | going to do is take the visual words visual features from our images um process them through

00:41:18.160 | the visual words that we have in our codebook and use that to create um our sparse frequency vector

00:41:25.600 | so that's what we are doing here so we go through each of our our images in descriptors or the set

00:41:34.000 | of descriptors image descriptors in descriptors and what we do is we use the vector quantization

00:41:41.360 | quantization function from from scipy to extract the the visual words from those and what that's

00:41:50.480 | going to do it's not going to extract the actual like the the vectors of visual words you're going

00:41:55.280 | to extract the like ids of the visual words um so we actually get all of these id values each of which

00:42:05.680 | represents one of our visual words like it's like a token right in nlp you have

00:42:10.560 | word tokens um in in bag of visual words you have visual word tokens so that's what we that's what

00:42:17.760 | we have here okay so in image zero uh we know that the first five of our visual words um have these

00:42:28.720 | tokens okay so number 84 22 and so on and in total there are five hundred three hundred and ninety

00:42:38.400 | seven visual words in image zero okay now this here so we can take a look at the the centroid

00:42:48.080 | so the vectored visual word um by looking at the codebook we go to visual word or use visual word

00:42:56.640 | id index 84 i'm going to see the shape that is 128 okay that's like our centroid position now

00:43:05.280 | how do we do the we need to do like a frequency count so how many of each visual word um based

00:43:14.400 | on the visual word ids how many of those are in each image okay so uh we know so we know that the

00:43:23.360 | length so let's have a look at the length of our codebook should be 200 okay so what we're going

00:43:30.720 | to do here is we're going to create a frequency vector each one of these is going to be of

00:43:35.440 | dimensionality 200 because that's all the possible uh visual words within our codebook okay so we're

00:43:42.080 | going to loop through all of our visual words and we're going to initialize the frequency vector

00:43:49.120 | for each image and then we're going to loop through each word found in the images visual

00:43:54.960 | words and for each one of those visual word i index values we use that to add a one in the

00:44:03.920 | position of that visual word in the images frequency vector okay after we've done all that

00:44:10.960 | we we have basically we have a single frequency vector for every single um image each one of

00:44:17.840 | dimensionally dimensionality 200 and at the end here we're just reformatting that big list of

00:44:25.040 | numpy vectors into a single numpy array okay and at the end we can see that we have 9469

00:44:37.840 | of these frequency vectors each one dimensionality 200 aligning to our codebook

00:44:44.000 | now from above we saw that we had these uh we had these visual word tokens in image i think image

00:44:51.920 | zero yeah so we had these visual word tokens you can see we have one two one seven two twice here

00:44:59.920 | okay because that that visual word appears in the image more than once um so let's go down here and

00:45:07.440 | let's have a look at each one of those so actually i don't know why i needed to count i just copy and

00:45:12.800 | pasted it across um so we can actually even remove that if we want but i'll just leave it so if we

00:45:18.720 | have a look at those we should we would expect all of these values to have at least a value of one

00:45:24.160 | because we've already seen that they are there and for one one seven two we know there's at least two

00:45:28.400 | of them in there um and we can see that here so we look at the position frequency vector zero uh

00:45:36.160 | we look at each one of these positions and we can see that there are numbers there okay and let's

00:45:41.840 | have a look at more of image zero's frequency vector so we we have this here we can increase

00:45:48.320 | this we can show everything i mean so we have this 200 values uh each of these are the frequency

00:45:57.360 | of those visual words in that image and we can represent that in a big histogram like this okay

00:46:07.120 | now we want to move on to tf idf part so you know we have all of these um these frequency

00:46:13.840 | counts in a big vector but some of those visual words are irrelevant they're like the vision

00:46:20.560 | equivalent of the word v or the word a or the word and we don't want those in there they're

00:46:26.320 | not not important don't tell us anything so we use tf idf to reduce the the score of those items

00:46:34.880 | reduce their sort of dominance in our frequency vectors so we are going to use the tf idf formula

00:46:45.440 | that you can see here um so first i'm going to calculate um i'm going to calculate well n we

00:46:53.040 | already know it's the number of images e.g the size of the data set um and then df is the number

00:47:01.360 | of images that a visual word actually appears in okay so basically what we're doing here is we're

00:47:09.440 | looking at all of the frequency vectors and we're saying okay if there's if there is um in a single

00:47:16.480 | vector if the value is greater than zero e.g one or more e.g that visual word appears in that

00:47:24.400 | particular image we will assign that position one okay and then we take the sum across all of those

00:47:35.200 | all of those frequency vectors along the zero axis so basically let's say we have visual word zero

00:47:44.080 | in image zero maybe that visual word is the frequency is three in image one the frequency

00:47:53.040 | is two and in image three the frequency is zero if we perform this operation across those just

00:48:00.560 | those three frequency vectors the value in that for that first column would be equal to two because

00:48:07.600 | it only appears is only not zero in two of the three images or frequency vectors okay so we can

00:48:16.000 | see what's happening here so df shape df shape it's still going to be the same um we are going

00:48:22.880 | like uh we have the 200 dimensionalized frequency vectors we have 9 000 of those um and we're

00:48:28.960 | compressing those 9 000 into a single a single um i don't know what we call this like a df vector

00:48:37.920 | okay um now let's have a look at this so the inverse document frequency i suppose we could

00:48:47.200 | call it the inverse image frequency um is just the log of n over df okay nothing nothing special there

00:48:56.080 | and then to calculate tf idf we also need the term frequency part of tf idf now the term frequency

00:49:04.000 | part we actually already have those frequency vectors that we've been uh we've been building

00:49:08.240 | they are the term frequencies so that's that's all we do uh is the frequency vectors multiplied by

00:49:17.200 | idf idf by the way we can see here it's just a single vector for all of our our entire data set

00:49:24.240 | okay the only part that changes is the frequency vectors okay and we get this now if we plot that

00:49:30.880 | you can see now this is the same the same image as before and so if we go up you see that we had

00:49:37.680 | all of these like kind of high values now we come here and there's just this this really high value

00:49:45.200 | here maybe that's a visual word of the the cross on top of the church that's probably like a

00:49:49.680 | distinguishing feature if we're looking for maybe looking for religious buildings or churches or

00:49:53.840 | something along those lines um whereas the rest of the visual words kind of get pushed down because

00:50:00.560 | most visual words are probably not going to be that relevant okay now we've generated our tf idf

00:50:08.560 | vectors here okay so we have them all stored in tf idf here i'll just point out here we have

00:50:15.040 | just over 9 000 of those and we can begin searching so to search we're going to use

00:50:24.480 | cosine similarity so we're going to solve this this image of a dog to start cosine similarity

00:50:29.840 | we calculate like this now we're going to search through the entire search space for for b so in

00:50:38.000 | this case b is not going to be a vector and we're just going to calculate the cosine similarity for

00:50:42.560 | all of the all of the items when compared to a so one of those items is going to be the exact

00:50:49.040 | match to a because a for search i so maybe search i is what is it 6595 that image is still within b

00:51:00.080 | here so we're going to get at least at the top we're going to get that image returned to us first

00:51:05.840 | okay so the minimum and maximum cosine similarities zero to pretty much one we're

00:51:11.600 | using floating point numbers so it's not actually exactly one um but it is basically one okay i'm

00:51:19.040 | going to go quite quickly through this so we use numpy argsort to get the most similar items

00:51:26.240 | the first image should obviously correspond to the image itself uh we'd expect the cosine

00:51:34.480 | similarity to be pretty much one um and yeah that's what we get here so we're going to just

00:51:39.840 | display a few of those so using this as our search term we get um sorry using this was our search

00:51:46.720 | term now this is the first result cosine similarity is equal to one um go down we get another dog

00:51:53.280 | pretty similar looking we get this you know it's not it's not always perfect and then we get more

00:51:58.800 | of these uh very similar looking dogs here okay now that looks pretty good so we're going to test

00:52:05.760 | for a few more images we have this one here it's like some old record player first we get the same

00:52:14.160 | record player and i don't think this record player actually appears again in this data set but i

00:52:18.400 | thought it was interesting because we get this printer um i assume it's a printer that looks

00:52:23.040 | very similar it has this like top bit on top and despite the background being very different and

00:52:28.960 | you know this one's just like an image a picture like a blurry image it manages to pull through

00:52:33.840 | this object that to me looks pretty similar and then the rest of them is nothing uh of relevance

00:52:39.680 | maybe this one's like a box it's kind of similar okay this one uh we search this guy with a fish

00:52:46.240 | obviously we get the exact image back first and then we get the same guy with fish in a slightly

00:52:52.000 | different position it's like open his mouth a little bit as you go to the side and get another

00:52:56.720 | guy with fish and then here i thought this was kind of interesting so here okay there's no guy

00:53:03.600 | with fish um but i mean you look at the background of these images right the background of these

00:53:09.760 | images is like some like leafy foresty place and obviously here leafy foresty place so it kind of

00:53:18.240 | makes sense that it would pull through this image uh here as well uh for this one i suppose there's

00:53:24.080 | kind of like a forest in the background but otherwise i don't know why it's why this one

00:53:28.240 | is being pulled through this one i thought was interesting so this one it didn't perform well on

00:53:33.680 | or at least it didn't um return what i would expect it to return okay so we have these two

00:53:40.240 | parachutists and then we get golf balls a dog of the golf ball but if you look at the this parachute

00:53:49.200 | here it has this like really thick um black line and then the rest of the parachute is probably

00:53:55.440 | white and here on this uh on this golf ball you have like a very similar black line so i assume

00:54:02.720 | for these uh golf balls has actually seen that i mean like okay that's uh they're two very similar

00:54:09.520 | features we have another one here we have this another golf ball um return obviously the same

00:54:16.080 | one again and then return of the golf ball like another golf ball same brand as well and knees

00:54:22.800 | as well and then we return this truck i thought maybe the reason it's returning this truck is it

00:54:28.240 | has like this white ridge pattern on it which the golf balls also have a similar sort of white ridge

00:54:34.240 | pattern i wasn't too sure um but anyway that one performed quite pretty well i thought

00:54:40.000 | this one we have this garbage truck the garbage truck here obviously return the garbage truck

00:54:49.200 | um and then we return this image and initially i thought okay it's not very good um first result

00:54:55.120 | back but then we get a garbage image here but then if you actually look at this image

00:55:01.040 | and you look in the background that actually does seem to be a garbage truck here it's either a

00:55:06.160 | garbage truck or like some big industrial bin i'm not really sure um but i thought that was kind of

00:55:12.320 | cool that i picked up on this thing in the background i i missed the first time i looked

00:55:17.280 | at this as well um so that's cool uh but yeah so they're all the results as sort of go through

00:55:24.080 | those quickly to have a look at how it's how it did and you can probably optimize this to be much

00:55:29.760 | better than what i have here i'll use a very small data set for example um but we've we've done

00:55:36.400 | something pretty cool i suppose we used k-means so it's learned a little bit there um in terms

00:55:42.560 | of visual words but otherwise it's it's very simple okay and for me to see this sort of

00:55:49.600 | performance on you know without like some big neural network or anything um i think it's it's

00:55:56.800 | pretty cool so i hope this has been useful um thank you very much for watching and i will see

00:56:04.800 | you again in the next one bye

Bag of Visual Words for Image Classification and Retrieval

Chapters