Bag of *Visual* Words for Image Classification and Retrieval

Today we are going to talk about something called Bag of Visual Words. Now Bag of Visual Words, we can use it for essentially converting images into vector embeddings or image embeddings and using those image embeddings in image classification and image retrieval. Now this video is going to be split into two parts.

First part we are going to look at the theory of Bag of Visual Words and how it works and then in the second half of the video we will actually look at an implementation of it and see how we can implement something like Bag of Visual Words in Python.

So let's get started with trying to understand what exactly is Bag of Visual Words. Now for those of you that are coming from NLP, you may have heard of Bag of Words before. It's a very common technique used in using NLP and information retrieval but it focuses on sentences, it focuses on language and text.

So we can pretty much visualize it as this. So we take a sentence and we pull out all of the individual words from that sentence and then we put them into a bag. This bag is unordered so imagine it's just like an unordered list and that's what Bag of Words is.

Bag of Visual Words is pretty much the same thing but applied to images which is quite interesting. So how does that possibly make sense? How do we extract words or visual words from an image? Well at a very high level it's pretty much what you can see here. So we take a load of images and we extract almost like patches from those images and this is not exactly what happens.

We don't just put those patches into a bag, we actually convert these patches into what are called the visual words. So these patches here we would call visual features and then we do some pre-processing in the middle here. You can't see it right now but we'll fill all that in throughout the video and what we get out of that processing is a load of what we call visual words.

Now as with language there are a limited number of visual words. There are so many words in the English language and there are also so many visual words in our image data sets. Now a visual feature, let's say this over here is a visual feature, it consists of two things.

Consists of a key point and a descriptor. Now key points are the points in an image that the visual feature is located at. So slider coordinates and they do not change or they should not change as an image is rotated, scaled or transformed in some way. And descriptors as you might have guessed are almost like a description of that particular visual feature.

There are multiple algorithms for extracting these features and creating these descriptors. You can use algorithms like SIFT, ORB or SURF. Now we're not going to go into detail on what those algorithms are doing right now at least but we'll just understand that they are pretty much looking at these images, they're finding edges, changes in color, changes in light and just identifying those and running us through a set algorithm to create what we call the descriptor.

Now the most common of these and the one that we are going to use is called SIFT. Now SIFT is pretty useful because it's invariant to scale, rotation, translation, illumination and blur to an extent. Of course it's not perfect but it can deal with all of these things. Now SIFT first looks through the image and it identifies a key point.

So something that it believes is of relevance. So just as one example it will probably look up here and there's not really anything in that image. There's no difference, it's just one flat color. So it probably won't extract a feature from here. But when it comes over here it's going to see some edges, you're going to see some changes in color and it's going to be like okay well there's something there.

So we put a key point there and then it's going to go through all of its like algorithmic magic and based on the edges, the colors, whatever else in that particular patch it is going to produce a 128 dimensional vector. Now it doesn't stop there, it doesn't just find the first patch, it goes through an image, it's going to find like a load of patches especially for this image.

There's a lot of information everywhere, it's going to be like okay look there's a window, there's this edge over here, something interesting over here, there's all these colors and edges. It's going to find these key points, these visual features everywhere throughout this image and we'll actually see what these look like a little bit later on.

So what we get is a load of these visual features, the order of those visual features doesn't matter, they're all just kind of placed or they're suspended in the air over here. We can almost imagine there's like another bag here which is our bag of visual features before it becomes a bag of visual words.

So we're kind of at this point right now, so every image just becomes this massive essentially an array with all these 128 dimensional sift vectors. Now once we have pulled out all of our visual features from an image, we move on to the next step which is how do we go from visual feature to visual word.

Well we have to use what's called a codebook. So a codebook is essentially, it's going to be like our vocabulary. If you come from NLP again, you'll recognize vocabulary is essentially like a big list of all the words in your data set. This codebook is going to be a big list of all of the visual words from our images but of course we don't have our visual words yet, we need to create them.

So that's what we're going to do next, we're going to build our codebook which is basically going to create all of our visual words at the same time. So how do we construct this codebook? Well the idea is that we take a load of similar visual features and we actually cluster them together and we get a centroid out of that cluster.

Now that centroid is what we would call a visual word. So we look at this image here right, on the left over here, these are visual features right. We have multiple visual features, there's all these things that look kind of like a dog's nose from a slightly different position but you know they pretty much look like super similar.

Okay and what we do is we process all of these, we use k-means clustering and we get a single centroid out of this cluster of points that all look very similar. Okay so we go from what is this, this is 5, 128 dimensional visual features to 1, still 128 dimensional centroid or visual word.

Okay and we do that using k-means clustering. Now this does not just happen for a single image. We actually feed in all of our images from our, let's call it a training data set. So we'll actually have loads and loads of these images, not just one. It's very very bad drawing and okay we have like a dog's nose over here but maybe we have another dog over here, another image of a dog and he has a nose obviously, maybe.

So these centroids here are also going to be very similar to the centroids of this other dog's nose that are over here. So in reality we don't just have these five visual features that create our single our single cluster over here, our dog nose visual word. We actually probably have quite a lot.

So that's that's the process of creating code but we basically go through all of the visual features that we create. We're going to create a lot of them and we compress them into a set of particular points which become our visual words. Now this process is called vector quantization.

It's where you take a whole range of vectors and you do something like clustering in order to convert this massive range of vectors into just a select few vectors. So again we can use the example or the parallel of language. There are pretty much an infinite number of sentences out there in the world but if we perform bag of words, we extract every word from let's leave stick with the English language sentences at the moment.

If we extract every single word and we put them all in a bag and we remove all duplicates we're going to end up with far less than an infinite number of items there. That's quantization. I suppose quantization of language. We're going from an infinite number of possible sentences to a limited number of possible words that make up those sentences.

That's exactly what we're doing here. So just taking this one example we have like a church building. Over on the left we have these two visual features. These two dots are visual features. Maybe we can also have another visual feature over here, over here. All of those should kind of be very similar.

They're like this green window. They have all these similar edges and similar colors. So they probably all end up being clustered together into this single visual word which is a window. This one over here, we have this kind of like bell tower thing and we just take different parts of this bell tower and they would probably all feed into a single visual word.

Now one thing I should point out here is this is just one image. Now imagine we took another image with other windows inside it. That other image with windows. Let's say the window is here. We would take the visual features from that and they would also be translated into this over here.

So what we get out of that is rather than our image being represented by all of these 128 dimensional visual features, they are actually represented by these technically 128 dimensional visual words. But because there's a limited number of these visual words we can number them. So for example this one would be zero.

This one we selected here would be one, two, three, and four. Let's for now pretend this is our entire codebook. That means that this image here would now be represented by the number of visual words that we found. So the like the frequency of these visual words. So at position zero we have a dog's nose.

We didn't see any of those in this image so that would be zero. This number one here, so we found like these two visual features over here. Another one, another one here. So we found four of these. So then this would become four. Again we don't have like a I think it's a dog's leg there.

In this case we found two of these features. So this is a this one we selected here and then here we have zero again. So what we've done just there is we've gone from this representing this image with a very large number of visual features. So 128 dimensional and also there was six of them in this image.

In reality there'd be a lot more. Okay so quite a few values there and then we've kind of compressed it down into this five dimensional sparse vector. Now in reality we're not going to just have five possible visual words. In the later examples we're going to use around 200 visual words to represent just over 9000 images.

Okay but that's still a big improvement over you know what we're doing here. So this massive matrix that we have has now just been translated into a 200 dimensional vector. And as well I should say the images that we that we take later when we actually process for example this image we won't just get around six visual features we'll get more like say 800 visual features.

So in reality it'd be something like this 128 dimensions and then we would compress that down into a single 200 dimensional vector. Okay that's what we're going to do. Now if we consider okay how how is this useful to us. So we're going to focus on you know the information retrieval side of things but it works as well with image classification where we basically look at so for now we're going to add another step in in a moment but for now we're creating what are these frequency vectors.

So this thing that you saw here it's a frequency vector and that frequency vector is basically telling us okay we have you know in it we have this many I'm just going to use this example this is I think okay two or something like that we have two windows in this image.

In this image we also have these two crosses on top of a bell tower whereas this image of a dog we have we have a nose of a leg we have a ear right. So if you were to plot these in a high dimensional space we'll just you know for for the sake of simplicity we'll pretend okay this is actually a two-dimensional vector.

So our our dog image could be like here okay and our church image could be over here. Now they're not similar okay we'll use cosine similarity so that basically means we're going to measure the angle between vectors. So if we maybe we had another dog image and because it has many of the similar visual words it'll be plotted in a similar space in a similar vector space.

So that means that the angle between these two here will be a lot smaller right when you compare it to the angle between either of those and this one here okay and then we're going to calculate cosine similarity between them and we'll see that the cosine similarity between our more similar images with similar visual words they are more similar.

Okay so we have our our sparse vectors but this is you know there's another step that we need to we need to take and I'll return to the language example. In language we in NLP we have something called stop words. Now stop words are basically super common words that don't really tell you much about what it is you're looking at.

So the word the or the word a the word and all of these words appear in language all over the place but they don't necessarily tell you what the topic of whatever it is you're talking about is. Okay and in natural language processing these stop words can be overpowering that you know you can get 200 instances of the in a in an article and if for example you are reading an article about like Roman history you want to be looking for things like the word Rome or the word Julius Caesar or the words Julius and Caesar you don't want to be looking for the word the it doesn't tell you anything.

So these stop words are usually they're either removed from your sentences or we can use something called TFIDF which allows us to look at the relevance of words based on how common they are across different documents. So in the example I just mentioned we let's say we have articles about a lot of different things and one of those happens to be this Roman history article and it has the word Rome in it right.

So as a user when I'm when I'm searching imagine I search something like the history of Rome right. So we have the history of Rome four words there when we're searching we don't want to give equal relevance to the word the as we do to Rome and we don't want to give equal relevance to the word of as we do to history right.

We basically want to just ignore of and ignore the and just focus on Rome and history right. So TFIDF allows us to do that by looking at a frequency of words across all these different documents and basically scaling the score of our the frequency scores that we we've just retrieved based on the frequency of these words throughout all the all your other documents and the same is possible with visual words.

So we have all of our images they are the documents and we have all the visual words in all those all of those images. We use TFIDF to look at the visual words found in each frequency vector and we use that to essentially lower the score of visual words that appear all the time and increase the score of visual words that are very rare and very unique to a particular image.

So in the church image that you see here in a lot of cases SIFT isn't going to pick up patch number one but let's say it did in in this example so it's picked up a patch of sky a patch of sky is probably pretty common in a lot of pictures.

So we would want to bring the score of that patch of sky down we'd want to lower it whereas this patch number two we want the score of that to be increased or at least just maintained okay. So this is how we do that this is a typical TFIDF formula now TFIDF typically you know we think of it with documents and terms which is just words so the notation maybe it's a bit a bit weird to translate over into image images here but we'll do it anyway.

T refers to term in in typical TFIDF lingo but we are using it to represent visual word. D usually means document in this case our documents are images so it's an image. Okay and then we have tf td and that refers to the term frequency of a so the frequency of a particular visual word in a particular image.

Okay so the frequency or term frequency of visual word in image okay let's say we took over here we have visual word one so tf of visual word one in image we'll call this image like image zero in image zero would be equal to two okay because there are two of these patches or if we want to include the other ones it would actually be four.

N that you can see over here that is the total number of images so number of images df t here so that stands for document frequency of term so essentially how many documents does our particular visual word appear in or how many images does our visual word appear in and all of this is so we we take both of those and we take the log of them.

Okay so this here becomes the this is the inverse document frequency this this whole sort of part here this is a term frequency we multiply those together and we get this and that gives us the tf idf. Now the idf part if you if you look at this so just consider this is useful to understand we have a number of images number of images does not change based on which image you're looking at and the document frequency for a particular term that also does not change depending on whether you're looking at a particular image or not so this idf vector is actually the same across our entire data set okay it is not going to change the only thing that changes is the tf idf which is the term frequency so using this we can go from what we have here so all of these all of these visual words are kind of on the same level there's not really any variation there to something more like this so maybe we've decided that a nose is a pretty good indicator of something relevant okay whereas like this leg thing here it's not a very good indicator or anything okay so they've been adjusted in accordance with their their relevance okay so we've just been through how to create our visual words how to create a it's like a sparse vector representation of our images based on those visual words and then how to scale the values in those sparse vectors based on the relevance of each of each visual word that the values represent so after we've done all of that we're sort of ready to move on to the next step which maybe you want to feed all these vector representations into a classifier model in order to perform some classification or what we're going to go through is how we can actually retrieve similar images based on these vectors now there are different distance or similarity metrics that we can use when we're comparing vectors you have dot product similarity you have euclidean distance but we're going to focus on using cosine similarity now this is really useful because the tf idf scores that we've output here the magnitude of those of those vectors can vary significantly so by using cosine similarity it means that we don't need to normalize anything so we can just take these vectors and we can compare them immediately which is really useful so cosine similarity is calculated like this so let's say we i'm gonna draw another one these uh 2d 2d graphs right let's say we have vector a here and we have vector b over here okay the dot product on the top here this is going to this is going to increase as these two vectors are are more similar but that by itself this uh this dot product value it works poorly when we have other vectors of different magnitude in here so if we had all these different vectors let's say we had one over here although we might say that we're going to call this one um c c i'm going to call this one d to me it seems like b and c are probably more similar than c and d but because d and c have a greater magnitude the dot product which considers both angle and magnitude is going to produce a higher score okay and that's even true if maybe let's say we had like b and another item here um even though these are basically the same vector they would score lower than d and c purely because d and c have a larger magnitude okay so what we do is we add this item here which looks at the magnitude of a and b together and we we divide by that so we're essentially normalizing everything in this in this calculation and that's how we calculate the the cosine here so we're actually just calculating the cosine between um between angles okay and that gives us our our cosine similarity so um let's have a look at another example so here if we had two exactly matching vectors the cosine similarity between those would be equal to one if we had two vectors that were a 90 degree angle so vector a is perpendicular to vector b we would return a value of zero now because we're using tf idf all of our vector values are positive so in reality all of our vectors would have to be placed within this space here so within like the i'm assuming this is the positive of the y-axis and this is the positive of the x-axis so all of our vectors would actually appear in this space here so this vector a down here would not happen but just using it for the sake of showing you that we can have a cosine similarity at zero now in this image here i'm just trying to point out the fact okay again it's the same thing um where we have two images that are exactly the same okay so these uh appear in the middle here the cosine similarity for those is going to be is going to be one okay and here and here uh this image there's maybe there's like a building in the background there's some trees so it's kind of maybe halfway in between the the church image and the dog image now because the dog image has some trees the church image has some it's a building so it shares some similarities both of them with this this image so it would probably be somewhere in the middle of that similarity scale okay somewhere around here and then images like the church and the dog they there's not already any similarity there so they have a lower closer to zero cosine similarity okay so that is the the theory behind bag of visual words and how we actually retrieve similar images using bag of visual words now like i said the two parts of this video now onto the second part which is where we're going to have a look at the implementation this all this in python now if you want to follow along with this there will be a link to a colab notebook with all of this in in the description or if you're reading this on pine cone in the resources at the bottom of the article so what we first need to consider okay we need a data set of images so we're going to keep this super quick light and simple first we we get a we're not uploading so we download the hug and face data set we're using this one so it's it's like what they describe it as a french image net i'm not sure what makes it french but it's just how they have described this but it's a almost 10 000 pretty high resolution images that we can we can just download immediately like this is all we do and we can see straight where we have that church image okay and yeah we can see all the images just by printing them like this so first thing we want to do is translate these images so these are what are called pill image objects at the moment we want to translate those into numpy arrays so we can do that we just go through the entire data set and we create this array or this list sorry of items which is going to be a list of numpy array versions of our pill objects now these images some of them are gray scale some of them are color we're going to drop the color from these images and we're just going to have gray scale so we can do that here so we use the opencv library okay and we just go through and we convert all of these images if they are rgb images to gray scale if they're already gray scale we don't need to transform them now how do we tell if they're gray scale or not well the shape of a of an image that is gray scale is different so let me show you images okay so we see here that we have this two-dimensional array okay whereas if we had a look at so in images training these are all the color images that we still have if we have a look at beforehand it's a three-dimensional image because you also have the red green and blue color channels included in that array so that's how we can tell that's why the the length of the image shape here when it is two we know that if it's greater than two sorry we know that it is a red rgb image whereas if not it's gray scale so we can see okay these are the new gray scale images now the next step now that we have our however images and they're ready and they're pre-processed we need to go through that process of creating our eventually creating our sparse vectors so we start by creating the visual features then we move on to visual words then we create the frequency vectors and then we create the tf-idf vectors okay so number one visual visual features we do that like so so it's actually pretty simple we are using the using sift from opencv and what we're doing here is just initializing lists where we're going to store all of our key points and descriptors so remember we have both key point is kind of pointing to where the where the visual feature is and the descriptor is a description of visual feature 128 dimensional vector and yeah we would so we initialize this extractor the sift extractor and here we're just for every image in our black and white images we're going through we're extracting the key points and the descriptors and we're adding those key points and descriptors to our big list of key points and descriptors here now you notice that we're doing this image by image so in here we're actually going to have so for example the descriptors we're going to have maybe I'll show you in a moment actually I'm sure I will show you later why don't I show you now so let's have a look at the descriptors for I know say image 100 I'm going to look at the shape so for image 100 you see that we have 436 descriptors and each of those descriptors is of dimensionality 128 because it's a sift descriptor okay now this won't happen with this sample but just be aware that some images um for example imagine you have like a picture of the sky and there's not really anything there it's just like the color of the sky um your the sift algorithm might just return nothing because there's no nothing for it to there's no key points for it to look at there's like no change just flat image um in that case the the image the sift descriptors will just be none um so we have to be careful in the next step and just avoid those and so that's what we're doing here if there are no descriptors in a particular image descriptor list um we just drop it we just drop that image okay but for this sample it's not really a problem but um just letting you know because in other data sets you might come into this problem okay so um yeah all we're doing now we're just dropping any of those descriptors that don't have any descript any of those images that don't have any descriptors um as i said not an issue for this for this sample now let's have a look at what these what these key points and descriptors actually look for look like um so these here so these are the key points so in each in the middle of each of these circles um we have we have a descriptor okay so we have the sift algorithm as identified a visual feature in each one of these circles okay and we see here and here and and so all right so the the center of these circles is the position of the key point uh the line that you can see is like the direction of the the feature that was detected and the size of the circles are essentially like the scale of the feature that was uh was detected so with all of our we have all of our visual features now remember in that list we had visual features visual words um frequency vectors and tf-idf vectors now we're on to the visual words part so first um all we're doing is creating 1000 random items a random we're going to use this to select 1000 random images we've added this random seed in here so you can reproduce what we what we do here now the reason that we are taking a sample in the first place is just to speed up the k-means part of this the training part you can leave this if you want and just train everything it just it will take longer and so we extract the descriptors and the key points for our training sample and we put everything together all the all the descriptors when we're training k-means we don't care which image a descriptor is in all we care is that the descriptors are there so we take all of our descriptors and we put them all into a single numpy array um we can check the shape so we have is it one point 1.1 million of these descriptors it's quite a lot um for less than 10 000 10 000 um images i think i think it's fairly large and yeah so um here i'm just showing you okay number of descriptors contained in each image sorry each image um so we have like image zero here it has 544 descriptors second one quite a lot more quite a lot less and so it's kind of you know each image has a different set of items in it so that's normal now you have to set the number of visual words that you would like from your from your images so it's like the vocab size in nlp were in nlp terms nlp lingo for this i just set k equal to 200 so we're going to have 200 visual words from this we don't have that many images if you have more images you should probably increase this particularly if they're very varied in what they are if they are not varied they're all kind of like the same thing you can probably use less um visual words and yeah so we are performing k means here we just do for one iteration and that outputs our codebook which is essentially just a big list of all of our all of our clusters now uh if you are i don't know maybe you're doing you're creating your codebook on on a machine and then you're going to be using it later on like over and over again i figured it's probably pretty important to show you how to actually save and load that codebook now it's pretty easy anyway it's not it's not like it's difficult so we use jobleb here and we just dump k which is you know the the number of visual words that we have you don't even necessarily need to put that in there you could just use the codebook alone and we're going to save it into this pickle file and then when you're loading it at a later later point we just use jobleb load there's nothing else to it now at this point we have here we've we've constructed our codebook um so we have gone we've finished we've created our visual features we've created our visual words um and they're all just stored in this codebook and now what we're going to do is take the visual words visual features from our images um process them through the visual words that we have in our codebook and use that to create um our sparse frequency vector so that's what we are doing here so we go through each of our our images in descriptors or the set of descriptors image descriptors in descriptors and what we do is we use the vector quantization quantization function from from scipy to extract the the visual words from those and what that's going to do it's not going to extract the actual like the the vectors of visual words you're going to extract the like ids of the visual words um so we actually get all of these id values each of which represents one of our visual words like it's like a token right in nlp you have word tokens um in in bag of visual words you have visual word tokens so that's what we that's what we have here okay so in image zero uh we know that the first five of our visual words um have these tokens okay so number 84 22 and so on and in total there are five hundred three hundred and ninety seven visual words in image zero okay now this here so we can take a look at the the centroid so the vectored visual word um by looking at the codebook we go to visual word or use visual word id index 84 i'm going to see the shape that is 128 okay that's like our centroid position now how do we do the we need to do like a frequency count so how many of each visual word um based on the visual word ids how many of those are in each image okay so uh we know so we know that the length so let's have a look at the length of our codebook should be 200 okay so what we're going to do here is we're going to create a frequency vector each one of these is going to be of dimensionality 200 because that's all the possible uh visual words within our codebook okay so we're going to loop through all of our visual words and we're going to initialize the frequency vector for each image and then we're going to loop through each word found in the images visual words and for each one of those visual word i index values we use that to add a one in the position of that visual word in the images frequency vector okay after we've done all that we we have basically we have a single frequency vector for every single um image each one of dimensionally dimensionality 200 and at the end here we're just reformatting that big list of numpy vectors into a single numpy array okay and at the end we can see that we have 9469 of these frequency vectors each one dimensionality 200 aligning to our codebook now from above we saw that we had these uh we had these visual word tokens in image i think image zero yeah so we had these visual word tokens you can see we have one two one seven two twice here okay because that that visual word appears in the image more than once um so let's go down here and let's have a look at each one of those so actually i don't know why i needed to count i just copy and pasted it across um so we can actually even remove that if we want but i'll just leave it so if we have a look at those we should we would expect all of these values to have at least a value of one because we've already seen that they are there and for one one seven two we know there's at least two of them in there um and we can see that here so we look at the position frequency vector zero uh we look at each one of these positions and we can see that there are numbers there okay and let's have a look at more of image zero's frequency vector so we we have this here we can increase this we can show everything i mean so we have this 200 values uh each of these are the frequency of those visual words in that image and we can represent that in a big histogram like this okay now we want to move on to tf idf part so you know we have all of these um these frequency counts in a big vector but some of those visual words are irrelevant they're like the vision equivalent of the word v or the word a or the word and we don't want those in there they're not not important don't tell us anything so we use tf idf to reduce the the score of those items reduce their sort of dominance in our frequency vectors so we are going to use the tf idf formula that you can see here um so first i'm going to calculate um i'm going to calculate well n we already know it's the number of images e.g the size of the data set um and then df is the number of images that a visual word actually appears in okay so basically what we're doing here is we're looking at all of the frequency vectors and we're saying okay if there's if there is um in a single vector if the value is greater than zero e.g one or more e.g that visual word appears in that particular image we will assign that position one okay and then we take the sum across all of those all of those frequency vectors along the zero axis so basically let's say we have visual word zero in image zero maybe that visual word is the frequency is three in image one the frequency is two and in image three the frequency is zero if we perform this operation across those just those three frequency vectors the value in that for that first column would be equal to two because it only appears is only not zero in two of the three images or frequency vectors okay so we can see what's happening here so df shape df shape it's still going to be the same um we are going like uh we have the 200 dimensionalized frequency vectors we have 9 000 of those um and we're compressing those 9 000 into a single a single um i don't know what we call this like a df vector okay um now let's have a look at this so the inverse document frequency i suppose we could call it the inverse image frequency um is just the log of n over df okay nothing nothing special there and then to calculate tf idf we also need the term frequency part of tf idf now the term frequency part we actually already have those frequency vectors that we've been uh we've been building they are the term frequencies so that's that's all we do uh is the frequency vectors multiplied by idf idf by the way we can see here it's just a single vector for all of our our entire data set okay the only part that changes is the frequency vectors okay and we get this now if we plot that you can see now this is the same the same image as before and so if we go up you see that we had all of these like kind of high values now we come here and there's just this this really high value here maybe that's a visual word of the the cross on top of the church that's probably like a distinguishing feature if we're looking for maybe looking for religious buildings or churches or something along those lines um whereas the rest of the visual words kind of get pushed down because most visual words are probably not going to be that relevant okay now we've generated our tf idf vectors here okay so we have them all stored in tf idf here i'll just point out here we have just over 9 000 of those and we can begin searching so to search we're going to use cosine similarity so we're going to solve this this image of a dog to start cosine similarity we calculate like this now we're going to search through the entire search space for for b so in this case b is not going to be a vector and we're just going to calculate the cosine similarity for all of the all of the items when compared to a so one of those items is going to be the exact match to a because a for search i so maybe search i is what is it 6595 that image is still within b here so we're going to get at least at the top we're going to get that image returned to us first okay so the minimum and maximum cosine similarities zero to pretty much one we're using floating point numbers so it's not actually exactly one um but it is basically one okay i'm going to go quite quickly through this so we use numpy argsort to get the most similar items the first image should obviously correspond to the image itself uh we'd expect the cosine similarity to be pretty much one um and yeah that's what we get here so we're going to just display a few of those so using this as our search term we get um sorry using this was our search term now this is the first result cosine similarity is equal to one um go down we get another dog pretty similar looking we get this you know it's not it's not always perfect and then we get more of these uh very similar looking dogs here okay now that looks pretty good so we're going to test for a few more images we have this one here it's like some old record player first we get the same record player and i don't think this record player actually appears again in this data set but i thought it was interesting because we get this printer um i assume it's a printer that looks very similar it has this like top bit on top and despite the background being very different and you know this one's just like an image a picture like a blurry image it manages to pull through this object that to me looks pretty similar and then the rest of them is nothing uh of relevance maybe this one's like a box it's kind of similar okay this one uh we search this guy with a fish obviously we get the exact image back first and then we get the same guy with fish in a slightly different position it's like open his mouth a little bit as you go to the side and get another guy with fish and then here i thought this was kind of interesting so here okay there's no guy with fish um but i mean you look at the background of these images right the background of these images is like some like leafy foresty place and obviously here leafy foresty place so it kind of makes sense that it would pull through this image uh here as well uh for this one i suppose there's kind of like a forest in the background but otherwise i don't know why it's why this one is being pulled through this one i thought was interesting so this one it didn't perform well on or at least it didn't um return what i would expect it to return okay so we have these two parachutists and then we get golf balls a dog of the golf ball but if you look at the this parachute here it has this like really thick um black line and then the rest of the parachute is probably white and here on this uh on this golf ball you have like a very similar black line so i assume for these uh golf balls has actually seen that i mean like okay that's uh they're two very similar features we have another one here we have this another golf ball um return obviously the same one again and then return of the golf ball like another golf ball same brand as well and knees as well and then we return this truck i thought maybe the reason it's returning this truck is it has like this white ridge pattern on it which the golf balls also have a similar sort of white ridge pattern i wasn't too sure um but anyway that one performed quite pretty well i thought this one we have this garbage truck the garbage truck here obviously return the garbage truck um and then we return this image and initially i thought okay it's not very good um first result back but then we get a garbage image here but then if you actually look at this image and you look in the background that actually does seem to be a garbage truck here it's either a garbage truck or like some big industrial bin i'm not really sure um but i thought that was kind of cool that i picked up on this thing in the background i i missed the first time i looked at this as well um so that's cool uh but yeah so they're all the results as sort of go through those quickly to have a look at how it's how it did and you can probably optimize this to be much better than what i have here i'll use a very small data set for example um but we've we've done something pretty cool i suppose we used k-means so it's learned a little bit there um in terms of visual words but otherwise it's it's very simple okay and for me to see this sort of performance on you know without like some big neural network or anything um i think it's it's pretty cool so i hope this has been useful um thank you very much for watching and i will see you again in the next one bye

Bag of Visual Words for Image Classification and Retrieval

Chapters

Transcript