back to indexBag of *Visual* Words for Image Classification and Retrieval
Chapters
0:0
0:27 The Theory of Bag of Visual Words
9:52 Vector Quantization
15:15 Frequency Vectors
15:20 Frequency Vector
16:14 Cosine Similarity
30:14 Data Set of Images
43:8 Frequency Count
46:43 Tf Idf Formula
48:44 Inverse Document Frequency
51:7 Minimum and Maximum Cosine Similarities
00:00:00.000 |
Today we are going to talk about something called Bag of Visual Words. Now Bag of Visual Words, 00:00:06.480 |
we can use it for essentially converting images into vector embeddings or image embeddings 00:00:14.720 |
and using those image embeddings in image classification and image retrieval. Now this 00:00:23.280 |
video is going to be split into two parts. First part we are going to look at the theory of Bag of 00:00:29.680 |
Visual Words and how it works and then in the second half of the video we will actually look 00:00:34.400 |
at an implementation of it and see how we can implement something like Bag of Visual Words 00:00:40.160 |
in Python. So let's get started with trying to understand what exactly is Bag of Visual Words. 00:00:48.320 |
Now for those of you that are coming from NLP, you may have heard of Bag of Words before. 00:00:56.000 |
It's a very common technique used in using NLP and information retrieval but it focuses on 00:01:02.880 |
sentences, it focuses on language and text. So we can pretty much visualize it as this. So we take 00:01:10.160 |
a sentence and we pull out all of the individual words from that sentence and then we put them into 00:01:17.040 |
a bag. This bag is unordered so imagine it's just like an unordered list and that's what Bag of 00:01:24.080 |
Words is. Bag of Visual Words is pretty much the same thing but applied to images which is 00:01:31.680 |
quite interesting. So how does that possibly make sense? How do we extract words or visual words 00:01:41.680 |
from an image? Well at a very high level it's pretty much what you can see here. So we take a 00:01:48.080 |
load of images and we extract almost like patches from those images and this is not exactly what 00:01:57.600 |
happens. We don't just put those patches into a bag, we actually convert these patches into what 00:02:03.200 |
are called the visual words. So these patches here we would call visual features and then we 00:02:10.480 |
do some pre-processing in the middle here. You can't see it right now but we'll fill all that in 00:02:16.240 |
throughout the video and what we get out of that processing is a load of what we call 00:02:23.840 |
visual words. Now as with language there are a limited number of visual words. There are so 00:02:32.960 |
many words in the English language and there are also so many visual words in our image data sets. 00:02:40.480 |
Now a visual feature, let's say this over here is a visual feature, it consists of two things. 00:02:56.560 |
Now key points are the points in an image that the visual feature is located at. 00:03:07.200 |
So slider coordinates and they do not change or they should not change as an image is rotated, 00:03:15.280 |
scaled or transformed in some way. And descriptors as you might have guessed are almost like a 00:03:23.040 |
description of that particular visual feature. There are multiple algorithms for extracting 00:03:30.240 |
these features and creating these descriptors. You can use algorithms like SIFT, ORB or SURF. 00:03:38.400 |
Now we're not going to go into detail on what those algorithms are doing right now at least 00:03:44.240 |
but we'll just understand that they are pretty much looking at these images, they're finding edges, 00:03:51.600 |
changes in color, changes in light and just identifying those and running us through a set 00:03:58.640 |
algorithm to create what we call the descriptor. Now the most common of these and the one that we 00:04:07.120 |
are going to use is called SIFT. Now SIFT is pretty useful because it's invariant to scale, 00:04:12.480 |
rotation, translation, illumination and blur to an extent. Of course it's not perfect 00:04:20.320 |
but it can deal with all of these things. Now SIFT first looks through the image 00:04:26.240 |
and it identifies a key point. So something that it believes is of relevance. So just as one 00:04:32.640 |
example it will probably look up here and there's not really anything in that image. There's no 00:04:40.400 |
difference, it's just one flat color. So it probably won't extract a feature from here. 00:04:49.760 |
But when it comes over here it's going to see some edges, you're going to see some changes in 00:04:55.680 |
color and it's going to be like okay well there's something there. So we put a key point there and 00:05:00.480 |
then it's going to go through all of its like algorithmic magic and based on the edges, the 00:05:07.600 |
colors, whatever else in that particular patch it is going to produce a 128 dimensional vector. 00:05:17.680 |
Now it doesn't stop there, it doesn't just find the first patch, it goes 00:05:24.240 |
through an image, it's going to find like a load of patches especially for this image. There's a 00:05:29.120 |
lot of information everywhere, it's going to be like okay look there's a window, there's this 00:05:33.440 |
edge over here, something interesting over here, there's all these colors and edges. It's going to 00:05:38.480 |
find these key points, these visual features everywhere throughout this image and we'll 00:05:46.240 |
actually see what these look like a little bit later on. So what we get is a load of these visual 00:05:53.840 |
features, the order of those visual features doesn't matter, they're all just kind of placed 00:05:58.960 |
or they're suspended in the air over here. We can almost imagine there's like another bag here 00:06:06.720 |
which is our bag of visual features before it becomes a bag of visual words. So we're kind of 00:06:12.320 |
at this point right now, so every image just becomes this massive essentially an array with 00:06:17.760 |
all these 128 dimensional sift vectors. Now once we have pulled out all of our visual features from 00:06:25.440 |
an image, we move on to the next step which is how do we go from visual feature to visual word. 00:06:32.720 |
Well we have to use what's called a codebook. So a codebook is essentially, it's going to be 00:06:41.440 |
like our vocabulary. If you come from NLP again, you'll recognize vocabulary is essentially like a 00:06:47.200 |
big list of all the words in your data set. This codebook is going to be a big list of all of the 00:06:57.440 |
visual words from our images but of course we don't have our visual words yet, we need to create 00:07:04.240 |
them. So that's what we're going to do next, we're going to build our codebook which is basically 00:07:10.080 |
going to create all of our visual words at the same time. So how do we construct this codebook? 00:07:16.240 |
Well the idea is that we take a load of similar visual features and we actually cluster them 00:07:26.160 |
together and we get a centroid out of that cluster. Now that centroid is what we would 00:07:34.080 |
call a visual word. So we look at this image here right, on the left over here, these are visual 00:07:44.560 |
features right. We have multiple visual features, there's all these things that look kind of like a 00:07:50.320 |
dog's nose from a slightly different position but you know they pretty much look like super similar. 00:07:59.040 |
Okay and what we do is we process all of these, we use k-means clustering and we get a single 00:08:09.840 |
centroid out of this cluster of points that all look very similar. Okay so we go from what is 00:08:17.680 |
this, this is 5, 128 dimensional visual features to 1, still 128 dimensional centroid or visual word. 00:08:34.080 |
Okay and we do that using k-means clustering. 00:08:38.720 |
Now this does not just happen for a single image. We actually feed in all of our images from our, 00:08:49.760 |
let's call it a training data set. So we'll actually have loads and loads of these images, 00:08:56.240 |
not just one. It's very very bad drawing and okay we have like a dog's nose over here but maybe we 00:09:04.080 |
have another dog over here, another image of a dog and he has a nose obviously, maybe. So these 00:09:11.840 |
centroids here are also going to be very similar to the centroids of this other dog's nose that 00:09:18.640 |
are over here. So in reality we don't just have these five visual features that create our single 00:09:24.960 |
our single cluster over here, our dog nose visual word. We actually probably have quite a lot. So 00:09:34.320 |
that's that's the process of creating code but we basically go through all of the visual features 00:09:41.200 |
that we create. We're going to create a lot of them and we compress them into a set of particular 00:09:48.320 |
points which become our visual words. Now this process is called vector quantization. It's where 00:09:54.640 |
you take a whole range of vectors and you do something like clustering in order to convert 00:10:07.920 |
this massive range of vectors into just a select few vectors. So again we can use the example or 00:10:17.760 |
the parallel of language. There are pretty much an infinite number of sentences out there in the 00:10:24.560 |
world but if we perform bag of words, we extract every word from let's leave stick with the English 00:10:32.160 |
language sentences at the moment. If we extract every single word and we put them all in a bag 00:10:37.840 |
and we remove all duplicates we're going to end up with far less than an infinite number of items 00:10:46.000 |
there. That's quantization. I suppose quantization of language. We're going from an infinite number 00:10:53.040 |
of possible sentences to a limited number of possible words that make up those sentences. 00:10:59.760 |
That's exactly what we're doing here. So just taking this one example we have like a church 00:11:08.000 |
building. Over on the left we have these two visual features. These two dots are visual features. 00:11:16.000 |
Maybe we can also have another visual feature over here, over here. All of those should kind 00:11:22.800 |
of be very similar. They're like this green window. They have all these similar edges and 00:11:27.600 |
similar colors. So they probably all end up being clustered together into this single visual word 00:11:34.000 |
which is a window. This one over here, we have this kind of like bell tower thing and we just 00:11:42.720 |
take different parts of this bell tower and they would probably all feed into a single visual word. 00:11:50.080 |
Now one thing I should point out here is this is just one image. Now imagine we took another image 00:11:55.040 |
with other windows inside it. That other image with windows. Let's say the window is here. 00:12:04.320 |
We would take the visual features from that and they would also be translated into this over here. 00:12:12.560 |
So what we get out of that is rather than our image being represented by all of these 128 00:12:23.280 |
dimensional visual features, they are actually represented by these technically 128 dimensional 00:12:31.120 |
visual words. But because there's a limited number of these visual words we can number them. 00:12:36.080 |
So for example this one would be zero. This one we selected here would be one, two, three, 00:12:44.320 |
and four. Let's for now pretend this is our entire codebook. That means that this image here 00:12:52.080 |
would now be represented by the number of visual words that we found. So the like the frequency 00:12:59.920 |
of these visual words. So at position zero we have a dog's nose. We didn't see any of those 00:13:05.200 |
in this image so that would be zero. This number one here, so we found like these two visual 00:13:11.600 |
features over here. Another one, another one here. So we found four of these. So then this would 00:13:19.120 |
become four. Again we don't have like a I think it's a dog's leg there. In this case we found 00:13:25.600 |
two of these features. So this is a this one we selected here and then here we have zero again. 00:13:31.360 |
So what we've done just there is we've gone from this representing this image with a very large 00:13:40.720 |
number of visual features. So 128 dimensional and also there was six of them in this image. 00:13:49.120 |
In reality there'd be a lot more. Okay so quite a few values there and then we've kind of compressed 00:13:56.640 |
it down into this five dimensional sparse vector. Now in reality we're not going to just have five 00:14:06.640 |
possible visual words. In the later examples we're going to use around 200 visual words to represent 00:14:13.680 |
just over 9000 images. Okay but that's still a big improvement over you know what we're doing here. So 00:14:22.160 |
this massive matrix that we have has now just been translated into a 200 dimensional vector. 00:14:28.640 |
And as well I should say the images that we that we take later when we actually process for example 00:14:35.680 |
this image we won't just get around six visual features we'll get more like say 800 visual 00:14:42.720 |
features. So in reality it'd be something like this 128 dimensions and then we would compress 00:14:49.520 |
that down into a single 200 dimensional vector. Okay that's what we're going to do. Now if we 00:14:56.320 |
consider okay how how is this useful to us. So we're going to focus on you know the information 00:15:00.960 |
retrieval side of things but it works as well with image classification where we basically look 00:15:07.440 |
at so for now we're going to add another step in in a moment but for now we're creating what are 00:15:15.520 |
these frequency vectors. So this thing that you saw here it's a frequency vector and that frequency 00:15:24.320 |
vector is basically telling us okay we have you know in it we have this many I'm just going to 00:15:31.440 |
use this example this is I think okay two or something like that we have two windows in this 00:15:36.960 |
image. In this image we also have these two crosses on top of a bell tower whereas this image of a dog 00:15:45.520 |
we have we have a nose of a leg we have a ear right. So if you were to plot these in a high 00:15:53.440 |
dimensional space we'll just you know for for the sake of simplicity we'll pretend okay this is 00:16:00.960 |
actually a two-dimensional vector. So our our dog image could be like here okay and our church image 00:16:10.240 |
could be over here. Now they're not similar okay we'll use cosine similarity so that basically 00:16:17.120 |
means we're going to measure the angle between vectors. So if we maybe we had another dog image 00:16:23.040 |
and because it has many of the similar visual words it'll be plotted in a similar space in a 00:16:29.040 |
similar vector space. So that means that the angle between these two here will be a lot smaller 00:16:37.440 |
right when you compare it to the angle between either of those and this one here okay and then 00:16:43.280 |
we're going to calculate cosine similarity between them and we'll see that the cosine similarity 00:16:47.760 |
between our more similar images with similar visual words they are more similar. Okay so we 00:16:55.280 |
have our our sparse vectors but this is you know there's another step that we need to we need to 00:17:01.200 |
take and I'll return to the language example. In language we in NLP we have something called 00:17:10.640 |
stop words. Now stop words are basically super common words that don't really tell you much about 00:17:19.040 |
what it is you're looking at. So the word the or the word a the word and all of these words appear 00:17:25.440 |
in language all over the place but they don't necessarily tell you what the topic of whatever 00:17:32.480 |
it is you're talking about is. Okay and in natural language processing these stop words can be 00:17:42.960 |
overpowering that you know you can get 200 instances of the in a in an article and if 00:17:51.040 |
for example you are reading an article about like Roman history you want to be looking for things 00:17:57.360 |
like the word Rome or the word Julius Caesar or the words Julius and Caesar you don't want to be 00:18:05.360 |
looking for the word the it doesn't tell you anything. So these stop words are usually they're 00:18:11.840 |
either removed from your sentences or we can use something called TFIDF which allows us to look at 00:18:19.600 |
the relevance of words based on how common they are across different documents. So in the example 00:18:27.520 |
I just mentioned we let's say we have articles about a lot of different things and one of those 00:18:34.800 |
happens to be this Roman history article and it has the word Rome in it right. So as a user when 00:18:43.280 |
I'm when I'm searching imagine I search something like the history of Rome right. So we have the 00:18:50.240 |
history of Rome four words there when we're searching we don't want to give equal relevance 00:18:57.200 |
to the word the as we do to Rome and we don't want to give equal relevance to the word of as 00:19:02.880 |
we do to history right. We basically want to just ignore of and ignore the and just focus on Rome 00:19:09.840 |
and history right. So TFIDF allows us to do that by looking at a frequency of words across all 00:19:17.920 |
these different documents and basically scaling the score of our the frequency scores that we 00:19:26.160 |
we've just retrieved based on the frequency of these words throughout all the all your other 00:19:31.600 |
documents and the same is possible with visual words. So we have all of our images they are the 00:19:38.960 |
documents and we have all the visual words in all those all of those images. We use TFIDF to look at 00:19:47.920 |
the visual words found in each frequency vector and we use that to essentially lower the score 00:19:58.960 |
of visual words that appear all the time and increase the score of visual words that are very 00:20:04.960 |
rare and very unique to a particular image. So in the church image that you see here in a lot of 00:20:12.400 |
cases SIFT isn't going to pick up patch number one but let's say it did in in this example so 00:20:17.680 |
it's picked up a patch of sky a patch of sky is probably pretty common in a lot of pictures. So 00:20:25.280 |
we would want to bring the score of that patch of sky down we'd want to lower it whereas this 00:20:33.920 |
patch number two we want the score of that to be increased or at least just maintained okay. 00:20:40.720 |
So this is how we do that this is a typical TFIDF formula now TFIDF typically you know we 00:20:48.720 |
think of it with documents and terms which is just words so the notation maybe it's a bit a 00:20:55.040 |
bit weird to translate over into image images here but we'll do it anyway. T refers to term 00:21:03.440 |
in in typical TFIDF lingo but we are using it to represent visual word. 00:21:20.000 |
in this case our documents are images so it's an image. 00:21:22.720 |
Okay and then we have tf td and that refers to the term frequency of a so the frequency 00:21:39.040 |
of a particular visual word in a particular image. Okay so the frequency or term frequency 00:21:48.160 |
of visual word in image okay let's say we took over here we have visual word one 00:21:56.480 |
so tf of visual word one in image we'll call this image like image zero in image zero 00:22:07.760 |
would be equal to two okay because there are two of these patches or if we want to include the 00:22:15.040 |
other ones it would actually be four. N that you can see over here that is the total number of 00:22:22.080 |
images so number of images df t here so that stands for document frequency of term so essentially 00:22:38.480 |
how many documents does our particular visual word appear in or how many images does our visual 00:22:44.560 |
word appear in and all of this is so we we take both of those and we take the log of them. Okay 00:22:52.160 |
so this here becomes the this is the inverse document frequency this this whole sort of part 00:23:00.240 |
here this is a term frequency we multiply those together and we get this and that gives us the 00:23:10.000 |
tf idf. Now the idf part if you if you look at this so just consider this is useful to understand 00:23:19.840 |
we have a number of images number of images does not change based on which image you're looking at 00:23:25.760 |
and the document frequency for a particular term that also does not change depending on whether 00:23:32.800 |
you're looking at a particular image or not so this idf vector is actually the same across our 00:23:39.760 |
entire data set okay it is not going to change the only thing that changes is the tf idf which 00:23:46.720 |
is the term frequency so using this we can go from what we have here so all of these all of 00:23:53.920 |
these visual words are kind of on the same level there's not really any variation there 00:23:58.000 |
to something more like this so maybe we've decided that a nose is a pretty good indicator 00:24:08.240 |
of something relevant okay whereas like this leg thing here it's not a very good 00:24:14.560 |
indicator or anything okay so they've been adjusted in accordance with their their relevance 00:24:22.320 |
okay so we've just been through how to create our visual words how to create a it's like a sparse 00:24:29.200 |
vector representation of our images based on those visual words and then how to scale the values in 00:24:37.680 |
those sparse vectors based on the relevance of each of each visual word that the values represent 00:24:44.880 |
so after we've done all of that we're sort of ready to move on to the next step which 00:24:49.440 |
maybe you want to feed all these vector representations into a classifier model 00:24:56.560 |
in order to perform some classification or what we're going to go through is how we can 00:25:03.360 |
actually retrieve similar images based on these vectors now there are different 00:25:11.840 |
distance or similarity metrics that we can use when we're comparing vectors you have dot product 00:25:18.880 |
similarity you have euclidean distance but we're going to focus on using cosine similarity now 00:25:26.160 |
this is really useful because the tf idf scores that we've output here the magnitude of those of 00:25:33.280 |
those vectors can vary significantly so by using cosine similarity it means that we don't need to 00:25:38.960 |
normalize anything so we can just take these vectors and we can compare them immediately 00:25:45.440 |
which is really useful so cosine similarity is calculated like this so let's say we 00:25:53.600 |
i'm gonna draw another one these uh 2d 2d graphs right let's say we have vector a here and we have 00:26:02.320 |
vector b over here okay the dot product on the top here this is going to this is going to increase 00:26:13.440 |
as these two vectors are are more similar but that by itself this uh this dot product value 00:26:22.800 |
it works poorly when we have other vectors of different magnitude in here so if we had all 00:26:28.000 |
these different vectors let's say we had one over here although we might say that we're going to 00:26:33.520 |
call this one um c c i'm going to call this one d to me it seems like b and c are probably more 00:26:41.920 |
similar than c and d but because d and c have a greater magnitude the dot product which considers 00:26:50.000 |
both angle and magnitude is going to produce a higher score okay and that's even true if maybe 00:26:58.160 |
let's say we had like b and another item here um even though these are basically the same vector 00:27:05.760 |
they would score lower than d and c purely because d and c have a larger magnitude okay so what we do 00:27:14.400 |
is we add this item here which looks at the magnitude of a and b together and we we divide 00:27:22.080 |
by that so we're essentially normalizing everything in this in this calculation and that's how we 00:27:28.800 |
calculate the the cosine here so we're actually just calculating the cosine between um between 00:27:33.680 |
angles okay and that gives us our our cosine similarity so um let's have a look at another 00:27:42.080 |
example so here if we had two exactly matching vectors the cosine similarity between those would 00:27:51.360 |
be equal to one if we had two vectors that were a 90 degree angle so vector a is perpendicular to 00:27:58.960 |
vector b we would return a value of zero now because we're using tf idf all of our vector 00:28:08.800 |
values are positive so in reality all of our vectors would have to be placed within this space 00:28:15.680 |
here so within like the i'm assuming this is the positive of the y-axis and this is the positive 00:28:21.600 |
of the x-axis so all of our vectors would actually appear in this space here so this vector a down 00:28:28.640 |
here would not happen but just using it for the sake of showing you that we can have a cosine 00:28:35.600 |
similarity at zero now in this image here i'm just trying to point out the fact okay again it's the 00:28:41.920 |
same thing um where we have two images that are exactly the same okay so these uh appear in the 00:28:48.640 |
middle here the cosine similarity for those is going to be is going to be one okay and here and 00:28:55.280 |
here uh this image there's maybe there's like a building in the background there's some trees so 00:29:00.880 |
it's kind of maybe halfway in between the the church image and the dog image now because the 00:29:07.440 |
dog image has some trees the church image has some it's a building so it shares some similarities 00:29:14.560 |
both of them with this this image so it would probably be somewhere in the middle of that 00:29:19.520 |
similarity scale okay somewhere around here and then images like the church and the dog they 00:29:28.560 |
there's not already any similarity there so they have a lower closer to zero cosine similarity 00:29:33.600 |
okay so that is the the theory behind bag of visual words and how we actually retrieve 00:29:43.120 |
similar images using bag of visual words now like i said the two parts of this video now onto the 00:29:50.720 |
second part which is where we're going to have a look at the implementation this all this in python 00:29:56.000 |
now if you want to follow along with this there will be a link to a colab notebook with all of 00:30:02.000 |
this in in the description or if you're reading this on pine cone in the resources at the bottom 00:30:09.600 |
of the article so what we first need to consider okay we need a data set of images so we're going 00:30:17.920 |
to keep this super quick light and simple first we we get a we're not uploading so we download 00:30:29.600 |
the hug and face data set we're using this one so it's it's like what they describe it as a french 00:30:36.800 |
image net i'm not sure what makes it french but it's just how they have described this 00:30:44.320 |
but it's a almost 10 000 pretty high resolution images that we can we can just download 00:30:54.720 |
immediately like this is all we do and we can see straight where we have that church image 00:30:59.440 |
okay and yeah we can see all the images just by printing them like this so first thing we 00:31:07.920 |
want to do is translate these images so these are what are called pill image objects at the moment 00:31:15.280 |
we want to translate those into numpy arrays so we can do that we just go through the entire 00:31:23.600 |
data set and we create this array or this list sorry of items which is going to be a list of 00:31:34.720 |
numpy array versions of our pill objects now these images some of them are gray scale some 00:31:40.960 |
of them are color we're going to drop the color from these images and we're just going to have 00:31:46.000 |
gray scale so we can do that here so we use the opencv library okay and we just go through and we 00:32:04.000 |
convert all of these images if they are rgb images to gray scale if they're already gray 00:32:12.960 |
scale we don't need to transform them now how do we tell if they're gray scale or not well the 00:32:17.840 |
shape of a of an image that is gray scale is different so let me show you 00:32:29.840 |
images okay so we see here that we have this two-dimensional array okay whereas if we had a 00:32:40.480 |
look at so in images training these are all the color images that we still have if we have a look 00:32:47.280 |
at beforehand it's a three-dimensional image because you also have the red green and blue 00:32:53.200 |
color channels included in that array so that's how we can tell that's why the the length of the 00:32:58.000 |
image shape here when it is two we know that if it's greater than two sorry we know that it is a 00:33:10.960 |
so we can see okay these are the new gray scale images now the next step now that we have our 00:33:22.880 |
however images and they're ready and they're pre-processed we need to go through that process 00:33:26.720 |
of creating our eventually creating our sparse vectors so we start by creating the visual 00:33:31.920 |
features then we move on to visual words then we create the frequency vectors and then we create 00:33:37.200 |
the tf-idf vectors okay so number one visual visual features we do that like so so it's actually 00:33:44.960 |
pretty simple we are using the using sift from opencv and what we're doing here is just initializing 00:33:54.640 |
lists where we're going to store all of our key points and descriptors so remember we have both 00:33:58.640 |
key point is kind of pointing to where the where the visual feature is and the descriptor is a 00:34:06.160 |
description of visual feature 128 dimensional vector and yeah we would so we initialize this 00:34:15.280 |
extractor the sift extractor and here we're just for every image in our black and white images 00:34:21.360 |
we're going through we're extracting the key points and the descriptors and we're adding 00:34:27.760 |
those key points and descriptors to our big list of key points and descriptors here 00:34:35.680 |
now you notice that we're doing this image by image so in here we're actually going to have 00:34:42.160 |
so for example the descriptors we're going to have maybe I'll show you in a moment actually 00:34:48.400 |
I'm sure I will show you later why don't I show you now so let's have a look at the descriptors 00:34:56.320 |
for I know say image 100 I'm going to look at the shape so for image 100 you see that we have 00:35:05.600 |
436 descriptors and each of those descriptors is of dimensionality 128 because it's a sift 00:35:15.600 |
descriptor okay now this won't happen with this sample but just be aware that some images 00:35:25.200 |
um for example imagine you have like a picture of the sky and there's not really anything there 00:35:30.560 |
it's just like the color of the sky um your the sift algorithm might just return nothing because 00:35:37.120 |
there's no nothing for it to there's no key points for it to look at there's like no change just flat 00:35:42.400 |
image um in that case the the image the sift descriptors will just be none um so we have to 00:35:52.880 |
be careful in the next step and just avoid those and so that's what we're doing here if there are 00:35:59.280 |
no descriptors in a particular image descriptor list um we just drop it we just drop that image 00:36:06.880 |
okay but for this sample it's not really a problem but um just letting you know because in other 00:36:12.960 |
data sets you might come into this problem okay so um yeah all we're doing now we're just dropping 00:36:22.320 |
any of those descriptors that don't have any descript any of those images that don't have 00:36:29.440 |
any descriptors um as i said not an issue for this for this sample now let's have a look at 00:36:36.480 |
what these what these key points and descriptors actually look for look like um so these here so 00:36:45.120 |
these are the key points so in each in the middle of each of these circles um we have 00:36:51.520 |
we have a descriptor okay so we have the sift algorithm as identified a visual feature in each 00:37:00.160 |
one of these circles okay and we see here and here and and so all right so the the center of 00:37:12.080 |
these circles is the position of the key point uh the line that you can see is like the direction 00:37:18.720 |
of the the feature that was detected and the size of the circles are essentially like the scale of 00:37:25.360 |
the feature that was uh was detected so with all of our we have all of our visual features now 00:37:33.120 |
remember in that list we had visual features visual words um frequency vectors and tf-idf vectors 00:37:40.560 |
now we're on to the visual words part so first um all we're doing is creating 1000 random 00:37:50.400 |
items a random we're going to use this to select 1000 random images we've added this random seed 00:37:56.320 |
in here so you can reproduce what we what we do here now the reason that we are taking a sample 00:38:01.760 |
in the first place is just to speed up the k-means part of this the training part you can leave this 00:38:09.040 |
if you want and just train everything it just it will take longer and so we extract the descriptors 00:38:15.440 |
and the key points for our training sample and we put everything together all the all the 00:38:23.120 |
descriptors when we're training k-means we don't care which image a descriptor is in all we care 00:38:29.120 |
is that the descriptors are there so we take all of our descriptors and we put them all into a 00:38:35.120 |
single numpy array um we can check the shape so we have is it one point 1.1 million of these 00:38:44.160 |
descriptors it's quite a lot um for less than 10 000 10 000 um images i think i think it's fairly 00:38:52.800 |
large and yeah so um here i'm just showing you okay number of descriptors contained in each image 00:39:06.160 |
sorry each image um so we have like image zero here it has 544 descriptors second one 00:39:16.560 |
quite a lot more quite a lot less and so it's kind of you know each image has a different set of 00:39:22.000 |
items in it so that's normal now you have to set the number of visual words that you would like 00:39:30.800 |
from your from your images so it's like the vocab size in nlp were in nlp terms nlp lingo 00:39:38.000 |
for this i just set k equal to 200 so we're going to have 200 visual words from this we don't have 00:39:46.880 |
that many images if you have more images you should probably increase this particularly if 00:39:52.080 |
they're very varied in what they are if they are not varied they're all kind of like the same thing 00:39:56.960 |
you can probably use less um visual words and yeah so we are performing k means here 00:40:05.600 |
we just do for one iteration and that outputs our codebook which is essentially just a big 00:40:12.080 |
list of all of our all of our clusters now uh if you are i don't know maybe you're doing 00:40:18.720 |
you're creating your codebook on on a machine and then you're going to be using it later on 00:40:23.040 |
like over and over again i figured it's probably pretty important to show you how to actually save 00:40:27.680 |
and load that codebook now it's pretty easy anyway it's not it's not like it's difficult 00:40:32.560 |
so we use jobleb here and we just dump k which is you know the the number of visual words that we 00:40:41.680 |
have you don't even necessarily need to put that in there you could just use the codebook alone 00:40:45.520 |
and we're going to save it into this pickle file and then when you're loading it at a later later 00:40:50.480 |
point we just use jobleb load there's nothing else to it now at this point we have here we've 00:40:58.160 |
we've constructed our codebook um so we have gone we've finished we've created our visual features 00:41:05.600 |
we've created our visual words um and they're all just stored in this codebook and now what we're 00:41:10.800 |
going to do is take the visual words visual features from our images um process them through 00:41:18.160 |
the visual words that we have in our codebook and use that to create um our sparse frequency vector 00:41:25.600 |
so that's what we are doing here so we go through each of our our images in descriptors or the set 00:41:34.000 |
of descriptors image descriptors in descriptors and what we do is we use the vector quantization 00:41:41.360 |
quantization function from from scipy to extract the the visual words from those and what that's 00:41:50.480 |
going to do it's not going to extract the actual like the the vectors of visual words you're going 00:41:55.280 |
to extract the like ids of the visual words um so we actually get all of these id values each of which 00:42:05.680 |
represents one of our visual words like it's like a token right in nlp you have 00:42:10.560 |
word tokens um in in bag of visual words you have visual word tokens so that's what we that's what 00:42:17.760 |
we have here okay so in image zero uh we know that the first five of our visual words um have these 00:42:28.720 |
tokens okay so number 84 22 and so on and in total there are five hundred three hundred and ninety 00:42:38.400 |
seven visual words in image zero okay now this here so we can take a look at the the centroid 00:42:48.080 |
so the vectored visual word um by looking at the codebook we go to visual word or use visual word 00:42:56.640 |
id index 84 i'm going to see the shape that is 128 okay that's like our centroid position now 00:43:05.280 |
how do we do the we need to do like a frequency count so how many of each visual word um based 00:43:14.400 |
on the visual word ids how many of those are in each image okay so uh we know so we know that the 00:43:23.360 |
length so let's have a look at the length of our codebook should be 200 okay so what we're going 00:43:30.720 |
to do here is we're going to create a frequency vector each one of these is going to be of 00:43:35.440 |
dimensionality 200 because that's all the possible uh visual words within our codebook okay so we're 00:43:42.080 |
going to loop through all of our visual words and we're going to initialize the frequency vector 00:43:49.120 |
for each image and then we're going to loop through each word found in the images visual 00:43:54.960 |
words and for each one of those visual word i index values we use that to add a one in the 00:44:03.920 |
position of that visual word in the images frequency vector okay after we've done all that 00:44:10.960 |
we we have basically we have a single frequency vector for every single um image each one of 00:44:17.840 |
dimensionally dimensionality 200 and at the end here we're just reformatting that big list of 00:44:25.040 |
numpy vectors into a single numpy array okay and at the end we can see that we have 9469 00:44:37.840 |
of these frequency vectors each one dimensionality 200 aligning to our codebook 00:44:44.000 |
now from above we saw that we had these uh we had these visual word tokens in image i think image 00:44:51.920 |
zero yeah so we had these visual word tokens you can see we have one two one seven two twice here 00:44:59.920 |
okay because that that visual word appears in the image more than once um so let's go down here and 00:45:07.440 |
let's have a look at each one of those so actually i don't know why i needed to count i just copy and 00:45:12.800 |
pasted it across um so we can actually even remove that if we want but i'll just leave it so if we 00:45:18.720 |
have a look at those we should we would expect all of these values to have at least a value of one 00:45:24.160 |
because we've already seen that they are there and for one one seven two we know there's at least two 00:45:28.400 |
of them in there um and we can see that here so we look at the position frequency vector zero uh 00:45:36.160 |
we look at each one of these positions and we can see that there are numbers there okay and let's 00:45:41.840 |
have a look at more of image zero's frequency vector so we we have this here we can increase 00:45:48.320 |
this we can show everything i mean so we have this 200 values uh each of these are the frequency 00:45:57.360 |
of those visual words in that image and we can represent that in a big histogram like this okay 00:46:07.120 |
now we want to move on to tf idf part so you know we have all of these um these frequency 00:46:13.840 |
counts in a big vector but some of those visual words are irrelevant they're like the vision 00:46:20.560 |
equivalent of the word v or the word a or the word and we don't want those in there they're 00:46:26.320 |
not not important don't tell us anything so we use tf idf to reduce the the score of those items 00:46:34.880 |
reduce their sort of dominance in our frequency vectors so we are going to use the tf idf formula 00:46:45.440 |
that you can see here um so first i'm going to calculate um i'm going to calculate well n we 00:46:53.040 |
already know it's the number of images e.g the size of the data set um and then df is the number 00:47:01.360 |
of images that a visual word actually appears in okay so basically what we're doing here is we're 00:47:09.440 |
looking at all of the frequency vectors and we're saying okay if there's if there is um in a single 00:47:16.480 |
vector if the value is greater than zero e.g one or more e.g that visual word appears in that 00:47:24.400 |
particular image we will assign that position one okay and then we take the sum across all of those 00:47:35.200 |
all of those frequency vectors along the zero axis so basically let's say we have visual word zero 00:47:44.080 |
in image zero maybe that visual word is the frequency is three in image one the frequency 00:47:53.040 |
is two and in image three the frequency is zero if we perform this operation across those just 00:48:00.560 |
those three frequency vectors the value in that for that first column would be equal to two because 00:48:07.600 |
it only appears is only not zero in two of the three images or frequency vectors okay so we can 00:48:16.000 |
see what's happening here so df shape df shape it's still going to be the same um we are going 00:48:22.880 |
like uh we have the 200 dimensionalized frequency vectors we have 9 000 of those um and we're 00:48:28.960 |
compressing those 9 000 into a single a single um i don't know what we call this like a df vector 00:48:37.920 |
okay um now let's have a look at this so the inverse document frequency i suppose we could 00:48:47.200 |
call it the inverse image frequency um is just the log of n over df okay nothing nothing special there 00:48:56.080 |
and then to calculate tf idf we also need the term frequency part of tf idf now the term frequency 00:49:04.000 |
part we actually already have those frequency vectors that we've been uh we've been building 00:49:08.240 |
they are the term frequencies so that's that's all we do uh is the frequency vectors multiplied by 00:49:17.200 |
idf idf by the way we can see here it's just a single vector for all of our our entire data set 00:49:24.240 |
okay the only part that changes is the frequency vectors okay and we get this now if we plot that 00:49:30.880 |
you can see now this is the same the same image as before and so if we go up you see that we had 00:49:37.680 |
all of these like kind of high values now we come here and there's just this this really high value 00:49:45.200 |
here maybe that's a visual word of the the cross on top of the church that's probably like a 00:49:49.680 |
distinguishing feature if we're looking for maybe looking for religious buildings or churches or 00:49:53.840 |
something along those lines um whereas the rest of the visual words kind of get pushed down because 00:50:00.560 |
most visual words are probably not going to be that relevant okay now we've generated our tf idf 00:50:08.560 |
vectors here okay so we have them all stored in tf idf here i'll just point out here we have 00:50:15.040 |
just over 9 000 of those and we can begin searching so to search we're going to use 00:50:24.480 |
cosine similarity so we're going to solve this this image of a dog to start cosine similarity 00:50:29.840 |
we calculate like this now we're going to search through the entire search space for for b so in 00:50:38.000 |
this case b is not going to be a vector and we're just going to calculate the cosine similarity for 00:50:42.560 |
all of the all of the items when compared to a so one of those items is going to be the exact 00:50:49.040 |
match to a because a for search i so maybe search i is what is it 6595 that image is still within b 00:51:00.080 |
here so we're going to get at least at the top we're going to get that image returned to us first 00:51:05.840 |
okay so the minimum and maximum cosine similarities zero to pretty much one we're 00:51:11.600 |
using floating point numbers so it's not actually exactly one um but it is basically one okay i'm 00:51:19.040 |
going to go quite quickly through this so we use numpy argsort to get the most similar items 00:51:26.240 |
the first image should obviously correspond to the image itself uh we'd expect the cosine 00:51:34.480 |
similarity to be pretty much one um and yeah that's what we get here so we're going to just 00:51:39.840 |
display a few of those so using this as our search term we get um sorry using this was our search 00:51:46.720 |
term now this is the first result cosine similarity is equal to one um go down we get another dog 00:51:53.280 |
pretty similar looking we get this you know it's not it's not always perfect and then we get more 00:51:58.800 |
of these uh very similar looking dogs here okay now that looks pretty good so we're going to test 00:52:05.760 |
for a few more images we have this one here it's like some old record player first we get the same 00:52:14.160 |
record player and i don't think this record player actually appears again in this data set but i 00:52:18.400 |
thought it was interesting because we get this printer um i assume it's a printer that looks 00:52:23.040 |
very similar it has this like top bit on top and despite the background being very different and 00:52:28.960 |
you know this one's just like an image a picture like a blurry image it manages to pull through 00:52:33.840 |
this object that to me looks pretty similar and then the rest of them is nothing uh of relevance 00:52:39.680 |
maybe this one's like a box it's kind of similar okay this one uh we search this guy with a fish 00:52:46.240 |
obviously we get the exact image back first and then we get the same guy with fish in a slightly 00:52:52.000 |
different position it's like open his mouth a little bit as you go to the side and get another 00:52:56.720 |
guy with fish and then here i thought this was kind of interesting so here okay there's no guy 00:53:03.600 |
with fish um but i mean you look at the background of these images right the background of these 00:53:09.760 |
images is like some like leafy foresty place and obviously here leafy foresty place so it kind of 00:53:18.240 |
makes sense that it would pull through this image uh here as well uh for this one i suppose there's 00:53:24.080 |
kind of like a forest in the background but otherwise i don't know why it's why this one 00:53:28.240 |
is being pulled through this one i thought was interesting so this one it didn't perform well on 00:53:33.680 |
or at least it didn't um return what i would expect it to return okay so we have these two 00:53:40.240 |
parachutists and then we get golf balls a dog of the golf ball but if you look at the this parachute 00:53:49.200 |
here it has this like really thick um black line and then the rest of the parachute is probably 00:53:55.440 |
white and here on this uh on this golf ball you have like a very similar black line so i assume 00:54:02.720 |
for these uh golf balls has actually seen that i mean like okay that's uh they're two very similar 00:54:09.520 |
features we have another one here we have this another golf ball um return obviously the same 00:54:16.080 |
one again and then return of the golf ball like another golf ball same brand as well and knees 00:54:22.800 |
as well and then we return this truck i thought maybe the reason it's returning this truck is it 00:54:28.240 |
has like this white ridge pattern on it which the golf balls also have a similar sort of white ridge 00:54:34.240 |
pattern i wasn't too sure um but anyway that one performed quite pretty well i thought 00:54:40.000 |
this one we have this garbage truck the garbage truck here obviously return the garbage truck 00:54:49.200 |
um and then we return this image and initially i thought okay it's not very good um first result 00:54:55.120 |
back but then we get a garbage image here but then if you actually look at this image 00:55:01.040 |
and you look in the background that actually does seem to be a garbage truck here it's either a 00:55:06.160 |
garbage truck or like some big industrial bin i'm not really sure um but i thought that was kind of 00:55:12.320 |
cool that i picked up on this thing in the background i i missed the first time i looked 00:55:17.280 |
at this as well um so that's cool uh but yeah so they're all the results as sort of go through 00:55:24.080 |
those quickly to have a look at how it's how it did and you can probably optimize this to be much 00:55:29.760 |
better than what i have here i'll use a very small data set for example um but we've we've done 00:55:36.400 |
something pretty cool i suppose we used k-means so it's learned a little bit there um in terms 00:55:42.560 |
of visual words but otherwise it's it's very simple okay and for me to see this sort of 00:55:49.600 |
performance on you know without like some big neural network or anything um i think it's it's 00:55:56.800 |
pretty cool so i hope this has been useful um thank you very much for watching and i will see