back to indexHNSW for Vector Search Explained and Implemented with Faiss (Python)
Chapters
0:0 Intro
0:41 Foundations of HNSW
8:41 How HNSW Works
16:38 The Basics of HNSW in Faiss
21:40 How Faiss Builds an HNSW Graph
33:33 Fine-tuning HNSW
34:30 Outro
00:00:00.000 |
Hi and welcome to the video. We're going to be having a look at the Hierarchical Navigable Small 00:00:06.080 |
Worlds graph or HNSW graph and how it's used in vector search. Now HNSW is one of the best 00:00:17.520 |
performing algorithms for approximate nearest neighbors search and what I want to do with this 00:00:25.520 |
video is explore how it works and how we can implement it in FICE. So let's go ahead and get 00:00:35.200 |
started. Now we'll start with the sort of foundations of HNSW and where it comes from. 00:00:46.800 |
So we can split approximate nearest neighbor algorithms into three broad categories. Trees, 00:00:54.960 |
hashes, and graphs. Now HNSW, it was HNSW graph so we can figure out it probably belongs in the 00:01:02.800 |
graph category and more specifically it's a type of proximity graph which is it simply 00:01:09.600 |
means that vertices which are just the vectors in our graph or the nodes in our graph 00:01:16.720 |
are connected to other nodes or vertices based on their proximity and we would typically 00:01:24.160 |
measure this proximity using Euclidean distance. We don't have to but that's the sort of standard. 00:01:32.720 |
Now going from a proximity graph which is pretty simple to a HNSW graph is a pretty big leap 00:01:42.880 |
and there's two main algorithms that have really helped me understand what HNSW actually does and 00:01:52.880 |
that is the probability skip list and the navigable small world graphs which are just 00:01:58.800 |
the predecessors of HNSW. So what I'm going to do is just quickly you know run through those 00:02:07.680 |
two algorithms. So this is a probability skip list. It's reasonably old. It was introduced in 00:02:15.840 |
1990 by a guy called William Poog and it works by building several layers of linked list which 00:02:23.520 |
is what you can see up here. So what we do with this is we we're looking for the number 11 or the 00:02:30.400 |
key 11 and we saw over here the top layer of our start block and what we do is we say in our layer 00:02:40.320 |
3 which is our entry layer and we move across. So we go over to the next block here and we say okay 00:02:48.800 |
number 11 is not 5 and 5 is not greater than 11. So we continue and we go all the way over to the 00:02:57.360 |
end block over here. Once we reach the end block we know that we need to go to the next layer so 00:03:05.280 |
one layer down and so we do that we go to layer 2 at the start block again and we continue moving 00:03:14.480 |
across. Now at this point we get to block 19 over here and we say okay 19 is a great 11 so now we 00:03:23.760 |
know that we need to go on to our next layer, layer 1 down here and then we go across and we 00:03:30.640 |
find number 11. Now the whole point of this is when we have a lot of values this can reduce the 00:03:39.600 |
number of steps we need to take in order to get to our final value rather than just going along 00:03:46.000 |
our layer 0 and going along every single value within our list. HNSW borrows this hierarchical 00:03:57.360 |
layered format or levels as we will see pretty soon. Navigable Small World or NSW graphs were 00:04:05.600 |
developed over the course of a few papers around 2011 up to 2014. The basic idea is that we can 00:04:15.920 |
have a very large graph with many vertices and because we have both short range links so the 00:04:27.440 |
sort of links that you see on the outside of our circle here and we also have these long range 00:04:33.600 |
links which is what you see going across the circle we make our graph very navigable so it's 00:04:40.480 |
very easy to get from one point to another in very few steps and there's some terminology that we 00:04:47.600 |
should cover here which is the all the vertices we have here all of their connections so all the 00:04:54.640 |
other vertices that they are connected to we refer to these as that vertex's friends list okay and 00:05:04.160 |
when searching this NSW graph we begin at a predetermined entry point so in here we see 00:05:11.440 |
we're sort of sawing a few different points but in reality with these sort of big graphs we always 00:05:18.160 |
start with one specific entry point or maybe sometimes we can do use different entry points 00:05:25.440 |
based on some sort of logic but we have either one or if only a few entry points in our graph 00:05:34.320 |
and then what we do is we perform what is called greedy routing now routing just refers to the 00:05:43.520 |
route that we take through our graph and greedy refers to the fact that we are always going to 00:05:49.360 |
choose out of our vertices friend list we're going to choose the friend that is the closest to our 00:05:59.040 |
target vertex now if we find that there are no near vertices in our friend list this is a local 00:06:09.760 |
minimum and this is our stopping condition we cannot get any closer we're kind of stuck in 00:06:16.400 |
in that local minimum now if our vertex has very few friends we can get stuck 00:06:24.320 |
quite easily so we can get stuck in the local minimum if there are less connections 00:06:40.080 |
fewer friends as a low degree vertex so it's low degree it means it has a low number of links or 00:06:48.160 |
edges and then we also have high degree vertices now with these it's if we are currently on a high 00:06:57.440 |
degree vertex it's harder to get stuck in local minimum because we have more options on where we 00:07:02.480 |
can where we can move now the routing through a navigable small world graph is split into two 00:07:09.520 |
phases so we have the typically the earlier zoom out phase where we are passing through vertices 00:07:18.720 |
that have a low number of links and then we also have the zoom in phase where we are passing through 00:07:28.320 |
high degree vertices which have a lot of links and what we tend to find is a high degree vertices 00:07:35.280 |
will usually have more long range links so we can move further across our graph with these 00:07:42.160 |
high degree vertices now to minimize the probability of stopping early so avoiding 00:07:49.120 |
those local minima we can increase the average degree of our vertices now this also increases 00:07:59.520 |
the complexity of our network and therefore slows down our search time so we have to find a balance 00:08:07.360 |
between both of those now another approach to avoiding hitting a local minima too early 00:08:14.800 |
is to start search on high degree vertices so the vertices which have a lot of connections now 00:08:22.960 |
for nsw we can do this and it works it works for low dimensional data but it doesn't work perfectly 00:08:32.400 |
and it's actually one of the key factors that hnsw improves upon and uses to improve its own 00:08:40.560 |
performance so let's move on to to hsw and how both this fit into hsw so hsw is obviously a 00:08:50.480 |
an evolution of nsw and it borrows inspiration from poops probability skip list structure with 00:08:59.280 |
the with the hierarchy so you can you can see that here we have a few different layers so we 00:09:04.720 |
have our entry layer just like we did with the probability skip list structure and we also have 00:09:12.160 |
an entry point and what we've basically done here is taken a navigable small nsw graph and split it 00:09:20.720 |
across multiple layers so high vertex high degree vertices will tend to be spread across more layers 00:09:32.400 |
because when we're building our graph we add the number of friends based on which layer it gets 00:09:41.520 |
inserted at the higher the layer that the vertex is inserted at the more friends it's going to have 00:09:48.480 |
which makes a high degree vertex and what that does is if we start at that highest layer that 00:09:55.360 |
means we're on high degree vertices so we are less likely to get stuck in the local minimum 00:10:01.520 |
and that means we're not going to stop early so that allows us to start the top layer and we move 00:10:08.080 |
through our graph and what we do is we have our query vector and on each layer we 00:10:17.200 |
keep moving or traversing across different edges in that layer in essentially the same way we did 00:10:25.280 |
for an sw so we greedily identify the friend of our current vertex that has the least distance 00:10:34.560 |
from our query vector and we move to that and then we we keep doing that and once we hit a local 00:10:42.640 |
minimum we we don't stop the whole thing we just move down to the next layer and then we we do it 00:10:50.720 |
again and we keep doing that until we reach the local minimum of layer zero and that is our 00:10:58.800 |
stopping condition that's where we stop now that that's how we search that's what our graph looks 00:11:03.840 |
like and how we search but let's have a quick look at how we actually build that graph so 00:11:09.360 |
what we do is we have a probability function which essentially says we're pretty much going 00:11:19.040 |
to put every vertex or every vector on layer zero not every but a pretty high number of them as we'll 00:11:26.880 |
see later and we say okay we get a uniform a random number from uniform distribution 00:11:35.360 |
and if that number is greater than our probability for that level we move on to 00:11:42.720 |
the next layer so we increment our layer so let's say this turns out to not be true we move up a 00:11:53.280 |
layer and we we try again okay now this next layer is going to have lower probability but we're also 00:11:59.920 |
going to subtract the probability value from our lower layer from the random number that we've 00:12:08.400 |
produced and we keep doing this again and again until we get up to our top layer at that point 00:12:15.200 |
there are no other layers this is very unlikely to happen if we say if we have a billion maybe 00:12:23.280 |
not a billion like a hundred i think it was a hundred million vectors i tried this with 00:12:28.400 |
i got i think four vectors on the top layer so it is it's very unlikely for one vector to go all the 00:12:36.640 |
way up to the top but if it does we insert it at that layer and it has the possibility of becoming 00:12:43.600 |
a entry point for our graph one other thing that we we also have here is the the ml at the top 00:12:50.480 |
now this is called the level multiplier and this value is it's just a normalization value 00:12:59.600 |
which allows us to modify the probability or the probability distribution across each one of our 00:13:06.400 |
layers now the the creators of hsw they found that the the best value for ml the optimum value 00:13:16.320 |
was to use one over the natural logarithm of m and m we'll we'll have a look at that pretty soon 00:13:25.520 |
that's just the number of neighbors each vertex is assigned so here we have we we have m so we've set 00:13:33.760 |
m to three and we we're saying here so we want to insert this blue vector so that blue vector you 00:13:44.720 |
see we've decided okay we're going to insert at layer one that means it gets an insert of it both 00:13:51.520 |
layer one and layer zero but not at layer two and we we go through our graph just like we do when we 00:14:01.920 |
search but this time we're we're trying to find many neighbors of our new vertex that we can assign 00:14:11.280 |
to it's friendless and what we have here is so with an m value of three we would add three neighbors 00:14:22.400 |
to that vertex both on layer one and also on layer zero now we can also have a m m max value and an 00:14:32.640 |
m max zero value which is what we have down here and basically as more vertices are added we may 00:14:46.000 |
find that this vertex ends up with more than three friends and maybe that that's fine but 00:14:55.280 |
what we essentially do is use m max to say okay if we have any vertices with more than this number 00:15:03.280 |
of friends we need to trim it and we need to just keep the closest like three in this example 00:15:10.400 |
and then m max zero is another value and that is the same so it's the maximum number of friends 00:15:17.760 |
of vertex you have but for layer zero which is i think always a higher number than the other layers 00:15:25.600 |
now one other thing that we we should cover very quickly is we also have two more variables ef 00:15:33.200 |
construction ef search now ef construction is used on we're building our graph and it's basically 00:15:40.400 |
the number of friends of our inserted vertex that we will use when traversing across layers so in 00:15:50.800 |
layer one up here we we have these three friends and that means that during our search for 00:15:57.840 |
friends in the low layers we would go down like this we would start our search from each of these 00:16:06.720 |
three vertices in the lower layer okay now that would be the case for all of those 00:16:16.000 |
and ef search is the same but it's used during search time to search through our graph 00:16:21.360 |
and i mean that they're pretty much that's pretty much all we need to know 00:16:27.520 |
in on the theory side of hsw so let's dive into the implementation of hsw in advice 00:16:39.680 |
okay so in we're in python now all i've done is taken our data so i'm using the sift 1m data set 00:16:47.440 |
as we as we usually do and all i'm going to do is use this to you know search throughout our index 00:16:55.440 |
as always now the first thing we want to do is actually create or initialize our index so i'm 00:17:02.400 |
going to use two parameters so we have d which is the dimensionality the sift 1m data set so the 00:17:09.200 |
length of our vectors in there and also m now m if you remember that's the number of neighbors 00:17:16.880 |
that we will assign to each vertex in our graph now for this i'm going to use a value of 32 00:17:24.080 |
and what we can now do is we write index equals advice 00:17:34.000 |
index flat hsw and then we just passed dnm and that's you know we've initialized 00:17:43.360 |
okay sorry so it's not flat hsw it's hsw flat okay there we go so with that we have initialized our 00:17:59.120 |
index of course there's nothing in there at the moment and we can see that so we can go ahead and 00:18:06.240 |
write index hsw max level to see okay how many levels do we have in our index and we'll see that 00:18:16.560 |
we get this minus one which means that the the level is in there that the structure has not 00:18:22.800 |
been initialized yet and we also see that if we we do this so this is how we get the 00:18:29.840 |
number of vertices that we have within each layer so we write the vice vector to array 00:18:40.000 |
and in here we want to write index hsw levels and then let's print out 00:18:52.160 |
and we see that we we just get this empty array now what we'd also want to do when we do so this 00:18:58.160 |
is just a massive array so if we we will have a million values in here once we we add our cif-1m 00:19:09.040 |
data set so what we do is we just write npb count levels to count the number to count basically how 00:19:18.960 |
many vertices we have in each level when i say level by the way it's the same as earlier on when 00:19:25.440 |
i was saying layers it's they're the same so when i say levels layers same thing so obviously at the 00:19:33.840 |
moment we don't have anything in our graph so how do we how do we add that we we already know 00:19:41.600 |
we just do index add xb okay nothing else so we add that that might take some time because 00:19:50.560 |
we're building the graph we saw before it's a reasonably intensive process so it does take a 00:19:55.680 |
little bit of time i think maybe around 40 seconds for this so i will fast forward okay so it's done 00:20:06.000 |
52 seconds not too bad now what we're now going to do is i want to have a look at these again so 00:20:14.240 |
let's see if we we have a structure in our graph yet so max level so that's cool we have that means 00:20:21.440 |
we have five levels zero one two up to four and then if we do this again we'll be able to see the 00:20:30.320 |
distribution of vertices across each one of those levels so i'll just write this there we go so we 00:20:38.880 |
just ignore this this first one here that doesn't mean anything so this is our layer zero like i 00:20:46.400 |
said most vectors get assigned to this layer now we have layer one all the way up to layer four here 00:20:56.320 |
or level four now vice will always add at least one vertex to your maximum level because 00:21:05.680 |
otherwise we we don't have an entry point to our graph so there's always at least one in there 00:21:11.120 |
and we can also check what which vector is being used there so we can write index hsw 00:21:20.880 |
entry points yes and there we have said that that's the number of our vector that we have added 00:21:31.120 |
as our entry point up here you can also change that if you want but i wouldn't probably especially 00:21:37.840 |
with this i definitely wouldn't recommend it now i was super curious as to how it decides 00:21:43.920 |
where to insert everything so went through the code which you can find over here so this is the 00:21:52.240 |
defy scale repo and we have this hsw cpp file and in here we come down and it might take me a little 00:22:02.320 |
while so this is where we initialize our index and we have this set default probus so if we if we go 00:22:11.680 |
into set default progress we will find this here which is not really too hard to read i don't 00:22:20.080 |
read i don't know c++ but this is pretty simple and what i translated that to in python is this 00:22:28.720 |
here with a few comments because there is no comments in the other file for some reason 00:22:35.360 |
now what we're basically doing is just building a probability distribution for each of our layers 00:22:42.560 |
and we do using this this function here so we're taking the exponential of the the level 00:22:51.280 |
divided by at that ml function or value that i mentioned earlier so that's the the multi-level 00:22:59.920 |
the level multiplier value and what we get is a probability for the current level now 00:23:07.840 |
what we do is once we reach a particularly low probability we we start creating new levels 00:23:15.840 |
so if we we run this oh another actually interesting thing is here we we are saying 00:23:25.440 |
on every level we will assign m neighbors as our as our m max except from level zero 00:23:32.720 |
where we will assign two m okay so this is this m times like two so for our use case we had 32 00:23:44.000 |
we had m set to 32 so that means in all of our layers except from layer zero we'll have 00:23:50.480 |
each vertex will have 32 neighbors or friends and on layer or level zero it will have 64 00:23:59.760 |
friends so i thought that was that was also interesting so 00:24:03.440 |
we can go ahead and run that so i'm the 32 here is our m we can replace it m if you if we want 00:24:13.600 |
so we run that and we get this probability distribution so this is for layer zero the 00:24:19.520 |
probability of it being added using a random number generator so we're saying if our randomly 00:24:26.160 |
generated number is less than this value we will assign or we will insert our vertex on at level 00:24:35.600 |
zero and then we continue that logic all the way down to here which is our level four and then here 00:24:43.600 |
is the total number so the cumulative number of neighbors per level so if we insert at level four 00:24:50.400 |
we we will have this many neighbors for that vertex in total if we insert level zero we have 64 again 00:24:58.320 |
a little bit interesting now in the in the code we when we're inserting vertices we use another 00:25:07.200 |
function this is the last one we're not going to go through all of them we're just going to go 00:25:10.800 |
through this one we insert or we decide where to insert our vertex using this function here which 00:25:17.040 |
is the random level function this takes our sign probus from up here so this here and also a 00:25:25.840 |
randomly generated number so here i'm using the numpy's random number generator 00:25:33.840 |
and we go through and that's that's how we assign a different level for each of our values so if i 00:25:42.000 |
let's run that and we'll see that we get a very similar distribution to to what we get in vice 00:25:52.240 |
which i is i think pretty cool to see so if we do the bin count again we get this now if you take a 00:26:01.760 |
look at this these values are so close to to what we produce up here so where is it um so here look 00:26:12.720 |
at these values almost exactly the same so yeah it's pretty close they're both working in the same 00:26:23.520 |
way now i also visualize that in in this chart here so you can you can see okay on the right 00:26:33.280 |
we have a python version on the left we have our vice version and i've included the numbers in there 00:26:38.720 |
basically that is exactly it's pretty much exactly the same it's off by a few here and there but not 00:26:45.280 |
much so i thought that was pretty cool to see now okay that's cool that's how everything works 00:26:53.280 |
that's interesting but what i think many of us are probably most interested in is how do i build the 00:27:02.400 |
best hs sub u index for my for my use case for what i want so i ran some codes to go through 00:27:12.240 |
all of that and pretty much i did this so we have this is also a good thing for me to show you um 00:27:22.960 |
is how we get out ef construction and ef search values so i think you know a good starting point 00:27:29.760 |
for both of these values is probably 16 32 is reasonable but you can play around them i think 00:27:40.800 |
that's usually the best way and for most of what i'm going to show you i was using this so i was 00:27:47.600 |
performing a thousand queries and calculating the recall at one value for those also i was 00:27:57.040 |
checking you know how long our search is taking and the memory usage of each of these indexes 00:28:02.960 |
so the first thing i wanted to look at is a recall of course so in in what what we have here 00:28:13.840 |
on the left we have with a ef constriction value of two which is is very low i person i 00:28:21.200 |
i don't know never use that never use an ef constriction value that low it well well you 00:28:28.000 |
can see the recall is pretty terrible but you know surprisingly even though this is a very bad 00:28:33.200 |
ef constriction value still got up to 40 percent recall so i was impressed with that that was cool 00:28:42.240 |
and so the ef search values are what you can see on the bottom so it went from from i think two all 00:28:48.320 |
the way up to 256 256 yeah that's right and we did that again for the ef constriction value of 16 here 00:28:59.200 |
and also 64 i did it for more ef constriction values but these are probably the ones i think 00:29:05.440 |
showed the relationship best and then all the colors you can see here we have a few different 00:29:10.560 |
colors we're using different m values so what i found from this is that you basically you want 00:29:19.760 |
to increase your your m and your ef search values to improve your recall and ef construction also 00:29:26.960 |
helps with that so higher ef constriction value means that we need lower ef search and m values 00:29:33.440 |
to achieve you know the same the same recall now of course you know we can just increase all of 00:29:40.960 |
these and we we get better recall but that isn't for free we have to sacrifice some of our search 00:29:50.560 |
time as well as always so we well this is this is what we've got so same same graphs as last time 00:30:01.040 |
this time we have search time on the y-axis now ef construction here 00:30:07.760 |
does actually make a bit of a difference when we start using higher m values so i found before in 00:30:18.000 |
previous videos and articles that ef construction didn't really have too much of an effect 00:30:23.200 |
on the search time which seemed fair enough because it is mostly a index construction 00:30:31.120 |
parameter but that it does seem to have some effect when you start adding or increasing 00:30:38.000 |
the number of queries that you are searching for and in particular where you have high m values 00:30:45.040 |
now one thing to also note here is that we're using a load scale on the y-axis so 00:30:52.000 |
the the higher values are definitely a fair bit higher but again i mean if we if we go for a sort 00:30:59.280 |
of reasonable ef search value all of these are pretty high i mean 256 for your search values 00:31:05.840 |
is high if we go for sort of es search of 32 let's say around here 00:31:18.800 |
we're still getting reasonable search times so this is on the 00:31:21.920 |
millisecond well it's a bit more than a millisecond scale 00:31:25.840 |
and if we if we take a look at that on uh recall we're sort of getting up into the sort of 60 00:31:38.320 |
60 plus area at that point six cent plus and and look obviously it's not bad it's reasonably good 00:31:48.400 |
but you can you can definitely get better than that when you start merging different 00:31:52.720 |
indexes which we have covered all of that in another video another article so you'll be 00:32:01.120 |
able to find that in in the either at the bottom of the article if you're reading the article 00:32:05.600 |
or in the the video description if you're watching us on youtube 00:32:08.960 |
now having a look at the the ef construction values again their effect on the search time 00:32:16.720 |
we find that ef construction when we use those lower end values like i mentioned it doesn't 00:32:22.720 |
have too much of a impact on the search time so if you are using a lower m value i think ef 00:32:29.520 |
construction is a a good value to increase because it doesn't tend to make push up your search time 00:32:37.680 |
too much and you can improve your recall fair bit by increasing it so it's definitely one to 00:32:46.080 |
one to try and increase them and see how it goes for me i found it it's been quite useful at least 00:32:52.000 |
and then finally the sort of negative point of h and sub u is the memory usage is just 00:33:00.640 |
through the roof as you can see here so using the lowest m value which i think was 00:33:06.160 |
two which is you you would never ever use that we were getting a memory usage for the sift 1m 00:33:13.120 |
data set of more than more than half a gigabyte if you push m up a lot i mean 500 and 512 is is 00:33:23.600 |
maybe a little bit too much but we're sort of pushing four and a half gigabytes five gigabytes 00:33:29.680 |
so it's something to be aware of i think but like i said before we can we can still improve the the 00:33:39.840 |
memory usage and the search speed by simply using different indexes alongside our hsw index so 00:33:47.760 |
if memory usage is a concern you just compress the vectors using product quantization 00:33:52.960 |
if the speed if you need it to be even faster you can use the an ibf index on top of hsw to improve 00:34:05.120 |
the search times and if you want both of those you just put them all together and you can get 00:34:11.280 |
that as well now like i said we we have covered all that in a lot more detail in another article 00:34:18.160 |
and video but for this video on hsw i think now we're gonna leave it there i think we've covered 00:34:28.160 |
plenty so thank you very much for watching and i'll see you in the next one bye