back to indexVector Search Benchmark[eting] - Philipp Krenn, Elastic

00:00:00.040 |
I'm Philippe. Let's talk a bit about benchmarking, benchmarking. Who has heard the term benchmarking 00:00:22.960 |
before? Benchmarking is what you get in most of the benchmarks out there, even though nobody 00:00:32.560 |
calls their own benchmarks like that. Let's take a quick look. Who is using vector search 00:00:37.200 |
today? I assume many. Who likes the performance of their vector search? Who has looked at various 00:00:46.960 |
performance benchmarks around vector search? Quite a few. How did you like those benchmarks? 00:00:51.920 |
Yes. That is, I have this X is faster than Y. The problem is that you can have pretty much 00:01:01.920 |
any product in X or any product in Y. I think at this point, we have benchmarks for every 00:01:06.700 |
single vendor being both faster and slower than all of their competitors, which is a bit 00:01:12.360 |
of a funny sign for benchmarks. Why is that? Are benchmarks just bad? Is everybody lying? 00:01:19.960 |
How does it happen that benchmarks are so different? Don't be shy. Come up here. We have some more 00:01:25.160 |
chairs up here. Also, don't trust the glossy charts. The better looking the benchmarks are, 00:01:33.960 |
sometimes, the worse the underlying material is. Let's dive into what to do and what not 00:01:41.400 |
do with benchmarks. Come up here. We have two more chairs up here. The first thing, and that's one of 00:01:48.360 |
the biggest ones, is finding the right use case. The right use case, of course, depends on your system. 00:01:55.560 |
This is one of my favorite benchmarking comics. It's under similar conditions. We are comparing two 00:02:02.120 |
systems, and one is doing much better than the other system. That is pretty much exactly what many 00:02:09.000 |
system or many companies are doing with their own vendor benchmarks. Define a scenario that they like, 00:02:16.040 |
and probably their competitors don't like so much, and then run the benchmark, and then afterwards they 00:02:21.320 |
will say how much better they are. This is like a very common pattern for benchmarks. Or to take it in 00:02:27.240 |
a slightly different direction. If you try to build a benchmarking dataset, and maybe you've seen that with 00:02:34.680 |
your own marketing team, it's like you run various scenarios, and then you see like this is no good, 00:02:40.440 |
this is no good, this is definitely no good, but here you have struck gold. Because here, 00:02:46.360 |
them is down here, and us is up here. So we will take this one, and then we'll generalize, and generalize, 00:02:55.400 |
and say like for everything, we're 40% faster than the competition. And you forget about all the others. 00:03:00.760 |
That is one of the very common things that you do with benchmarks. And I will just call out a couple 00:03:08.600 |
of things, especially for vector search. One of the things, depending on the underlying data structure 00:03:13.480 |
that you have, and like this is how the system is built, most of the benchmarks, because it's much easier, 00:03:19.560 |
and also more reproducible, will be read-only, oftentimes within optimized format. The problem is, 00:03:25.640 |
most of our workloads are not read-only, but pretty much all the benchmarks that you see out there are 00:03:32.040 |
for read-only datasets. Because it's much easier to compare, versus if you have a specific dataset, 00:03:38.440 |
and then you would need to find the right read-write ratio, and like there's so many parameters that you can 00:03:43.640 |
do, that most people just don't, and then you have a benchmark that is not really representative of 00:03:48.440 |
what you're trying to do. Another thing is, one very counterintuitive thing about vector search is, 00:03:56.440 |
that at least for HNSW, filtering makes things slower. So if you come from a regular data store, 00:04:04.200 |
and you normally, you have a restrictive filter, and you reduce the dataset, let's say like only 20% 00:04:09.400 |
of the things make it through the filter, and you then try to rank those, it will be faster. 00:04:13.800 |
The way vector search works, or at least for approximate nearest neighbor search, is that that will 00:04:20.200 |
actually make it slower, because you need to look at a lot more candidates, and then find actually what 00:04:24.760 |
remains after the filter. And there are various optimizations to do that, but that's another trick 00:04:29.640 |
that you can do, depending on how good or how bad you are with filters, that you just tune the right 00:04:35.160 |
scenario to find your sweet spot. Or that you have just built in some optimizations, and then you're 00:04:42.600 |
making sure that you have a scenario that hits exactly these optimizations that your competition doesn't have. 00:04:47.400 |
Another thing that I see quite frequently, especially as people update their benchmarks, 00:04:54.520 |
they normally always update their own software, but they don't update their competitors. 00:04:58.760 |
We see that in various benchmarks, where we are also present as a company, that our version is probably 00:05:04.840 |
18 months old, and the competitor has something that came out in the last one or two months. 00:05:09.080 |
And then obviously, they will do quite a bit better there. 00:05:13.720 |
So yeah, you only opted your own version and the other ones. I can understand, you don't want to keep 00:05:20.920 |
up with the changes of all your competitors. You don't want to figure out like what changed and how are things 00:05:26.680 |
configured the right way in a newer version. And you're mostly focused on the interesting things that 00:05:30.760 |
you're doing yourself. It is also convenient for your own performance. 00:05:35.080 |
There's also like a lot of implicit biases. It's like, you know how your system is built and what it is 00:05:43.640 |
built for and how it works well. And then you might often pick scenarios that work well for that use case. 00:05:50.760 |
And then it might not even be intentional, but then you might pick something that is just not a very good 00:05:56.040 |
fit for your competitors. The same goes for defaults. It's like how you split up shard, how you allocate 00:06:02.760 |
memory, what instance size or instance configuration you pick. You might not even mean it in a bad way, 00:06:10.600 |
or you haven't even benchmarked the competitor against it on that specific machine. But you might just pick 00:06:15.640 |
something that you know that works well for you. And then conveniently, it's not so great for your 00:06:20.280 |
competitors. It could be like the shard size, the total data size, what fits into memory, how you compress 00:06:26.440 |
something, the overall memory settings, the general data access patterns. There are a ton of different 00:06:32.600 |
ways how you could tweak that one way or another. And then there is obviously cheating. My favorite example 00:06:43.240 |
for that is if you got the Volkswagen a while ago, like they had some creative ways to look good in the 00:06:52.600 |
benchmarks for like how much exhaust they were producing. I think there was even a fun project 00:06:59.880 |
that was like a Volkswagen plugin for UCI server that would just always make your tests pass, 00:07:05.640 |
which is kind of like the same thing here. And cheating can be many things. For vector search 00:07:12.360 |
specifically, especially when you do approximate nearest neighbor search, where it's depending on how you 00:07:19.480 |
pick the parameters and like how widely you search, the quality of the results will 00:07:27.640 |
be different or potentially be different. So precision recall is normally what we do for 00:07:32.680 |
information retrieval. And if you have different parameters or different implementations, and you 00:07:37.080 |
don't include those in the benchmark, you're comparing totally different things. And then it's not totally 00:07:43.480 |
uncommon to see that precision and benchmark are I'm not sure if intentionally or unintentionally 00:07:49.560 |
skipped for performance, but that you then see widely different numbers, but the results are actually 00:07:57.480 |
totally different or the quality is totally different. It's like, obviously, you can produce crap results 00:08:02.120 |
very quickly. But that is probably not the point of what you do in your benchmarks. We actually had 00:08:07.960 |
some of those ourselves with people who were less experienced in vector search benchmarks, they forgot 00:08:13.240 |
about that quality attribute because for other types of searches, for example, if you have a boolean filter 00:08:19.080 |
where something passes and something doesn't pass, this doesn't exist, that they almost published something 00:08:25.000 |
without looking at precision and recall. And it did make quite a bit of a difference, even in our own 00:08:29.800 |
benchmarks. I sometimes see creative statistics. So one thing that I've recently seen that was very 00:08:38.360 |
funny is like, there were like, I think, 20 data points or so or 20 different things measured. And in 18 or 19, 00:08:44.920 |
two systems were quite similar. And then one system basically introduced some optimization for one of the use 00:08:50.840 |
cases and made it like 10 times faster than the competitor. And then they found a way to kind of 00:08:56.360 |
like even out the statistics across all of them. And then they said like, overall, we're dead faster. 00:09:01.240 |
But they found this one weird edge case, I think it was like, instead of descending sorting, 00:09:07.480 |
it was ascending sorting or something like that on one specific data type. And that one was much faster. 00:09:13.320 |
And then they evened it out enough that it looked like overall, the systems were wildly different. But it was 00:09:17.720 |
basically like one specific query on one of the many things you looked at. And then you have like the 00:09:24.120 |
headline, because that's mostly what you're going for in the benchmarks, right? You want to have this 00:09:28.200 |
benchmark five times faster than x. Again, coming back to the glossy charts, those are the things that 00:09:34.360 |
you do. And it's kind of like the money shot in benchmarking. And of course, then there are the 00:09:42.840 |
problems when data doesn't reproduce, you don't publish all the pieces that anybody can actually run this 00:09:47.720 |
to try it out. Those don't make it easy either. So how do we make benchmarks better or more meaningful? 00:09:56.280 |
They should definitely be automated and reproducible. So what we internally do, for example, we have a 00:10:03.640 |
very wide set of, we call them the nightly benchmarks. That's why up here it says nightly. I just pulled 00:10:10.680 |
this one out because it was nice. We optimized one specific thing, I think like, I don't know, 00:10:15.880 |
100x or 10x or whatever the ratio was. But we optimized this, which was nice. But the important thing here is, 00:10:22.760 |
we run this benchmark every night. And then we put all the changes that we had during that day together 00:10:28.120 |
to see if the performance changes. Why are we doing that? To avoid the slow boiling frog problem. 00:10:34.520 |
Who knows the slow boiling frog? It's a very French thing. It's like, you know, you throw the frog into 00:10:42.440 |
the boiling water and the frog will jump out again. And you put the frog into the cold water and slowly 00:10:47.240 |
heat it up and the frog will stay in the water. And the same thing kind of like applies to benchmarks. 00:10:52.600 |
If you make a change today, and it makes something 1% slower, you're probably not going to see it. And 00:10:57.160 |
next week, you make another change where it gets 2% slower. And over time, your performance just 00:11:01.560 |
decreases a lot. So you're the frog sitting in the warming up water and being boiled. So you want to 00:11:08.760 |
avoid that with your own system that you don't want to boil your own performance. So you want to have 00:11:13.720 |
nightly benchmarks where you see how things are changing and evolving over time. Second, this is a bit of an 00:11:21.960 |
unfortunate thing. But you will need to do your own benchmarks. Don't trust anybody's benchmarks. And 00:11:29.560 |
they probably nobody has done exactly the scenario that you want to have. It's like, you need to know 00:11:34.360 |
this is my data size. And this is my data structure. And this is the read write ratio. And this is how 00:11:40.520 |
exactly my queries look like. And this is the latency that I am willing to accept. And this is the type of 00:11:45.240 |
hardware that I have. There are so many ways that make the existing benchmarks probably not 100% 00:11:51.400 |
meaningful for you, that if you want to really be sure, there is no way around doing that yourself. 00:11:56.840 |
Otherwise, you will buy somebody's glossy benchmarks, and you will have to trust them and hope for the best 00:12:04.840 |
That's unfortunate, because it always means work. And you need to put in that work. But there is no 00:12:11.400 |
easy way around that. We have a tool we call Valley. It's basically you create a so-called track, and then 00:12:18.360 |
you run the track where you say, this is the data, this is the query. And then you can optimize that. And then 00:12:23.240 |
you can tune the settings or the hardware or whatever else. It's only against us and ourselves. This is how we 00:12:29.000 |
would do nightly benchmarks with ourselves. But that is how you can then benchmark and figure out, 00:12:34.120 |
for my hardware, what can I actually get out in terms of performance? And then you will need to take 00:12:38.360 |
whatever other tools you want to evaluate and do something similar to see, like, for exactly your 00:12:43.400 |
workload, how does that compare? One final thing, and that is also what I need to keep repeating internally 00:12:51.720 |
for ourselves. If you find the flaw in a benchmark, it's easy to call the entire benchmark crap and then 00:12:58.440 |
ignore everything that comes out of it. What is much smarter is if you look at it and still figure 00:13:03.400 |
out, can I learn something from that? Even if it's like somebody produces a bad benchmark against a 00:13:09.560 |
competitor, it can tell you, like, where they think their sweet spot is or what is the scenario that they 00:13:14.520 |
pick for themselves or what they try to highlight. Because even then you can learn, like, what are the 00:13:20.040 |
potential strengths of a system? Where do they shine? What works well and what doesn't work well? So all of that 00:13:25.640 |
makes it easier to find something useful if you want to look. But it's, of course, much easier to say, 00:13:31.560 |
like, all of this is crap and we'll ignore it and we'll call it BS. And then you move on. 00:13:35.880 |
I hope this was useful. I'm afraid you will have to do your own benchmarks to get proper results. 00:13:44.200 |
Otherwise, you will have to believe the benchmarking. Let me take a quick picture with you so I can 00:13:50.200 |
prove to my colleagues that I've been working today. Can you all wave and say benchmarking? 00:13:59.880 |
Thanks a lot. Enjoy the heat and I'll hand over to the next speaker.