Vector Search Benchmark[eting] - Philipp Krenn, Elastic

00:00:00.040 | I'm Philippe. Let's talk a bit about benchmarking, benchmarking. Who has heard the term benchmarking

00:00:22.960 | before? Benchmarking is what you get in most of the benchmarks out there, even though nobody

00:00:32.560 | calls their own benchmarks like that. Let's take a quick look. Who is using vector search

00:00:37.200 | today? I assume many. Who likes the performance of their vector search? Who has looked at various

00:00:46.960 | performance benchmarks around vector search? Quite a few. How did you like those benchmarks?

00:00:51.920 | Yes. That is, I have this X is faster than Y. The problem is that you can have pretty much

00:01:01.920 | any product in X or any product in Y. I think at this point, we have benchmarks for every

00:01:06.700 | single vendor being both faster and slower than all of their competitors, which is a bit

00:01:12.360 | of a funny sign for benchmarks. Why is that? Are benchmarks just bad? Is everybody lying?

00:01:19.960 | How does it happen that benchmarks are so different? Don't be shy. Come up here. We have some more

00:01:25.160 | chairs up here. Also, don't trust the glossy charts. The better looking the benchmarks are,

00:01:33.960 | sometimes, the worse the underlying material is. Let's dive into what to do and what not

00:01:41.400 | do with benchmarks. Come up here. We have two more chairs up here. The first thing, and that's one of

00:01:48.360 | the biggest ones, is finding the right use case. The right use case, of course, depends on your system.

00:01:55.560 | This is one of my favorite benchmarking comics. It's under similar conditions. We are comparing two

00:02:02.120 | systems, and one is doing much better than the other system. That is pretty much exactly what many

00:02:09.000 | system or many companies are doing with their own vendor benchmarks. Define a scenario that they like,

00:02:16.040 | and probably their competitors don't like so much, and then run the benchmark, and then afterwards they

00:02:21.320 | will say how much better they are. This is like a very common pattern for benchmarks. Or to take it in

00:02:27.240 | a slightly different direction. If you try to build a benchmarking dataset, and maybe you've seen that with

00:02:34.680 | your own marketing team, it's like you run various scenarios, and then you see like this is no good,

00:02:40.440 | this is no good, this is definitely no good, but here you have struck gold. Because here,

00:02:46.360 | them is down here, and us is up here. So we will take this one, and then we'll generalize, and generalize,

00:02:55.400 | and say like for everything, we're 40% faster than the competition. And you forget about all the others.

00:03:00.760 | That is one of the very common things that you do with benchmarks. And I will just call out a couple

00:03:08.600 | of things, especially for vector search. One of the things, depending on the underlying data structure

00:03:13.480 | that you have, and like this is how the system is built, most of the benchmarks, because it's much easier,

00:03:19.560 | and also more reproducible, will be read-only, oftentimes within optimized format. The problem is,

00:03:25.640 | most of our workloads are not read-only, but pretty much all the benchmarks that you see out there are

00:03:32.040 | for read-only datasets. Because it's much easier to compare, versus if you have a specific dataset,

00:03:38.440 | and then you would need to find the right read-write ratio, and like there's so many parameters that you can

00:03:43.640 | do, that most people just don't, and then you have a benchmark that is not really representative of

00:03:48.440 | what you're trying to do. Another thing is, one very counterintuitive thing about vector search is,

00:03:56.440 | that at least for HNSW, filtering makes things slower. So if you come from a regular data store,

00:04:04.200 | and you normally, you have a restrictive filter, and you reduce the dataset, let's say like only 20%

00:04:09.400 | of the things make it through the filter, and you then try to rank those, it will be faster.

00:04:13.800 | The way vector search works, or at least for approximate nearest neighbor search, is that that will

00:04:20.200 | actually make it slower, because you need to look at a lot more candidates, and then find actually what

00:04:24.760 | remains after the filter. And there are various optimizations to do that, but that's another trick

00:04:29.640 | that you can do, depending on how good or how bad you are with filters, that you just tune the right

00:04:35.160 | scenario to find your sweet spot. Or that you have just built in some optimizations, and then you're

00:04:42.600 | making sure that you have a scenario that hits exactly these optimizations that your competition doesn't have.

00:04:47.400 | Another thing that I see quite frequently, especially as people update their benchmarks,

00:04:54.520 | they normally always update their own software, but they don't update their competitors.

00:04:58.760 | We see that in various benchmarks, where we are also present as a company, that our version is probably

00:05:04.840 | 18 months old, and the competitor has something that came out in the last one or two months.

00:05:09.080 | And then obviously, they will do quite a bit better there.

00:05:13.720 | So yeah, you only opted your own version and the other ones. I can understand, you don't want to keep

00:05:20.920 | up with the changes of all your competitors. You don't want to figure out like what changed and how are things

00:05:26.680 | configured the right way in a newer version. And you're mostly focused on the interesting things that

00:05:30.760 | you're doing yourself. It is also convenient for your own performance.

00:05:35.080 | There's also like a lot of implicit biases. It's like, you know how your system is built and what it is

00:05:43.640 | built for and how it works well. And then you might often pick scenarios that work well for that use case.

00:05:50.760 | And then it might not even be intentional, but then you might pick something that is just not a very good

00:05:56.040 | fit for your competitors. The same goes for defaults. It's like how you split up shard, how you allocate

00:06:02.760 | memory, what instance size or instance configuration you pick. You might not even mean it in a bad way,

00:06:10.600 | or you haven't even benchmarked the competitor against it on that specific machine. But you might just pick

00:06:15.640 | something that you know that works well for you. And then conveniently, it's not so great for your

00:06:20.280 | competitors. It could be like the shard size, the total data size, what fits into memory, how you compress

00:06:26.440 | something, the overall memory settings, the general data access patterns. There are a ton of different

00:06:32.600 | ways how you could tweak that one way or another. And then there is obviously cheating. My favorite example

00:06:43.240 | for that is if you got the Volkswagen a while ago, like they had some creative ways to look good in the

00:06:52.600 | benchmarks for like how much exhaust they were producing. I think there was even a fun project

00:06:59.880 | that was like a Volkswagen plugin for UCI server that would just always make your tests pass,

00:07:05.640 | which is kind of like the same thing here. And cheating can be many things. For vector search

00:07:12.360 | specifically, especially when you do approximate nearest neighbor search, where it's depending on how you

00:07:19.480 | pick the parameters and like how widely you search, the quality of the results will

00:07:27.640 | be different or potentially be different. So precision recall is normally what we do for

00:07:32.680 | information retrieval. And if you have different parameters or different implementations, and you

00:07:37.080 | don't include those in the benchmark, you're comparing totally different things. And then it's not totally

00:07:43.480 | uncommon to see that precision and benchmark are I'm not sure if intentionally or unintentionally

00:07:49.560 | skipped for performance, but that you then see widely different numbers, but the results are actually

00:07:57.480 | totally different or the quality is totally different. It's like, obviously, you can produce crap results

00:08:02.120 | very quickly. But that is probably not the point of what you do in your benchmarks. We actually had

00:08:07.960 | some of those ourselves with people who were less experienced in vector search benchmarks, they forgot

00:08:13.240 | about that quality attribute because for other types of searches, for example, if you have a boolean filter

00:08:19.080 | where something passes and something doesn't pass, this doesn't exist, that they almost published something

00:08:25.000 | without looking at precision and recall. And it did make quite a bit of a difference, even in our own

00:08:29.800 | benchmarks. I sometimes see creative statistics. So one thing that I've recently seen that was very

00:08:38.360 | funny is like, there were like, I think, 20 data points or so or 20 different things measured. And in 18 or 19,

00:08:44.920 | two systems were quite similar. And then one system basically introduced some optimization for one of the use

00:08:50.840 | cases and made it like 10 times faster than the competitor. And then they found a way to kind of

00:08:56.360 | like even out the statistics across all of them. And then they said like, overall, we're dead faster.

00:09:01.240 | But they found this one weird edge case, I think it was like, instead of descending sorting,

00:09:07.480 | it was ascending sorting or something like that on one specific data type. And that one was much faster.

00:09:13.320 | And then they evened it out enough that it looked like overall, the systems were wildly different. But it was

00:09:17.720 | basically like one specific query on one of the many things you looked at. And then you have like the

00:09:24.120 | headline, because that's mostly what you're going for in the benchmarks, right? You want to have this

00:09:28.200 | benchmark five times faster than x. Again, coming back to the glossy charts, those are the things that

00:09:34.360 | you do. And it's kind of like the money shot in benchmarking. And of course, then there are the

00:09:42.840 | problems when data doesn't reproduce, you don't publish all the pieces that anybody can actually run this

00:09:47.720 | to try it out. Those don't make it easy either. So how do we make benchmarks better or more meaningful?

00:09:56.280 | They should definitely be automated and reproducible. So what we internally do, for example, we have a

00:10:03.640 | very wide set of, we call them the nightly benchmarks. That's why up here it says nightly. I just pulled

00:10:10.680 | this one out because it was nice. We optimized one specific thing, I think like, I don't know,

00:10:15.880 | 100x or 10x or whatever the ratio was. But we optimized this, which was nice. But the important thing here is,

00:10:22.760 | we run this benchmark every night. And then we put all the changes that we had during that day together

00:10:28.120 | to see if the performance changes. Why are we doing that? To avoid the slow boiling frog problem.

00:10:34.520 | Who knows the slow boiling frog? It's a very French thing. It's like, you know, you throw the frog into

00:10:42.440 | the boiling water and the frog will jump out again. And you put the frog into the cold water and slowly

00:10:47.240 | heat it up and the frog will stay in the water. And the same thing kind of like applies to benchmarks.

00:10:52.600 | If you make a change today, and it makes something 1% slower, you're probably not going to see it. And

00:10:57.160 | next week, you make another change where it gets 2% slower. And over time, your performance just

00:11:01.560 | decreases a lot. So you're the frog sitting in the warming up water and being boiled. So you want to

00:11:08.760 | avoid that with your own system that you don't want to boil your own performance. So you want to have

00:11:13.720 | nightly benchmarks where you see how things are changing and evolving over time. Second, this is a bit of an

00:11:21.960 | unfortunate thing. But you will need to do your own benchmarks. Don't trust anybody's benchmarks. And

00:11:29.560 | they probably nobody has done exactly the scenario that you want to have. It's like, you need to know

00:11:34.360 | this is my data size. And this is my data structure. And this is the read write ratio. And this is how

00:11:40.520 | exactly my queries look like. And this is the latency that I am willing to accept. And this is the type of

00:11:45.240 | hardware that I have. There are so many ways that make the existing benchmarks probably not 100%

00:11:51.400 | meaningful for you, that if you want to really be sure, there is no way around doing that yourself.

00:11:56.840 | Otherwise, you will buy somebody's glossy benchmarks, and you will have to trust them and hope for the best

00:12:03.640 | that it actually behaves the same way.

00:12:04.840 | That's unfortunate, because it always means work. And you need to put in that work. But there is no

00:12:11.400 | easy way around that. We have a tool we call Valley. It's basically you create a so-called track, and then

00:12:18.360 | you run the track where you say, this is the data, this is the query. And then you can optimize that. And then

00:12:23.240 | you can tune the settings or the hardware or whatever else. It's only against us and ourselves. This is how we

00:12:29.000 | would do nightly benchmarks with ourselves. But that is how you can then benchmark and figure out,

00:12:34.120 | for my hardware, what can I actually get out in terms of performance? And then you will need to take

00:12:38.360 | whatever other tools you want to evaluate and do something similar to see, like, for exactly your

00:12:43.400 | workload, how does that compare? One final thing, and that is also what I need to keep repeating internally

00:12:51.720 | for ourselves. If you find the flaw in a benchmark, it's easy to call the entire benchmark crap and then

00:12:58.440 | ignore everything that comes out of it. What is much smarter is if you look at it and still figure

00:13:03.400 | out, can I learn something from that? Even if it's like somebody produces a bad benchmark against a

00:13:09.560 | competitor, it can tell you, like, where they think their sweet spot is or what is the scenario that they

00:13:14.520 | pick for themselves or what they try to highlight. Because even then you can learn, like, what are the

00:13:20.040 | potential strengths of a system? Where do they shine? What works well and what doesn't work well? So all of that

00:13:25.640 | makes it easier to find something useful if you want to look. But it's, of course, much easier to say,

00:13:31.560 | like, all of this is crap and we'll ignore it and we'll call it BS. And then you move on.

00:13:35.880 | I hope this was useful. I'm afraid you will have to do your own benchmarks to get proper results.

00:13:44.200 | Otherwise, you will have to believe the benchmarking. Let me take a quick picture with you so I can

00:13:50.200 | prove to my colleagues that I've been working today. Can you all wave and say benchmarking?

00:13:59.880 | Thanks a lot. Enjoy the heat and I'll hand over to the next speaker.