Vector Search Benchmark[eting] - Philipp Krenn, Elastic

I'm Philippe. Let's talk a bit about benchmarking, benchmarking. Who has heard the term benchmarking before? Benchmarking is what you get in most of the benchmarks out there, even though nobody calls their own benchmarks like that. Let's take a quick look. Who is using vector search today? I assume many.

Who likes the performance of their vector search? Who has looked at various performance benchmarks around vector search? Quite a few. How did you like those benchmarks? Yes. That is, I have this X is faster than Y. The problem is that you can have pretty much any product in X or any product in Y.

I think at this point, we have benchmarks for every single vendor being both faster and slower than all of their competitors, which is a bit of a funny sign for benchmarks. Why is that? Are benchmarks just bad? Is everybody lying? How does it happen that benchmarks are so different?

Don't be shy. Come up here. We have some more chairs up here. Also, don't trust the glossy charts. The better looking the benchmarks are, sometimes, the worse the underlying material is. Let's dive into what to do and what not do with benchmarks. Come up here. We have two more chairs up here.

The first thing, and that's one of the biggest ones, is finding the right use case. The right use case, of course, depends on your system. This is one of my favorite benchmarking comics. It's under similar conditions. We are comparing two systems, and one is doing much better than the other system.

That is pretty much exactly what many system or many companies are doing with their own vendor benchmarks. Define a scenario that they like, and probably their competitors don't like so much, and then run the benchmark, and then afterwards they will say how much better they are. This is like a very common pattern for benchmarks.

Or to take it in a slightly different direction. If you try to build a benchmarking dataset, and maybe you've seen that with your own marketing team, it's like you run various scenarios, and then you see like this is no good, this is no good, this is definitely no good, but here you have struck gold.

Because here, them is down here, and us is up here. So we will take this one, and then we'll generalize, and generalize, and say like for everything, we're 40% faster than the competition. And you forget about all the others. That is one of the very common things that you do with benchmarks.

And I will just call out a couple of things, especially for vector search. One of the things, depending on the underlying data structure that you have, and like this is how the system is built, most of the benchmarks, because it's much easier, and also more reproducible, will be read-only, oftentimes within optimized format.

The problem is, most of our workloads are not read-only, but pretty much all the benchmarks that you see out there are for read-only datasets. Because it's much easier to compare, versus if you have a specific dataset, and then you would need to find the right read-write ratio, and like there's so many parameters that you can do, that most people just don't, and then you have a benchmark that is not really representative of what you're trying to do.

Another thing is, one very counterintuitive thing about vector search is, that at least for HNSW, filtering makes things slower. So if you come from a regular data store, and you normally, you have a restrictive filter, and you reduce the dataset, let's say like only 20% of the things make it through the filter, and you then try to rank those, it will be faster.

The way vector search works, or at least for approximate nearest neighbor search, is that that will actually make it slower, because you need to look at a lot more candidates, and then find actually what remains after the filter. And there are various optimizations to do that, but that's another trick that you can do, depending on how good or how bad you are with filters, that you just tune the right scenario to find your sweet spot.

Or that you have just built in some optimizations, and then you're making sure that you have a scenario that hits exactly these optimizations that your competition doesn't have. Another thing that I see quite frequently, especially as people update their benchmarks, they normally always update their own software, but they don't update their competitors.

We see that in various benchmarks, where we are also present as a company, that our version is probably 18 months old, and the competitor has something that came out in the last one or two months. And then obviously, they will do quite a bit better there. So yeah, you only opted your own version and the other ones.

I can understand, you don't want to keep up with the changes of all your competitors. You don't want to figure out like what changed and how are things configured the right way in a newer version. And you're mostly focused on the interesting things that you're doing yourself. It is also convenient for your own performance.

There's also like a lot of implicit biases. It's like, you know how your system is built and what it is built for and how it works well. And then you might often pick scenarios that work well for that use case. And then it might not even be intentional, but then you might pick something that is just not a very good fit for your competitors.

The same goes for defaults. It's like how you split up shard, how you allocate memory, what instance size or instance configuration you pick. You might not even mean it in a bad way, or you haven't even benchmarked the competitor against it on that specific machine. But you might just pick something that you know that works well for you.

And then conveniently, it's not so great for your competitors. It could be like the shard size, the total data size, what fits into memory, how you compress something, the overall memory settings, the general data access patterns. There are a ton of different ways how you could tweak that one way or another.

And then there is obviously cheating. My favorite example for that is if you got the Volkswagen a while ago, like they had some creative ways to look good in the benchmarks for like how much exhaust they were producing. I think there was even a fun project that was like a Volkswagen plugin for UCI server that would just always make your tests pass, which is kind of like the same thing here.

And cheating can be many things. For vector search specifically, especially when you do approximate nearest neighbor search, where it's depending on how you pick the parameters and like how widely you search, the quality of the results will be different or potentially be different. So precision recall is normally what we do for information retrieval.

And if you have different parameters or different implementations, and you don't include those in the benchmark, you're comparing totally different things. And then it's not totally uncommon to see that precision and benchmark are I'm not sure if intentionally or unintentionally skipped for performance, but that you then see widely different numbers, but the results are actually totally different or the quality is totally different.

It's like, obviously, you can produce crap results very quickly. But that is probably not the point of what you do in your benchmarks. We actually had some of those ourselves with people who were less experienced in vector search benchmarks, they forgot about that quality attribute because for other types of searches, for example, if you have a boolean filter where something passes and something doesn't pass, this doesn't exist, that they almost published something without looking at precision and recall.

And it did make quite a bit of a difference, even in our own benchmarks. I sometimes see creative statistics. So one thing that I've recently seen that was very funny is like, there were like, I think, 20 data points or so or 20 different things measured. And in 18 or 19, two systems were quite similar.

And then one system basically introduced some optimization for one of the use cases and made it like 10 times faster than the competitor. And then they found a way to kind of like even out the statistics across all of them. And then they said like, overall, we're dead faster.

But they found this one weird edge case, I think it was like, instead of descending sorting, it was ascending sorting or something like that on one specific data type. And that one was much faster. And then they evened it out enough that it looked like overall, the systems were wildly different.

But it was basically like one specific query on one of the many things you looked at. And then you have like the headline, because that's mostly what you're going for in the benchmarks, right? You want to have this benchmark five times faster than x. Again, coming back to the glossy charts, those are the things that you do.

And it's kind of like the money shot in benchmarking. And of course, then there are the problems when data doesn't reproduce, you don't publish all the pieces that anybody can actually run this to try it out. Those don't make it easy either. So how do we make benchmarks better or more meaningful?

They should definitely be automated and reproducible. So what we internally do, for example, we have a very wide set of, we call them the nightly benchmarks. That's why up here it says nightly. I just pulled this one out because it was nice. We optimized one specific thing, I think like, I don't know, 100x or 10x or whatever the ratio was.

But we optimized this, which was nice. But the important thing here is, we run this benchmark every night. And then we put all the changes that we had during that day together to see if the performance changes. Why are we doing that? To avoid the slow boiling frog problem.

Who knows the slow boiling frog? It's a very French thing. It's like, you know, you throw the frog into the boiling water and the frog will jump out again. And you put the frog into the cold water and slowly heat it up and the frog will stay in the water.

And the same thing kind of like applies to benchmarks. If you make a change today, and it makes something 1% slower, you're probably not going to see it. And next week, you make another change where it gets 2% slower. And over time, your performance just decreases a lot. So you're the frog sitting in the warming up water and being boiled.

So you want to avoid that with your own system that you don't want to boil your own performance. So you want to have nightly benchmarks where you see how things are changing and evolving over time. Second, this is a bit of an unfortunate thing. But you will need to do your own benchmarks.

Don't trust anybody's benchmarks. And they probably nobody has done exactly the scenario that you want to have. It's like, you need to know this is my data size. And this is my data structure. And this is the read write ratio. And this is how exactly my queries look like.

And this is the latency that I am willing to accept. And this is the type of hardware that I have. There are so many ways that make the existing benchmarks probably not 100% meaningful for you, that if you want to really be sure, there is no way around doing that yourself.

Otherwise, you will buy somebody's glossy benchmarks, and you will have to trust them and hope for the best that it actually behaves the same way. That's unfortunate, because it always means work. And you need to put in that work. But there is no easy way around that. We have a tool we call Valley.

It's basically you create a so-called track, and then you run the track where you say, this is the data, this is the query. And then you can optimize that. And then you can tune the settings or the hardware or whatever else. It's only against us and ourselves. This is how we would do nightly benchmarks with ourselves.

But that is how you can then benchmark and figure out, for my hardware, what can I actually get out in terms of performance? And then you will need to take whatever other tools you want to evaluate and do something similar to see, like, for exactly your workload, how does that compare?

One final thing, and that is also what I need to keep repeating internally for ourselves. If you find the flaw in a benchmark, it's easy to call the entire benchmark crap and then ignore everything that comes out of it. What is much smarter is if you look at it and still figure out, can I learn something from that?

Even if it's like somebody produces a bad benchmark against a competitor, it can tell you, like, where they think their sweet spot is or what is the scenario that they pick for themselves or what they try to highlight. Because even then you can learn, like, what are the potential strengths of a system?

Where do they shine? What works well and what doesn't work well? So all of that makes it easier to find something useful if you want to look. But it's, of course, much easier to say, like, all of this is crap and we'll ignore it and we'll call it BS.

And then you move on. I hope this was useful. I'm afraid you will have to do your own benchmarks to get proper results. Otherwise, you will have to believe the benchmarking. Let me take a quick picture with you so I can prove to my colleagues that I've been working today.

Can you all wave and say benchmarking? Thanks a lot. Enjoy the heat and I'll hand over to the next speaker.

Vector Search Benchmark[eting] - Philipp Krenn, Elastic

Transcript