Information Retrieval from the Ground Up

Let's get going. Audio is okay for everybody. I have some slight feedback, but I'll try to manage. I hope it's okay for you. Hi, I'm Philip. Let's talk a bit about retrieval. I'll show you some retrieval from the ground up. We'll keep it pretty hands on. You will have a chance to follow along and do everything that I show you as well.

I have a demo instance that you can use. Or you can just watch me if you have any questions, ask at any moment. If anything is too small to reach out, then we'll try to make it larger. We'll try to adjust as we go along. So, I guess we're not over reg yet.

But reg is a thing. And we'll focus on the R in reg. The retrieval augmented generation. We'll just focus on the retrieval. Just let's see where we are with retrieval. Quick show of hands. Who has done reg before? Okay. That's about half or so. Who has done anything with vector search and reg?

Do I need vector search for reg or can I do anything else? Yeah. Yeah. So, you can do anything. Retrieval is actually a very old thing. Depending on how you define it, it might be 50, 70, whatever years old. It's just getting the right context to the generation. I'll ignore all the generation for today.

We'll keep it very simple. We'll just focus on the retrieval part of getting the right information in. Partially from the old stuff, like the classics. But we'll get to some new things as well as we go along. Who has done keyword search before? Just that is fewer than vector search, I feel like.

Which almost reminds me of like 15 years ago or so, when no SQL came up, like more people had done MongoDB, Redis, whatever else, rather than SQL. That has changed again. I think it will be kind of similar for retrieval. The way I would always say that vector search is a feature of retrieval.

It's only one of multiple features or many features that you want in retrieval. And we'll see a bit why and how and we'll dive into those details. So, I work for Elastic, the company behind Elasticsearch. We're the most downloaded, deployed, whatever else, search engine. We do vector search, we do keyword search, we do hybrid search.

We'll dive into various examples. Everything that I will show you works -- well, the query language is Elasticsearch. But if you use anything built on Apache Lucene, everything behaves very similarly. If you use something that is a clone or close to Lucene, like anything built on 10 TV or anything like that, it will be very similar.

The foundation, keyword search and vector search will apply broadly everywhere. So, let's get going. We'll keep this pretty hands on. Who remembers in Star Wars when he's making that hand gesture? What is the quote? These are not the droids you're looking for. We'll keep this relatively Star Wars based.

Feel free to come in and filter on the sites or whatever. I'm afraid we have, I think, one chair over there otherwise and one down there. Otherwise, it's getting a bit full. Okay. Let's look at this of what these are not the droids you're looking for, does for search.

And I will start kind of like with the classic approach. Keyword search or lexical search is like you search for the words that you have stored and we want to find what is relevant in our examples. If you want to follow along, there is a gist which has all the code that I'm showing you.

So, let's go to the last slash AI dot engineer. There is one important thing. It's I have one shared instance basically for everybody. So, you can all just use this without signing up for any accounts or anything. So, this is just a cloud instance that you can use. There is my handle.

It's in the index name. If you don't want to fight and overwrite each other's data, replace that with your unique handle or something that is specific to you. Because otherwise, you will all work on the same index and kind of like overwrite each other's data. You can also just watch me.

If you don't have a computer handy, that's fine. But if you want to follow along, last slash AI dot engineer, there will be a gist. It will have the connection string. Like there is a URL and then the credentials are workshop, workshop. If you go into log in, it will say log in with elastic search.

That's where you use workshop, workshop. Then you will be able to log in. And you can just run all the queries that I'm showing you. You can try out stuff. If you have any questions, shout. I have a couple of colleagues dispersed in the room. So, if we have too many questions, we will somehow divide and conquer.

So, let's get going and see what we have here. And I will show you most of the stuff live. I think this is large enough in the back row. If it's not large enough for anybody, shout and we will see how much larger I can make this. And let me turn off the Wi-Fi and hope that my wired connection is good enough.

Let's refresh to see. Ooh. Maybe we will use my phone after all. Okay. Let's try this again. Okay. This is no good. Out you go. Okay. Hardest problem of the day solved. We have network. Okay. So, we have the sentence. These are not the droids you are looking for.

And we will start with the classic keyword or lexical search. Like, what happens behind the scenes? So, what you generally want to do is you basically want to extract the individual words and then make them searchable. So, here, I'm not storing anything. I'm just looking at, like, how would that look like if I stored something?

I'm using this underscore analyze endpoint to see what I will actually store in the background to make them searchable. So, these are not the droids you are looking for. And you see, these are not the droids you are looking for. In Western languages, the first step that happens is the tokenization.

In Western languages, it's pretty simple. It's normally any white spaces and punctuation marks where you just break out the individual tokens. Especially Asian languages are a bit more complicated around that. But we will gloss over that for today. And we have a couple of interesting pieces of information here.

So, we have the token. So, this is the first token. We have the start offset and the end offset. Why would I need a start and end offset? Why would I extract and then store that potentially? Any guesses? Yeah? Yes. Especially if you have a longer text, you would want to have that highlighting feature that you want to say, this is where my hit actually was.

So, if I'm searching for these, which is maybe not a great word, but you would very easily be able to highlight where you had actually the match. And the trick that you're doing in search, generally what differentiates it from a database is a database just stores what you give it and then does basically almost everything at query or search time.

Whereas a search engine does a lot of the work at ingestion or when you store the data. So, we break out the individual tokens. We calculate these offsets and store them. So, whenever we have a match afterwards, we never need to reanalyze the actual text, which could potentially be multiple pages long.

But we could just highlight where we have that match because we have extracted those positions. We have a position. Why would I want to store the position with the text that I have? Yeah? Annotation. So, the main use case that you have is if you have these positions and later on, we'll briefly look at if you want to look for a phrase, if you want to look for this word followed by that word.

So, you could then just look for all the text that contain these words. But then you could also just compare the positions and basically look for n, n plus 1, et cetera. And you never need to look at the string again. But you can just look at the positions to figure out like this was one continuous phrase.

Even if you have broken it out into the individual tokens. Most of the things that we see here is alpha num for alpha numeric. An alternative would be synonyms. We'll skip over synonym definition because it's not fun to define tons of synonyms. But this is all the things that we're storing here in the background.

You can also customize this analysis. And that is one of the features, again, of full text search and lexical searches that you preprocess a lot of the information to make that search afterwards faster. So, here you can see I'm stripping out the HTML because nobody's going to search for this emphasis tag.

I use a standard tokenizer that breaks up, for example, on dashes. You will see that. Alternatives would be white space that you only break up on white spaces. I lowercase everything, which is most of the times what you want because nobody searches in Google with proper casing or at least maybe my parents.

But nobody else searches with proper casing in Google. We remove stop words. We'll get to stop words in a moment. And we do stemming with the snowball stemmer. What stemming is it basically reduces a word down to the root. So, you don't care about singular, plural or like the flexion of a verb anymore.

But you really care more about the concept. So, if I run through that analysis, does anybody want to guess what will remain of this phrase or which tokens will be extracted and in what form? Not a lot will remain. Two? Droid and Look? Yeah, close. So, we'll actually have three.

So, we have Droid, You and Look. And you can see all the others were stop words which were removed. The stemming reduced looking down to look because we don't care if it looks, looking, look. We just reduce it to the word stem. So, we do this when we store the data.

And by the way, when you search afterwards, your text will run through the same analysis that you would have exact matches. So, you don't need to do anything like a like search anymore in the future. So, this will be much more performant than anything that you would do in a relational database because you have direct matches.

And we'll look at the data structure behind it in a moment. But what we get is Droid, You and Look with the right positions. So, for example, if we search for Droid, You, we could easily retrieve that because we have the positions, even though that is a weird phrase.

Do we start indexing at zero or one? 0, yes. It's the only right way. There is a different discussion here. So, we are -- the positions are based starting at zero. And these are the tokens that are remaining. If you do this for a different language, like you might hear I'm a native German speaker.

This is the text in German. And you would, if you use a German analyzer, it would know the rules for German and then would analyze the text in the right way. So, then you would have remaining Droid, den, such. Anybody wants to guess what happens if I have the wrong language for a text?

It will go very poorly. Because the -- so, how this works is, basically, you have rules for every single language. It's like, what is the stop word? How does stemming work? If you apply the wrong rules, you basically just get wrong stuff out. So, it will not do what you want.

So, what you get here is, like, this is an article. But, well, in English, the rule is an S at the end just gets stemmed away, even though this doesn't make any sense. So, you apply the wrong rules and you just produce pretty much garbage. So, don't do that.

Just to give you another example, French, this is the same phrase in French. And then you see Droid, La, and Recherche are the words that are remaining in these examples. Otherwise, it works the same. But you need to have the right analysis for what you're doing. Otherwise, you'll just produce garbage.

A couple of things as we're going along. The stop word list, by default, which you could overwrite, is relatively short. This is linguists have spent many years figuring out what are the right list of stop words. And you don't want to have too many or too few. In English, I always forget, I think it's 33 or so.

This is where you can find it in the source code. It's -- I don't want to say well hidden, but it's not easy to find either. So, every language has, like, a list of stop words that are defined that will be automatically removed for. These are not the droids you are looking for.

By accident, more or less, we had a lot of stop words, since that why not a lot remained here in the phrase. And then for all other languages, you will have a similar list of stop words. Should you always remove stop words? Yes, no. Yes. That is, by the way, another not is a very good -- I'm not sure if everybody heard that.

The comment was about not. One important thing here, we're talking about lexical or keyword search, which is dumb, but scalable. It doesn't understand if there is a droid or there's no droid. It's just defined as a stop word. It does just keyword matching. That is, in vector search or anything with a machine learning model behind it will be a bit of a different story afterwards, where these things might make a difference.

But this is very simple, because it just matches on similar strings, basically. It doesn't understand the context. It doesn't know what's going on. That's why the linguists decided not this is a good stop word. You could overwrite that if for your specific use case, this is not a good idea.

Always removing stop words, yes, no, maybe. So, our favorite phrase is it depends. And then you have to explain, like, what it depends on. So, what it depends on is there are scenarios where removing all stop words does not give you the desired result. And maybe you want to have, like, a text with and without stop words.

Like, sometimes stop words are just, like, a lot of noise that blow up the index size and don't really add a lot of value. That's why we have to find them and try to remove them by default. But if you had, for example, to be or not to be, these are all stop words.

It would all be gone when you run it through analysis. So, it is tricky to figure out, like, what is the right balance for stop words or what works for your use case. But you might have unexpected surprises in all of this. Okay. We have seen the German examples.

Let's do some more queries. Or let's actually store something. So far, we only pretended or we only looked at what would happen if we would store something. Now, I'm actually creating an index. Again, if you're running this yourself, please use a different name than me. Just replace all my handle instances with your handle or whatever you want.

Since this is a shared instance. If you have too many collisions, I might jump to another instance that I have as a backup in the background. But what I'm doing here is I'm creating this analysis pipeline that I have looked at before. Like, I'm throwing out the HTML. I use a standard tokenizer, lower casing, stop word removal, and stemming.

And then I call this my analyzer. And then I'm basically applying this my analyzer on a field called "quote". We call this a mapping. It's kind of like the equivalent of a schema in a relational database. But this defines how different fields behave. Okay. And somebody did not replace the query.

By the way, you need to keep user_. Let me quickly do this myself. Oops. I should have seen this the query is coming. We want to replace, and we'll use, oops, oops. Please don't copy that. And I want to let's try it again. So we're creating our own index.

And now I just to double check, I'll just again run this underscore analyze against this field that I've set up to just double check that I've set it up correctly. And now I'm actually starting to store documents. Bless you. So we'll store -- these are not the droids you're looking for.

I have two others that I'll index just so we have a bit more to search. No, I am your father. Any guesses what will remain here? Father. Father. Yeah. Okay. Let's try this out. Let me copy my -- this one actually has way fewer stoppers than you would expect.

Let's quickly do this. Since I didn't do the HTML removal, let's take these out manually. So what you get is, no, I am your father. And this was stupid because this was not what I wanted. We need to run this against the right analysis. This happens when you copy-paste.

Okay, um, uh, sorry. And we'll do text. No, I think I've patched this back together. Okay, I am your father. So no is the only stopper in this list, actually. No was on the stopper in this list, all the others are not. Um, okay, let's try another one. Obi-Wan never told you what happened to your father.

How many tokens will Obi-Wan be? Two? One? No, Obi-Wan will be two. Like Obi-Wan -- because we use the default tokenizer or standard tokenizer, that one breaks up at dashes. If you had used another tokenizer like white space, that would keep it together because it breaks up in white spaces.

So there are various reasons why you want or would not want to do it. I don't want to go into all the details. But there are a lot of things to do right or wrong when you ingest the data, which will then allow you to query the data in specific ways.

So, for example, if you would have an email address, that one is also weirdly broken up. Like, you might use, like, there's a dedicated tokenizer for URL and email addresses. So, depending on what type of data you have, you will need to process the data the right way because pretty much all the smart pieces are kind of like an ingestion here to make the search afterwards easier.

So, you can easily do that. Let's see. Let's index all my three documents so that we can actually search for them. Now, if I start searching for droid, it should match. These are not the droids you're looking for. Yes or no. Because this one is singular and uppercase, and the droid that we stored was plural and lowercase.

Will that match, yes or no? Yes. Why? Because of the stemming. Yes, we had the stemming. We had the lower casing. And when we search, so we store the text, it runs through this pipeline or the analysis. And for the search, it does the same thing. So, it will lowercase the droid.

It has stemmed down the droids in the text to droid and then we have an exact match. So, what the data structure behind the scene actually looks like. The magic is kind of like in this so-called inverted index. What the inverted index is, is these are all the tokens that remained that I have extracted.

I have alphabetically sorted them. And they basically have a pointer and say in this document, like with the IDs 1, 2, 3 that I have stored, we have how many occurrences? Like 0, 1, yeah, nothing had 2. And then we also know at which position they appeared. So, search for droid now.

This is what I have stored. I lowercase the droid to droid. I have an exact match here. Then I go through the list and see, retrieve this document, skip this one, skip this one. And at position 4, you have that hit. And then you could easily highlight that. So, you have almost done all the hard work at ingestion and this retrieval afterwards will be very fast and efficient.

That's the classic data structure for search, the inverted index where you have this alphabetic list of all the tokens that you have extracted to do that. And this will just be built in the background for you and that's how you can retrieve all of this. Let's look at a few other queries and how they behave.

If I search for robot, will I find anything? No, because there was no robot. There was a droid. We could now define a synonym and say, like, all droids are robots, for example. Who likes creating synonym lists? Nobody anymore. Okay. Normally, I would have said that's the Stockholm syndrome because there is sometimes somebody who likes creating synonym lists because they have done that for so many years.

But it got easier nowadays. Now you can use LLMs to generate the synonyms. So, it can get a bit easier to create them. But they're still limited because you have always this mapping. So, with synonyms, you can expand the right way. Where it gets trickier if you have homonyms.

If a word has multiple meanings, like a bat could be the animal or it could be the thing you hit a ball with. There it just gets trickier because there is no meaning behind the words or no context. So, you just match strings and that is inherently limited. But, like I said, it's dumb, but it scales very well.

And that's why it has been around for a long time. And it does surprisingly well for many things because there's not a lot of things that are unexpected or that can go totally wrong. Now, other things that you can do. You could do a phrase search where you say, I am your father.

Will this find anything? Yes. Because we had no, I am your father. What happens if I say, for example, I am, let's see, I am not your father. Yes, no? No. No. Why? So, you're right. Looking for an exact match based on the position. Not as a stop word.

But not as a stop word. But you're right because the positions still don't match. So, the stop word not would be filtered out, but it still doesn't match because the positions are off. That is one of the things that sometimes can be confusing. So, even if something is a stop word and will be filtered out, it doesn't work like that.

One thing that you can do is, though, that the factor is called slop, where you basically say if there is something missing, it would still work. So, I am your father and I am father with slop zero, that's kind of like the implicit one. Will not find anything. But if I say one, then I basically say, like, there can be a one-off in there.

Like, one word can be missing. However, I am his father. Here, his would not match. So, this still will not work. The slope is really just to skip a word. Yeah? What about I am your father? I am your father? I assume that -- no, I possibly am your father.

I assume that won't work. Ah. That will not work. How would you get that to work? There you might need to do something like a synonym where you say slash m gets replaced by m. Or we will need to have some more machine learning capabilities behind the scenes to do stuff like that.

Are there any libraries that would predefine contractions like that? So, what is built in is generally a very simple set of rules. What you will need to do for things like this is normally you need a dictionary. The problem around these is they are normally not available for free or open source.

Funnily enough, they are often coming out of university, the dictionaries, because they have a lot of free labor. The students. That's why the universities have been creating a lot of dictionaries. But they often come out under the weirdest licenses. That's why they are not very widely available. But, yes, there is a smarter or more powerful approach if you have a dictionary and you can do these things.

For example, one thing to show is like, maybe that's a good thing to also mention. You don't always get words out of the stemming. It's not a dictionary. It doesn't really get what you're doing. It just applies some rules. So, for example, Blackberry. Blackberry. Sorry, Blackberries, I think that this will be stamped down differently.

Ah, sorry, I need English. Without English, this will not work. So, this will stand down to this weird word Blackberry. And it will also stem down the singular Blackberry. So, there's a rule that applies this. But it's just a rule. It's not dictionary-based. It's not very smart. And it only has some rules built in that work for this.

But you will definitely hit limits. And the other thing, by the way, why I picked Blackberry as an example, you have some annoying languages like German, Korean, and others that compound nouns like Blackberry, where you have basically two words. Black would never find Blackberry in the simplest form because it's not a complete string.

There are various ways to work around that that all come with their own downsides. And either you have a dictionary or you extract the so-called engrams. It's like group of words and then you mention group of words. But all of those are one of the many tools how we try to make this a bit better or smarter, but it all has limitations.

I hope that answers the question and makes sense. So, there are dictionaries, but they're generally not free or not under an easy license available. For some languages, by the way, even the stemmers are not freely available. I think there is a stemmer or analyzer for Hebrew. I think that has also like some commercial license or at least you can't use it for free or free in commercial products.

Though licensing with machine learning models is also its own dark secret. Yeah. Yes. That is what an engram is doing. Let me see if I can. An engram is normally a word group, normally a trigram. This is way too small. Somehow I have weirdly overwritten my command plus so I can't use that.

Let me make this slightly larger. Okay. Here we basically use one or two letters as word groups, which is way too small. But just to show the example, and this is very hard to read. Let me copy that over to my console. There you can -- there you can -- oops.

There you can see this. But this is a great question. So, we'll use engram for quick fox. And then you can see the tokens that I extract here are the first letter, the first two, the second, the second and third, et cetera. And you end up with a ton of tokens.

The downside is, A, you have to do more work when you store this. B, it creates a lot of storage on disk because you extract so many different tokens. And then your search will also be pretty expensive because normally you would at least do three trigrams. But even that creates a ton of tokens and a ton of matches.

And then you need to find the ones with the most matches. And it works. But A, it is pretty expensive in disk but also query time. And it might also create undecided results or results that are a bit unexpected for the end user. It is, I would call it, again, it's a very dumb tool that works reasonably well for some scenarios.

But it's only one of many potential factors. What you could potentially do is, and I don't have a full example for that, but we could build it quickly, what you would do in reality probably, you might store a text more than one way. So, you might store it, like, with stop words and without stop words and maybe with engrams.

And then you give a lower weight to the engrams and say, like, if I have an exact match, then I want this first. But if I don't have anything in the exact matches, then I want to look into my engram list. And then I want to kind of, like, take whatever is coming up next.

So, even keyword-based search will be more complex if you combine different methods. Engrams are interesting, but, again, they're a dumb but pretty heavy hammer. Use them with the right, at the right scenario. Sorry, quick question about this engram. Is it by default one or two? Yes. But you could redefine that.

So, we can, let me go back to the docs. The engram, you can say mingram and maxgram. If you set both to three, you would have trigrams, where it's always groups of three, like 1, 2, 3, 2, 3, 4, et cetera. You could also have something called edge engram, where you expect that somebody types the first few letters right, and then you only start from the beginning but not in the middle of the word, which sometimes avoids unexpected results.

And, of course, reduces the number of tokens quite a bit. So, somewhere in here, edge engram. Let's just copy that over so I won't type. So, here we have edge engram with quick, and you can see it only does the first and the first two letters, but nothing else.

And, in reality, you would probably define this like 2, 2, 5, or more, or whatever else you want. But, here, we only do from the start and nothing else, which reduces the tokens tremendously. But, of course, if you have blackberry and you want to match the berry, you're out of luck.

Makes sense. Anybody else? Anything else? Yeah, so, if you have multiple languages, do not mix them up. That will just create chaos. Because we'll get to that in a moment. But, how keyword search works is basically word frequency. And if you mix languages, it screws up all frequencies and statistics.

So, what you would do is, either you field English and you would have field English and then you would have field whatever the abbreviation for Hebrew is. Hebrew. And then you would have that. And then you would need to define the right analyzer for that specific field. So, you break it out either into different fields or you could even do different indices.

And ideally, we even have that built-in. We have a language analyzer. Even if you just provide a couple of words, it will guess, or not guess, it will infer the language with a very high degree of certainty. Especially Hebrew will be very easy to identify. If you have your own diacrites, it's easy.

But even if you just throw random languages in there, it will be a very high degree of certainty. Especially Hebrew will be very easy to identify. If you have your own diacrites, it's easy. But even if you just throw random languages at it, it will have a very good chance, just with a few words, to know this is this language and then you can treat it the right way.

Good. Let's continue. So, we have done all of these searches. We have done slope. One more thing before we get into the relevance. One other very heavy hammer that people often overuse is fuzziness. So, bless you. If you have a misspelling, so I misspelled Obi-Wan Kenobi. We already know that this is broken out into two different words or tokens.

It will still match your Obi-Wan because we have this fuzziness, which allows edits. It's like a Lievenstein distance. So, you can have one. By default here, you could either give it an absolute value, like you can have one edit, which could be one character too much, too little, or one character different.

You could set it to two or three. You can't do more because otherwise you match almost anything. And auto is kind of smart because, depending on on how long the token that you're searching for, it will set a specific value. If you have zero to two characters, auto fuzziness, I think is one from two to -- no, zero to two characters is zero.

Three to five characters is one. And after that, it's two. So, you can match these. Will this one match? Yes, no, yes, no, and why? No, because you go to T and the value. Yes. So, we have -- we have -- both of those are misspelled. It still matches.

Why? You get tokenized separately and you can't have a single . Yes. That is a bit of a gotcha. So, yes. You need to know the tokenizer. So, we tokenize with standard, so it's two tokens, and then the fuzziness applies per token, which is another slightly surprising thing. But, yes, that's how you end up here.

Okay. Now, we could look at how the Levenstein distance works behind the scenes, but it's basically a Levenstein automaton which looks something like this. If you search for food and you have two edits, this is how the automaton would work in the background to figure out, like, what are all the possible permutations.

It's a fancy algorithm that was, I think, pretty hard to implement, but it's in Levenstein nowadays. Okay. Now, let's talk about scoring. One thing that you have seen that you don't have anywhere or in a non-search engine or just in a database is like we have to score. It's like, how well does this match?

How does the score work here? Let's look at the details of that one. So, the basic algorithm, which is also most of us, or pretty much all of us here, term frequency inverse document frequency, or TF-IDF. It has been slightly tweaked, like the new implementation is called BM25, which stands for best match, and it's the 25th iteration of the best match algorithm.

So, what they look like is you have the term frequency. If I search for Droid, how many times does Droid appear in the text that I'm looking for? And it's basically the square root of that. So, the assumption is if a text contains Droid once, this is the relevancy.

If I have a text that contains Droid 10 times, this is the relevancy. The tweak between TF-IDF, that one just keeps growing, BM25 says, like, once you hit, like, five Droid in a text, it doesn't really get much more relevant anymore. So, it kind of, like, flattens out the curve.

That is the idea of term frequency. The next thing is the inverse document frequency, which is almost the inverse curve. The assumption here is over my entire text, this is how often the term Droid appears. So, if a term is rare, it is much more relevant than if a term is very common, then it's kind of, like, less relevant.

Basically, the assumption is rare is relevant and interesting. Very common is not very interesting anymore. And then it's kind of, like, just works its curve out like that. And the final thing is the field length norm is, like, the shorter a field is and you have a match, the more relevant it is.

Which assumes, like, if you have a short title and your keyword appears there, it's much more relevant than if there's a very long text body and your keyword and you have a match there. And these are the three main components of TF-IDF. So, let's take a look at how this looks like.

You can make this a bit more complicated. This will show you why something matches. Don't be confused by the -- or let me take that out for the first try. So, I'm looking for father. And I am -- no, I am your father. And Obi-Wan never told you what happened to your father.

One is more relevant than the other. Why is the first one more relevant than the second one? Yeah. Term frequency is the same. Both contain father ones. The inverse document frequency is also the same because we are looking for the same term. The only difference is that the second one is longer than the first one.

And that's why it's more relevant here. So, this is very simple. And you can then, if you're unsure why something is calculated in a specific way, you can add this explained true. And then it will tell you all the details of, like, okay, we have father. And it then calculates basically all the different pieces of the formula for you and shows you how it did the calculation.

So, you can debug that if you need to. But it's probably a bit too much output for the everyday use case. And then you can customize the score if you want to. Here I'm doing a random score. So, my two fathers -- this is a bit hard to show -- they will just be in random order because their score is here randomly assigned.

But you could do this more intelligently that you combine, like, the score and, like, if you have, I don't know, the margin on the product that you sell or the rating that you include that in the rating somehow and you can build a custom score for things like that.

So, you can influence that any way you want. One thing that I see every now and then that is a very bad idea and we'll skip this one because it's probably a bit too much. This one, by the way, is the total formula that you can do or maybe I'll show you the parts that I skipped.

What happens if you search for two terms and they're not the same, they don't have the same relevancy? So, what the calculation behind the scenes basically looks like is let's say we search for father. Father is very rare, that's why it's much more relevant than your. Your is pretty common.

And then we have a document that contains your father. It's kind of like this axis. This will be the best match. But will a document that only contains father be more relevant or only your? Intuitively, the one with just father will be more relevant. But how does it calculate that?

It basically calculates like this is the relevancy of father. This is the ideal document and this is your. And then it looks like which one has the shorter angle. And this is the one that is more relevant. So, if you have a multi-term search, you can figure out which term is more relevant and how they are combined.

And then you can also have the coordination factor which basically rewards documents containing more of the terms that you're searching for. So, if I'm searching for three terms like I am father, whatever. If a document contains all three, this will be the formula that combines the scores of all three and multiplies it by three divided by three.

If it only contains two of them, it would only have the relevancy of 2/3 and with one 1/3. And then you put it all together and this is the formula that happens behind the scenes and you don't have to do that in your head, luckily. Cool. We have seen these.

One thing that we see every now and then is that people try to translate the score into percentages. Like you say, this is a 100% score and this is only like a 50% match. Who wants to do that? Hopefully nobody, because the Lucene documentation is pretty explicit about that.

You should not think about the problem in that way, because it doesn't work. And I'll show you why it doesn't work or how this breaks. Let's take another example. Let's say we take this short text. These are my father's machines. I think of a good Star Wars quote to use here, but bear with me.

So what remains if I run this through my analyzer? My father machine. These are the three tokens that remain. Now, I will store that. You remember the three tokens that we have stored. And if I search for my father machine, you might be inclined to say this is the perfect score.

This is like 100%. Agreed? Because all the three tokens that I have stored in these are my father's machines are there. So this must be like my perfect match. So it's 3.2, that would be 100%. The problem now is every time you add or remove a document, the statistics will change and your score will change.

So if I delete that document and I search the same thing again, I don't know what percentage this is now. Is this now the new 100% the best document or is this a zero point or, I don't know, 20%? How does this compare? And then you can play funny tricks where these droids are my father's father's machines.

And you can see I have a term frequency of 2 for father here. So if I store that one then and then search it, is this now 100%, is this now 110%? So don't try to translate scores into percentages. They're only relevant within one query. They're also not comparable across queries.

They're really just sorting within one query to do that. Okay. Let me get rid of this one again. Now, we've seen the limitations of keyword search. We don't want to define our synonyms. We might want to extract a bit more meaning. So we'll do some simple examples to extend.

I will add, from OpenAI, text embedding, text embedding small. I'm basically connecting that inference API for text embeddings here in my instance. I have removed the API key. You will need to use your own API key if you want to use it. But it is already configured. So let me pull up the inference services that we have here.

I have done -- or I have added two different models. One sparse, one dense. Let's go to these. By the way, if you try to do this with a 100% score, don't do this. Because it will just not work. Okay. Not everybody has worked with dense vectors, right? So I have a couple of graphics coming back to our Star Wars theme, just to look at how that works.

So what you do with dense vectors is we keep this very simple. This one just has a single dimension. And it has, like, the axis is pretty much like realistic Star Wars characters and cartoonish Star Wars characters. And this one falls on the realistic side and that other one is just cartoonish.

And you have a model behind the scenes that can rate those images and figure out where they fall. Now, in reality, you will have more dimensions than one. And you will also have floating point precision. So it's not just, like, minus one, zero, or one. But you will have more dimensions.

So, for example, here, in human and machine, and in a realistic model, you don't have -- the dimensions are not labeled as nicely and clearly understandable. The machine has learned what they represent. But they're not representing an actual thing that you can extract like that. But in our simple example here, now, we can say this layer character is realistic and a human versus, I don't know, the Darth Vader is cartoonish and, I don't know, somewhere between human and machine.

So this is the representation in the vector space. And then you could have, like I said, you could have floating point values and then you can have different characters. And similar characters, like, both of those are human. Without the hand, he's only, like, not quite as human anymore, so he's a bit lower down here.

So he's a bit closer to the machines. So you can have all of your entities in this vector space. And then if you search for something, you could figure out, like, which characters are the closest to this one. And again, in reality, you will have hundreds of dimensions. It will be much harder to say, like, these are the explicit things and this is why it works like that.

It will depend on how good your model is in interpreting your data and extracting the right meaning from it. But that is the general idea of dense vector representation. You have your documents or sometimes it's like chunks of documents that are represented in this vector space and then you try to find something that is close to it for that.

Does that make sense for everybody or any specific questions? So it's a bit more opaque, I want to say. It's not quite as easy because you say, like, these five characters match these other five characters here. But you need to trust or evaluate that you have the right model to figure out how these things connect.

So let's see how that looks like. I have one dense vector model down here. We have OpenAI embedding. This one is a very small model. It only has 128 dimensions. The results will not be great, but it's actually for demonstrating it actually helpful. So we'll see that. The other model that we have, and let me show you the output of that.

So if I take my text, these are not the droids you are looking for, this is the representation. It's basically an array of floating point values that will be stored and then you just look for similar floating point values. And then you have these are not the droids you are looking for.

Here on the previous one, dense text embedding. This one here does sparse embedding. Sparse is the main model used for that It is called splayed. Our input of splayed is, we call it ELSER. It's kind of like a slightly improved splayed, but the concept is still the same. What you get is, you take your words, and this is not just a TF-IDF.

This is a learned representation where I take all of my tokens and then expand them and say, like, for this text, these are all the tokens that I think are relevant. And this number here tells me how relevant they are. Again, not all of these make sense intuitively. And you might get some funky results, for example, with foreign languages.

This currently only supports English. But these are all the terms that we have extracted. Normally, yeah, you get, like, 100-something or so. So, the idea is that this text is represented by all of these tokens. And the higher the score here, the more important it is. And what you will do is, you store that behind the scenes.

When you search for something, you will generate a similar list, and then you look for the ones that have an overlap, and you basically multiply the scores together, and the ones with the highest values will then find the most relevant document. This is insofar interesting or nice because it's a bit easier to interpret.

It's not just, like, long array of floating point values. Sometimes these don't make sense. The main downside of this, though, is that it gets pretty expensive at time. Because you store a ton of different tokens here for this. When you retrieve it, the search query will generate a similar long list of terms.

And if you have a large enough text body, a query might hit a very large percentage of your entire stored documents with these OR matches. Because, basically, these are just a lot of ORs that you combine, calculate the score, and then return the most or the highest ranking results.

So, it's an interesting approach. It didn't gain as much traction as dense vector models, but it can be, as a first step or an easy and interpretable step, it can be a good starting point to dive into the details here. So, these are not the droids looking for. It's basically represented by this embedding here.

So, it's like this entire list of terms with this, yeah, with this relevancy, basically. This is the representation of this string. And then, when I search for something, I will generate the same list and then I basically try to match the two together. Like for what has the most or the highest matches here.

Make sense? Yes, we'll do that in a second. I will create a new index. This one keeps the configuration from before, but I'm adding this semantic text for the sparse model and the dense model. So, I've created this one. And now I'll just put three documents. I have my other index.

As you can see here, it says three documents were moved over. So, we can then start searching here. And if I look at that, the first document is still, these are not the droids you're looking for. You don't see, like for the, for a keyword search, you don't see the extracted tokens here.

We also don't show you the dense vector representation or the sparse vector representation. Those are just stored behind the scenes for querying, but there's no real point in retrieving them because you're not going to do anything with that huge array of dense vectors. It will just slow down your searches.

You can look at the mapping and you can see I'm basically copying my existing quote field to these other two that I can also search those. Okay. So, if I look for machine on my original quote, will it find anything? No? No, because it only had -- these are not the droids you're looking for.

And this is still the keyword search. It doesn't work, shouldn't work. That's exactly the result that we want out of this here. Now, if I say answer and I say machine, then it will match here. These are not the droids you're looking for. And you can see this one matches pretty well, I don't know, at 0.9.

But it also has some overlap, with no I am your father. I mean, it is much lower in terms of relevance. But something had an overlap here. And only the third document, Obi-Wan never told you what happened to your father. Only that one is not in our result list at all.

But there was something here. I don't know the expansion. We would need to basically run -- where was it? We would need to run this one here for all the strings and look then for the expansion of the query. And then there would be some overlap, and that's how we retrieve that one.

Is that a threshold that you have? You could define a threshold. It will, though, depend -- let's see. This is not the droids you're looking for. Let's say if I -- if I say -- I'm not sure if this will change anything. I mean, the relevance here is still -- it's still 10x or so.

But, yeah, this one still -- we'll just have a very low -- it's still -- terms you look for. The score just totally jumps around. It's a bit hard to define the threshold. Because here you can see, in my previous query, we might have said 0.2 is the cutoff point.

But now it's actually 0.4, even though it's not super relevant. So it might be a bit tricky, or you might need to have a more dynamic threshold depending on how many terms you're looking for and what is a relevant result. In the bigger picture, the assumption would be if you have hundreds of thousands or even millions of documents, you will probably not have the problem that anything that is so remotely connected will actually be in the top 10 or 20 or whatever list that you want to retrieve.

So for larger proper data sets, this should be less of an issue. With my hello world example of three documents, it can be a bit misleading. But, yes, you can have a cutoff point if you figure out what for your data set and your queries is a good cutoff point.

You could define the cutoff point. No, sorry, you have three documents. How come it's only showing two? Is it because of -- So the query gets expanded into, I don't know, those 100 tokens or whatever. And then for those two, there is some overlap, but the third one just didn't have any overlap.

But I -- so we -- okay, we can do that. It's just a bit tricky to figure out that the term that has the overlap. So we will need to take this one, machine -- no, I am your father. Let's take this one. What you need to do is to figure that one out.

I don't know, actually, we should be able -- let me see. Let's see. This is a pretty long output. Somewhere I was actually hoping that it would show me the term that has matched here. Okay. I see something -- okay, there is something puppet that seems to be the overlap.

How much sense that term expansion for the stored text and the query text makes is a bit of a different discussion. But in here with that explained true, you can actually see how it matched and what happened behind the scenes. If you have any really hard or weird queries or something that is hard to explain, to debug that.

But the third one didn't match. Now, if I take the dense vector model with OpenAI and I search for machine, how many results do you expect to get back from this one? 0, 1, 2, 3. Yes, 3. Why 3? Yes, because there's always some match. That is the other -- or let me run the query first.

These are not the droids you're looking for. This one is the first one. I don't think that this model is generally great because here the results are super close. It is -- I mean, the droids with the machines, that is the first one. But the score is super close to the second one, which is no, I am your father, which feels pretty unrelated.

And Obi-Wan never told you what happened to your father. Even that one is still with a reasonably close score. But why do we have those? Because if we say, what is the relevance, I mean, it's further away, but it's always like there's always kind of like some angle to it, even if the kind of like the angle here or depending on the similarity calculation that you do, but it's still always related.

There is no easy way to say something is totally unrelated. That is, by the way, one good thing about keyword search where it was relatively easy to have a cutoff point of things that are totally not relevant, where you're not going to confuse your users. Whereas here, if you don't have great matches, you might get almost -- it's not random, but it's potentially -- it looks very unrelated to your end users what you might return.

just because it's very hard to show. Yes? Is it fair to say, then, that the OpenAI embedding search is worse for this kind of toy example, because the magnitude of difference is -- I'm careful with worse, because it's really a hello world example, so I don't take this as a quality measurement in any way.

I -- yeah, I mean, the OpenAI model with 128 dimensions is very few dimensions. I think it will probably be cheap, but not give you great results necessarily. But don't use this as a benchmark. I think it's just a good way to see that this is now much harder, because now you need to pick the right machine learning model to actually figure out what is a good match.

With keyword-based search, it was a bit of a different story. There you need to pay more attention to, like, how do I tokenize, and do I have the right language, and do I do stemming or not stemming. But most of that work is relatively, I want to say, almost algorithmic, and then you can figure that out, and you configure it, and then it's very predictable at query time.

Whereas with the dense vector representation, you really need to evaluate for the queries that you run and the data that you have, like, is that relevant, and is this an improvement or not? It's very easy to get going and just throw a dense vector model together, and you will match -- you will always match something that might be an advantage over the lexical search where you don't have any matches, which sometimes is the other problem that nothing comes back and you would want to have at least some results.

Here it might just be unrelated. So that can be tricky. That you want to have some results is, by the way, a funny story that the European e-commerce store once told me, they said they accidentally deleted, I think, two-thirds of their data that they had for the products that you could buy.

And then I asked them, like, okay, so how much revenue did you lose because of that? And they said, basically nothing, because as long as you showed some somewhat relevant results quickly enough, people would still buy that. So only if you have no results, that's probably the worst. So for an e-commerce store, you might want to show stuff a bit further out, because people might still buy it.

But it really depends on the -- I'm coming to you in a moment -- it really depends on your use case. E-commerce is kind of like one extreme where you want to show always something for people to buy. If you have a database of legal cases or something like that, you probably don't want that approach, because that will go horribly wrong.

So it is very domain specific. That's, I think, also the good thing about search, because it keeps a lot of people employed, because it's not an easy problem. It's almost job security, because it depends much on the -- this is the data that you have, and this is the query that people run, and this is the expectation of what will happen, and this is for this domain, the right behavior.

So there's no easy right or wrong with the checkbox. And the other thing is you might make -- if you tune it, you might make it better for one case, but worse for 20 others. That's why a robust evaluation set is normally very important, though very rare. A lot of people YOLO it, and you will see that in the results.

And for the e-commerce store, it probably works well enough. Sorry, you had a question. Can I limit the semantic enrichment to a subset of my index based off the properties of the document? So if I have a very large shared index with a lot of customers, and I want to enable AI for a subset of the index, can I say, hey, only do the semantic enrichment if the document has this property where maybe it's like an AI customer?

Yeah, so the way we would do it in our product is that you would probably have two different indices with different mappings. Yeah, but then it's not so fun, like the customer upgrades and I have to migrate them to the new index. Dave, please. Yeah, so if you, for example, have an index in Elasticsearch, you can think of it almost like a sparse table, right?

So there's no penalty for having a field that is not populated. So either in your application or an ingest processor, you could have an inference statement and say, Yeah, that's how we do it now. No, we'll only move it over. With this automatic way where you kind of turn it off.

No, the problem is the data structure, like if the field is there, so the data structure that we build in the background is called HNN. And either we build a data structure or we don't build it. Yeah, so if you had, you know, 10 billion entries in your vector index, your index is set up for vectors, right?

And you just don't populate the thing that is either putting in a dense vector or triggering the inference to create a dense vector to put into there, then it's just going to be a, you know, a bunch of, the index is just a bunch of pointers and none of them head towards the HNNW and it won't show up in search results.

The penalty is nothing, right? But you're going to have to manage what does or does not create the vector. You could do that in an ingest processor by just saying, Hey, we're going to use the copy command to have two copies of the text, one that's meant for non-vector indexing, one that's meant for actual vector indexing.

You'd have to manage that with some tricky, complex AI technology called if-then-else, right? Somewhere inside of your ingesting pipeline, then it would work just fine. Yeah. One more question. When we did HNNW in the last week, we found that it was extremely slow at write times and the community suggested that we freeze our index if we were going to use HNNW.

Force merge or? Yeah, I think just freeze writes. They said, they said build the index and free it, otherwise you'll put a ton of load on the computer. I mean, yes. What we found is that some of the defaults kind of like have been around in Elasticsearch for 10 years settings with the merge scheduler really optimize the keyword search and for high update workloads on HNNW, we've got some suggestions.

They take a little bit of parameter speed tuning to go and find something right for your IOPS and for your actual update workload. So sometimes it's about the merge scheduler and not doing kind of an inefficient HNNW build when it's not important for the use case. Okay. The other thing you'd say is that sometimes friends don't let friends run Elasticsearch 8.11, upgrade, upgrade, upgrade.

They put a lot of optimization work in here. It should be simple. That's great. So the reason -- It used to be that. The reason why that is, it's like merging -- so because you have the immutable segment structure in Elasticsearch. And HNNW, you cannot easily merge. You basically need to rebuild them.

The one trick -- I forgot which version it was. I'm not sure, Dave, if you remember. I think it was even before 8.11. But basically, if we do a merge, we would take the largest segment with not deleted documents and basically plop the new documents on top of them rather than starting from scratch from two HNNW data structures.

There's another optimization somewhere now in 9.0 that will make that a lot faster. So it really depends on the the version that you have. And there are a couple of tricks that you can play. But yeah, that is one of the downsides of like the way immutable segments work and HNNW is built, that you can't easily just merge it as easily together as other data structures because you really need to rebuild the HNNW data structure or like take the largest one and then prop the other one in.

Okay. Some of the things where we just like -- we fixed it in the next version of the scene, so I want you to -- We found like KNNW was too slow and then KNNW broke our CPU and then we moved to find them, and now we're . Stay true.

Yeah. Yeah. Might have been a while ago. Yeah? So for like traditional document search, you know what I'm saying, like, hey, please find me a document that contains my search query, right? For R in the context of RAG, it might be something more like, hey, come up with a fun plan for my weekend, right?

And then the documents that we want to find don't necessarily look like the search query, right? Yeah. So like one approach to that is you just give -- it's an agent and you give it a search tool, and it searches, right? So like I'm just curious what you -- how do you think about that in general?

Yeah. I feel like RAG has been very heavily abused. It's -- or like the mental model I think started off as like you do retrieval and then you do the generation, but you could do the generation earlier on as well, that you do the rewriting and expanded query. So I -- my favorite example for that is you're looking for a recipe.

You don't need to have the LLM regenerate the recipe. You just want to find the recipe. But maybe you have a scenario where you forgot what the thing is called that you want to cook. And then you could use the LLM, for example, to tell you what you're looking for.

Like you say, like, oh, I'm looking for this Italian dish that has like these layers of pasta and then some meat in between. And then the LLM says, oh, you're looking for lasagna. And then you basically do the generation first or a query rewriting and then search and then get the results.

as a very explicit example here. Your example would look very different and probably smarter than my example. But query rewriting is one thing. There's also this concept of height where your documents and your queries often look very different. And that you use an LLM to generate something from the query that looks more close like the documents that you have.

And then you match the documents together because they're more similar in structure. So there are all kinds of interesting things that you can do. Like I said earlier, it depends is becoming a bigger and bigger factor. But, yeah, your use case is probably might be, yeah, maybe a multi retrieval where you figure out, like, oh, you look, I don't know, I know the example from an e-commerce store where it's like, I'm going to a theme party from the 1920s, give me some suggestions.

And then the LLM will need to figure out, like, what am I searching for? And then it can retrieve the right items and rewrite the query and then actually give you proper suggestions. But it's not just running a query anymore. Yeah? Yeah? We use instruction to embed models. And it's kind of like .

Along with your theory, you can say, like, this is the kind of thing I am doing. And you can say . Like, you can have document query embeddings. We try to embed queries from documents we're going to find, like the text. You can have instruction when you embed models that, instead of saying, like, I'm actually creating documents, you have to .

And then, I don't know, why do you think you have a problem? Yeah? How should we be thinking about the number of dimensions in the embedding model? Is, like, a 5 or 12-dimensional model necessarily better than a 1, 2, 3? Definitely not necessarily. Yeah. It's an interesting question. That feels almost like a blast from the past.

I remember, like, two or three years ago, there was this big debate of, like, how many dimensions does each data store support and, like, how many dimensions should you have? And at first, it looked like, oh, more dimensions is always better. But then it turned out more dimensions are very expensive.

So, it really depends on the model and what you're trying to solve. Like, if you can get away with fewer dimensions, it's potentially much cheaper and faster. But I don't think there is a hard rule, like, maybe the model with more dimensions can express more in because it just has more data and then it will come in handy.

But maybe it's not necessary for a specific use case and then you're just wasting a lot of resources. I don't think there is an easy answer to say, like, yes, for this use case, you need at least 4,000 dimensions. It will depend. But it depends on the model, how many dimensions it will output, and then maybe you have some quantization in the background to reduce that again or reduce either the number of dimensions or the fidelity per dimension.

So there are a lot of different tradeoffs in that performance consideration. But it will mostly rely on, like, how good does the model work for the use case that you're trying to do? Yeah. So that is one area. So I want to say historically what you would do is you would have a golden data set and then you would know what people are searching for and then you would have human experts who rate your queries.

And then you run different queries against it and then you see, like, is it getting better or is it getting worse? Now LLMs open a new opportunity where you might have human experts in the loop to help them out a bit, but they might be actually good at evaluating the results.

So you almost nobody has, like, the golden data set and test against that. But you can either use it, look at the behavior of your end users and try to infer something from that or you have an LLM that evaluates, like, what you have or you have a human together with an LLM evaluate the results.

So you have various tools, but, again, it's -- and it depends on really not an easy question of saying, like, this is the right thing. Maybe you can get away with something simple. So the classic approach I want to say is, like, you looked at the clickstream of how your users behaved and then you saw, like, they clicked on the first or up to the third result.

The result was potentially good and it didn't just go back and then click on something else, but they stuck on the page. If they don't click on anything and just leave, it might be very bad. If they go to the second or third page, it might also not be great.

So there are some quality signals that you can infer from that or you really look into the quality aspect and try to evaluate, like, what people were doing and how it behaves. But you can make this from relatively simple to pretty complicated. What else? Obviously, if I search for the AdWords query extension, it will find my father example.

And this one here will still, again, match my droids pretty much like the opening AI example. One thing that I wanted to show you what is also happening behind the scenes here, this is a very long segment, like, it's a lot of information with different speakers. What I have created here, though, is we have created multiple chunks behind the scenes.

And if I search for that, I think looking for murder in the Skywalk saga works pretty well here. It finds the document that I have retrieved, but it can also highlight -- so here I say, show me the fragment that actually matched best here. And if I search here for murder, it didn't find anything.

But I think the term that it found was in this highlighted segment here, it found kill and it was that one that was expanded here. So here I have broken up my long text field into multiple chunks and there are multiple strategies. You can do that by page, by paragraph, by sentence.

You could do it overlapping or not overlapping. Many strategies will depend on how you want to retrieve what works best for your use case. But you want to kind of like reduce the context per element that you're matching because there's only so much context that a dense vector representation can hold.

So you want to chunk that up, especially if you have like a full book, you want to break up those individual at least pages. And then find the relevant part where the match is. And then you can actually link back to that. The point in this query here is also to show you, I didn't define any chunks.

I didn't say like, okay, send this representation of a dense vector there and then when it comes back, interpret again. This is all happening behind the scenes just to make this easier. So the entire behavior here is still very similar to the keyword matching even though there's a lot more magic happening behind the scenes.

Just to keep that very simple. So let's see. Okay. How does everybody feel about long adjacent queries? We'll see about alternatives and maybe we can make this a bit simpler again. But let me show you one more way of looking at it. We call them retrievers. They're a more powerful mechanism to actually combine different types of searches.

Combining different types of searches, let me get from my slides actually. When we talk about combining searches and how this all plays together. This is kind of my little interactive map of what you do when you do retrieval or what your searches do. We started here in the lexical keyword search and then we run the match query and we're matching these strings.

This often combined with some rank features are often what we call full text search. The rank features could be either you extract a specific signal or it could also be something, however you influence that ranking, it could be the margin on the product, how many people bought something, what the rating is.

There are many different signals that you could include, not just with the match of the text, but any other signals that you want to combine for retrieving that. And then you have full text search as a whole. On top of that, I kept it to the side here, you might have a Boolean filter where you have a hard include or exclude of certain attributes, this does not contribute to the score, this is just like black and white, this is included or excluded, whereas this here calculates the score for you, how you match.

And then this was kind of like the algorithmic side. And then we have this machine learning, the learn side or the semantic search where you have a model behind the scenes split into the dense vector embeddings and the sparse vector embeddings for vector search or learn sparse retrieval, I think those are the two common terms.

And the interesting thing is all of these, including the sparse one, these are the sparse vector representation in the background, and only this one here is the dense vector representation. And then when you combine any grouping down here to combine for one search, this is then what we would call hybrid search, even though there can be big discussions of like what is exactly hybrid search or not, I will definitely stick to the definition that as soon as you combine more than one type of search it could be sparse and dense or it could be dense and keyword or maybe if you combine two dense vector searches, hybrid search because you have multiple approaches and then you can either boost them together, you could do re-ranking which is becoming more and more popular.

One thing that we lean heavily into is RRF which is reciprocal rank fusion that doesn't rely on the score but it relies on the position of the position of each search mechanism. So it basically says like the lexical search had this documented position four and the dense vector search had it at position two and then it kind of like evens out the position and gives you an overall position by blending them together rather than looking at the individual scores because they might be totally different.

So this is kind of like the information retrieval map map overall and we have, okay, we didn't do a lot of filters but I think filters are intuitively relatively clear that you just say like I'm only interested in users with this ID or whatever other criteria. It could be a geo-based filter like only things within 10 kilometers or only products that came out in the last year.

Like a hard yes or no. All the others will give you a value for the relevance and then you can blend that potentially together to give you the overall results. That is kind of like the total map of search. Can you give an example of the signal one too?

Yeah, for signal, so we have our own data structure for these rank features. It could be, for example, the rating of a book and then you combine the keyword match for, I don't know, you search for murder mysteries but then another feature would be how well they are ranked and then you would see that.

Or it could be your margin on the product or the stock you have available and you would want to show the product where you have more in stock. Or it might even be a simple like a click stream like what have people clicked before. There are a lot of different signals that you could include in all of this searching then.

Any other questions? Are everybody good for now? Yeah? You would have to normalize them. Depending on the comparison that you do for dense vectors, it might be between 0 and 1. But you saw that for the keyword search, also depending on how many words I was searching for, it might be a much higher value.

There is no real ceiling for that. Or you could add a boost and say, like, this field is 20 times more important than this other field. There is no real max value that you would have here. You could normalize the score and then basically say, like, I'll take the highest value in this sub query as 100% and then reduce everything down by that factor.

And then I combine them. Maybe that works well. RF is a very simple paper. I think it's like two pages. And it really just takes the different positions. I think it's one divided by 60, which is like a factor that figured out made sense, plus the position. And then you add the scores or like the positions for each document together.

And then that value gives you the overall position. It really just, it doesn't look at the score anymore, but it blends the different positions together and like how they are interleaving and what should be first or second. Yeah? Yeah. So, just for vector search, why should I do the last check-over CD vector or something like that?

I'm sure that . My data is already in the database. We change, probably via CDC, change in the capture. So, there's one extra . Like, CD vector, like it's right there. I was just curious, what sort of systems, like, what have you seen in production? I mean, PG vector will always be there because, like, if you are already using Postgres, it's very easy to add.

I think then the question is, like, does it have all the features that you need? For example, Postgres doesn't even do BM25. It has some matching, but it's not the full BM25 algorithm because I don't think it keeps all the statistics. It will be a question of, like, scaling out Postgres can be a problem and then just, like, the breadth of all the search features.

If you only need vector search, I think my or our default question back to that is, like, do you really only need vector search? Maybe for your use case, but for many use cases, you probably need hybrid search. One area, for example, where vector search will not do great is, like, if somebody searches for, like, a brand.

Because there is no easy representation in most models for the specific brand and it will be very hard to beat keyword search. So there will be very -- and also your users will be very angry when they know you have this word somewhere in your documents or in your data set, but you don't give me the result back.

So there are many scenarios where you probably want hybrid search, I feel like that's the -- we started two years ago, we started with just vector search, but I feel like the overall trend trend is coming more to hybrid search, because you probably want some sort of key search and then you want to have that combined, probably with some model for the added benefit and extra context, but you often want the combination.

It might also depend a bit on, like, the types of queries that your users run. So if your users run single word queries, like I've done in my examples, that's often not really ideal for vector search because you live off like any machine learning model because you live off extra context.

So depending on that, I've seen some people build searches where it's like if you search for one or two words, they do keyword search, but if you search for more, they might fall over to vector search. So it depends a bit on the context what works. If you really only need vector search and PG vector is small enough to do all of that, and Postgres is your primary data store, then that's probably where you will do well.

But there are plenty of scenarios where that will, or not all of those are necessary boxes will be ticked. My last question. Specifically for code, let's say you have a file, and it's like, you don't get repository, there's like thousands of comments, right? And so you have two options, either you embed at a file level, or you embed at a chunk level, right?

But I don't want to pay the penalty across thousands of unchanged, like the file hasn't changed for thousand comments, but just for this one it has changed, right? So I cut any shadow copies of the same thing, and have you seen, like, what are some like tips and tricks that people use to not have exploding storage costs, and like, this might not be a plastic problem, but a general vector problem, like how do I just pay the penalty once of storing the same embedding, and only when it changes, I re-embed and then, uh, in that sense.

But so, you would create, so it's one dataset basically with thousands of files that all are chunked together, and so one change would invalidate all of them, or -- No, no. So, I think I have a file that is reportedly, right? It has like 5,000 commits, and there's one file, .

It didn't change for 4,999 commits, but on the 5,000 commits, it did change, right? And so, if it hasn't changed, I want to only change the file to one of the embedding, right? But only when I insert file contents and change, I need -- I want to re-ingess, right?

But for those 4,999 times, I don't want to store, like, this hash, like, this hash has the same embedding, same embedding. Ah, so -- I'm not sure if this is a problem. You have seen with some of the customers, but -- Maybe so that -- I think the way we might solve it is that if you create the hash of the file and use that as the ID, and you only use the operation create, and it would reject any duplicate writes, you would at least not ingest and then create the vector representation again.

You will still send it over again, and it would need to get rejected. If a doc ID would have to be the hash of the file. Yes. If you have that doc ID, and then you need to set the operation to just create and not update or upset, then it would just be rejected and you would only write it once.

I'm not sure if that is a great use case or if you might want to keep, like, I don't know, an outside cache of, like, all the hashes that you've already had and deduplicate it there, but that would be the elasticsearch solution of, like, using the hash as the ID and then just writing to that.

Okay. And although with create on that, yeah. That is, I think, the intuitive or most native approach that we could offer for that. Yeah. I think there was some other question somewhere. Yeah. Yeah. I just wanted to add on to the Postgres question a minute ago. Postgres does have a native .

It's pretty good. It's a . There has also been recently a lot of work. So there is a . But from what I remember, the default Postgres full-text search does not do full BM25, but it only does -- it doesn't have all the statistics, I think, from what I remember.

Right. Yeah. Any other questions? Joe, please, go ahead. Do you plan to cover more on the receiver thing, like the receiver concept? To show now? Yeah. I mean, how much chase do you want to see? Well, okay. I'm practically interested in -- I mean, you're modeling -- so you have multiple -- I love the receiver concept.

It's a great concept. I think you have multiple different candidates. What kind of flexibility do you have on that? Because I'm not interested in using this. I'm interested in re-scoring them. Not necessarily looking at any, but I'm modeling the effect. You're just receiving a bunch of stuff. Can they?

And then you're actually elevating your . Yes. I'm sure there's . Maybe for -- before we dive into that, for everybody else, like, re-scoring is like, let's say we have a million documents, and then we have one cheaper way of retrieving them, and we retrieve the top, I don't know, 1,000 candidates, and then we have a more expensive way, but higher quality way of actually re-scoring them, then we will run this more expensive re-scoring on just the top 1,000 to get our ultimate result list of results.

But the re-scoring algorithm would be too expensive to run it across. Like a million documents, that's why you don't want to do that. That's why you have a two-step process. And that's why you might want to have the re-scoring. So, yes, we have a -- you can in Elasticsearch now, you can do re-scoring because it becomes more and more popular.

I don't have a full example there, but we do have like -- we do have a re-scoring model built in by default now, let me pull that up. So, we have currently the version 1 re-ranking, but we have a built-in re-ranking model now as well. So, for one of the tasks that we can do, you can see here we have the other tasks, for example, the dense text embedding, now we have a re-ranking task that you can also call.

Your question? How do you express that? Okay. It might be too easy. No, re-ranking is good. Let me -- somehow my keyboard binding is broken. This is very annoying. Okay. We re-rank results. Let me see. Somewhere here there should be -- so, there's learn to rank, but it should not be the only one.

This is what we want. Okay. We have our re-ranking model. Unless, Dave, you know from the top of your head where we have the right docs for this. organization of our docs can't help you with -- but retrievers -- Yeah, retrievers could find them -- Starting in 8.16 or 8.17, the retrievers API added a specific parent level retriever that you -- it's like nested around outside the retrievers that go inside this.

And it's specifically called the text re-ranking retriever. And so if you have a cross encoder, say from Huggy Face, or you're using the elastic re-ranker, using one of these things that complies that kind of inference task of taking a bunch of things and re-ranking them against the query, right?

Taking the full token stream, right? And the full token stream with the context and doing that. So you can target a parent level text field of the documents that are being retrieved. So it works really well for the one document or chunk, kind of re-ranking use case. I've also seen people just do it outside in the second API call, say if you wanted to do it on a highlighted thing.

Or if you wanted to do re-ranking sub-document chunks, that works pretty well for the API. But there's a text re-ranker retriever that specifically got added in 8.16 or 8.16. Yeah, so I think this is a simple example. Like, we have a standard match. Like, this will be very cheap.

And then we have the text-similarity re-ranker, which uses our elastic re-ranker. That falls back to that model behind the scenes. So you can think about it. I was a functional programmer, so don't mind the parentheses. But you would have, like, the text re-ranker retriever. Inside of that, you have the RRF.

Inside of that, you would have lexical and KNN as peers. And it works from the inside out. Hey, do each of those retrieval methodologies. Do, like, the Venn diagram. Find the best results. And then take the full text of those results and run them on re-ranker. It's almost like a little mini LLM saying, what do you actually answer the question?

And then we'll ask what outcomes. It's pretty good. The cool thing about the re-ranker is you can run it on structured lexical retrieval. You don't have to run it on a vector. You can run it on anything you want. So you don't want to pay the vector search everything.

Or maybe the text is too small, the vector search. You're not needing there to actually have the model lock on the stuff. The re-ranker, when you run it on just kind of actual customer data sets, how to do it, they're like, yeah, our evaluation score is bumped by 10 points.

Basically, for free, it feels like cheap. Right? So when you run against, like, a Gemini API, and you're like, wow, why is this 10 points better than the Amazon one? It's because they threw on their retriever . Right? So there's a lot of black box stuff out there that we're exposing.

So don't be scared if we're telling you how it works inside. But this is what the leading retrieval technology is doing under the hood and reselling to you as if they're, you know, it's all AI. Right? Yeah. Does that answer the question? Yeah. So my wish list is I want to be able to do in the retriever's API, re-ranking on sub-documentments.

A lot of my things are about sub-documentology retrieval. Right now, I've got to do it outside of the retriever's API, but I'm bending here as a developer. Yeah. So just to give you the example of, like, I don't think I have a re-ranking example here, but this one uses a classic keyword match for retriever.

And then we have -- we normalize here the score. I think somebody else asked about the normalize -- or we had a discussion about the normalizing. We do a min/max normalizing. We weight this with two. And then I use the OpenAI embeddings with a -- again, normalized with a weight of 1.5.

And then they will get blended together and you get the results that won't surprise you that these are not the droids you're looking for. If you search for droid and robot, will be by far the highest ranking document. You had a question somewhere. Yes. How much control does the last -- if you're doing the re-ranking thing, currently we do something similar, but we, like, do the different steps of the re-inkers at kind of, like, different levels of re-distribution hierarchy.

So, like, you have, like, you have, like, you're, like, charted droid processing nodes and you some, like, you do some ranking there and, like, once you, like, rejoin before you return the result from the query engine, that's when you run a final re-ranker. So, we would retrieve, like, x candidates and you could define the number of candidates and then we would run the re-ranking on top of those.

So, that will be a trade-off for you, like, the larger the window is, the slower it will be, but the potentially higher quality your overall results will be because you will just have everything in your data set that you can then re-rank at the end of the day. Is that what you meant or you wanted something per node or -- Yeah, I mean, like, right now we, like, are actually confirmed, like, specifically, we're kind of, like, we're in the compute and that happens, so that you can, like, get out for our rest of the week and do, like, keep re-ranking and then do some re-ranking of one, but -- but maybe that doesn't apply to these .

Like, how the query is mechanically running? Yeah, I don't think that's how we do it. So, what you can control here is, like, this is a window of, like, what you might retrieve and then we have the minimum score, like a cutoff point, to throw out what might not be relevant anyway, to keep that a bit cheaper, that's what we have here.

Those are the retrievers. And then you could do the RRF that I've explained where you blend results together. All of that is easy. One final note, if you got tired of all the JSON, we have a new way of defining those queries as well, where here we have a match operator, like, the one we have used all the time, that you can use either on a keyword field, but it could also be either a dense or a sparse vector embedding, and then you can just run a query on that, and then just get the scores from that.

So, it is a particular language, it's a bit more like, I don't know, like a shell. But if you don't want to type all the JSON anymore, this is how you can do that. And here my screen size is a bit off. But, yeah, you get the code that we retrieved, the speaker, and the score.

Maybe I'll take out the speaker to make this slightly more readable. No, it broke. Oh. This is, you could write queries with a fraction of the JSON. This will also support funny things like joins. It doesn't have every single search feature yet, but it's getting pretty close. So this is more like a closing out at the end.

If you're tired of all the JSON queries, you don't have to write JSON queries anymore. This is nice both for, like, observability use cases where you have, like, just like aggregations and things like that, but it's also very helpful for full text search now if you want to write different queries.

I think the main answer is that the language support in the different languages like Java, etc., is not very strong yet. You basically give it strings and then it gives you a result back that you need to parse out again. So it is not as strongly typed on the client side yet as the other languages.

Any final questions? Yes? Can you just talk about hybrid search and I was just curious, like, what is kind of the recommended best practices? like, we also do hybrid search today, what we do is we trigger two elastic queries. One to do, like, a basic keyword search, the other one to do, like, a key-end method search and then we walk through some, like, like a code here, a rewrite, we can get the final results.

But, like, from the reviewers, you just showed us, like, it's better, like, to combine those two queries, in one retriever and, like, would that make the results significantly better? Like, one of the, like, each of the query will have its own stats and normalizer and things in, like, more, I don't know, just, like, in general, it sounds better.

I mean, we can make your life easier. Yeah. It's just all behind one single query endpoint. So you could use the two different methods to retrieve and then you could still re-rank but all from one single query so you don't have to do it yourself. I mean, it's not like we want to stop you, but you don't have to and we can make your life a bit easier.

I mean, it's only one single query that you need to run and, like, one single round trip to the server that you need to do. Yeah. But, like, I was just curious comparing to the two queries method, would you do that? I mean, if you still need to do the retrieval, like, you do the retrieval, like, all the individual pieces are still there.

If you have two parts of the query, you will still retrieve those if that is the main cost and then you have the re-ranking so you're not getting out of those completely. But you can just do it in one single request that you send. We take care of all of that for you and then send you one result setback rather than sending more back to your application.

So it will potentially be a little less work on the Elasticsearch side, but it will mostly be less work on your application side. So if you don't have any problems, you may not notice, but you're running two mapper users, right? When you could be running one, and you're denying the optimizer the opportunity to do any short gets to say, oh, there's no more results that are better.

Yeah. So you're potentially going to be a little bit more performance and resources. So you're going to be a little bit more performance. You're going to be a little bit more performance. So if it's not hurting you, by all means, keep going. But if you have some point, you're going to start vertically scaling your hardware when you don't shoot you, you can get further, it's just a higher percent.

Yeah. Perfect. Thank you so much. I hope everybody learned something. I will let the instance running for today or so, so you can still play around with the queries if you feel like it. Thanks a lot for joining. We want stickers. We have stickers up there. We also have a booth the next few days.

Come join and get some proper swag from us there. Thank you. See you around. Thanks. Thanks. Bye. Bye. Bye. We'll see you next time.

Information Retrieval from the Ground Up - Philipp Krenn, Elastic

Transcript