Information Retrieval from the Ground Up

00:00:00.000 | Let's get going. Audio is okay for everybody. I have some slight feedback, but I'll try to

00:00:20.520 | manage. I hope it's okay for you. Hi, I'm Philip. Let's talk a bit about retrieval. I'll show you

00:00:27.300 | some retrieval from the ground up. We'll keep it pretty hands on. You will have a chance

00:00:32.660 | to follow along and do everything that I show you as well. I have a demo instance that you

00:00:37.020 | can use. Or you can just watch me if you have any questions, ask at any moment. If anything

00:00:44.340 | is too small to reach out, then we'll try to make it larger. We'll try to adjust as we

00:00:48.760 | go along. So, I guess we're not over reg yet. But reg is a thing. And we'll focus on the

00:00:57.760 | R in reg. The retrieval augmented generation. We'll just focus on the retrieval. Just let's

00:01:04.760 | see where we are with retrieval. Quick show of hands. Who has done reg before? Okay. That's

00:01:10.760 | about half or so. Who has done anything with vector search and reg? Do I need vector search

00:01:19.580 | for reg or can I do anything else? Yeah. Yeah. So, you can do anything. Retrieval is actually

00:01:28.620 | a very old thing. Depending on how you define it, it might be 50, 70, whatever years old.

00:01:34.560 | It's just getting the right context to the generation. I'll ignore all the generation

00:01:39.820 | for today. We'll keep it very simple. We'll just focus on the retrieval part of getting

00:01:43.220 | the right information in. Partially from the old stuff, like the classics. But we'll get

00:01:49.440 | to some new things as well as we go along. Who has done keyword search before? Just that

00:01:56.840 | is fewer than vector search, I feel like. Which almost reminds me of like 15 years ago or so,

00:02:04.140 | when no SQL came up, like more people had done MongoDB, Redis, whatever else, rather than

00:02:09.020 | SQL. That has changed again. I think it will be kind of similar for retrieval. The way I

00:02:15.380 | would always say that vector search is a feature of retrieval. It's only one of multiple features

00:02:22.820 | or many features that you want in retrieval. And we'll see a bit why and how and we'll dive

00:02:27.180 | into those details. So, I work for Elastic, the company behind Elasticsearch. We're the most downloaded,

00:02:33.720 | deployed, whatever else, search engine. We do vector search, we do keyword search, we do

00:02:38.280 | hybrid search. We'll dive into various examples. Everything that I will show you works -- well,

00:02:44.960 | the query language is Elasticsearch. But if you use anything built on Apache Lucene, everything

00:02:50.540 | behaves very similarly. If you use something that is a clone or close to Lucene, like anything

00:02:56.680 | built on 10 TV or anything like that, it will be very similar. The foundation, keyword search

00:03:03.680 | and vector search will apply broadly everywhere. So, let's get going. We'll keep this pretty

00:03:09.800 | hands on. Who remembers in Star Wars when he's making that hand gesture? What is the quote?

00:03:16.920 | These are not the droids you're looking for. We'll keep this relatively Star Wars based. Feel

00:03:26.040 | free to come in and filter on the sites or whatever. I'm afraid we have, I think, one chair over

00:03:31.740 | there otherwise and one down there. Otherwise, it's getting a bit full. Okay. Let's look at

00:03:40.040 | this of what these are not the droids you're looking for, does for search. And I will start

00:03:45.200 | kind of like with the classic approach. Keyword search or lexical search is like you search

00:03:50.040 | for the words that you have stored and we want to find what is relevant in our examples.

00:03:56.760 | If you want to follow along, there is a gist which has all the code that I'm showing you.

00:04:03.160 | So, let's go to the last slash AI dot engineer. There is one important thing. It's I have one

00:04:11.000 | shared instance basically for everybody. So, you can all just use this without signing up

00:04:15.400 | for any accounts or anything. So, this is just a cloud instance that you can use. There is my handle.

00:04:21.240 | It's in the index name. If you don't want to fight and overwrite each other's data, replace that with

00:04:28.840 | your unique handle or something that is specific to you. Because otherwise, you will all work on

00:04:33.320 | the same index and kind of like overwrite each other's data. You can also just watch me. If you

00:04:37.960 | don't have a computer handy, that's fine. But if you want to follow along, last slash AI dot engineer,

00:04:44.360 | there will be a gist. It will have the connection string. Like there is a URL and then the credentials

00:04:49.800 | are workshop, workshop. If you go into log in, it will say log in with elastic search. That's where you

00:04:55.000 | use workshop, workshop. Then you will be able to log in. And you can just run all the queries that I'm

00:05:00.360 | showing you. You can try out stuff. If you have any questions, shout. I have a couple of colleagues

00:05:05.960 | dispersed in the room. So, if we have too many questions, we will somehow divide and conquer.

00:05:11.080 | So, let's get going and see what we have here. And I will show you most of the stuff live.

00:05:19.960 | I think this is large enough in the back row. If it's not large enough for anybody, shout and we will

00:05:24.440 | see how much larger I can make this. And let me turn off the Wi-Fi and hope that my wired connection

00:05:33.640 | is good enough. Let's refresh to see. Ooh. Maybe we will use my phone after all.

00:05:44.760 | Okay. Let's try this again.

00:06:07.320 | Okay. This is no good. Out you go.

00:06:26.600 | Okay. Hardest problem of the day solved. We have network.

00:06:34.200 | Okay. So, we have the sentence. These are not the droids you are looking for. And we will start

00:06:38.040 | with the classic keyword or lexical search. Like, what happens behind the scenes?

00:06:41.880 | So, what you generally want to do is you basically want to extract the individual words and then make

00:06:47.480 | them searchable. So, here, I'm not storing anything. I'm just looking at, like, how would that look like

00:06:53.240 | if I stored something? I'm using this underscore analyze endpoint to see what I will actually store in

00:07:01.080 | the background to make them searchable. So, these are not the droids you are looking for. And you see,

00:07:06.120 | these are not the droids you are looking for. In Western languages, the first step that happens is

00:07:21.160 | the tokenization. In Western languages, it's pretty simple. It's normally any white spaces and punctuation

00:07:25.800 | marks where you just break out the individual tokens. Especially Asian languages are a bit more

00:07:31.880 | complicated around that. But we will gloss over that for today. And we have a couple of interesting

00:07:37.320 | pieces of information here. So, we have the token. So, this is the first token. We have the start offset

00:07:43.720 | and the end offset. Why would I need a start and end offset? Why would I extract and then store that

00:07:50.200 | potentially? Any guesses? Yeah? Yes. Especially if you have a longer text, you would want to have that

00:07:58.360 | highlighting feature that you want to say, this is where my hit actually was. So, if I'm searching for

00:08:03.720 | these, which is maybe not a great word, but you would very easily be able to highlight where you had

00:08:08.120 | actually the match. And the trick that you're doing in search, generally what differentiates it from a

00:08:14.040 | database is a database just stores what you give it and then does basically almost everything at query

00:08:19.560 | or search time. Whereas a search engine does a lot of the work at ingestion or when you store the data.

00:08:25.480 | So, we break out the individual tokens. We calculate these offsets and store them. So, whenever we have a

00:08:31.400 | match afterwards, we never need to reanalyze the actual text, which could potentially be multiple pages

00:08:36.760 | long. But we could just highlight where we have that match because we have extracted those positions.

00:08:41.640 | We have a position. Why would I want to store the position with the text that I have?

00:08:46.600 | Yeah? Annotation. So, the main use case that you have is if you have these positions and later on,

00:08:57.880 | we'll briefly look at if you want to look for a phrase, if you want to look for this word followed

00:09:02.040 | by that word. So, you could then just look for all the text that contain these words. But then you

00:09:07.880 | could also just compare the positions and basically look for n, n plus 1, et cetera. And you never need

00:09:12.840 | to look at the string again. But you can just look at the positions to figure out like this was one

00:09:17.480 | continuous phrase. Even if you have broken it out into the individual tokens. Most of the things that

00:09:23.960 | we see here is alpha num for alpha numeric. An alternative would be synonyms. We'll skip over

00:09:30.440 | synonym definition because it's not fun to define tons of synonyms. But this is all the things that we're

00:09:36.200 | storing here in the background. You can also customize this analysis. And that is one of the features,

00:09:41.960 | again, of full text search and lexical searches that you preprocess a lot of the information to make

00:09:47.400 | that search afterwards faster. So, here you can see I'm stripping out the HTML because nobody's going

00:09:53.320 | to search for this emphasis tag. I use a standard tokenizer that breaks up, for example, on dashes.

00:10:01.480 | You will see that. Alternatives would be white space that you only break up on white spaces.

00:10:05.880 | I lowercase everything, which is most of the times what you want because nobody searches in Google

00:10:13.880 | with proper casing or at least maybe my parents. But nobody else searches with proper casing in Google.

00:10:20.200 | We remove stop words. We'll get to stop words in a moment. And we do stemming with the snowball stemmer.

00:10:27.800 | What stemming is it basically reduces a word down to the root. So, you don't care about singular,

00:10:32.280 | plural or like the flexion of a verb anymore. But you really care more about the concept. So,

00:10:38.360 | if I run through that analysis, does anybody want to guess what will remain of

00:10:43.800 | this phrase or which tokens will be extracted and in what form?

00:10:47.400 | Not a lot will remain. Two?

00:10:57.080 | Droid and Look?

00:10:58.520 | Yeah, close. So, we'll actually have three. So, we have Droid, You and Look. And you can see

00:11:08.840 | all the others were stop words which were removed. The stemming

00:11:13.720 | reduced looking down to look because we don't care if it looks, looking, look. We just reduce it to the

00:11:19.960 | word stem. So, we do this when we store the data. And by the way, when you search afterwards, your text

00:11:26.040 | will run through the same analysis that you would have exact matches. So, you don't need to do anything

00:11:30.520 | like a like search anymore in the future. So, this will be much more performant than anything that you

00:11:35.800 | would do in a relational database because you have direct matches. And we'll look at the data structure

00:11:40.200 | behind it in a moment. But what we get is Droid, You and Look with the right positions. So, for example,

00:11:47.640 | if we search for Droid, You, we could easily retrieve that because we have the positions,

00:11:52.200 | even though that is a weird phrase. Do we start indexing at zero or one?

00:11:57.320 | 0, yes. It's the only right way. There is a different discussion here. So, we are -- the positions are

00:12:09.080 | based starting at zero. And these are the tokens that are remaining. If you do this for a different

00:12:15.640 | language, like you might hear I'm a native German speaker. This is the text in German. And you would,

00:12:21.720 | if you use a German analyzer, it would know the rules for German and then would analyze the text in the

00:12:27.800 | right way. So, then you would have remaining Droid, den, such. Anybody wants to guess what happens if I

00:12:36.280 | have the wrong language for a text? It will go very poorly. Because the -- so, how this works is, basically,

00:12:47.160 | you have rules for every single language. It's like, what is the stop word? How does stemming work?

00:12:51.720 | If you apply the wrong rules, you basically just get wrong stuff out. So, it will not do what you want.

00:12:58.520 | So, what you get here is, like, this is an article. But, well, in English, the rule is an S at the end

00:13:06.200 | just gets stemmed away, even though this doesn't make any sense. So, you apply the wrong rules and you

00:13:10.840 | just produce pretty much garbage. So, don't do that. Just to give you another example,

00:13:16.360 | French, this is the same phrase in French. And then you see Droid, La, and Recherche are the words

00:13:27.480 | that are remaining in these examples. Otherwise, it works the same. But you need to have the right

00:13:31.960 | analysis for what you're doing. Otherwise, you'll just produce garbage. A couple of things as we're

00:13:39.160 | going along. The stop word list, by default, which you could overwrite, is relatively short. This is

00:13:45.160 | linguists have spent many years figuring out what are the right list of stop words. And you don't want

00:13:49.960 | to have too many or too few. In English, I always forget, I think it's 33 or so. This is where you

00:13:56.120 | can find it in the source code. It's -- I don't want to say well hidden, but it's not easy to find either.

00:14:00.440 | So, every language has, like, a list of stop words that are defined that will be automatically removed for.

00:14:06.440 | These are not the droids you are looking for. By accident, more or less, we had a lot of stop words,

00:14:11.640 | since that why not a lot remained here in the phrase. And then for all other languages,

00:14:15.320 | you will have a similar list of stop words. Should you always remove stop words?

00:14:25.560 | Yes, no. Yes. That is, by the way, another not is a very good -- I'm not sure if everybody heard that.

00:14:36.040 | The comment was about not. One important thing here, we're talking about lexical or keyword search,

00:14:41.800 | which is dumb, but scalable. It doesn't understand if there is a droid or there's no droid. It's just

00:14:49.640 | defined as a stop word. It does just keyword matching. That is, in vector search or anything

00:14:56.200 | with a machine learning model behind it will be a bit of a different story afterwards, where these

00:15:00.840 | things might make a difference. But this is very simple, because it just matches on similar strings,

00:15:06.120 | basically. It doesn't understand the context. It doesn't know what's going on. That's why the linguists

00:15:10.840 | decided not this is a good stop word. You could overwrite that if for your specific use case, this is not a

00:15:17.240 | good idea. Always removing stop words, yes, no, maybe. So, our favorite phrase is it depends.

00:15:27.560 | And then you have to explain, like, what it depends on. So, what it depends on is there are scenarios

00:15:34.360 | where removing all stop words does not give you the desired result. And maybe you want to have, like,

00:15:39.400 | a text with and without stop words. Like, sometimes stop words are just, like, a lot of noise that blow up

00:15:44.520 | the index size and don't really add a lot of value. That's why we have to find them and try to remove

00:15:48.840 | them by default. But if you had, for example, to be or not to be, these are all stop words.

00:15:54.840 | It would all be gone when you run it through analysis. So, it is tricky to figure out, like,

00:16:02.600 | what is the right balance for stop words or what works for your use case. But you might have unexpected surprises

00:16:08.040 | in all of this. Okay. We have seen the German examples. Let's do some more queries. Or let's

00:16:17.560 | actually store something. So far, we only pretended or we only looked at what would happen if we would

00:16:23.240 | store something. Now, I'm actually creating an index. Again, if you're running this yourself,

00:16:28.680 | please use a different name than me. Just replace all my handle instances with your handle or whatever you

00:16:37.400 | want. Since this is a shared instance. If you have too many collisions, I might jump to another instance

00:16:42.440 | that I have as a backup in the background. But what I'm doing here is I'm creating this analysis pipeline

00:16:48.360 | that I have looked at before. Like, I'm throwing out the HTML. I use a standard tokenizer, lower casing,

00:16:53.400 | stop word removal, and stemming. And then I call this my analyzer. And then I'm basically applying this

00:17:00.760 | my analyzer on a field called "quote". We call this a mapping. It's kind of like the equivalent of a schema in a

00:17:09.000 | relational database. But this defines how different fields behave. Okay. And somebody did not replace

00:17:19.800 | the query. By the way, you need to keep user_. Let me quickly do this myself. Oops. I should have seen this

00:17:38.840 | the query is coming. We want to replace, and we'll use, oops, oops. Please don't copy that. And I want to

00:18:07.160 | let's try it again. So we're creating our own index. And now I just to double check, I'll just again run this

00:18:19.160 | underscore analyze against this field that I've set up to just double check that I've set it up correctly.

00:18:37.160 | And now I'm actually starting to store documents. Bless you. So we'll store -- these are not the droids you're

00:18:45.160 | looking for. I have two others that I'll index just so we have a bit more to search. No, I am your father.

00:18:53.880 | Any guesses what will remain here? Father. Father. Yeah.

00:19:00.680 | Okay. Let's try this out. Let me copy my -- this one actually has way fewer stoppers than you would expect.

00:19:11.160 | Let's quickly do this. Since I didn't do the HTML removal, let's take these out manually.

00:19:23.560 | So what you get is, no, I am your father. And this was stupid because this was not what I wanted. We need to run this against the right analysis.

00:19:33.160 | This happens when you copy-paste. Okay, um, uh, sorry. And we'll do text.

00:19:47.320 | No, I think I've patched this back together. Okay, I am your father. So no is the only stopper in this list, actually.

00:20:01.000 | No was on the stopper in this list, all the others are not. Um, okay, let's try another one.

00:20:19.720 | Obi-Wan never told you what happened to your father. How many tokens will Obi-Wan be? Two? One?

00:20:32.040 | No, Obi-Wan will be two. Like Obi-Wan -- because we use the default tokenizer or standard tokenizer,

00:20:41.880 | that one breaks up at dashes. If you had used another tokenizer like white space, that would keep it

00:20:46.760 | together because it breaks up in white spaces. So there are various reasons why you want or would

00:20:51.560 | not want to do it. I don't want to go into all the details. But there are a lot of things to do

00:20:55.880 | right or wrong when you ingest the data, which will then allow you to query the data in specific ways.

00:21:01.160 | So, for example, if you would have an email address, that one is also weirdly broken up. Like,

00:21:09.960 | you might use, like, there's a dedicated tokenizer for URL and email addresses. So, depending on what type of

00:21:16.200 | data you have, you will need to process the data the right way because pretty much all the smart

00:21:21.400 | pieces are kind of like an ingestion here to make the search afterwards easier. So, you can easily do

00:21:27.800 | that. Let's see. Let's index all my three documents so that we can actually search for them. Now, if I

00:21:35.720 | start searching for droid, it should match. These are not the droids you're looking for. Yes or no. Because this

00:21:43.800 | one is singular and uppercase, and the droid that we stored was plural and lowercase. Will that match,

00:21:49.240 | yes or no? Yes. Why? Because of the stemming.

00:21:54.520 | Yes, we had the stemming. We had the lower casing. And when we search, so we store the text, it runs

00:22:01.560 | through this pipeline or the analysis. And for the search, it does the same thing. So, it will lowercase the

00:22:07.880 | droid. It has stemmed down the droids in the text to droid and then we have an exact match. So, what the

00:22:16.360 | data structure behind the scene actually looks like. The magic is kind of like in this so-called inverted

00:22:23.080 | index. What the inverted index is, is these are all the tokens that remained that I have extracted.

00:22:30.920 | I have alphabetically sorted them. And they basically have a pointer and say in this document, like with

00:22:36.120 | the IDs 1, 2, 3 that I have stored, we have how many occurrences? Like 0, 1, yeah, nothing had 2.

00:22:45.080 | And then we also know at which position they appeared. So, search for droid now. This is what I have stored.

00:22:55.560 | I lowercase the droid to droid. I have an exact match here. Then I go through the list and see,

00:23:00.600 | retrieve this document, skip this one, skip this one. And at position 4, you have that hit. And then you

00:23:06.440 | could easily highlight that. So, you have almost done all the hard work at ingestion and this retrieval

00:23:11.640 | afterwards will be very fast and efficient. That's the classic data structure for search, the inverted

00:23:17.880 | index where you have this alphabetic list of all the tokens that you have extracted to do that.

00:23:23.160 | And this will just be built in the background for you and that's how you can retrieve all of this.

00:23:26.920 | Let's look at a few other queries and how they behave. If I search for robot, will I find anything?

00:23:41.240 | No, because there was no robot. There was a droid. We could now define a synonym and say, like,

00:23:51.160 | all droids are robots, for example. Who likes creating synonym lists? Nobody anymore. Okay.

00:24:00.920 | Normally, I would have said that's the Stockholm syndrome because there is sometimes somebody who

00:24:05.320 | likes creating synonym lists because they have done that for so many years. But it got easier nowadays.

00:24:10.840 | Now you can use LLMs to generate the synonyms. So, it can get a bit easier to create them. But they're

00:24:16.200 | still limited because you have always this mapping. So, with synonyms, you can expand the right way.

00:24:21.480 | Where it gets trickier if you have homonyms. If a word has multiple meanings, like a bat could be the

00:24:28.360 | animal or it could be the thing you hit a ball with. There it just gets trickier because there is no meaning

00:24:34.360 | behind the words or no context. So, you just match strings and that is inherently limited. But, like I said,

00:24:40.680 | it's dumb, but it scales very well. And that's why it has been around for a long time. And it does

00:24:46.920 | surprisingly well for many things because there's not a lot of things that are unexpected or that can

00:24:51.640 | go totally wrong. Now, other things that you can do. You could do a phrase search where you say,

00:24:58.680 | I am your father. Will this find anything? Yes. Because we had no, I am your father.

00:25:11.320 | What happens if I say, for example, I am, let's see, I am not your father. Yes, no? No. No. Why? So, you're right.

00:25:27.080 | Looking for an exact match based on the position.

00:25:29.240 | Not as a stop word. But not as a stop word. But you're right because the positions still don't match.

00:25:38.280 | So, the stop word not would be filtered out, but it still doesn't match because the positions are off.

00:25:45.240 | That is one of the things that sometimes can be confusing. So, even if something is a stop word

00:25:51.240 | and will be filtered out, it doesn't work like that. One thing that you can do is, though,

00:25:56.520 | that the factor is called slop, where you basically say if there is something missing,

00:26:01.480 | it would still work. So, I am your father and I am father with slop zero, that's kind of like the

00:26:09.000 | implicit one. Will not find anything. But if I say one, then I basically say, like,

00:26:13.880 | there can be a one-off in there. Like, one word can be missing.

00:26:24.040 | However, I am his father. Here, his would not match. So, this still will not work.

00:26:28.680 | The slope is really just to skip a word. Yeah?

00:26:31.960 | What about I am your father?

00:26:33.880 | I am your father?

00:26:37.080 | I assume that -- no, I possibly am your father. I assume that won't work.

00:26:41.000 | Ah. That will not work.

00:26:43.240 | How would you get that to work?

00:26:45.160 | There you might need to do something like a synonym where you say slash m gets replaced by m.

00:26:52.680 | Or we will need to have some more machine learning capabilities behind the scenes to do stuff like

00:26:57.640 | that. Are there any libraries that would predefine contractions like that?

00:27:02.760 | So, what is built in is generally a very simple set of rules. What you will need to do for things

00:27:11.800 | like this is normally you need a dictionary. The problem around these is they are normally not available for

00:27:17.000 | free or open source. Funnily enough, they are often coming out of university, the dictionaries,

00:27:24.200 | because they have a lot of free labor. The students. That's why the universities have been creating a lot

00:27:31.240 | of dictionaries. But they often come out under the weirdest licenses. That's why they are not very widely

00:27:35.720 | available. But, yes, there is a smarter or more powerful approach if you have a dictionary and you

00:27:41.480 | can do these things. For example, one thing to show is like, maybe that's a good thing to also mention.

00:27:50.920 | You don't always get words out of the stemming. It's not a dictionary. It doesn't really get what

00:27:58.520 | you're doing. It just applies some rules. So, for example, Blackberry. Blackberry. Sorry, Blackberries,

00:28:09.000 | I think that this will be stamped down differently. Ah, sorry, I need English. Without English, this will

00:28:15.160 | not work. So, this will stand down to this weird word Blackberry. And it will also stem down the singular

00:28:27.160 | Blackberry. So, there's a rule that applies this. But it's just a rule. It's not dictionary-based.

00:28:32.600 | It's not very smart. And it only has some rules built in that work for this. But you will definitely

00:28:38.680 | hit limits. And the other thing, by the way, why I picked Blackberry as an example, you have some

00:28:45.720 | annoying languages like German, Korean, and others that compound nouns like Blackberry, where you have

00:28:53.000 | basically two words. Black would never find Blackberry in the simplest form because it's not a complete

00:28:59.880 | string. There are various ways to work around that that all come with their own downsides. And either

00:29:05.960 | you have a dictionary or you extract the so-called engrams. It's like group of words and then you mention

00:29:10.680 | group of words. But all of those are one of the many tools how we try to make this a bit better or smarter,

00:29:17.560 | but it all has limitations. I hope that answers the question and makes sense.

00:29:22.360 | So, there are dictionaries, but they're generally not free or not under an easy license available.

00:29:29.080 | For some languages, by the way, even the stemmers are not freely available. I think there is a stemmer

00:29:35.880 | or analyzer for Hebrew. I think that has also like some commercial license or at least you can't use it for

00:29:43.240 | free or free in commercial products. Though licensing with machine learning models is also its own dark

00:29:51.240 | secret. Yeah.

00:29:52.360 | Yes. That is what an engram is doing. Let me see if I can.

00:30:18.120 | An engram is normally a word group, normally a trigram. This is way too small.

00:30:23.800 | Somehow I have weirdly overwritten my command plus so I can't use that. Let me make this slightly larger.

00:30:32.440 | Okay. Here we basically use one or two letters as word groups, which is way too small.

00:30:46.360 | But just to show the example, and this is very hard to read. Let me copy that over to my console.

00:30:52.920 | There you can -- there you can -- oops. There you can see this. But this is a great question.

00:31:03.080 | So, we'll use engram for quick fox. And then you can see the tokens that I extract here are the first letter,

00:31:10.200 | the first two, the second, the second and third, et cetera. And you end up with a ton of tokens.

00:31:16.760 | The downside is, A, you have to do more work when you store this. B, it creates a lot of storage on

00:31:23.640 | disk because you extract so many different tokens. And then your search will also be pretty expensive

00:31:28.120 | because normally you would at least do three trigrams. But even that creates a ton of tokens and a ton of

00:31:36.440 | matches. And then you need to find the ones with the most matches. And it works. But A, it is pretty

00:31:42.280 | expensive in disk but also query time. And it might also create undecided results or results that are a bit

00:31:49.160 | unexpected for the end user. It is, I would call it, again, it's a very dumb tool that works reasonably

00:31:56.760 | well for some scenarios. But it's only one of many potential factors. What you could potentially do is,

00:32:03.400 | and I don't have a full example for that, but we could build it quickly, what you would do in reality

00:32:09.080 | probably, you might store a text more than one way. So, you might store it, like, with stop words and

00:32:15.720 | without stop words and maybe with engrams. And then you give a lower weight to the engrams and say,

00:32:22.040 | like, if I have an exact match, then I want this first. But if I don't have anything in the exact

00:32:26.040 | matches, then I want to look into my engram list. And then I want to kind of, like, take whatever is

00:32:31.400 | coming up next. So, even keyword-based search will be more complex if you combine different methods.

00:32:38.680 | Engrams are interesting, but, again, they're a dumb but pretty heavy hammer.

00:32:45.320 | Use them with the right, at the right scenario.

00:32:48.040 | Sorry, quick question about this engram. Is it by default one or two?

00:32:51.560 | Yes. But you could redefine that. So, we can, let me go back to the docs. The engram,

00:33:01.000 | you can say mingram and maxgram. If you set both to three, you would have trigrams, where it's always

00:33:06.840 | groups of three, like 1, 2, 3, 2, 3, 4, et cetera. You could also have something called edge engram,

00:33:14.200 | where you expect that somebody types the first few letters right, and then you only start from the

00:33:18.600 | beginning but not in the middle of the word, which sometimes avoids unexpected results. And, of course,

00:33:24.680 | reduces the number of tokens quite a bit. So, somewhere in here, edge engram.

00:33:38.120 | Let's just copy that over so I won't type. So, here we have edge engram with quick, and you can see it

00:33:47.400 | only does the first and the first two letters, but nothing else. And, in reality, you would probably

00:33:53.160 | define this like 2, 2, 5, or more, or whatever else you want. But, here, we only do from the start and

00:33:59.880 | nothing else, which reduces the tokens tremendously. But, of course, if you have blackberry and you want to

00:34:06.840 | match the berry, you're out of luck. Makes sense. Anybody else? Anything else?

00:34:16.120 | Yeah, so, if you have multiple languages, do not mix them up. That will just create chaos. Because we'll get to that in a moment. But, how keyword search works is basically word frequency.

00:34:23.240 | And if you mix languages, it screws up all frequencies and statistics. So, what you would do is, either you

00:34:44.520 | field English and you would have field English and then you would have field whatever the abbreviation for

00:34:53.800 | Hebrew is. Hebrew. And then you would have that. And then you would need to define the right analyzer for that

00:35:03.080 | specific field. So, you break it out either into different fields or you could even do different indices.

00:35:07.080 | And ideally, we even have that built-in. We have a language analyzer. Even if you just provide a couple of

00:35:16.360 | words, it will guess, or not guess, it will infer the language with a very high degree of certainty.

00:35:25.640 | Especially Hebrew will be very easy to identify. If you have your own diacrites, it's easy.

00:35:29.640 | But even if you just throw random languages in there, it will be a very high degree of certainty.

00:35:34.440 | Especially Hebrew will be very easy to identify. If you have your own diacrites, it's easy. But even if

00:35:44.760 | you just throw random languages at it, it will have a very good chance, just with a few words, to know

00:35:49.880 | this is this language and then you can treat it the right way. Good. Let's continue.

00:36:03.320 | So, we have done all of these searches. We have done slope. One more thing before we get into the

00:36:12.200 | relevance. One other very heavy hammer that people often overuse is fuzziness. So, bless you. If you

00:36:21.240 | have a misspelling, so I misspelled Obi-Wan Kenobi. We already know that this is broken out into two different

00:36:29.080 | words or tokens. It will still match your Obi-Wan because we have this

00:36:33.240 | fuzziness, which allows edits. It's like a Lievenstein distance. So, you can have one. By default here,

00:36:42.200 | you could either give it an absolute value, like you can have one edit, which could be one character

00:36:49.400 | too much, too little, or one character different. You could set it to two or three. You can't do

00:36:55.800 | more because otherwise you match almost anything. And auto is kind of smart because, depending on

00:37:03.160 | on how long the token that you're searching for, it will set a specific value. If you have zero to two

00:37:10.120 | characters, auto fuzziness, I think is one from two to -- no, zero to two characters is zero. Three to five

00:37:18.440 | characters is one. And after that, it's two. So, you can match these. Will this one match?

00:37:31.720 | Yes, no, yes, no, and why?

00:37:36.680 | No, because you go to T and the value.

00:37:40.680 | Yes. So, we have -- we have -- both of those are misspelled.

00:37:43.960 | It still matches. Why?

00:37:48.360 | You get tokenized separately and you can't have a single .

00:37:52.680 | Yes. That is a bit of a gotcha. So, yes. You need to know the tokenizer. So, we tokenize with standard,

00:37:58.920 | so it's two tokens, and then the fuzziness applies per token, which is another slightly surprising

00:38:04.920 | thing. But, yes, that's how you end up here. Okay. Now, we could look at how the Levenstein

00:38:14.600 | distance works behind the scenes, but it's basically a Levenstein automaton which looks something like

00:38:21.240 | this. If you search for food and you have two edits, this is how the automaton would work in the

00:38:25.320 | background to figure out, like, what are all the possible permutations. It's a fancy algorithm that

00:38:30.440 | was, I think, pretty hard to implement, but it's in Levenstein nowadays. Okay. Now, let's talk about

00:38:38.520 | scoring. One thing that you have seen that you don't have anywhere or in a non-search engine or

00:38:44.200 | just in a database is like we have to score. It's like, how well does this match? How does the score

00:38:50.680 | work here? Let's look at the details of that one. So, the basic algorithm, which is also most of us,

00:39:01.640 | or pretty much all of us here, term frequency inverse document frequency, or TF-IDF. It has been slightly

00:39:09.720 | tweaked, like the new implementation is called BM25, which stands for best match, and it's the 25th

00:39:15.240 | iteration of the best match algorithm. So, what they look like is you have the term frequency. If I

00:39:22.200 | search for Droid, how many times does Droid appear in the text that I'm looking for? And it's basically

00:39:28.840 | the square root of that. So, the assumption is if a text contains Droid once, this is the relevancy. If I

00:39:35.880 | have a text that contains Droid 10 times, this is the relevancy. The tweak between TF-IDF, that one just

00:39:42.680 | keeps growing, BM25 says, like, once you hit, like, five Droid in a text, it doesn't really get much more

00:39:48.840 | relevant anymore. So, it kind of, like, flattens out the curve. That is the idea of term frequency.

00:39:55.000 | The next thing is the inverse document frequency, which is almost the inverse curve. The assumption here is

00:40:04.120 | over my entire text, this is how often the term Droid appears. So, if a term is rare, it is much more

00:40:13.080 | relevant than if a term is very common, then it's kind of, like, less relevant. Basically, the assumption

00:40:18.920 | is rare is relevant and interesting. Very common is not very interesting anymore. And then it's kind of,

00:40:24.520 | like, just works its curve out like that. And the final thing is the field length norm is, like,

00:40:32.280 | the shorter a field is and you have a match, the more relevant it is. Which assumes, like,

00:40:38.680 | if you have a short title and your keyword appears there, it's much more relevant than if there's a

00:40:42.600 | very long text body and your keyword and you have a match there. And these are the three main components

00:40:49.240 | of TF-IDF. So, let's take a look at how this looks like. You can make this a bit more complicated.

00:40:56.200 | This will show you why something matches. Don't be confused by the -- or let me take that out for the

00:41:05.240 | first try. So, I'm looking for father. And I am -- no, I am your father. And Obi-Wan never told you what

00:41:12.600 | happened to your father. One is more relevant than the other. Why is the first one more relevant than

00:41:19.160 | the second one? Yeah. Term frequency is the same. Both contain father ones. The inverse document frequency

00:41:29.560 | is also the same because we are looking for the same term. The only difference is that the second one

00:41:34.840 | is longer than the first one. And that's why it's more relevant here. So, this is very simple. And you

00:41:44.360 | can then, if you're unsure why something is calculated in a specific way, you can add this explained true.

00:41:49.800 | And then it will tell you all the details of, like, okay, we have father. And it then calculates basically

00:41:57.880 | all the different pieces of the formula for you and shows you how it did the calculation. So, you can

00:42:01.960 | debug that if you need to. But it's probably a bit too much output for the everyday use case.

00:42:07.640 | And then you can customize the score if you want to. Here I'm doing a random score. So,

00:42:16.280 | my two fathers -- this is a bit hard to show -- they will just be in random order because their score is

00:42:21.960 | here randomly assigned. But you could do this more intelligently that you combine, like, the score and,

00:42:29.240 | like, if you have, I don't know, the margin on the product that you sell or the rating that you

00:42:34.200 | include that in the rating somehow and you can build a custom score for things like that.

00:42:38.120 | So, you can influence that any way you want. One thing that I see every now and then that is a

00:42:46.680 | very bad idea and we'll skip this one because it's probably a bit too much. This one, by the way,

00:42:52.040 | is the total formula that you can do or maybe I'll show you the parts that I skipped. What happens if

00:42:58.200 | you search for two terms and they're not the same, they don't have the same relevancy? So, what the

00:43:03.560 | calculation behind the scenes basically looks like is let's say we search for father. Father is very rare,

00:43:11.080 | that's why it's much more relevant than your. Your is pretty common. And then we have a document that

00:43:16.040 | contains your father. It's kind of like this axis. This will be the best match. But will a document

00:43:22.680 | that only contains father be more relevant or only your? Intuitively, the one with just father will

00:43:29.800 | be more relevant. But how does it calculate that? It basically calculates like this is the relevancy of

00:43:35.880 | father. This is the ideal document and this is your. And then it looks like which one has the shorter

00:43:41.880 | angle. And this is the one that is more relevant. So, if you have a multi-term search, you can figure

00:43:49.000 | out which term is more relevant and how they are combined. And then you can also have the coordination

00:43:54.040 | factor which basically rewards documents containing more of the terms that you're searching for. So,

00:43:59.080 | if I'm searching for three terms like I am father, whatever. If a document contains all three, this

00:44:09.320 | will be the formula that combines the scores of all three and multiplies it by three divided by three.

00:44:14.920 | If it only contains two of them, it would only have the relevancy of 2/3 and with one 1/3. And then you

00:44:21.080 | put it all together and this is the formula that happens behind the scenes and you don't have to do that

00:44:25.800 | in your head, luckily. Cool. We have seen these. One thing that we see every now and then is that

00:44:34.600 | people try to translate the score into percentages. Like you say, this is a 100% score and this is only like

00:44:44.120 | a 50% match. Who wants to do that? Hopefully nobody, because the Lucene documentation is pretty

00:44:52.680 | explicit about that. You should not think about the problem in that way, because it doesn't work.

00:44:59.480 | And I'll show you why it doesn't work or how this breaks. Let's take another example.

00:45:05.560 | Let's say we take this short text. These are my father's machines. I think of a good Star Wars quote to

00:45:18.280 | use here, but bear with me. So what remains if I run this through my analyzer? My father machine.

00:45:23.720 | These are the three tokens that remain. Now, I will store that. You remember the three tokens that we have

00:45:31.160 | stored. And if I search for my father machine, you might be inclined to say this is the perfect score.

00:45:43.960 | This is like 100%. Agreed? Because all the three tokens that I have stored in these are my father's

00:45:51.400 | machines are there. So this must be like my perfect match. So it's 3.2, that would be 100%.

00:45:57.080 | The problem now is every time you add or remove a document, the statistics will change and your score

00:46:03.400 | will change. So if I delete that document and I search the same thing again, I don't know what percentage

00:46:10.520 | this is now. Is this now the new 100% the best document or is this a zero point or, I don't know,

00:46:15.800 | 20%? How does this compare? And then you can play funny tricks where these droids are my father's

00:46:24.840 | father's machines. And you can see I have a term frequency of 2 for father here. So if I store that one

00:46:32.200 | then and then search it, is this now 100%, is this now 110%? So don't try to translate scores into

00:46:44.200 | percentages. They're only relevant within one query. They're also not comparable across queries. They're

00:46:50.200 | really just sorting within one query to do that. Okay. Let me get rid of this one again.

00:46:57.400 | Now, we've seen the limitations of keyword search. We don't want to define our synonyms. We might want

00:47:06.840 | to extract a bit more meaning. So we'll do some simple examples to extend. I will add, from OpenAI,

00:47:17.400 | text embedding, text embedding small. I'm basically connecting that inference API for text embeddings

00:47:24.520 | here in my instance. I have removed the API key. You will need to use your own API key if you want to

00:47:30.760 | use it. But it is already configured. So let me pull up the inference services that we have here. I have

00:47:38.280 | done -- or I have added two different models. One sparse, one dense. Let's go to these. By the way,

00:47:49.000 | if you try to do this with a 100% score, don't do this. Because it will just not work. Okay.

00:47:58.520 | Not everybody has worked with dense vectors, right? So I have a couple of graphics coming back to our

00:48:05.480 | Star Wars theme, just to look at how that works. So what you do with dense vectors is we keep this

00:48:12.840 | very simple. This one just has a single dimension. And it has, like, the axis is pretty much like

00:48:20.280 | realistic Star Wars characters and cartoonish Star Wars characters. And this one falls on the realistic

00:48:25.560 | side and that other one is just cartoonish. And you have a model behind the scenes that can rate those

00:48:31.720 | images and figure out where they fall. Now, in reality, you will have more dimensions than one.

00:48:38.760 | And you will also have floating point precision. So it's not just, like, minus one, zero, or one.

00:48:45.400 | But you will have more dimensions. So, for example, here, in human and machine, and in a realistic model,

00:48:55.000 | you don't have -- the dimensions are not labeled as nicely and clearly understandable. The machine has

00:48:59.800 | learned what they represent. But they're not representing an actual thing that you can extract like that.

00:49:04.680 | But in our simple example here, now, we can say this layer character is realistic and a human versus,

00:49:14.200 | I don't know, the Darth Vader is cartoonish and, I don't know, somewhere between human and machine.

00:49:23.080 | So this is the representation in the vector space. And then you could have, like I said,

00:49:27.080 | you could have floating point values and then you can have different characters. And similar characters,

00:49:32.840 | like, both of those are human. Without the hand, he's only, like, not quite as human anymore,

00:49:38.200 | so he's a bit lower down here. So he's a bit closer to the machines. So you can have all of your entities in

00:49:47.960 | this vector space. And then if you search for something, you could figure out, like,

00:49:51.880 | which characters are the closest to this one. And again, in reality, you will have hundreds of

00:49:57.880 | dimensions. It will be much harder to say, like, these are the explicit things and this is why it

00:50:03.320 | works like that. It will depend on how good your model is in interpreting your data and extracting the

00:50:09.960 | right meaning from it. But that is the general idea of dense vector representation. You have your documents

00:50:18.200 | or sometimes it's like chunks of documents that are represented in this vector space and then you

00:50:24.360 | try to find something that is close to it for that. Does that make sense for everybody or any specific

00:50:31.800 | questions? So it's a bit more opaque, I want to say. It's not quite as easy because you say, like,

00:50:39.720 | these five characters match these other five characters here. But you need to trust or evaluate

00:50:45.880 | that you have the right model to figure out how these things connect. So let's see how that looks like.

00:50:56.280 | I have one dense vector model down here. We have OpenAI embedding. This one is a very small model. It

00:51:05.800 | only has 128 dimensions. The results will not be great, but it's actually for demonstrating it actually

00:51:14.760 | helpful. So we'll see that. The other model that we have, and let me show you the output of that. So if I

00:51:20.760 | take my text, these are not the droids you are looking for, this is the representation. It's basically

00:51:26.840 | an array of floating point values that will be stored and then you just look for similar floating point

00:51:32.600 | values. And then you have these are not the droids you are looking for. Here on the previous one,

00:51:37.320 | dense text embedding. This one here does sparse embedding. Sparse is the main model used for that

00:51:47.720 | It is called splayed. Our input of splayed is, we call it ELSER. It's kind of like a slightly improved splayed,

00:51:56.280 | but the concept is still the same. What you get is, you take your words, and this is not just a TF-IDF.

00:52:05.960 | This is a learned representation where I take all of my tokens and then expand them and say, like, for this text,

00:52:14.680 | these are all the tokens that I think are relevant. And this number here tells me how relevant they are.

00:52:20.920 | Again, not all of these make sense intuitively. And you might get some funky results, for example,

00:52:29.720 | with foreign languages. This currently only supports English. But these are all the terms that we have

00:52:37.240 | extracted. Normally, yeah, you get, like, 100-something or so. So, the idea is that this text is represented

00:52:46.920 | by all of these tokens. And the higher the score here, the more important it is. And what you will do is,

00:52:52.920 | you store that behind the scenes. When you search for something, you will generate a similar list,

00:52:58.680 | and then you look for the ones that have an overlap, and you basically multiply the scores together, and the

00:53:04.120 | ones with the highest values will then find the most relevant document. This is insofar interesting or

00:53:12.040 | nice because it's a bit easier to interpret. It's not just, like, long array of floating point values.

00:53:17.880 | Sometimes these don't make sense. The main downside of this, though, is that it gets pretty expensive at

00:53:25.640 | time. Because you store a ton of different tokens here for this. When you retrieve it, the search query will

00:53:33.560 | generate a similar long list of terms. And if you have a large enough text body, a query might hit a very

00:53:41.960 | large percentage of your entire stored documents with these OR matches. Because, basically, these are just a lot of ORs that you combine, calculate the score, and then return the most or the highest ranking results.

00:53:54.840 | So, it's an interesting approach. It didn't gain as much traction as dense vector models, but it can be, as a first step or an easy and interpretable step, it can be a good starting point to dive into the details here.

00:54:08.280 | So, these are not the droids looking for. It's basically represented by this embedding here.

00:54:22.520 | So, it's like this entire list of terms with this, yeah, with this relevancy, basically. This is the representation of this string. And then, when I search for something, I will generate the same

00:54:36.280 | list and then I basically try to match the two together. Like for what has the most or the highest matches here.

00:54:44.920 | Make sense?

00:54:48.280 | Yes, we'll do that in a second.

00:54:57.720 | I will create a new index. This one keeps the configuration from before, but I'm adding this semantic text for the sparse model and the dense model.

00:55:21.000 | So, I've created this one. And now I'll just put three documents. I have my other index.

00:55:26.600 | As you can see here, it says three documents were moved over.

00:55:30.600 | So, we can then start searching here. And if I look at that, the first document is still,

00:55:36.680 | these are not the droids you're looking for. You don't see, like for the, for a keyword search,

00:55:41.480 | you don't see the extracted tokens here. We also don't show you the dense vector representation or the

00:55:47.080 | sparse vector representation. Those are just stored behind the scenes for querying, but there's no real

00:55:53.080 | point in retrieving them because you're not going to do anything with that huge array of dense vectors.

00:55:58.040 | It will just slow down your searches. You can look at the mapping and you can see I'm basically copying my

00:56:06.920 | existing quote field to these other two that I can also search those.

00:56:10.920 | Okay. So, if I look for machine on my original quote, will it find anything?

00:56:17.960 | No?

00:56:21.080 | No, because it only had -- these are not the droids you're looking for. And this is still the keyword search.

00:56:32.040 | It doesn't work, shouldn't work. That's exactly the result that we want out of this here. Now,

00:56:48.040 | if I say answer and I say machine, then it will match here. These are not the droids you're looking for.

00:56:55.400 | And you can see this one matches pretty well, I don't know, at 0.9. But it also has some overlap,

00:57:01.800 | with no I am your father. I mean, it is much lower in terms of relevance. But something had an overlap

00:57:08.360 | here. And only the third document, Obi-Wan never told you what happened to your father. Only that one

00:57:15.320 | is not in our result list at all. But there was something here. I don't know the expansion. We would

00:57:21.000 | need to basically run -- where was it? We would need to run this one here for all the strings and look

00:57:30.840 | then for the expansion of the query. And then there would be some overlap, and that's how we retrieve

00:57:34.680 | that one. Is that a threshold that you have?

00:57:39.400 | You could define a threshold. It will, though, depend -- let's see.

00:57:46.040 | This is not the droids you're looking for. Let's say if I -- if I say -- I'm not sure if this will change anything.

00:57:56.840 | I mean, the relevance here is still -- it's still 10x or so. But, yeah, this one still -- we'll just have a

00:58:12.440 | very low -- it's still -- terms you look for. The score just totally jumps around. It's a bit hard to define the threshold.

00:58:19.320 | Because here you can see, in my previous query, we might have said 0.2 is the cutoff point. But now it's

00:58:28.760 | actually 0.4, even though it's not super relevant. So it might be a bit tricky, or you might need to

00:58:34.920 | have a more dynamic threshold depending on how many terms you're looking for and what is a relevant result.

00:58:40.440 | In the bigger picture, the assumption would be if you have hundreds of thousands or even millions of

00:58:46.920 | documents, you will probably not have the problem that anything that is so remotely connected will

00:58:53.320 | actually be in the top 10 or 20 or whatever list that you want to retrieve. So for larger proper data

00:59:00.360 | sets, this should be less of an issue. With my hello world example of three documents, it can be a bit

00:59:06.520 | misleading. But, yes, you can have a cutoff point if you figure out what for your data set and your

00:59:11.160 | queries is a good cutoff point. You could define the cutoff point.

00:59:14.360 | No, sorry, you have three documents. How come it's only showing two? Is it because of --

00:59:18.280 | So the query gets expanded into, I don't know, those 100 tokens or whatever. And then for those two,

00:59:25.640 | there is some overlap, but the third one just didn't have any overlap. But I -- so we -- okay, we can do that.

00:59:33.400 | It's just a bit tricky to figure out that the term that has the overlap. So we will need to take this

00:59:40.280 | one, machine -- no, I am your father. Let's take this one. What you need to do is to figure that one

00:59:48.760 | out. I don't know, actually, we should be able -- let me see. Let's see.

01:00:09.320 | This is a pretty long output. Somewhere I was actually hoping that it would show me the term that has

01:00:20.040 | matched here. Okay. I see something -- okay, there is something puppet that seems to be the overlap.

01:00:28.360 | How much sense that term expansion for the stored text and the query text makes is a bit of a different

01:00:35.880 | discussion. But in here with that explained true, you can actually see how it matched and what happened

01:00:41.880 | behind the scenes. If you have any really hard or weird queries or something that is hard to explain,

01:00:46.760 | to debug that. But the third one didn't match. Now, if I take the dense vector model with OpenAI and I

01:00:54.280 | search for machine, how many results do you expect to get back from this one? 0, 1, 2, 3. Yes, 3. Why 3?

01:01:08.120 | Yes, because there's always some match. That is the other -- or let me run the query first.

01:01:19.240 | These are not the droids you're looking for. This one is the first one. I don't think that this model

01:01:23.880 | is generally great because here the results are super close. It is -- I mean, the droids with the

01:01:29.320 | machines, that is the first one. But the score is super close to the second one, which is no, I am

01:01:34.520 | your father, which feels pretty unrelated. And Obi-Wan never told you what happened to your father. Even

01:01:39.320 | that one is still with a reasonably close score. But why do we have those? Because if we say,

01:01:49.080 | what is the relevance, I mean, it's further away, but it's always like there's always kind of like

01:01:55.960 | some angle to it, even if the kind of like the angle here or depending on the similarity calculation that

01:02:01.800 | you do, but it's still always related. There is no easy way to say something is totally unrelated.

01:02:07.080 | That is, by the way, one good thing about keyword search where it was relatively easy to have a cutoff

01:02:14.520 | point of things that are totally not relevant, where you're not going to confuse your users.

01:02:18.600 | Whereas here, if you don't have great matches, you might get almost -- it's not random, but it's

01:02:25.000 | potentially -- it looks very unrelated to your end users what you might return.

01:02:29.240 | just because it's very hard to show. Yes?

01:02:31.000 | Is it fair to say, then, that the OpenAI embedding search is worse for this kind of toy example,

01:02:38.200 | because the magnitude of difference is --

01:02:39.880 | I'm careful with worse, because it's really a hello world example, so I don't take this as a

01:02:47.080 | quality measurement in any way. I -- yeah, I mean, the OpenAI model with 128 dimensions is very few

01:02:54.520 | dimensions. I think it will probably be cheap, but not give you great results necessarily. But don't use

01:02:59.880 | this as a benchmark. I think it's just a good way to see that this is now much harder, because now you

01:03:07.400 | need to pick the right machine learning model to actually figure out what is a good match. With

01:03:12.760 | keyword-based search, it was a bit of a different story. There you need to pay more attention to,

01:03:16.680 | like, how do I tokenize, and do I have the right language, and do I do stemming or not stemming.

01:03:22.120 | But most of that work is relatively, I want to say, almost algorithmic, and then you can figure that

01:03:28.120 | out, and you configure it, and then it's very predictable at query time. Whereas with the dense

01:03:33.640 | vector representation, you really need to evaluate for the queries that you run and the data that you

01:03:39.320 | have, like, is that relevant, and is this an improvement or not? It's very easy to get going and

01:03:45.320 | just throw a dense vector model together, and you will match -- you will always match something

01:03:51.080 | that might be an advantage over the lexical search where you don't have any matches, which

01:03:55.160 | sometimes is the other problem that nothing comes back and you would want to have at least some

01:03:59.480 | results. Here it might just be unrelated. So that can be tricky. That you want to have some results is,

01:04:09.160 | by the way, a funny story that the European e-commerce store once told me, they said they accidentally

01:04:15.880 | deleted, I think, two-thirds of their data that they had for the products that you could buy.

01:04:21.320 | And then I asked them, like, okay, so how much revenue did you lose because of that? And they

01:04:26.120 | said, basically nothing, because as long as you showed some somewhat relevant results quickly enough,

01:04:32.920 | people would still buy that. So only if you have no results, that's probably the worst.

01:04:37.080 | So for an e-commerce store, you might want to show stuff a bit further out, because people might still

01:04:42.840 | buy it. But it really depends on the -- I'm coming to you in a moment -- it really depends on your use

01:04:47.960 | case. E-commerce is kind of like one extreme where you want to show always something for people to buy.

01:04:53.800 | If you have a database of legal cases or something like that, you probably don't want that approach,

01:05:00.040 | because that will go horribly wrong. So it is very domain specific. That's, I think,

01:05:06.200 | also the good thing about search, because it keeps a lot of people employed, because it's not an easy

01:05:10.280 | problem. It's almost job security, because it depends much on the -- this is the data that you have,

01:05:16.840 | and this is the query that people run, and this is the expectation of what will happen, and this

01:05:20.680 | is for this domain, the right behavior. So there's no easy right or wrong with the checkbox. And the

01:05:27.720 | other thing is you might make -- if you tune it, you might make it better for one case, but worse for 20

01:05:32.440 | others. That's why a robust evaluation set is normally very important, though very rare. A lot of people

01:05:39.960 | YOLO it, and you will see that in the results. And for the e-commerce store, it probably works well enough.

01:05:44.520 | Sorry, you had a question. Can I limit the semantic enrichment to a subset of my index based off

01:05:51.320 | the properties of the document? So if I have a very large shared index with a lot of customers,

01:05:56.280 | and I want to enable AI for a subset of the index, can I say, hey, only do the semantic enrichment if the

01:06:03.080 | document has this property where maybe it's like an AI customer? Yeah, so the way we would do it in

01:06:09.800 | our product is that you would probably have two different indices with different mappings.

01:06:15.560 | Yeah, but then it's not so fun, like the customer upgrades and I have to migrate them to the new

01:06:21.160 | index. Dave, please. Yeah, so if you, for example, have an index in Elasticsearch, you can think of it

01:06:31.240 | almost like a sparse table, right? So there's no penalty for having a field that is not populated.

01:06:36.920 | So either in your application or an ingest processor, you could have an inference statement and say,

01:06:42.200 | Yeah, that's how we do it now. No, we'll only move it over.

01:06:45.000 | With this automatic way where you kind of turn it off.

01:06:47.800 | No, the problem is the data structure, like if the field is there, so the data structure that we build

01:06:52.440 | in the background is called HNN. And either we build a data structure or we don't build it.

01:06:57.560 | Yeah, so if you had, you know, 10 billion entries in your vector index, your index is set up for vectors,

01:07:07.560 | right? And you just don't populate the thing that is either putting in a dense vector or triggering the

01:07:14.920 | inference to create a dense vector to put into there, then it's just going to be a, you know,

01:07:20.120 | a bunch of, the index is just a bunch of pointers and none of them head towards the HNNW and it won't

01:07:24.600 | show up in search results. The penalty is nothing, right? But you're going to have

01:07:30.120 | to manage what does or does not create the vector. You could do that in an ingest processor by just

01:07:36.680 | saying, Hey, we're going to use the copy command to have two copies of the text, one that's meant for

01:07:42.120 | non-vector indexing, one that's meant for actual vector indexing. You'd have to manage that with some

01:07:46.920 | tricky, complex AI technology called if-then-else, right? Somewhere inside of your ingesting pipeline,

01:07:54.120 | then it would work just fine. Yeah. One more question. When we did HNNW in the last week,

01:08:02.440 | we found that it was extremely slow at write times and the community suggested that we freeze our index

01:08:08.920 | if we were going to use HNNW. Force merge or? Yeah, I think just freeze writes. They said,

01:08:17.160 | they said build the index and free it, otherwise you'll put a ton of load on the computer. I mean, yes.

01:08:21.960 | What we found is that some of the defaults kind of like have been around in Elasticsearch for 10

01:08:26.840 | years settings with the merge scheduler really optimize the keyword search and for high update

01:08:33.400 | workloads on HNNW, we've got some suggestions. They take a little bit of parameter speed tuning to go

01:08:39.960 | and find something right for your IOPS and for your actual update workload. So sometimes it's about the

01:08:44.920 | merge scheduler and not doing kind of an inefficient HNNW build when it's not important for the use case.

01:08:51.480 | Okay. The other thing you'd say is that sometimes friends don't let friends run Elasticsearch 8.11,

01:08:58.920 | upgrade, upgrade, upgrade. They put a lot of optimization work in here. It should be simple.

01:09:03.240 | That's great. So the reason -- It used to be that.

01:09:06.120 | The reason why that is, it's like merging -- so because you have the immutable segment structure in

01:09:14.040 | Elasticsearch. And HNNW, you cannot easily merge. You basically need to rebuild them. The one trick -- I forgot which

01:09:21.320 | version it was. I'm not sure, Dave, if you remember. I think it was even before 8.11. But basically,

01:09:26.120 | if we do a merge, we would take the largest segment with not deleted documents and basically plop the

01:09:31.880 | new documents on top of them rather than starting from scratch from two HNNW data structures. There's

01:09:37.560 | another optimization somewhere now in 9.0 that will make that a lot faster. So it really depends on the

01:09:44.040 | the version that you have. And there are a couple of tricks that you can play. But yeah,

01:09:48.760 | that is one of the downsides of like the way immutable segments work and HNNW is built,

01:09:54.840 | that you can't easily just merge it as easily together as other data structures because you really

01:09:59.480 | need to rebuild the HNNW data structure or like take the largest one and then prop the other one in.

01:10:04.520 | Okay. Some of the things where we just like -- we fixed it in the next version of the scene, so I want you to --

01:10:09.320 | We found like KNNW was too slow and then KNNW broke our CPU and then we moved to find them,

01:10:15.960 | and now we're .

01:10:17.320 | Stay true. Yeah.

01:10:18.680 | Yeah. Might have been a while ago. Yeah?

01:10:21.640 | So for like traditional document search, you know what I'm saying, like, hey,

01:10:25.960 | please find me a document that contains my search query, right? For R in the context of RAG,

01:10:32.200 | it might be something more like, hey, come up with a fun plan for my weekend, right? And then the

01:10:37.480 | documents that we want to find don't necessarily look like the search query, right? Yeah.

01:10:41.480 | So like one approach to that is you just give -- it's an agent and you give it a search tool,

01:10:46.840 | and it searches, right? So like I'm just curious what you -- how do you think about that in general?

01:10:51.160 | Yeah. I feel like RAG has been very heavily abused. It's -- or like the mental model I

01:10:56.280 | think started off as like you do retrieval and then you do the generation, but you could do the

01:11:00.360 | generation earlier on as well, that you do the rewriting and expanded query. So I -- my favorite

01:11:07.560 | example for that is you're looking for a recipe. You don't need to have the LLM regenerate the recipe.

01:11:16.200 | You just want to find the recipe. But maybe you have a scenario where you forgot what the thing is

01:11:20.600 | called that you want to cook. And then you could use the LLM, for example, to tell you what you're

01:11:24.600 | looking for. Like you say, like, oh, I'm looking for this Italian dish that has like these layers of

01:11:31.640 | pasta and then some meat in between. And then the LLM says, oh, you're looking for lasagna. And then

01:11:36.280 | you basically do the generation first or a query rewriting and then search and then get the results.

01:11:42.440 | as a very explicit example here. Your example would look very different and probably smarter than my

01:11:49.480 | example. But query rewriting is one thing. There's also this concept of height where your documents and

01:11:58.840 | your queries often look very different. And that you use an LLM to generate something from the query that

01:12:04.200 | looks more close like the documents that you have. And then you match the documents together because they're

01:12:09.560 | more similar in structure. So there are all kinds of interesting things that you can do. Like I said

01:12:15.640 | earlier, it depends is becoming a bigger and bigger factor. But, yeah, your use case is probably

01:12:21.080 | might be, yeah, maybe a multi retrieval where you figure out, like, oh, you look, I don't know,

01:12:29.080 | I know the example from an e-commerce store where it's like, I'm going to a theme party from the 1920s,

01:12:36.520 | give me some suggestions. And then the LLM will need to figure out, like, what am I searching for?

01:12:40.680 | And then it can retrieve the right items and rewrite the query and then actually give you proper

01:12:44.920 | suggestions. But it's not just running a query anymore. Yeah?

01:12:50.600 | Yeah?

01:12:51.100 | We use instruction to embed models.

01:12:54.100 | And it's kind of like .

01:12:59.000 | Along with your theory, you can say, like, this is the kind of thing I am doing.

01:13:04.000 | And you can say .

01:13:07.000 | Like, you can have document query embeddings.

01:13:13.500 | We try to embed queries from documents we're going to find, like the text.

01:13:18.500 | You can have instruction when you embed models that, instead of saying, like, I'm actually

01:13:23.340 | creating documents, you have to .

01:13:26.400 | And then, I don't know, why do you think you have a problem?

01:13:30.400 | Yeah?

01:13:37.900 | How should we be thinking about the number of dimensions in the embedding model?

01:13:41.400 | Is, like, a 5 or 12-dimensional model necessarily better than a 1, 2, 3?

01:13:46.400 | Definitely not necessarily.

01:13:48.400 | Yeah.

01:13:49.400 | It's an interesting question.

01:13:51.300 | That feels almost like a blast from the past.

01:13:53.240 | I remember, like, two or three years ago, there was this big debate of, like, how many dimensions

01:13:57.800 | does each data store support and, like, how many dimensions should you have?

01:14:02.000 | And at first, it looked like, oh, more dimensions is always better.

01:14:04.300 | But then it turned out more dimensions are very expensive.

01:14:07.240 | So, it really depends on the model and what you're trying to solve.

01:14:10.300 | Like, if you can get away with fewer dimensions, it's potentially much cheaper and faster.

01:14:15.240 | But I don't think there is a hard rule, like, maybe the model with more dimensions can express

01:14:21.840 | more in because it just has more data and then it will come in handy.

01:14:26.580 | But maybe it's not necessary for a specific use case and then you're just wasting a lot

01:14:29.740 | of resources.

01:14:30.740 | I don't think there is an easy answer to say, like, yes, for this use case, you need at least

01:14:36.220 | 4,000 dimensions.

01:14:39.660 | It will depend.

01:14:40.660 | But it depends on the model, how many dimensions it will output, and then maybe you have some

01:14:44.880 | quantization in the background to reduce that again or reduce either the number of dimensions

01:14:49.200 | or the fidelity per dimension.

01:14:52.940 | So there are a lot of different tradeoffs in that performance consideration.

01:14:55.980 | But it will mostly rely on, like, how good does the model work for the use case that you're

01:15:01.260 | trying to do?

01:15:22.260 | Yeah.

01:15:23.260 | So that is one area.

01:15:27.620 | So I want to say historically what you would do is you would have a golden data set and then

01:15:33.760 | you would know what people are searching for and then you would have human experts who rate

01:15:37.260 | your queries.

01:15:38.380 | And then you run different queries against it and then you see, like, is it getting better

01:15:42.080 | or is it getting worse?

01:15:44.500 | Now LLMs open a new opportunity where you might have human experts in the loop to help them out

01:15:50.360 | a bit, but they might be actually good at evaluating the results.

01:15:55.200 | So you almost nobody has, like, the golden data set and test against that.

01:16:00.420 | But you can either use it, look at the behavior of your end users and try to infer something

01:16:05.600 | from that or you have an LLM that evaluates, like, what you have or you have a human together

01:16:12.060 | with an LLM evaluate the results.

01:16:14.900 | So you have various tools, but, again, it's -- and it depends on really not an easy question

01:16:22.780 | of saying, like, this is the right thing.

01:16:25.040 | Maybe you can get away with something simple.

01:16:26.680 | So the classic approach I want to say is, like, you looked at the clickstream of how your users

01:16:31.520 | behaved and then you saw, like, they clicked on the first or up to the third result.

01:16:36.200 | The result was potentially good and it didn't just go back and then click on something else,

01:16:39.920 | but they stuck on the page.

01:16:42.220 | If they don't click on anything and just leave, it might be very bad.

01:16:45.680 | If they go to the second or third page, it might also not be great.

01:16:48.720 | So there are some quality signals that you can infer from that or you really look into

01:16:52.540 | the quality aspect and try to evaluate, like, what people were doing and how it behaves.

01:16:58.420 | But you can make this from relatively simple to pretty complicated.

01:17:14.100 | What else?

01:17:17.100 | Obviously, if I search for the AdWords query extension, it will find my father example.

01:17:27.720 | And this one here will still, again, match my droids pretty much like the opening AI example.

01:17:36.720 | One thing that I wanted to show you what is also happening behind the scenes here, this is

01:17:41.100 | a very long segment, like, it's a lot of information with different speakers.

01:17:47.600 | What I have created here, though, is we have created multiple chunks behind the scenes.

01:17:54.640 | And if I search for that, I think looking for murder in the Skywalk saga works pretty well

01:18:02.820 | here.

01:18:03.820 | It finds the document that I have retrieved, but it can also highlight -- so here I say,

01:18:09.760 | show me the fragment that actually matched best here.

01:18:13.760 | And if I search here for murder, it didn't find anything.

01:18:18.700 | But I think the term that it found was in this highlighted segment here, it found kill and

01:18:24.800 | it was that one that was expanded here.

01:18:28.380 | So here I have broken up my long text field into multiple chunks and there are multiple

01:18:33.320 | strategies.

01:18:34.320 | You can do that by page, by paragraph, by sentence.

01:18:39.120 | You could do it overlapping or not overlapping.

01:18:43.280 | Many strategies will depend on how you want to retrieve what works best for your use case.

01:18:48.540 | But you want to kind of like reduce the context per element that you're matching because there's

01:18:53.140 | only so much context that a dense vector representation can hold.

01:18:57.400 | So you want to chunk that up, especially if you have like a full book, you want to break

01:19:01.120 | up those individual at least pages.

01:19:04.120 | And then find the relevant part where the match is.

01:19:06.960 | And then you can actually link back to that.

01:19:09.720 | The point in this query here is also to show you, I didn't define any chunks.

01:19:14.740 | I didn't say like, okay, send this representation of a dense vector there and then when it comes

01:19:20.640 | back, interpret again.

01:19:22.900 | This is all happening behind the scenes just to make this easier.

01:19:25.500 | So the entire behavior here is still very similar to the keyword matching even though there's

01:19:30.220 | a lot more magic happening behind the scenes.

01:19:34.220 | Just to keep that very simple.

01:19:37.220 | So let's see.

01:19:38.940 | Okay.

01:19:39.940 | How does everybody feel about long adjacent queries?

01:19:45.940 | We'll see about alternatives and maybe we can make this a bit simpler again.

01:19:50.940 | But let me show you one more way of looking at it.

01:19:55.940 | We call them retrievers.

01:19:57.940 | They're a more powerful mechanism to actually combine different types of searches.

01:20:05.280 | Combining different types of searches, let me get from my slides actually.

01:20:08.660 | When we talk about combining searches and how this all plays together.

01:20:12.660 | This is kind of my little interactive map of what you do when you do retrieval or what your

01:20:19.660 | searches do.

01:20:20.660 | We started here in the lexical keyword search and then we run the match query and we're matching

01:20:28.380 | these strings.

01:20:31.180 | This often combined with some rank features are often what we call full text search.

01:20:37.820 | The rank features could be either you extract a specific signal or it could also be something,

01:20:42.380 | however you influence that ranking, it could be the margin on the product, how many people

01:20:47.900 | bought something, what the rating is.

01:20:49.900 | There are many different signals that you could include, not just with the match of the text,

01:20:55.820 | but any other signals that you want to combine for retrieving that.

01:20:59.580 | And then you have full text search as a whole.

01:21:02.460 | On top of that, I kept it to the side here, you might have a Boolean filter where you have a hard

01:21:10.620 | include or exclude of certain attributes, this does not contribute to the score, this is just like

01:21:16.780 | black and white, this is included or excluded, whereas this here calculates the score for you,

01:21:22.940 | how you match.

01:21:24.220 | And then this was kind of like the algorithmic side.

01:21:29.660 | And then we have this machine learning, the learn side or the semantic search where you have a model

01:21:35.100 | behind the scenes split into the dense vector embeddings and the sparse vector embeddings

01:21:41.900 | for vector search or learn sparse retrieval, I think those are the two common terms.

01:21:47.180 | And the interesting thing is all of these, including the sparse one, these are the sparse vector

01:21:56.860 | representation in the background, and only this one here is the dense vector representation.

01:22:03.180 | And then when you combine any grouping down here to combine for one search, this is then what we

01:22:13.420 | would call hybrid search, even though there can be big discussions of like what is exactly hybrid search

01:22:18.460 | or not, I will definitely stick to the definition that as soon as you combine more than one type of

01:22:23.900 | search it could be sparse and dense or it could be dense and keyword or maybe if you combine two

01:22:30.140 | dense vector searches, hybrid search because you have multiple approaches and then you can either

01:22:37.500 | boost them together, you could do re-ranking which is becoming more and more popular. One thing that we

01:22:43.020 | lean heavily into is RRF which is reciprocal rank fusion that doesn't rely on the score but it relies on the

01:22:51.180 | position of the position of each search mechanism. So it basically says like the lexical search had this

01:22:57.340 | documented position four and the dense vector search had it at position two and then it kind of like evens out the

01:23:03.740 | position and gives you an overall position by blending them together rather than looking at the individual

01:23:09.020 | scores because they might be totally different. So this is kind of like the information retrieval map

01:23:15.420 | map overall and we have, okay, we didn't do a lot of filters but I think filters are intuitively

01:23:21.180 | relatively clear that you just say like I'm only interested in users with this ID or whatever other

01:23:26.220 | criteria. It could be a geo-based filter like only things within 10 kilometers or only products that came

01:23:31.740 | out in the last year. Like a hard yes or no. All the others will give you a value for the relevance and then you

01:23:42.140 | can blend that potentially together to give you the overall results. That is kind of like the

01:23:47.180 | total map of search. Can you give an example of the signal one too?

01:23:54.300 | Yeah, for signal, so we have our own data structure for these rank features. It could be, for example,

01:24:01.180 | the rating of a book and then you combine the keyword match for, I don't know, you search for murder

01:24:15.340 | mysteries but then another feature would be how well they are ranked and then you would see that. Or it

01:24:22.700 | could be your margin on the product or the stock you have available and you would want to show the product

01:24:28.060 | where you have more in stock. Or it might even be a simple like a click stream like what have people

01:24:34.220 | clicked before. There are a lot of different signals that you could include in all of this searching then.

01:24:39.420 | Any other questions? Are everybody good for now? Yeah?

01:24:47.900 | You would have to normalize them. Depending on the comparison that you do for dense vectors, it might be

01:25:13.500 | between 0 and 1. But you saw that for the keyword search, also depending on how many words I was

01:25:19.340 | searching for, it might be a much higher value. There is no real ceiling for that. Or you could add a

01:25:26.060 | boost and say, like, this field is 20 times more important than this other field. There is no real

01:25:32.220 | max value that you would have here. You could normalize the score and then basically say, like,

01:25:37.260 | I'll take the highest value in this sub query as 100% and then reduce everything down by that factor.

01:25:44.060 | And then I combine them. Maybe that works well. RF is a very simple paper. I think it's like two pages.

01:25:51.020 | And it really just takes the different positions. I think it's one divided by 60, which is like a factor

01:25:57.340 | that figured out made sense, plus the position. And then you add the scores or like the positions for each

01:26:03.580 | document together. And then that value gives you the overall position. It really just, it doesn't look

01:26:10.620 | at the score anymore, but it blends the different positions together and like how they are interleaving

01:26:15.100 | and what should be first or second. Yeah?

01:26:19.580 | Yeah.

01:26:20.580 | So, just for vector search, why should I do the last check-over CD vector or something like that?

01:26:27.580 | I'm sure that .

01:26:30.580 | My data is already in the database.

01:26:34.580 | We change, probably via CDC, change in the capture.

01:26:39.580 | So, there's one extra .

01:26:42.580 | Like, CD vector, like it's right there.

01:26:46.580 | I was just curious, what sort of systems, like, what have you seen in production?

01:26:59.580 | I mean, PG vector will always be there because, like, if you are already using Postgres, it's

01:27:04.580 | very easy to add.

01:27:05.580 | I think then the question is, like, does it have all the features that you need?

01:27:09.580 | For example, Postgres doesn't even do BM25.

01:27:13.580 | It has some matching, but it's not the full BM25 algorithm because I don't think it keeps

01:27:17.720 | all the statistics.

01:27:19.760 | It will be a question of, like, scaling out Postgres can be a problem and then just, like, the breadth

01:27:24.820 | of all the search features.

01:27:27.180 | If you only need vector search, I think my or our default question back to that is, like,

01:27:33.740 | do you really only need vector search?

01:27:36.920 | Maybe for your use case, but for many use cases, you probably need hybrid search.

01:27:41.540 | One area, for example, where vector search will not do great is, like, if somebody searches

01:27:46.380 | for, like, a brand.

01:27:48.720 | Because there is no easy representation in most models for the specific brand and it will be

01:27:53.740 | very hard to beat keyword search.

01:27:55.160 | So there will be very -- and also your users will be very angry when they know you have

01:27:59.300 | this word somewhere in your documents or in your data set, but you don't give me the result

01:28:03.500 | back.

01:28:04.500 | So there are many scenarios where you probably want hybrid search, I feel like that's the -- we

01:28:10.580 | started two years ago, we started with just vector search, but I feel like the overall trend

01:28:15.100 | trend is coming more to hybrid search, because you probably want some sort of key search and

01:28:21.260 | then you want to have that combined, probably with some model for the added benefit and extra

01:28:27.180 | context, but you often want the combination.

01:28:30.880 | It might also depend a bit on, like, the types of queries that your users run.

01:28:34.900 | So if your users run single word queries, like I've done in my examples, that's often not

01:28:39.860 | really ideal for vector search because you live off like any machine learning model because

01:28:44.440 | you live off extra context.

01:28:48.080 | So depending on that, I've seen some people build searches where it's like if you search

01:28:52.740 | for one or two words, they do keyword search, but if you search for more, they might fall

01:28:56.620 | over to vector search.

01:28:57.620 | So it depends a bit on the context what works.

01:29:00.820 | If you really only need vector search and PG vector is small enough to do all of that, and

01:29:08.440 | Postgres is your primary data store, then that's probably where you will do well.

01:29:12.900 | But there are plenty of scenarios where that will, or not all of those are necessary boxes

01:29:16.800 | will be ticked.

01:29:17.800 | My last question.

01:29:18.800 | Specifically for code, let's say you have a file, and it's like, you don't get repository,

01:29:25.680 | there's like thousands of comments, right?

01:29:27.680 | And so you have two options, either you embed at a file level, or you embed at a chunk level,

01:29:34.140 | right?

01:29:35.140 | But I don't want to pay the penalty across thousands of unchanged, like the file hasn't changed for

01:29:38.040 | thousand comments, but just for this one it has changed, right?

01:29:42.500 | So I cut any shadow copies of the same thing, and have you seen, like, what are some like

01:29:48.500 | tips and tricks that people use to not have exploding storage costs, and like, this might not be

01:29:58.960 | a plastic problem, but a general vector problem, like how do I just pay the penalty once of storing

01:30:13.420 | the same embedding, and only when it changes, I re-embed and then, uh, in that sense.

01:30:22.880 | But so, you would create, so it's one dataset basically with thousands of files that all are

01:30:28.920 | chunked together, and so one change would invalidate all of them, or --

01:30:32.880 | No, no.

01:30:33.880 | So, I think I have a file that is reportedly, right?

01:30:34.880 | It has like 5,000 commits, and there's one file, . It didn't change for 4,999 commits, but on the 5,000 commits, it did change, right?

01:30:47.880 | And so, if it hasn't changed, I want to only change the file to one of the embedding, right?

01:30:53.880 | But only when I insert file contents and change, I need -- I want to re-ingess, right?

01:31:00.880 | But for those 4,999 times, I don't want to store, like, this hash, like, this hash has the same embedding, same embedding.

01:31:09.880 | Ah, so --

01:31:10.880 | I'm not sure if this is a problem.

01:31:11.880 | You have seen with some of the customers, but --

01:31:13.880 | Maybe so that --

01:31:15.880 | I think the way we might solve it is that if you create the hash of the file and use that

01:31:22.880 | as the ID, and you only use the operation create, and it would reject any duplicate writes,

01:31:29.880 | you would at least not ingest and then create the vector representation again.

01:31:34.880 | You will still send it over again, and it would need to get rejected.

01:31:39.880 | If a doc ID would have to be the hash of the file.

01:31:41.880 | Yes.

01:31:42.880 | If you have that doc ID, and then you need to set the operation to just create and not update

01:31:47.880 | or upset, then it would just be rejected and you would only write it once.

01:31:52.880 | I'm not sure if that is a great use case or if you might want to keep, like, I don't know,

01:31:57.880 | an outside cache of, like, all the hashes that you've already had and deduplicate it there,

01:32:01.880 | but that would be the elasticsearch solution of, like, using the hash as the ID and then just

01:32:05.880 | writing to that.

01:32:06.880 | Okay.

01:32:07.880 | And although with create on that, yeah.

01:32:08.880 | That is, I think, the intuitive or most native approach that we could offer for that.

01:32:19.880 | Yeah.

01:32:20.880 | I think there was some other question somewhere.

01:32:21.880 | Yeah.

01:32:22.880 | Yeah.

01:32:23.880 | I just wanted to add on to the Postgres question a minute ago.

01:32:26.880 | Postgres does have a native .

01:32:29.880 | It's pretty good.

01:32:30.880 | It's a .

01:32:33.880 | There has also been recently a lot of work.

01:32:36.880 | So there is a .

01:32:43.880 | But from what I remember, the default Postgres full-text search does not do full BM25,

01:32:53.880 | but it only does -- it doesn't have all the statistics, I think, from what I remember.

01:32:56.880 | Right.

01:32:57.880 | Yeah.

01:32:58.880 | Any other questions?

01:33:02.880 | Joe, please, go ahead.

01:33:05.880 | Do you plan to cover more on the receiver thing, like the receiver concept?

01:33:12.880 | To show now?

01:33:13.880 | Yeah.

01:33:14.880 | I mean, how much chase do you want to see?

01:33:15.880 | Well, okay.

01:33:16.880 | I'm practically interested in -- I mean, you're modeling -- so you have multiple -- I love the

01:33:22.880 | receiver concept.

01:33:23.880 | It's a great concept.

01:33:24.880 | I think you have multiple different candidates.

01:33:27.880 | What kind of flexibility do you have on that?

01:33:30.880 | Because I'm not interested in using this.

01:33:36.880 | I'm interested in re-scoring them.

01:33:39.880 | Not necessarily looking at any, but I'm modeling the effect.

01:33:41.880 | You're just receiving a bunch of stuff.

01:33:44.880 | Can they?

01:33:45.880 | And then you're actually elevating your .

01:33:48.880 | Yes.

01:33:49.880 | I'm sure there's .

01:33:52.880 | Maybe for -- before we dive into that, for everybody else, like, re-scoring is like,

01:33:57.880 | let's say we have a million documents, and then we have one cheaper way of retrieving them,

01:34:01.880 | and we retrieve the top, I don't know, 1,000 candidates, and then we have a more expensive

01:34:06.880 | way, but higher quality way of actually re-scoring them, then we will run this more expensive re-scoring

01:34:12.880 | on just the top 1,000 to get our ultimate result list of results.

01:34:18.880 | But the re-scoring algorithm would be too expensive to run it across.

01:34:22.880 | Like a million documents, that's why you don't want to do that.

01:34:25.880 | That's why you have a two-step process.

01:34:27.880 | And that's why you might want to have the re-scoring.

01:34:30.880 | So, yes, we have a -- you can in Elasticsearch now, you can do re-scoring because it becomes

01:34:35.880 | more and more popular.

01:34:37.880 | I don't have a full example there, but we do have like -- we do have a re-scoring model built

01:34:43.880 | in by default now, let me pull that up.

01:34:50.880 | So, we have currently the version 1 re-ranking, but we have a built-in re-ranking model now

01:34:57.880 | as well.

01:34:58.880 | So, for one of the tasks that we can do, you can see here we have the other tasks, for example,

01:35:05.880 | the dense text embedding, now we have a re-ranking task that you can also call.

01:35:11.880 | Your question?

01:35:12.880 | How do you express that?

01:35:14.880 | Okay.

01:35:15.880 | It might be too easy.

01:35:26.880 | No, re-ranking is good.

01:35:28.880 | Let me -- somehow my keyboard binding is broken.

01:35:33.880 | This is very annoying.

01:35:37.880 | Okay.

01:35:40.880 | We re-rank results.

01:35:42.880 | Let me see.

01:35:43.880 | Somewhere here there should be -- so, there's learn to rank, but it should not be the only

01:35:51.880 | one.

01:35:54.880 | This is what we want.

01:35:55.880 | Okay.

01:35:56.880 | We have our re-ranking model.

01:36:01.880 | Unless, Dave, you know from the top of your head where we have the right docs for this.

01:36:08.880 | organization of our docs can't help you with -- but retrievers --

01:36:11.880 | Yeah, retrievers could find them --

01:36:13.880 | Starting in 8.16 or 8.17, the retrievers API added a specific parent level retriever that

01:36:21.880 | you -- it's like nested around outside the retrievers that go inside this.

01:36:26.880 | And it's specifically called the text re-ranking retriever.

01:36:31.880 | And so if you have a cross encoder, say from Huggy Face, or you're using the elastic re-ranker,

01:36:36.880 | using one of these things that complies that kind of inference task of taking a bunch of things and re-ranking them against the query, right?

01:36:46.880 | Taking the full token stream, right?

01:36:48.880 | And the full token stream with the context and doing that.

01:36:51.880 | So you can target a parent level text field of the documents that are being retrieved.

01:36:57.880 | So it works really well for the one document or chunk, kind of re-ranking use case.

01:37:03.880 | I've also seen people just do it outside in the second API call, say if you wanted to do it on a highlighted thing.

01:37:09.880 | Or if you wanted to do re-ranking sub-document chunks, that works pretty well for the API.

01:37:21.880 | But there's a text re-ranker retriever that specifically got added in 8.16 or 8.16.

01:37:29.880 | Yeah, so I think this is a simple example.

01:37:31.880 | Like, we have a standard match.

01:37:33.880 | Like, this will be very cheap.

01:37:35.880 | And then we have the text-similarity re-ranker, which uses our elastic re-ranker.

01:37:41.880 | That falls back to that model behind the scenes.

01:37:44.880 | So you can think about it.

01:37:46.880 | I was a functional programmer, so don't mind the parentheses.

01:37:50.880 | But you would have, like, the text re-ranker retriever.

01:37:54.880 | Inside of that, you have the RRF.

01:37:56.880 | Inside of that, you would have lexical and KNN as peers.

01:38:00.880 | And it works from the inside out.

01:38:02.880 | Hey, do each of those retrieval methodologies.

01:38:05.880 | Do, like, the Venn diagram.

01:38:07.880 | Find the best results.

01:38:08.880 | And then take the full text of those results and run them on re-ranker.

01:38:12.880 | It's almost like a little mini LLM saying, what do you actually answer the question?

01:38:16.880 | And then we'll ask what outcomes.

01:38:18.880 | It's pretty good.

01:38:19.880 | The cool thing about the re-ranker is you can run it on structured lexical retrieval.

01:38:25.880 | You don't have to run it on a vector.

01:38:28.880 | You can run it on anything you want.

01:38:30.880 | So you don't want to pay the vector search everything.

01:38:32.880 | Or maybe the text is too small, the vector search.

01:38:35.880 | You're not needing there to actually have the model lock on the stuff.

01:38:39.880 | The re-ranker, when you run it on just kind of actual customer data sets,

01:38:44.880 | how to do it, they're like, yeah, our evaluation score is bumped by 10 points.

01:38:49.880 | Basically, for free, it feels like cheap.

01:38:53.880 | Right?

01:38:54.880 | So when you run against, like, a Gemini API,

01:38:57.880 | and you're like, wow, why is this 10 points better than the Amazon one?

01:39:00.880 | It's because they threw on their retriever .

01:39:03.880 | Right?

01:39:04.880 | So there's a lot of black box stuff out there that we're exposing.

01:39:08.880 | So don't be scared if we're telling you how it works inside.

01:39:13.880 | But this is what the leading retrieval technology is doing under the hood

01:39:18.880 | and reselling to you as if they're, you know, it's all AI.

01:39:23.880 | Right?

01:39:24.880 | Yeah.

01:39:25.880 | Does that answer the question?

01:39:28.880 | Yeah.

01:39:29.880 | So my wish list is I want to be able to do in the retriever's API, re-ranking on sub-documentments.

01:39:39.880 | A lot of my things are about sub-documentology retrieval.

01:39:41.880 | Right now, I've got to do it outside of the retriever's API, but I'm bending here as a developer.

01:39:50.880 | Yeah.

01:39:51.880 | So just to give you the example of, like, I don't think I have a re-ranking example here,

01:39:56.880 | but this one uses a classic keyword match for retriever.

01:40:03.880 | And then we have -- we normalize here the score.

01:40:06.880 | I think somebody else asked about the normalize -- or we had a discussion about the normalizing.

01:40:10.880 | We do a min/max normalizing.

01:40:12.880 | We weight this with two.

01:40:14.880 | And then I use the OpenAI embeddings with a -- again, normalized with a weight of 1.5.

01:40:21.880 | And then they will get blended together and you get the results that won't surprise you

01:40:25.880 | that these are not the droids you're looking for.

01:40:27.880 | If you search for droid and robot, will be by far the highest ranking document.

01:40:33.880 | You had a question somewhere.

01:40:34.880 | Yes.

01:40:35.880 | How much control does the last -- if you're doing the re-ranking thing, currently we do something

01:40:40.880 | similar, but we, like, do the different steps of the re-inkers at kind of, like, different

01:40:43.880 | levels of re-distribution hierarchy.

01:40:44.880 | So, like, you have, like, you have, like, you're, like, charted droid processing nodes and you

01:40:53.880 | some, like, you do some ranking there and, like, once you, like, rejoin before you return

01:40:58.880 | the result from the query engine, that's when you run a final re-ranker.

01:41:03.880 | So, we would retrieve, like, x candidates and you could define the number of candidates

01:41:19.880 | and then we would run the re-ranking on top of those.

01:41:22.880 | So, that will be a trade-off for you, like, the larger the window is, the slower it will

01:41:27.880 | be, but the potentially higher quality your overall results will be because you will just

01:41:32.880 | have everything in your data set that you can then re-rank at the end of the day.

01:41:38.880 | Is that what you meant or you wanted something per node or --

01:41:43.880 | Yeah, I mean, like, right now we, like, are actually confirmed, like, specifically,

01:41:48.880 | we're kind of, like, we're in the compute and that happens, so that you can, like, get out

01:41:53.880 | for our rest of the week and do, like, keep re-ranking and then do some re-ranking of one,

01:41:58.880 | but -- but maybe that doesn't apply to these .

01:42:03.880 | Like, how the query is mechanically running?

01:42:06.880 | Yeah, I don't think that's how we do it.

01:42:09.880 | So, what you can control here is, like, this is a window of, like, what you might retrieve

01:42:14.880 | and then we have the minimum score, like a cutoff point, to throw out what might not be relevant

01:42:19.880 | anyway, to keep that a bit cheaper, that's what we have here.

01:42:29.880 | Those are the retrievers.

01:42:30.880 | And then you could do the RRF that I've explained where you blend results together.

01:42:34.880 | All of that is easy.

01:42:36.880 | One final note, if you got tired of all the JSON, we have a new way of defining those queries

01:42:45.880 | as well, where here we have a match operator, like, the one we have used all the time, that

01:42:51.880 | you can use either on a keyword field, but it could also be either a dense or a sparse vector

01:42:55.880 | embedding, and then you can just run a query on that, and then just get the scores from that.

01:43:01.880 | So, it is a particular language, it's a bit more like, I don't know, like a shell.

01:43:05.880 | But if you don't want to type all the JSON anymore, this is how you can do that.

01:43:10.880 | And here my screen size is a bit off.

01:43:12.880 | But, yeah, you get the code that we retrieved, the speaker, and the score.

01:43:18.880 | Maybe I'll take out the speaker to make this slightly more readable.

01:43:25.880 | No, it broke.

01:43:34.880 | Oh.

01:43:37.880 | This is, you could write queries with a fraction of the JSON.

01:43:43.220 | This will also support funny things like joins.

01:43:46.220 | It doesn't have every single search feature yet, but it's getting pretty close.

01:43:50.640 | So this is more like a closing out at the end.

01:43:53.320 | If you're tired of all the JSON queries, you don't have to write JSON queries anymore.

01:43:59.040 | This is nice both for, like, observability use cases where you have, like, just like aggregations

01:44:04.380 | and things like that, but it's also very helpful for full text search now if you want to write

01:44:09.100 | different queries.

01:44:10.100 | I think the main answer is that the language support in the different languages like Java, etc.,

01:44:15.220 | is not very strong yet.

01:44:16.220 | You basically give it strings and then it gives you a result back that you need to parse

01:44:19.940 | out again.

01:44:20.940 | So it is not as strongly typed on the client side yet as the other languages.

01:44:29.100 | Any final questions?

01:44:30.100 | Yes?

01:44:31.100 | Can you just talk about hybrid search and I was just curious, like, what is kind of the

01:44:35.820 | recommended best practices?

01:44:36.820 | like, we also do hybrid search today, what we do is we trigger two elastic queries.

01:44:42.540 | One to do, like, a basic keyword search, the other one to do, like, a key-end method search

01:44:46.540 | and then we walk through some, like, like a code here, a rewrite, we can get the final results.

01:44:51.820 | But, like, from the reviewers, you just showed us, like, it's better, like, to combine those

01:44:53.540 | two queries, in one retriever and, like, would that make the results significantly better?

01:45:06.540 | Like, one of the, like, each of the query will have its own stats and normalizer and things

01:45:10.540 | in, like, more, I don't know, just, like, in general, it sounds better.

01:45:12.540 | I mean, we can make your life easier.

01:45:16.360 | Yeah.

01:45:17.360 | It's just all behind one single query endpoint.

01:45:20.300 | So you could use the two different methods to retrieve and then you could still re-rank

01:45:25.420 | but all from one single query so you don't have to do it yourself.

01:45:29.300 | I mean, it's not like we want to stop you, but you don't have to and we can make your life

01:45:32.800 | a bit easier.

01:45:33.800 | I mean, it's only one single query that you need to run and, like, one single round

01:45:46.480 | trip to the server that you need to do.

01:45:48.400 | Yeah.

01:45:49.400 | But, like, I was just curious comparing to the two queries method, would you do that?

01:45:56.280 | I mean, if you still need to do the retrieval, like, you do the retrieval, like, all the individual

01:46:00.540 | pieces are still there.

01:46:01.540 | If you have two parts of the query, you will still retrieve those if that is the main cost

01:46:06.140 | and then you have the re-ranking so you're not getting out of those completely.

01:46:10.460 | But you can just do it in one single request that you send.

01:46:12.860 | We take care of all of that for you and then send you one result setback rather than sending

01:46:17.140 | more back to your application.

01:46:18.900 | So it will potentially be a little less work on the Elasticsearch side, but it will mostly

01:46:24.140 | be less work on your application side.

01:46:27.280 | So if you don't have any problems, you may not notice, but you're running two mapper users,

01:46:42.880 | right?

01:46:43.880 | When you could be running one, and you're denying the optimizer the opportunity to do any short

01:46:50.460 | gets to say, oh, there's no more results that are better.

01:46:55.060 | Yeah.

01:46:56.060 | So you're potentially going to be a little bit more performance and resources.

01:47:02.060 | So you're going to be a little bit more performance.

01:47:14.660 | You're going to be a little bit more performance.

01:47:17.660 | So if it's not hurting you, by all means, keep going.

01:47:20.660 | But if you have some point, you're going to start vertically scaling your hardware when you

01:47:24.720 | don't shoot you, you can get further, it's just a higher percent.

01:47:35.320 | Yeah.

01:47:36.320 | Perfect.

01:47:39.320 | Thank you so much.

01:47:40.320 | I hope everybody learned something.

01:47:42.320 | I will let the instance running for today or so, so you can still play around with the queries

01:47:46.320 | if you feel like it.

01:47:48.320 | Thanks a lot for joining.

01:47:49.320 | We want stickers.

01:47:50.320 | We have stickers up there.

01:47:51.320 | We also have a booth the next few days.

01:47:52.960 | Come join and get some proper swag from us there.

01:47:58.300 | Thank you.

01:47:59.300 | See you around.

01:47:59.620 | Thanks.

01:48:00.620 | Thanks.

01:48:01.620 | Bye.

01:48:02.620 | Bye.

01:48:03.620 | We'll see you next time.

Information Retrieval from the Ground Up - Philipp Krenn, Elastic