back to index

Information Retrieval from the Ground Up - Philipp Krenn, Elastic


Whisper Transcript | Transcript Only Page

00:00:00.000 | Let's get going. Audio is okay for everybody. I have some slight feedback, but I'll try to
00:00:20.520 | manage. I hope it's okay for you. Hi, I'm Philip. Let's talk a bit about retrieval. I'll show you
00:00:27.300 | some retrieval from the ground up. We'll keep it pretty hands on. You will have a chance
00:00:32.660 | to follow along and do everything that I show you as well. I have a demo instance that you
00:00:37.020 | can use. Or you can just watch me if you have any questions, ask at any moment. If anything
00:00:44.340 | is too small to reach out, then we'll try to make it larger. We'll try to adjust as we
00:00:48.760 | go along. So, I guess we're not over reg yet. But reg is a thing. And we'll focus on the
00:00:57.760 | R in reg. The retrieval augmented generation. We'll just focus on the retrieval. Just let's
00:01:04.760 | see where we are with retrieval. Quick show of hands. Who has done reg before? Okay. That's
00:01:10.760 | about half or so. Who has done anything with vector search and reg? Do I need vector search
00:01:19.580 | for reg or can I do anything else? Yeah. Yeah. So, you can do anything. Retrieval is actually
00:01:28.620 | a very old thing. Depending on how you define it, it might be 50, 70, whatever years old.
00:01:34.560 | It's just getting the right context to the generation. I'll ignore all the generation
00:01:39.820 | for today. We'll keep it very simple. We'll just focus on the retrieval part of getting
00:01:43.220 | the right information in. Partially from the old stuff, like the classics. But we'll get
00:01:49.440 | to some new things as well as we go along. Who has done keyword search before? Just that
00:01:56.840 | is fewer than vector search, I feel like. Which almost reminds me of like 15 years ago or so,
00:02:04.140 | when no SQL came up, like more people had done MongoDB, Redis, whatever else, rather than
00:02:09.020 | SQL. That has changed again. I think it will be kind of similar for retrieval. The way I
00:02:15.380 | would always say that vector search is a feature of retrieval. It's only one of multiple features
00:02:22.820 | or many features that you want in retrieval. And we'll see a bit why and how and we'll dive
00:02:27.180 | into those details. So, I work for Elastic, the company behind Elasticsearch. We're the most downloaded,
00:02:33.720 | deployed, whatever else, search engine. We do vector search, we do keyword search, we do
00:02:38.280 | hybrid search. We'll dive into various examples. Everything that I will show you works -- well,
00:02:44.960 | the query language is Elasticsearch. But if you use anything built on Apache Lucene, everything
00:02:50.540 | behaves very similarly. If you use something that is a clone or close to Lucene, like anything
00:02:56.680 | built on 10 TV or anything like that, it will be very similar. The foundation, keyword search
00:03:03.680 | and vector search will apply broadly everywhere. So, let's get going. We'll keep this pretty
00:03:09.800 | hands on. Who remembers in Star Wars when he's making that hand gesture? What is the quote?
00:03:16.920 | These are not the droids you're looking for. We'll keep this relatively Star Wars based. Feel
00:03:26.040 | free to come in and filter on the sites or whatever. I'm afraid we have, I think, one chair over
00:03:31.740 | there otherwise and one down there. Otherwise, it's getting a bit full. Okay. Let's look at
00:03:40.040 | this of what these are not the droids you're looking for, does for search. And I will start
00:03:45.200 | kind of like with the classic approach. Keyword search or lexical search is like you search
00:03:50.040 | for the words that you have stored and we want to find what is relevant in our examples.
00:03:56.760 | If you want to follow along, there is a gist which has all the code that I'm showing you.
00:04:03.160 | So, let's go to the last slash AI dot engineer. There is one important thing. It's I have one
00:04:11.000 | shared instance basically for everybody. So, you can all just use this without signing up
00:04:15.400 | for any accounts or anything. So, this is just a cloud instance that you can use. There is my handle.
00:04:21.240 | It's in the index name. If you don't want to fight and overwrite each other's data, replace that with
00:04:28.840 | your unique handle or something that is specific to you. Because otherwise, you will all work on
00:04:33.320 | the same index and kind of like overwrite each other's data. You can also just watch me. If you
00:04:37.960 | don't have a computer handy, that's fine. But if you want to follow along, last slash AI dot engineer,
00:04:44.360 | there will be a gist. It will have the connection string. Like there is a URL and then the credentials
00:04:49.800 | are workshop, workshop. If you go into log in, it will say log in with elastic search. That's where you
00:04:55.000 | use workshop, workshop. Then you will be able to log in. And you can just run all the queries that I'm
00:05:00.360 | showing you. You can try out stuff. If you have any questions, shout. I have a couple of colleagues
00:05:05.960 | dispersed in the room. So, if we have too many questions, we will somehow divide and conquer.
00:05:11.080 | So, let's get going and see what we have here. And I will show you most of the stuff live.
00:05:19.960 | I think this is large enough in the back row. If it's not large enough for anybody, shout and we will
00:05:24.440 | see how much larger I can make this. And let me turn off the Wi-Fi and hope that my wired connection
00:05:33.640 | is good enough. Let's refresh to see. Ooh. Maybe we will use my phone after all.
00:05:44.760 | Okay. Let's try this again.
00:06:07.320 | Okay. This is no good. Out you go.
00:06:26.600 | Okay. Hardest problem of the day solved. We have network.
00:06:34.200 | Okay. So, we have the sentence. These are not the droids you are looking for. And we will start
00:06:38.040 | with the classic keyword or lexical search. Like, what happens behind the scenes?
00:06:41.880 | So, what you generally want to do is you basically want to extract the individual words and then make
00:06:47.480 | them searchable. So, here, I'm not storing anything. I'm just looking at, like, how would that look like
00:06:53.240 | if I stored something? I'm using this underscore analyze endpoint to see what I will actually store in
00:07:01.080 | the background to make them searchable. So, these are not the droids you are looking for. And you see,
00:07:06.120 | these are not the droids you are looking for. In Western languages, the first step that happens is
00:07:21.160 | the tokenization. In Western languages, it's pretty simple. It's normally any white spaces and punctuation
00:07:25.800 | marks where you just break out the individual tokens. Especially Asian languages are a bit more
00:07:31.880 | complicated around that. But we will gloss over that for today. And we have a couple of interesting
00:07:37.320 | pieces of information here. So, we have the token. So, this is the first token. We have the start offset
00:07:43.720 | and the end offset. Why would I need a start and end offset? Why would I extract and then store that
00:07:50.200 | potentially? Any guesses? Yeah? Yes. Especially if you have a longer text, you would want to have that
00:07:58.360 | highlighting feature that you want to say, this is where my hit actually was. So, if I'm searching for
00:08:03.720 | these, which is maybe not a great word, but you would very easily be able to highlight where you had
00:08:08.120 | actually the match. And the trick that you're doing in search, generally what differentiates it from a
00:08:14.040 | database is a database just stores what you give it and then does basically almost everything at query
00:08:19.560 | or search time. Whereas a search engine does a lot of the work at ingestion or when you store the data.
00:08:25.480 | So, we break out the individual tokens. We calculate these offsets and store them. So, whenever we have a
00:08:31.400 | match afterwards, we never need to reanalyze the actual text, which could potentially be multiple pages
00:08:36.760 | long. But we could just highlight where we have that match because we have extracted those positions.
00:08:41.640 | We have a position. Why would I want to store the position with the text that I have?
00:08:46.600 | Yeah? Annotation. So, the main use case that you have is if you have these positions and later on,
00:08:57.880 | we'll briefly look at if you want to look for a phrase, if you want to look for this word followed
00:09:02.040 | by that word. So, you could then just look for all the text that contain these words. But then you
00:09:07.880 | could also just compare the positions and basically look for n, n plus 1, et cetera. And you never need
00:09:12.840 | to look at the string again. But you can just look at the positions to figure out like this was one
00:09:17.480 | continuous phrase. Even if you have broken it out into the individual tokens. Most of the things that
00:09:23.960 | we see here is alpha num for alpha numeric. An alternative would be synonyms. We'll skip over
00:09:30.440 | synonym definition because it's not fun to define tons of synonyms. But this is all the things that we're
00:09:36.200 | storing here in the background. You can also customize this analysis. And that is one of the features,
00:09:41.960 | again, of full text search and lexical searches that you preprocess a lot of the information to make
00:09:47.400 | that search afterwards faster. So, here you can see I'm stripping out the HTML because nobody's going
00:09:53.320 | to search for this emphasis tag. I use a standard tokenizer that breaks up, for example, on dashes.
00:10:01.480 | You will see that. Alternatives would be white space that you only break up on white spaces.
00:10:05.880 | I lowercase everything, which is most of the times what you want because nobody searches in Google
00:10:13.880 | with proper casing or at least maybe my parents. But nobody else searches with proper casing in Google.
00:10:20.200 | We remove stop words. We'll get to stop words in a moment. And we do stemming with the snowball stemmer.
00:10:27.800 | What stemming is it basically reduces a word down to the root. So, you don't care about singular,
00:10:32.280 | plural or like the flexion of a verb anymore. But you really care more about the concept. So,
00:10:38.360 | if I run through that analysis, does anybody want to guess what will remain of
00:10:43.800 | this phrase or which tokens will be extracted and in what form?
00:10:47.400 | Not a lot will remain. Two?
00:10:57.080 | Droid and Look?
00:10:58.520 | Yeah, close. So, we'll actually have three. So, we have Droid, You and Look. And you can see
00:11:08.840 | all the others were stop words which were removed. The stemming
00:11:13.720 | reduced looking down to look because we don't care if it looks, looking, look. We just reduce it to the
00:11:19.960 | word stem. So, we do this when we store the data. And by the way, when you search afterwards, your text
00:11:26.040 | will run through the same analysis that you would have exact matches. So, you don't need to do anything
00:11:30.520 | like a like search anymore in the future. So, this will be much more performant than anything that you
00:11:35.800 | would do in a relational database because you have direct matches. And we'll look at the data structure
00:11:40.200 | behind it in a moment. But what we get is Droid, You and Look with the right positions. So, for example,
00:11:47.640 | if we search for Droid, You, we could easily retrieve that because we have the positions,
00:11:52.200 | even though that is a weird phrase. Do we start indexing at zero or one?
00:11:57.320 | 0, yes. It's the only right way. There is a different discussion here. So, we are -- the positions are
00:12:09.080 | based starting at zero. And these are the tokens that are remaining. If you do this for a different
00:12:15.640 | language, like you might hear I'm a native German speaker. This is the text in German. And you would,
00:12:21.720 | if you use a German analyzer, it would know the rules for German and then would analyze the text in the
00:12:27.800 | right way. So, then you would have remaining Droid, den, such. Anybody wants to guess what happens if I
00:12:36.280 | have the wrong language for a text? It will go very poorly. Because the -- so, how this works is, basically,
00:12:47.160 | you have rules for every single language. It's like, what is the stop word? How does stemming work?
00:12:51.720 | If you apply the wrong rules, you basically just get wrong stuff out. So, it will not do what you want.
00:12:58.520 | So, what you get here is, like, this is an article. But, well, in English, the rule is an S at the end
00:13:06.200 | just gets stemmed away, even though this doesn't make any sense. So, you apply the wrong rules and you
00:13:10.840 | just produce pretty much garbage. So, don't do that. Just to give you another example,
00:13:16.360 | French, this is the same phrase in French. And then you see Droid, La, and Recherche are the words
00:13:27.480 | that are remaining in these examples. Otherwise, it works the same. But you need to have the right
00:13:31.960 | analysis for what you're doing. Otherwise, you'll just produce garbage. A couple of things as we're
00:13:39.160 | going along. The stop word list, by default, which you could overwrite, is relatively short. This is
00:13:45.160 | linguists have spent many years figuring out what are the right list of stop words. And you don't want
00:13:49.960 | to have too many or too few. In English, I always forget, I think it's 33 or so. This is where you
00:13:56.120 | can find it in the source code. It's -- I don't want to say well hidden, but it's not easy to find either.
00:14:00.440 | So, every language has, like, a list of stop words that are defined that will be automatically removed for.
00:14:06.440 | These are not the droids you are looking for. By accident, more or less, we had a lot of stop words,
00:14:11.640 | since that why not a lot remained here in the phrase. And then for all other languages,
00:14:15.320 | you will have a similar list of stop words. Should you always remove stop words?
00:14:25.560 | Yes, no. Yes. That is, by the way, another not is a very good -- I'm not sure if everybody heard that.
00:14:36.040 | The comment was about not. One important thing here, we're talking about lexical or keyword search,
00:14:41.800 | which is dumb, but scalable. It doesn't understand if there is a droid or there's no droid. It's just
00:14:49.640 | defined as a stop word. It does just keyword matching. That is, in vector search or anything
00:14:56.200 | with a machine learning model behind it will be a bit of a different story afterwards, where these
00:15:00.840 | things might make a difference. But this is very simple, because it just matches on similar strings,
00:15:06.120 | basically. It doesn't understand the context. It doesn't know what's going on. That's why the linguists
00:15:10.840 | decided not this is a good stop word. You could overwrite that if for your specific use case, this is not a
00:15:17.240 | good idea. Always removing stop words, yes, no, maybe. So, our favorite phrase is it depends.
00:15:27.560 | And then you have to explain, like, what it depends on. So, what it depends on is there are scenarios
00:15:34.360 | where removing all stop words does not give you the desired result. And maybe you want to have, like,
00:15:39.400 | a text with and without stop words. Like, sometimes stop words are just, like, a lot of noise that blow up
00:15:44.520 | the index size and don't really add a lot of value. That's why we have to find them and try to remove
00:15:48.840 | them by default. But if you had, for example, to be or not to be, these are all stop words.
00:15:54.840 | It would all be gone when you run it through analysis. So, it is tricky to figure out, like,
00:16:02.600 | what is the right balance for stop words or what works for your use case. But you might have unexpected surprises
00:16:08.040 | in all of this. Okay. We have seen the German examples. Let's do some more queries. Or let's
00:16:17.560 | actually store something. So far, we only pretended or we only looked at what would happen if we would
00:16:23.240 | store something. Now, I'm actually creating an index. Again, if you're running this yourself,
00:16:28.680 | please use a different name than me. Just replace all my handle instances with your handle or whatever you
00:16:37.400 | want. Since this is a shared instance. If you have too many collisions, I might jump to another instance
00:16:42.440 | that I have as a backup in the background. But what I'm doing here is I'm creating this analysis pipeline
00:16:48.360 | that I have looked at before. Like, I'm throwing out the HTML. I use a standard tokenizer, lower casing,
00:16:53.400 | stop word removal, and stemming. And then I call this my analyzer. And then I'm basically applying this
00:17:00.760 | my analyzer on a field called "quote". We call this a mapping. It's kind of like the equivalent of a schema in a
00:17:09.000 | relational database. But this defines how different fields behave. Okay. And somebody did not replace
00:17:19.800 | the query. By the way, you need to keep user_. Let me quickly do this myself. Oops. I should have seen this
00:17:38.840 | the query is coming. We want to replace, and we'll use, oops, oops. Please don't copy that. And I want to
00:18:07.160 | let's try it again. So we're creating our own index. And now I just to double check, I'll just again run this
00:18:19.160 | underscore analyze against this field that I've set up to just double check that I've set it up correctly.
00:18:37.160 | And now I'm actually starting to store documents. Bless you. So we'll store -- these are not the droids you're
00:18:45.160 | looking for. I have two others that I'll index just so we have a bit more to search. No, I am your father.
00:18:53.880 | Any guesses what will remain here? Father. Father. Yeah.
00:19:00.680 | Okay. Let's try this out. Let me copy my -- this one actually has way fewer stoppers than you would expect.
00:19:11.160 | Let's quickly do this. Since I didn't do the HTML removal, let's take these out manually.
00:19:23.560 | So what you get is, no, I am your father. And this was stupid because this was not what I wanted. We need to run this against the right analysis.
00:19:33.160 | This happens when you copy-paste. Okay, um, uh, sorry. And we'll do text.
00:19:47.320 | No, I think I've patched this back together. Okay, I am your father. So no is the only stopper in this list, actually.
00:20:01.000 | No was on the stopper in this list, all the others are not. Um, okay, let's try another one.
00:20:19.720 | Obi-Wan never told you what happened to your father. How many tokens will Obi-Wan be? Two? One?
00:20:32.040 | No, Obi-Wan will be two. Like Obi-Wan -- because we use the default tokenizer or standard tokenizer,
00:20:41.880 | that one breaks up at dashes. If you had used another tokenizer like white space, that would keep it
00:20:46.760 | together because it breaks up in white spaces. So there are various reasons why you want or would
00:20:51.560 | not want to do it. I don't want to go into all the details. But there are a lot of things to do
00:20:55.880 | right or wrong when you ingest the data, which will then allow you to query the data in specific ways.
00:21:01.160 | So, for example, if you would have an email address, that one is also weirdly broken up. Like,
00:21:09.960 | you might use, like, there's a dedicated tokenizer for URL and email addresses. So, depending on what type of
00:21:16.200 | data you have, you will need to process the data the right way because pretty much all the smart
00:21:21.400 | pieces are kind of like an ingestion here to make the search afterwards easier. So, you can easily do
00:21:27.800 | that. Let's see. Let's index all my three documents so that we can actually search for them. Now, if I
00:21:35.720 | start searching for droid, it should match. These are not the droids you're looking for. Yes or no. Because this
00:21:43.800 | one is singular and uppercase, and the droid that we stored was plural and lowercase. Will that match,
00:21:49.240 | yes or no? Yes. Why? Because of the stemming.
00:21:54.520 | Yes, we had the stemming. We had the lower casing. And when we search, so we store the text, it runs
00:22:01.560 | through this pipeline or the analysis. And for the search, it does the same thing. So, it will lowercase the
00:22:07.880 | droid. It has stemmed down the droids in the text to droid and then we have an exact match. So, what the
00:22:16.360 | data structure behind the scene actually looks like. The magic is kind of like in this so-called inverted
00:22:23.080 | index. What the inverted index is, is these are all the tokens that remained that I have extracted.
00:22:30.920 | I have alphabetically sorted them. And they basically have a pointer and say in this document, like with
00:22:36.120 | the IDs 1, 2, 3 that I have stored, we have how many occurrences? Like 0, 1, yeah, nothing had 2.
00:22:45.080 | And then we also know at which position they appeared. So, search for droid now. This is what I have stored.
00:22:55.560 | I lowercase the droid to droid. I have an exact match here. Then I go through the list and see,
00:23:00.600 | retrieve this document, skip this one, skip this one. And at position 4, you have that hit. And then you
00:23:06.440 | could easily highlight that. So, you have almost done all the hard work at ingestion and this retrieval
00:23:11.640 | afterwards will be very fast and efficient. That's the classic data structure for search, the inverted
00:23:17.880 | index where you have this alphabetic list of all the tokens that you have extracted to do that.
00:23:23.160 | And this will just be built in the background for you and that's how you can retrieve all of this.
00:23:26.920 | Let's look at a few other queries and how they behave. If I search for robot, will I find anything?
00:23:41.240 | No, because there was no robot. There was a droid. We could now define a synonym and say, like,
00:23:51.160 | all droids are robots, for example. Who likes creating synonym lists? Nobody anymore. Okay.
00:24:00.920 | Normally, I would have said that's the Stockholm syndrome because there is sometimes somebody who
00:24:05.320 | likes creating synonym lists because they have done that for so many years. But it got easier nowadays.
00:24:10.840 | Now you can use LLMs to generate the synonyms. So, it can get a bit easier to create them. But they're
00:24:16.200 | still limited because you have always this mapping. So, with synonyms, you can expand the right way.
00:24:21.480 | Where it gets trickier if you have homonyms. If a word has multiple meanings, like a bat could be the
00:24:28.360 | animal or it could be the thing you hit a ball with. There it just gets trickier because there is no meaning
00:24:34.360 | behind the words or no context. So, you just match strings and that is inherently limited. But, like I said,
00:24:40.680 | it's dumb, but it scales very well. And that's why it has been around for a long time. And it does
00:24:46.920 | surprisingly well for many things because there's not a lot of things that are unexpected or that can
00:24:51.640 | go totally wrong. Now, other things that you can do. You could do a phrase search where you say,
00:24:58.680 | I am your father. Will this find anything? Yes. Because we had no, I am your father.
00:25:11.320 | What happens if I say, for example, I am, let's see, I am not your father. Yes, no? No. No. Why? So, you're right.
00:25:27.080 | Looking for an exact match based on the position.
00:25:29.240 | Not as a stop word. But not as a stop word. But you're right because the positions still don't match.
00:25:38.280 | So, the stop word not would be filtered out, but it still doesn't match because the positions are off.
00:25:45.240 | That is one of the things that sometimes can be confusing. So, even if something is a stop word
00:25:51.240 | and will be filtered out, it doesn't work like that. One thing that you can do is, though,
00:25:56.520 | that the factor is called slop, where you basically say if there is something missing,
00:26:01.480 | it would still work. So, I am your father and I am father with slop zero, that's kind of like the
00:26:09.000 | implicit one. Will not find anything. But if I say one, then I basically say, like,
00:26:13.880 | there can be a one-off in there. Like, one word can be missing.
00:26:24.040 | However, I am his father. Here, his would not match. So, this still will not work.
00:26:28.680 | The slope is really just to skip a word. Yeah?
00:26:31.960 | What about I am your father?
00:26:33.880 | I am your father?
00:26:37.080 | I assume that -- no, I possibly am your father. I assume that won't work.
00:26:41.000 | Ah. That will not work.
00:26:43.240 | How would you get that to work?
00:26:45.160 | There you might need to do something like a synonym where you say slash m gets replaced by m.
00:26:52.680 | Or we will need to have some more machine learning capabilities behind the scenes to do stuff like
00:26:57.640 | that. Are there any libraries that would predefine contractions like that?
00:27:02.760 | So, what is built in is generally a very simple set of rules. What you will need to do for things
00:27:11.800 | like this is normally you need a dictionary. The problem around these is they are normally not available for
00:27:17.000 | free or open source. Funnily enough, they are often coming out of university, the dictionaries,
00:27:24.200 | because they have a lot of free labor. The students. That's why the universities have been creating a lot
00:27:31.240 | of dictionaries. But they often come out under the weirdest licenses. That's why they are not very widely
00:27:35.720 | available. But, yes, there is a smarter or more powerful approach if you have a dictionary and you
00:27:41.480 | can do these things. For example, one thing to show is like, maybe that's a good thing to also mention.
00:27:50.920 | You don't always get words out of the stemming. It's not a dictionary. It doesn't really get what
00:27:58.520 | you're doing. It just applies some rules. So, for example, Blackberry. Blackberry. Sorry, Blackberries,
00:28:09.000 | I think that this will be stamped down differently. Ah, sorry, I need English. Without English, this will
00:28:15.160 | not work. So, this will stand down to this weird word Blackberry. And it will also stem down the singular
00:28:27.160 | Blackberry. So, there's a rule that applies this. But it's just a rule. It's not dictionary-based.
00:28:32.600 | It's not very smart. And it only has some rules built in that work for this. But you will definitely
00:28:38.680 | hit limits. And the other thing, by the way, why I picked Blackberry as an example, you have some
00:28:45.720 | annoying languages like German, Korean, and others that compound nouns like Blackberry, where you have
00:28:53.000 | basically two words. Black would never find Blackberry in the simplest form because it's not a complete
00:28:59.880 | string. There are various ways to work around that that all come with their own downsides. And either
00:29:05.960 | you have a dictionary or you extract the so-called engrams. It's like group of words and then you mention
00:29:10.680 | group of words. But all of those are one of the many tools how we try to make this a bit better or smarter,
00:29:17.560 | but it all has limitations. I hope that answers the question and makes sense.
00:29:22.360 | So, there are dictionaries, but they're generally not free or not under an easy license available.
00:29:29.080 | For some languages, by the way, even the stemmers are not freely available. I think there is a stemmer
00:29:35.880 | or analyzer for Hebrew. I think that has also like some commercial license or at least you can't use it for
00:29:43.240 | free or free in commercial products. Though licensing with machine learning models is also its own dark
00:29:51.240 | secret. Yeah.
00:29:52.360 | Yes. That is what an engram is doing. Let me see if I can.
00:30:18.120 | An engram is normally a word group, normally a trigram. This is way too small.
00:30:23.800 | Somehow I have weirdly overwritten my command plus so I can't use that. Let me make this slightly larger.
00:30:32.440 | Okay. Here we basically use one or two letters as word groups, which is way too small.
00:30:46.360 | But just to show the example, and this is very hard to read. Let me copy that over to my console.
00:30:52.920 | There you can -- there you can -- oops. There you can see this. But this is a great question.
00:31:03.080 | So, we'll use engram for quick fox. And then you can see the tokens that I extract here are the first letter,
00:31:10.200 | the first two, the second, the second and third, et cetera. And you end up with a ton of tokens.
00:31:16.760 | The downside is, A, you have to do more work when you store this. B, it creates a lot of storage on
00:31:23.640 | disk because you extract so many different tokens. And then your search will also be pretty expensive
00:31:28.120 | because normally you would at least do three trigrams. But even that creates a ton of tokens and a ton of
00:31:36.440 | matches. And then you need to find the ones with the most matches. And it works. But A, it is pretty
00:31:42.280 | expensive in disk but also query time. And it might also create undecided results or results that are a bit
00:31:49.160 | unexpected for the end user. It is, I would call it, again, it's a very dumb tool that works reasonably
00:31:56.760 | well for some scenarios. But it's only one of many potential factors. What you could potentially do is,
00:32:03.400 | and I don't have a full example for that, but we could build it quickly, what you would do in reality
00:32:09.080 | probably, you might store a text more than one way. So, you might store it, like, with stop words and
00:32:15.720 | without stop words and maybe with engrams. And then you give a lower weight to the engrams and say,
00:32:22.040 | like, if I have an exact match, then I want this first. But if I don't have anything in the exact
00:32:26.040 | matches, then I want to look into my engram list. And then I want to kind of, like, take whatever is
00:32:31.400 | coming up next. So, even keyword-based search will be more complex if you combine different methods.
00:32:38.680 | Engrams are interesting, but, again, they're a dumb but pretty heavy hammer.
00:32:45.320 | Use them with the right, at the right scenario.
00:32:48.040 | Sorry, quick question about this engram. Is it by default one or two?
00:32:51.560 | Yes. But you could redefine that. So, we can, let me go back to the docs. The engram,
00:33:01.000 | you can say mingram and maxgram. If you set both to three, you would have trigrams, where it's always
00:33:06.840 | groups of three, like 1, 2, 3, 2, 3, 4, et cetera. You could also have something called edge engram,
00:33:14.200 | where you expect that somebody types the first few letters right, and then you only start from the
00:33:18.600 | beginning but not in the middle of the word, which sometimes avoids unexpected results. And, of course,
00:33:24.680 | reduces the number of tokens quite a bit. So, somewhere in here, edge engram.
00:33:38.120 | Let's just copy that over so I won't type. So, here we have edge engram with quick, and you can see it
00:33:47.400 | only does the first and the first two letters, but nothing else. And, in reality, you would probably
00:33:53.160 | define this like 2, 2, 5, or more, or whatever else you want. But, here, we only do from the start and
00:33:59.880 | nothing else, which reduces the tokens tremendously. But, of course, if you have blackberry and you want to
00:34:06.840 | match the berry, you're out of luck. Makes sense. Anybody else? Anything else?
00:34:16.120 | Yeah, so, if you have multiple languages, do not mix them up. That will just create chaos. Because we'll get to that in a moment. But, how keyword search works is basically word frequency.
00:34:23.240 | And if you mix languages, it screws up all frequencies and statistics. So, what you would do is, either you
00:34:44.520 | field English and you would have field English and then you would have field whatever the abbreviation for
00:34:53.800 | Hebrew is. Hebrew. And then you would have that. And then you would need to define the right analyzer for that
00:35:03.080 | specific field. So, you break it out either into different fields or you could even do different indices.
00:35:07.080 | And ideally, we even have that built-in. We have a language analyzer. Even if you just provide a couple of
00:35:16.360 | words, it will guess, or not guess, it will infer the language with a very high degree of certainty.
00:35:25.640 | Especially Hebrew will be very easy to identify. If you have your own diacrites, it's easy.
00:35:29.640 | But even if you just throw random languages in there, it will be a very high degree of certainty.
00:35:34.440 | Especially Hebrew will be very easy to identify. If you have your own diacrites, it's easy. But even if
00:35:44.760 | you just throw random languages at it, it will have a very good chance, just with a few words, to know
00:35:49.880 | this is this language and then you can treat it the right way. Good. Let's continue.
00:36:03.320 | So, we have done all of these searches. We have done slope. One more thing before we get into the
00:36:12.200 | relevance. One other very heavy hammer that people often overuse is fuzziness. So, bless you. If you
00:36:21.240 | have a misspelling, so I misspelled Obi-Wan Kenobi. We already know that this is broken out into two different
00:36:29.080 | words or tokens. It will still match your Obi-Wan because we have this
00:36:33.240 | fuzziness, which allows edits. It's like a Lievenstein distance. So, you can have one. By default here,
00:36:42.200 | you could either give it an absolute value, like you can have one edit, which could be one character
00:36:49.400 | too much, too little, or one character different. You could set it to two or three. You can't do
00:36:55.800 | more because otherwise you match almost anything. And auto is kind of smart because, depending on
00:37:03.160 | on how long the token that you're searching for, it will set a specific value. If you have zero to two
00:37:10.120 | characters, auto fuzziness, I think is one from two to -- no, zero to two characters is zero. Three to five
00:37:18.440 | characters is one. And after that, it's two. So, you can match these. Will this one match?
00:37:31.720 | Yes, no, yes, no, and why?
00:37:36.680 | No, because you go to T and the value.
00:37:40.680 | Yes. So, we have -- we have -- both of those are misspelled.
00:37:43.960 | It still matches. Why?
00:37:48.360 | You get tokenized separately and you can't have a single .
00:37:52.680 | Yes. That is a bit of a gotcha. So, yes. You need to know the tokenizer. So, we tokenize with standard,
00:37:58.920 | so it's two tokens, and then the fuzziness applies per token, which is another slightly surprising
00:38:04.920 | thing. But, yes, that's how you end up here. Okay. Now, we could look at how the Levenstein
00:38:14.600 | distance works behind the scenes, but it's basically a Levenstein automaton which looks something like
00:38:21.240 | this. If you search for food and you have two edits, this is how the automaton would work in the
00:38:25.320 | background to figure out, like, what are all the possible permutations. It's a fancy algorithm that
00:38:30.440 | was, I think, pretty hard to implement, but it's in Levenstein nowadays. Okay. Now, let's talk about
00:38:38.520 | scoring. One thing that you have seen that you don't have anywhere or in a non-search engine or
00:38:44.200 | just in a database is like we have to score. It's like, how well does this match? How does the score
00:38:50.680 | work here? Let's look at the details of that one. So, the basic algorithm, which is also most of us,
00:39:01.640 | or pretty much all of us here, term frequency inverse document frequency, or TF-IDF. It has been slightly
00:39:09.720 | tweaked, like the new implementation is called BM25, which stands for best match, and it's the 25th
00:39:15.240 | iteration of the best match algorithm. So, what they look like is you have the term frequency. If I
00:39:22.200 | search for Droid, how many times does Droid appear in the text that I'm looking for? And it's basically
00:39:28.840 | the square root of that. So, the assumption is if a text contains Droid once, this is the relevancy. If I
00:39:35.880 | have a text that contains Droid 10 times, this is the relevancy. The tweak between TF-IDF, that one just
00:39:42.680 | keeps growing, BM25 says, like, once you hit, like, five Droid in a text, it doesn't really get much more
00:39:48.840 | relevant anymore. So, it kind of, like, flattens out the curve. That is the idea of term frequency.
00:39:55.000 | The next thing is the inverse document frequency, which is almost the inverse curve. The assumption here is
00:40:04.120 | over my entire text, this is how often the term Droid appears. So, if a term is rare, it is much more
00:40:13.080 | relevant than if a term is very common, then it's kind of, like, less relevant. Basically, the assumption
00:40:18.920 | is rare is relevant and interesting. Very common is not very interesting anymore. And then it's kind of,
00:40:24.520 | like, just works its curve out like that. And the final thing is the field length norm is, like,
00:40:32.280 | the shorter a field is and you have a match, the more relevant it is. Which assumes, like,
00:40:38.680 | if you have a short title and your keyword appears there, it's much more relevant than if there's a
00:40:42.600 | very long text body and your keyword and you have a match there. And these are the three main components
00:40:49.240 | of TF-IDF. So, let's take a look at how this looks like. You can make this a bit more complicated.
00:40:56.200 | This will show you why something matches. Don't be confused by the -- or let me take that out for the
00:41:05.240 | first try. So, I'm looking for father. And I am -- no, I am your father. And Obi-Wan never told you what
00:41:12.600 | happened to your father. One is more relevant than the other. Why is the first one more relevant than
00:41:19.160 | the second one? Yeah. Term frequency is the same. Both contain father ones. The inverse document frequency
00:41:29.560 | is also the same because we are looking for the same term. The only difference is that the second one
00:41:34.840 | is longer than the first one. And that's why it's more relevant here. So, this is very simple. And you
00:41:44.360 | can then, if you're unsure why something is calculated in a specific way, you can add this explained true.
00:41:49.800 | And then it will tell you all the details of, like, okay, we have father. And it then calculates basically
00:41:57.880 | all the different pieces of the formula for you and shows you how it did the calculation. So, you can
00:42:01.960 | debug that if you need to. But it's probably a bit too much output for the everyday use case.
00:42:07.640 | And then you can customize the score if you want to. Here I'm doing a random score. So,
00:42:16.280 | my two fathers -- this is a bit hard to show -- they will just be in random order because their score is
00:42:21.960 | here randomly assigned. But you could do this more intelligently that you combine, like, the score and,
00:42:29.240 | like, if you have, I don't know, the margin on the product that you sell or the rating that you
00:42:34.200 | include that in the rating somehow and you can build a custom score for things like that.
00:42:38.120 | So, you can influence that any way you want. One thing that I see every now and then that is a
00:42:46.680 | very bad idea and we'll skip this one because it's probably a bit too much. This one, by the way,
00:42:52.040 | is the total formula that you can do or maybe I'll show you the parts that I skipped. What happens if
00:42:58.200 | you search for two terms and they're not the same, they don't have the same relevancy? So, what the
00:43:03.560 | calculation behind the scenes basically looks like is let's say we search for father. Father is very rare,
00:43:11.080 | that's why it's much more relevant than your. Your is pretty common. And then we have a document that
00:43:16.040 | contains your father. It's kind of like this axis. This will be the best match. But will a document
00:43:22.680 | that only contains father be more relevant or only your? Intuitively, the one with just father will
00:43:29.800 | be more relevant. But how does it calculate that? It basically calculates like this is the relevancy of
00:43:35.880 | father. This is the ideal document and this is your. And then it looks like which one has the shorter
00:43:41.880 | angle. And this is the one that is more relevant. So, if you have a multi-term search, you can figure
00:43:49.000 | out which term is more relevant and how they are combined. And then you can also have the coordination
00:43:54.040 | factor which basically rewards documents containing more of the terms that you're searching for. So,
00:43:59.080 | if I'm searching for three terms like I am father, whatever. If a document contains all three, this
00:44:09.320 | will be the formula that combines the scores of all three and multiplies it by three divided by three.
00:44:14.920 | If it only contains two of them, it would only have the relevancy of 2/3 and with one 1/3. And then you
00:44:21.080 | put it all together and this is the formula that happens behind the scenes and you don't have to do that
00:44:25.800 | in your head, luckily. Cool. We have seen these. One thing that we see every now and then is that
00:44:34.600 | people try to translate the score into percentages. Like you say, this is a 100% score and this is only like
00:44:44.120 | a 50% match. Who wants to do that? Hopefully nobody, because the Lucene documentation is pretty
00:44:52.680 | explicit about that. You should not think about the problem in that way, because it doesn't work.
00:44:59.480 | And I'll show you why it doesn't work or how this breaks. Let's take another example.
00:45:05.560 | Let's say we take this short text. These are my father's machines. I think of a good Star Wars quote to
00:45:18.280 | use here, but bear with me. So what remains if I run this through my analyzer? My father machine.
00:45:23.720 | These are the three tokens that remain. Now, I will store that. You remember the three tokens that we have
00:45:31.160 | stored. And if I search for my father machine, you might be inclined to say this is the perfect score.
00:45:43.960 | This is like 100%. Agreed? Because all the three tokens that I have stored in these are my father's
00:45:51.400 | machines are there. So this must be like my perfect match. So it's 3.2, that would be 100%.
00:45:57.080 | The problem now is every time you add or remove a document, the statistics will change and your score
00:46:03.400 | will change. So if I delete that document and I search the same thing again, I don't know what percentage
00:46:10.520 | this is now. Is this now the new 100% the best document or is this a zero point or, I don't know,
00:46:15.800 | 20%? How does this compare? And then you can play funny tricks where these droids are my father's
00:46:24.840 | father's machines. And you can see I have a term frequency of 2 for father here. So if I store that one
00:46:32.200 | then and then search it, is this now 100%, is this now 110%? So don't try to translate scores into
00:46:44.200 | percentages. They're only relevant within one query. They're also not comparable across queries. They're
00:46:50.200 | really just sorting within one query to do that. Okay. Let me get rid of this one again.
00:46:57.400 | Now, we've seen the limitations of keyword search. We don't want to define our synonyms. We might want
00:47:06.840 | to extract a bit more meaning. So we'll do some simple examples to extend. I will add, from OpenAI,
00:47:17.400 | text embedding, text embedding small. I'm basically connecting that inference API for text embeddings
00:47:24.520 | here in my instance. I have removed the API key. You will need to use your own API key if you want to
00:47:30.760 | use it. But it is already configured. So let me pull up the inference services that we have here. I have
00:47:38.280 | done -- or I have added two different models. One sparse, one dense. Let's go to these. By the way,
00:47:49.000 | if you try to do this with a 100% score, don't do this. Because it will just not work. Okay.
00:47:58.520 | Not everybody has worked with dense vectors, right? So I have a couple of graphics coming back to our
00:48:05.480 | Star Wars theme, just to look at how that works. So what you do with dense vectors is we keep this
00:48:12.840 | very simple. This one just has a single dimension. And it has, like, the axis is pretty much like
00:48:20.280 | realistic Star Wars characters and cartoonish Star Wars characters. And this one falls on the realistic
00:48:25.560 | side and that other one is just cartoonish. And you have a model behind the scenes that can rate those
00:48:31.720 | images and figure out where they fall. Now, in reality, you will have more dimensions than one.
00:48:38.760 | And you will also have floating point precision. So it's not just, like, minus one, zero, or one.
00:48:45.400 | But you will have more dimensions. So, for example, here, in human and machine, and in a realistic model,
00:48:55.000 | you don't have -- the dimensions are not labeled as nicely and clearly understandable. The machine has
00:48:59.800 | learned what they represent. But they're not representing an actual thing that you can extract like that.
00:49:04.680 | But in our simple example here, now, we can say this layer character is realistic and a human versus,
00:49:14.200 | I don't know, the Darth Vader is cartoonish and, I don't know, somewhere between human and machine.
00:49:23.080 | So this is the representation in the vector space. And then you could have, like I said,
00:49:27.080 | you could have floating point values and then you can have different characters. And similar characters,
00:49:32.840 | like, both of those are human. Without the hand, he's only, like, not quite as human anymore,
00:49:38.200 | so he's a bit lower down here. So he's a bit closer to the machines. So you can have all of your entities in
00:49:47.960 | this vector space. And then if you search for something, you could figure out, like,
00:49:51.880 | which characters are the closest to this one. And again, in reality, you will have hundreds of
00:49:57.880 | dimensions. It will be much harder to say, like, these are the explicit things and this is why it
00:50:03.320 | works like that. It will depend on how good your model is in interpreting your data and extracting the
00:50:09.960 | right meaning from it. But that is the general idea of dense vector representation. You have your documents
00:50:18.200 | or sometimes it's like chunks of documents that are represented in this vector space and then you
00:50:24.360 | try to find something that is close to it for that. Does that make sense for everybody or any specific
00:50:31.800 | questions? So it's a bit more opaque, I want to say. It's not quite as easy because you say, like,
00:50:39.720 | these five characters match these other five characters here. But you need to trust or evaluate
00:50:45.880 | that you have the right model to figure out how these things connect. So let's see how that looks like.
00:50:56.280 | I have one dense vector model down here. We have OpenAI embedding. This one is a very small model. It
00:51:05.800 | only has 128 dimensions. The results will not be great, but it's actually for demonstrating it actually
00:51:14.760 | helpful. So we'll see that. The other model that we have, and let me show you the output of that. So if I
00:51:20.760 | take my text, these are not the droids you are looking for, this is the representation. It's basically
00:51:26.840 | an array of floating point values that will be stored and then you just look for similar floating point
00:51:32.600 | values. And then you have these are not the droids you are looking for. Here on the previous one,
00:51:37.320 | dense text embedding. This one here does sparse embedding. Sparse is the main model used for that
00:51:47.720 | It is called splayed. Our input of splayed is, we call it ELSER. It's kind of like a slightly improved splayed,
00:51:56.280 | but the concept is still the same. What you get is, you take your words, and this is not just a TF-IDF.
00:52:05.960 | This is a learned representation where I take all of my tokens and then expand them and say, like, for this text,
00:52:14.680 | these are all the tokens that I think are relevant. And this number here tells me how relevant they are.
00:52:20.920 | Again, not all of these make sense intuitively. And you might get some funky results, for example,
00:52:29.720 | with foreign languages. This currently only supports English. But these are all the terms that we have
00:52:37.240 | extracted. Normally, yeah, you get, like, 100-something or so. So, the idea is that this text is represented
00:52:46.920 | by all of these tokens. And the higher the score here, the more important it is. And what you will do is,
00:52:52.920 | you store that behind the scenes. When you search for something, you will generate a similar list,
00:52:58.680 | and then you look for the ones that have an overlap, and you basically multiply the scores together, and the
00:53:04.120 | ones with the highest values will then find the most relevant document. This is insofar interesting or
00:53:12.040 | nice because it's a bit easier to interpret. It's not just, like, long array of floating point values.
00:53:17.880 | Sometimes these don't make sense. The main downside of this, though, is that it gets pretty expensive at
00:53:25.640 | time. Because you store a ton of different tokens here for this. When you retrieve it, the search query will
00:53:33.560 | generate a similar long list of terms. And if you have a large enough text body, a query might hit a very
00:53:41.960 | large percentage of your entire stored documents with these OR matches. Because, basically, these are just a lot of ORs that you combine, calculate the score, and then return the most or the highest ranking results.
00:53:54.840 | So, it's an interesting approach. It didn't gain as much traction as dense vector models, but it can be, as a first step or an easy and interpretable step, it can be a good starting point to dive into the details here.
00:54:08.280 | So, these are not the droids looking for. It's basically represented by this embedding here.
00:54:22.520 | So, it's like this entire list of terms with this, yeah, with this relevancy, basically. This is the representation of this string. And then, when I search for something, I will generate the same
00:54:36.280 | list and then I basically try to match the two together. Like for what has the most or the highest matches here.
00:54:44.920 | Make sense?
00:54:48.280 | Yes, we'll do that in a second.
00:54:57.720 | I will create a new index. This one keeps the configuration from before, but I'm adding this semantic text for the sparse model and the dense model.
00:55:21.000 | So, I've created this one. And now I'll just put three documents. I have my other index.
00:55:26.600 | As you can see here, it says three documents were moved over.
00:55:30.600 | So, we can then start searching here. And if I look at that, the first document is still,
00:55:36.680 | these are not the droids you're looking for. You don't see, like for the, for a keyword search,
00:55:41.480 | you don't see the extracted tokens here. We also don't show you the dense vector representation or the
00:55:47.080 | sparse vector representation. Those are just stored behind the scenes for querying, but there's no real
00:55:53.080 | point in retrieving them because you're not going to do anything with that huge array of dense vectors.
00:55:58.040 | It will just slow down your searches. You can look at the mapping and you can see I'm basically copying my
00:56:06.920 | existing quote field to these other two that I can also search those.
00:56:10.920 | Okay. So, if I look for machine on my original quote, will it find anything?
00:56:21.080 | No, because it only had -- these are not the droids you're looking for. And this is still the keyword search.
00:56:32.040 | It doesn't work, shouldn't work. That's exactly the result that we want out of this here. Now,
00:56:48.040 | if I say answer and I say machine, then it will match here. These are not the droids you're looking for.
00:56:55.400 | And you can see this one matches pretty well, I don't know, at 0.9. But it also has some overlap,
00:57:01.800 | with no I am your father. I mean, it is much lower in terms of relevance. But something had an overlap
00:57:08.360 | here. And only the third document, Obi-Wan never told you what happened to your father. Only that one
00:57:15.320 | is not in our result list at all. But there was something here. I don't know the expansion. We would
00:57:21.000 | need to basically run -- where was it? We would need to run this one here for all the strings and look
00:57:30.840 | then for the expansion of the query. And then there would be some overlap, and that's how we retrieve
00:57:34.680 | that one. Is that a threshold that you have?
00:57:39.400 | You could define a threshold. It will, though, depend -- let's see.
00:57:46.040 | This is not the droids you're looking for. Let's say if I -- if I say -- I'm not sure if this will change anything.
00:57:56.840 | I mean, the relevance here is still -- it's still 10x or so. But, yeah, this one still -- we'll just have a
00:58:12.440 | very low -- it's still -- terms you look for. The score just totally jumps around. It's a bit hard to define the threshold.
00:58:19.320 | Because here you can see, in my previous query, we might have said 0.2 is the cutoff point. But now it's
00:58:28.760 | actually 0.4, even though it's not super relevant. So it might be a bit tricky, or you might need to
00:58:34.920 | have a more dynamic threshold depending on how many terms you're looking for and what is a relevant result.
00:58:40.440 | In the bigger picture, the assumption would be if you have hundreds of thousands or even millions of
00:58:46.920 | documents, you will probably not have the problem that anything that is so remotely connected will
00:58:53.320 | actually be in the top 10 or 20 or whatever list that you want to retrieve. So for larger proper data
00:59:00.360 | sets, this should be less of an issue. With my hello world example of three documents, it can be a bit
00:59:06.520 | misleading. But, yes, you can have a cutoff point if you figure out what for your data set and your
00:59:11.160 | queries is a good cutoff point. You could define the cutoff point.
00:59:14.360 | No, sorry, you have three documents. How come it's only showing two? Is it because of --
00:59:18.280 | So the query gets expanded into, I don't know, those 100 tokens or whatever. And then for those two,
00:59:25.640 | there is some overlap, but the third one just didn't have any overlap. But I -- so we -- okay, we can do that.
00:59:33.400 | It's just a bit tricky to figure out that the term that has the overlap. So we will need to take this
00:59:40.280 | one, machine -- no, I am your father. Let's take this one. What you need to do is to figure that one
00:59:48.760 | out. I don't know, actually, we should be able -- let me see. Let's see.
01:00:09.320 | This is a pretty long output. Somewhere I was actually hoping that it would show me the term that has
01:00:20.040 | matched here. Okay. I see something -- okay, there is something puppet that seems to be the overlap.
01:00:28.360 | How much sense that term expansion for the stored text and the query text makes is a bit of a different
01:00:35.880 | discussion. But in here with that explained true, you can actually see how it matched and what happened
01:00:41.880 | behind the scenes. If you have any really hard or weird queries or something that is hard to explain,
01:00:46.760 | to debug that. But the third one didn't match. Now, if I take the dense vector model with OpenAI and I
01:00:54.280 | search for machine, how many results do you expect to get back from this one? 0, 1, 2, 3. Yes, 3. Why 3?
01:01:08.120 | Yes, because there's always some match. That is the other -- or let me run the query first.
01:01:19.240 | These are not the droids you're looking for. This one is the first one. I don't think that this model
01:01:23.880 | is generally great because here the results are super close. It is -- I mean, the droids with the
01:01:29.320 | machines, that is the first one. But the score is super close to the second one, which is no, I am
01:01:34.520 | your father, which feels pretty unrelated. And Obi-Wan never told you what happened to your father. Even
01:01:39.320 | that one is still with a reasonably close score. But why do we have those? Because if we say,
01:01:49.080 | what is the relevance, I mean, it's further away, but it's always like there's always kind of like
01:01:55.960 | some angle to it, even if the kind of like the angle here or depending on the similarity calculation that
01:02:01.800 | you do, but it's still always related. There is no easy way to say something is totally unrelated.
01:02:07.080 | That is, by the way, one good thing about keyword search where it was relatively easy to have a cutoff
01:02:14.520 | point of things that are totally not relevant, where you're not going to confuse your users.
01:02:18.600 | Whereas here, if you don't have great matches, you might get almost -- it's not random, but it's
01:02:25.000 | potentially -- it looks very unrelated to your end users what you might return.
01:02:29.240 | just because it's very hard to show. Yes?
01:02:31.000 | Is it fair to say, then, that the OpenAI embedding search is worse for this kind of toy example,
01:02:38.200 | because the magnitude of difference is --
01:02:39.880 | I'm careful with worse, because it's really a hello world example, so I don't take this as a
01:02:47.080 | quality measurement in any way. I -- yeah, I mean, the OpenAI model with 128 dimensions is very few
01:02:54.520 | dimensions. I think it will probably be cheap, but not give you great results necessarily. But don't use
01:02:59.880 | this as a benchmark. I think it's just a good way to see that this is now much harder, because now you
01:03:07.400 | need to pick the right machine learning model to actually figure out what is a good match. With
01:03:12.760 | keyword-based search, it was a bit of a different story. There you need to pay more attention to,
01:03:16.680 | like, how do I tokenize, and do I have the right language, and do I do stemming or not stemming.
01:03:22.120 | But most of that work is relatively, I want to say, almost algorithmic, and then you can figure that
01:03:28.120 | out, and you configure it, and then it's very predictable at query time. Whereas with the dense
01:03:33.640 | vector representation, you really need to evaluate for the queries that you run and the data that you
01:03:39.320 | have, like, is that relevant, and is this an improvement or not? It's very easy to get going and
01:03:45.320 | just throw a dense vector model together, and you will match -- you will always match something
01:03:51.080 | that might be an advantage over the lexical search where you don't have any matches, which
01:03:55.160 | sometimes is the other problem that nothing comes back and you would want to have at least some
01:03:59.480 | results. Here it might just be unrelated. So that can be tricky. That you want to have some results is,
01:04:09.160 | by the way, a funny story that the European e-commerce store once told me, they said they accidentally
01:04:15.880 | deleted, I think, two-thirds of their data that they had for the products that you could buy.
01:04:21.320 | And then I asked them, like, okay, so how much revenue did you lose because of that? And they
01:04:26.120 | said, basically nothing, because as long as you showed some somewhat relevant results quickly enough,
01:04:32.920 | people would still buy that. So only if you have no results, that's probably the worst.
01:04:37.080 | So for an e-commerce store, you might want to show stuff a bit further out, because people might still
01:04:42.840 | buy it. But it really depends on the -- I'm coming to you in a moment -- it really depends on your use
01:04:47.960 | case. E-commerce is kind of like one extreme where you want to show always something for people to buy.
01:04:53.800 | If you have a database of legal cases or something like that, you probably don't want that approach,
01:05:00.040 | because that will go horribly wrong. So it is very domain specific. That's, I think,
01:05:06.200 | also the good thing about search, because it keeps a lot of people employed, because it's not an easy
01:05:10.280 | problem. It's almost job security, because it depends much on the -- this is the data that you have,
01:05:16.840 | and this is the query that people run, and this is the expectation of what will happen, and this
01:05:20.680 | is for this domain, the right behavior. So there's no easy right or wrong with the checkbox. And the
01:05:27.720 | other thing is you might make -- if you tune it, you might make it better for one case, but worse for 20
01:05:32.440 | others. That's why a robust evaluation set is normally very important, though very rare. A lot of people
01:05:39.960 | YOLO it, and you will see that in the results. And for the e-commerce store, it probably works well enough.
01:05:44.520 | Sorry, you had a question. Can I limit the semantic enrichment to a subset of my index based off
01:05:51.320 | the properties of the document? So if I have a very large shared index with a lot of customers,
01:05:56.280 | and I want to enable AI for a subset of the index, can I say, hey, only do the semantic enrichment if the
01:06:03.080 | document has this property where maybe it's like an AI customer? Yeah, so the way we would do it in
01:06:09.800 | our product is that you would probably have two different indices with different mappings.
01:06:15.560 | Yeah, but then it's not so fun, like the customer upgrades and I have to migrate them to the new
01:06:21.160 | index. Dave, please. Yeah, so if you, for example, have an index in Elasticsearch, you can think of it
01:06:31.240 | almost like a sparse table, right? So there's no penalty for having a field that is not populated.
01:06:36.920 | So either in your application or an ingest processor, you could have an inference statement and say,
01:06:42.200 | Yeah, that's how we do it now. No, we'll only move it over.
01:06:45.000 | With this automatic way where you kind of turn it off.
01:06:47.800 | No, the problem is the data structure, like if the field is there, so the data structure that we build
01:06:52.440 | in the background is called HNN. And either we build a data structure or we don't build it.
01:06:57.560 | Yeah, so if you had, you know, 10 billion entries in your vector index, your index is set up for vectors,
01:07:07.560 | right? And you just don't populate the thing that is either putting in a dense vector or triggering the
01:07:14.920 | inference to create a dense vector to put into there, then it's just going to be a, you know,
01:07:20.120 | a bunch of, the index is just a bunch of pointers and none of them head towards the HNNW and it won't
01:07:24.600 | show up in search results. The penalty is nothing, right? But you're going to have
01:07:30.120 | to manage what does or does not create the vector. You could do that in an ingest processor by just
01:07:36.680 | saying, Hey, we're going to use the copy command to have two copies of the text, one that's meant for
01:07:42.120 | non-vector indexing, one that's meant for actual vector indexing. You'd have to manage that with some
01:07:46.920 | tricky, complex AI technology called if-then-else, right? Somewhere inside of your ingesting pipeline,
01:07:54.120 | then it would work just fine. Yeah. One more question. When we did HNNW in the last week,
01:08:02.440 | we found that it was extremely slow at write times and the community suggested that we freeze our index
01:08:08.920 | if we were going to use HNNW. Force merge or? Yeah, I think just freeze writes. They said,
01:08:17.160 | they said build the index and free it, otherwise you'll put a ton of load on the computer. I mean, yes.
01:08:21.960 | What we found is that some of the defaults kind of like have been around in Elasticsearch for 10
01:08:26.840 | years settings with the merge scheduler really optimize the keyword search and for high update
01:08:33.400 | workloads on HNNW, we've got some suggestions. They take a little bit of parameter speed tuning to go
01:08:39.960 | and find something right for your IOPS and for your actual update workload. So sometimes it's about the
01:08:44.920 | merge scheduler and not doing kind of an inefficient HNNW build when it's not important for the use case.
01:08:51.480 | Okay. The other thing you'd say is that sometimes friends don't let friends run Elasticsearch 8.11,
01:08:58.920 | upgrade, upgrade, upgrade. They put a lot of optimization work in here. It should be simple.
01:09:03.240 | That's great. So the reason -- It used to be that.
01:09:06.120 | The reason why that is, it's like merging -- so because you have the immutable segment structure in
01:09:14.040 | Elasticsearch. And HNNW, you cannot easily merge. You basically need to rebuild them. The one trick -- I forgot which
01:09:21.320 | version it was. I'm not sure, Dave, if you remember. I think it was even before 8.11. But basically,
01:09:26.120 | if we do a merge, we would take the largest segment with not deleted documents and basically plop the
01:09:31.880 | new documents on top of them rather than starting from scratch from two HNNW data structures. There's
01:09:37.560 | another optimization somewhere now in 9.0 that will make that a lot faster. So it really depends on the
01:09:44.040 | the version that you have. And there are a couple of tricks that you can play. But yeah,
01:09:48.760 | that is one of the downsides of like the way immutable segments work and HNNW is built,
01:09:54.840 | that you can't easily just merge it as easily together as other data structures because you really
01:09:59.480 | need to rebuild the HNNW data structure or like take the largest one and then prop the other one in.
01:10:04.520 | Okay. Some of the things where we just like -- we fixed it in the next version of the scene, so I want you to --
01:10:09.320 | We found like KNNW was too slow and then KNNW broke our CPU and then we moved to find them,
01:10:15.960 | and now we're .
01:10:17.320 | Stay true. Yeah.
01:10:18.680 | Yeah. Might have been a while ago. Yeah?
01:10:21.640 | So for like traditional document search, you know what I'm saying, like, hey,
01:10:25.960 | please find me a document that contains my search query, right? For R in the context of RAG,
01:10:32.200 | it might be something more like, hey, come up with a fun plan for my weekend, right? And then the
01:10:37.480 | documents that we want to find don't necessarily look like the search query, right? Yeah.
01:10:41.480 | So like one approach to that is you just give -- it's an agent and you give it a search tool,
01:10:46.840 | and it searches, right? So like I'm just curious what you -- how do you think about that in general?
01:10:51.160 | Yeah. I feel like RAG has been very heavily abused. It's -- or like the mental model I
01:10:56.280 | think started off as like you do retrieval and then you do the generation, but you could do the
01:11:00.360 | generation earlier on as well, that you do the rewriting and expanded query. So I -- my favorite
01:11:07.560 | example for that is you're looking for a recipe. You don't need to have the LLM regenerate the recipe.
01:11:16.200 | You just want to find the recipe. But maybe you have a scenario where you forgot what the thing is
01:11:20.600 | called that you want to cook. And then you could use the LLM, for example, to tell you what you're
01:11:24.600 | looking for. Like you say, like, oh, I'm looking for this Italian dish that has like these layers of
01:11:31.640 | pasta and then some meat in between. And then the LLM says, oh, you're looking for lasagna. And then
01:11:36.280 | you basically do the generation first or a query rewriting and then search and then get the results.
01:11:42.440 | as a very explicit example here. Your example would look very different and probably smarter than my
01:11:49.480 | example. But query rewriting is one thing. There's also this concept of height where your documents and
01:11:58.840 | your queries often look very different. And that you use an LLM to generate something from the query that
01:12:04.200 | looks more close like the documents that you have. And then you match the documents together because they're
01:12:09.560 | more similar in structure. So there are all kinds of interesting things that you can do. Like I said
01:12:15.640 | earlier, it depends is becoming a bigger and bigger factor. But, yeah, your use case is probably
01:12:21.080 | might be, yeah, maybe a multi retrieval where you figure out, like, oh, you look, I don't know,
01:12:29.080 | I know the example from an e-commerce store where it's like, I'm going to a theme party from the 1920s,
01:12:36.520 | give me some suggestions. And then the LLM will need to figure out, like, what am I searching for?
01:12:40.680 | And then it can retrieve the right items and rewrite the query and then actually give you proper
01:12:44.920 | suggestions. But it's not just running a query anymore. Yeah?
01:12:50.600 | Yeah?
01:12:51.100 | We use instruction to embed models.
01:12:54.100 | And it's kind of like .
01:12:59.000 | Along with your theory, you can say, like, this is the kind of thing I am doing.
01:13:04.000 | And you can say .
01:13:07.000 | Like, you can have document query embeddings.
01:13:13.500 | We try to embed queries from documents we're going to find, like the text.
01:13:18.500 | You can have instruction when you embed models that, instead of saying, like, I'm actually
01:13:23.340 | creating documents, you have to .
01:13:26.400 | And then, I don't know, why do you think you have a problem?
01:13:30.400 | Yeah?
01:13:37.900 | How should we be thinking about the number of dimensions in the embedding model?
01:13:41.400 | Is, like, a 5 or 12-dimensional model necessarily better than a 1, 2, 3?
01:13:46.400 | Definitely not necessarily.
01:13:48.400 | Yeah.
01:13:49.400 | It's an interesting question.
01:13:51.300 | That feels almost like a blast from the past.
01:13:53.240 | I remember, like, two or three years ago, there was this big debate of, like, how many dimensions
01:13:57.800 | does each data store support and, like, how many dimensions should you have?
01:14:02.000 | And at first, it looked like, oh, more dimensions is always better.
01:14:04.300 | But then it turned out more dimensions are very expensive.
01:14:07.240 | So, it really depends on the model and what you're trying to solve.
01:14:10.300 | Like, if you can get away with fewer dimensions, it's potentially much cheaper and faster.
01:14:15.240 | But I don't think there is a hard rule, like, maybe the model with more dimensions can express
01:14:21.840 | more in because it just has more data and then it will come in handy.
01:14:26.580 | But maybe it's not necessary for a specific use case and then you're just wasting a lot
01:14:29.740 | of resources.
01:14:30.740 | I don't think there is an easy answer to say, like, yes, for this use case, you need at least
01:14:36.220 | 4,000 dimensions.
01:14:39.660 | It will depend.
01:14:40.660 | But it depends on the model, how many dimensions it will output, and then maybe you have some
01:14:44.880 | quantization in the background to reduce that again or reduce either the number of dimensions
01:14:49.200 | or the fidelity per dimension.
01:14:52.940 | So there are a lot of different tradeoffs in that performance consideration.
01:14:55.980 | But it will mostly rely on, like, how good does the model work for the use case that you're
01:15:01.260 | trying to do?
01:15:22.260 | Yeah.
01:15:23.260 | So that is one area.
01:15:27.620 | So I want to say historically what you would do is you would have a golden data set and then
01:15:33.760 | you would know what people are searching for and then you would have human experts who rate
01:15:37.260 | your queries.
01:15:38.380 | And then you run different queries against it and then you see, like, is it getting better
01:15:42.080 | or is it getting worse?
01:15:44.500 | Now LLMs open a new opportunity where you might have human experts in the loop to help them out
01:15:50.360 | a bit, but they might be actually good at evaluating the results.
01:15:55.200 | So you almost nobody has, like, the golden data set and test against that.
01:16:00.420 | But you can either use it, look at the behavior of your end users and try to infer something
01:16:05.600 | from that or you have an LLM that evaluates, like, what you have or you have a human together
01:16:12.060 | with an LLM evaluate the results.
01:16:14.900 | So you have various tools, but, again, it's -- and it depends on really not an easy question
01:16:22.780 | of saying, like, this is the right thing.
01:16:25.040 | Maybe you can get away with something simple.
01:16:26.680 | So the classic approach I want to say is, like, you looked at the clickstream of how your users
01:16:31.520 | behaved and then you saw, like, they clicked on the first or up to the third result.
01:16:36.200 | The result was potentially good and it didn't just go back and then click on something else,
01:16:39.920 | but they stuck on the page.
01:16:42.220 | If they don't click on anything and just leave, it might be very bad.
01:16:45.680 | If they go to the second or third page, it might also not be great.
01:16:48.720 | So there are some quality signals that you can infer from that or you really look into
01:16:52.540 | the quality aspect and try to evaluate, like, what people were doing and how it behaves.
01:16:58.420 | But you can make this from relatively simple to pretty complicated.
01:17:14.100 | What else?
01:17:17.100 | Obviously, if I search for the AdWords query extension, it will find my father example.
01:17:27.720 | And this one here will still, again, match my droids pretty much like the opening AI example.
01:17:36.720 | One thing that I wanted to show you what is also happening behind the scenes here, this is
01:17:41.100 | a very long segment, like, it's a lot of information with different speakers.
01:17:47.600 | What I have created here, though, is we have created multiple chunks behind the scenes.
01:17:54.640 | And if I search for that, I think looking for murder in the Skywalk saga works pretty well
01:18:02.820 | here.
01:18:03.820 | It finds the document that I have retrieved, but it can also highlight -- so here I say,
01:18:09.760 | show me the fragment that actually matched best here.
01:18:13.760 | And if I search here for murder, it didn't find anything.
01:18:18.700 | But I think the term that it found was in this highlighted segment here, it found kill and
01:18:24.800 | it was that one that was expanded here.
01:18:28.380 | So here I have broken up my long text field into multiple chunks and there are multiple
01:18:33.320 | strategies.
01:18:34.320 | You can do that by page, by paragraph, by sentence.
01:18:39.120 | You could do it overlapping or not overlapping.
01:18:43.280 | Many strategies will depend on how you want to retrieve what works best for your use case.
01:18:48.540 | But you want to kind of like reduce the context per element that you're matching because there's
01:18:53.140 | only so much context that a dense vector representation can hold.
01:18:57.400 | So you want to chunk that up, especially if you have like a full book, you want to break
01:19:01.120 | up those individual at least pages.
01:19:04.120 | And then find the relevant part where the match is.
01:19:06.960 | And then you can actually link back to that.
01:19:09.720 | The point in this query here is also to show you, I didn't define any chunks.
01:19:14.740 | I didn't say like, okay, send this representation of a dense vector there and then when it comes
01:19:20.640 | back, interpret again.
01:19:22.900 | This is all happening behind the scenes just to make this easier.
01:19:25.500 | So the entire behavior here is still very similar to the keyword matching even though there's
01:19:30.220 | a lot more magic happening behind the scenes.
01:19:34.220 | Just to keep that very simple.
01:19:37.220 | So let's see.
01:19:38.940 | Okay.
01:19:39.940 | How does everybody feel about long adjacent queries?
01:19:45.940 | We'll see about alternatives and maybe we can make this a bit simpler again.
01:19:50.940 | But let me show you one more way of looking at it.
01:19:55.940 | We call them retrievers.
01:19:57.940 | They're a more powerful mechanism to actually combine different types of searches.
01:20:05.280 | Combining different types of searches, let me get from my slides actually.
01:20:08.660 | When we talk about combining searches and how this all plays together.
01:20:12.660 | This is kind of my little interactive map of what you do when you do retrieval or what your
01:20:19.660 | searches do.
01:20:20.660 | We started here in the lexical keyword search and then we run the match query and we're matching
01:20:28.380 | these strings.
01:20:31.180 | This often combined with some rank features are often what we call full text search.
01:20:37.820 | The rank features could be either you extract a specific signal or it could also be something,
01:20:42.380 | however you influence that ranking, it could be the margin on the product, how many people
01:20:47.900 | bought something, what the rating is.
01:20:49.900 | There are many different signals that you could include, not just with the match of the text,
01:20:55.820 | but any other signals that you want to combine for retrieving that.
01:20:59.580 | And then you have full text search as a whole.
01:21:02.460 | On top of that, I kept it to the side here, you might have a Boolean filter where you have a hard
01:21:10.620 | include or exclude of certain attributes, this does not contribute to the score, this is just like
01:21:16.780 | black and white, this is included or excluded, whereas this here calculates the score for you,
01:21:22.940 | how you match.
01:21:24.220 | And then this was kind of like the algorithmic side.
01:21:29.660 | And then we have this machine learning, the learn side or the semantic search where you have a model
01:21:35.100 | behind the scenes split into the dense vector embeddings and the sparse vector embeddings
01:21:41.900 | for vector search or learn sparse retrieval, I think those are the two common terms.
01:21:47.180 | And the interesting thing is all of these, including the sparse one, these are the sparse vector
01:21:56.860 | representation in the background, and only this one here is the dense vector representation.
01:22:03.180 | And then when you combine any grouping down here to combine for one search, this is then what we
01:22:13.420 | would call hybrid search, even though there can be big discussions of like what is exactly hybrid search
01:22:18.460 | or not, I will definitely stick to the definition that as soon as you combine more than one type of
01:22:23.900 | search it could be sparse and dense or it could be dense and keyword or maybe if you combine two
01:22:30.140 | dense vector searches, hybrid search because you have multiple approaches and then you can either
01:22:37.500 | boost them together, you could do re-ranking which is becoming more and more popular. One thing that we
01:22:43.020 | lean heavily into is RRF which is reciprocal rank fusion that doesn't rely on the score but it relies on the
01:22:51.180 | position of the position of each search mechanism. So it basically says like the lexical search had this
01:22:57.340 | documented position four and the dense vector search had it at position two and then it kind of like evens out the
01:23:03.740 | position and gives you an overall position by blending them together rather than looking at the individual
01:23:09.020 | scores because they might be totally different. So this is kind of like the information retrieval map
01:23:15.420 | map overall and we have, okay, we didn't do a lot of filters but I think filters are intuitively
01:23:21.180 | relatively clear that you just say like I'm only interested in users with this ID or whatever other
01:23:26.220 | criteria. It could be a geo-based filter like only things within 10 kilometers or only products that came
01:23:31.740 | out in the last year. Like a hard yes or no. All the others will give you a value for the relevance and then you
01:23:42.140 | can blend that potentially together to give you the overall results. That is kind of like the
01:23:47.180 | total map of search. Can you give an example of the signal one too?
01:23:54.300 | Yeah, for signal, so we have our own data structure for these rank features. It could be, for example,
01:24:01.180 | the rating of a book and then you combine the keyword match for, I don't know, you search for murder
01:24:15.340 | mysteries but then another feature would be how well they are ranked and then you would see that. Or it
01:24:22.700 | could be your margin on the product or the stock you have available and you would want to show the product
01:24:28.060 | where you have more in stock. Or it might even be a simple like a click stream like what have people
01:24:34.220 | clicked before. There are a lot of different signals that you could include in all of this searching then.
01:24:39.420 | Any other questions? Are everybody good for now? Yeah?
01:24:47.900 | You would have to normalize them. Depending on the comparison that you do for dense vectors, it might be
01:25:13.500 | between 0 and 1. But you saw that for the keyword search, also depending on how many words I was
01:25:19.340 | searching for, it might be a much higher value. There is no real ceiling for that. Or you could add a
01:25:26.060 | boost and say, like, this field is 20 times more important than this other field. There is no real
01:25:32.220 | max value that you would have here. You could normalize the score and then basically say, like,
01:25:37.260 | I'll take the highest value in this sub query as 100% and then reduce everything down by that factor.
01:25:44.060 | And then I combine them. Maybe that works well. RF is a very simple paper. I think it's like two pages.
01:25:51.020 | And it really just takes the different positions. I think it's one divided by 60, which is like a factor
01:25:57.340 | that figured out made sense, plus the position. And then you add the scores or like the positions for each
01:26:03.580 | document together. And then that value gives you the overall position. It really just, it doesn't look
01:26:10.620 | at the score anymore, but it blends the different positions together and like how they are interleaving
01:26:15.100 | and what should be first or second. Yeah?
01:26:19.580 | Yeah.
01:26:20.580 | So, just for vector search, why should I do the last check-over CD vector or something like that?
01:26:27.580 | I'm sure that .
01:26:30.580 | My data is already in the database.
01:26:34.580 | We change, probably via CDC, change in the capture.
01:26:39.580 | So, there's one extra .
01:26:42.580 | Like, CD vector, like it's right there.
01:26:46.580 | I was just curious, what sort of systems, like, what have you seen in production?
01:26:59.580 | I mean, PG vector will always be there because, like, if you are already using Postgres, it's
01:27:04.580 | very easy to add.
01:27:05.580 | I think then the question is, like, does it have all the features that you need?
01:27:09.580 | For example, Postgres doesn't even do BM25.
01:27:13.580 | It has some matching, but it's not the full BM25 algorithm because I don't think it keeps
01:27:17.720 | all the statistics.
01:27:19.760 | It will be a question of, like, scaling out Postgres can be a problem and then just, like, the breadth
01:27:24.820 | of all the search features.
01:27:27.180 | If you only need vector search, I think my or our default question back to that is, like,
01:27:33.740 | do you really only need vector search?
01:27:36.920 | Maybe for your use case, but for many use cases, you probably need hybrid search.
01:27:41.540 | One area, for example, where vector search will not do great is, like, if somebody searches
01:27:46.380 | for, like, a brand.
01:27:48.720 | Because there is no easy representation in most models for the specific brand and it will be
01:27:53.740 | very hard to beat keyword search.
01:27:55.160 | So there will be very -- and also your users will be very angry when they know you have
01:27:59.300 | this word somewhere in your documents or in your data set, but you don't give me the result
01:28:03.500 | back.
01:28:04.500 | So there are many scenarios where you probably want hybrid search, I feel like that's the -- we
01:28:10.580 | started two years ago, we started with just vector search, but I feel like the overall trend
01:28:15.100 | trend is coming more to hybrid search, because you probably want some sort of key search and
01:28:21.260 | then you want to have that combined, probably with some model for the added benefit and extra
01:28:27.180 | context, but you often want the combination.
01:28:30.880 | It might also depend a bit on, like, the types of queries that your users run.
01:28:34.900 | So if your users run single word queries, like I've done in my examples, that's often not
01:28:39.860 | really ideal for vector search because you live off like any machine learning model because
01:28:44.440 | you live off extra context.
01:28:48.080 | So depending on that, I've seen some people build searches where it's like if you search
01:28:52.740 | for one or two words, they do keyword search, but if you search for more, they might fall
01:28:56.620 | over to vector search.
01:28:57.620 | So it depends a bit on the context what works.
01:29:00.820 | If you really only need vector search and PG vector is small enough to do all of that, and
01:29:08.440 | Postgres is your primary data store, then that's probably where you will do well.
01:29:12.900 | But there are plenty of scenarios where that will, or not all of those are necessary boxes
01:29:16.800 | will be ticked.
01:29:17.800 | My last question.
01:29:18.800 | Specifically for code, let's say you have a file, and it's like, you don't get repository,
01:29:25.680 | there's like thousands of comments, right?
01:29:27.680 | And so you have two options, either you embed at a file level, or you embed at a chunk level,
01:29:34.140 | right?
01:29:35.140 | But I don't want to pay the penalty across thousands of unchanged, like the file hasn't changed for
01:29:38.040 | thousand comments, but just for this one it has changed, right?
01:29:42.500 | So I cut any shadow copies of the same thing, and have you seen, like, what are some like
01:29:48.500 | tips and tricks that people use to not have exploding storage costs, and like, this might not be
01:29:58.960 | a plastic problem, but a general vector problem, like how do I just pay the penalty once of storing
01:30:13.420 | the same embedding, and only when it changes, I re-embed and then, uh, in that sense.
01:30:22.880 | But so, you would create, so it's one dataset basically with thousands of files that all are
01:30:28.920 | chunked together, and so one change would invalidate all of them, or --
01:30:32.880 | No, no.
01:30:33.880 | So, I think I have a file that is reportedly, right?
01:30:34.880 | It has like 5,000 commits, and there's one file, . It didn't change for 4,999 commits, but on the 5,000 commits, it did change, right?
01:30:47.880 | And so, if it hasn't changed, I want to only change the file to one of the embedding, right?
01:30:53.880 | But only when I insert file contents and change, I need -- I want to re-ingess, right?
01:31:00.880 | But for those 4,999 times, I don't want to store, like, this hash, like, this hash has the same embedding, same embedding.
01:31:09.880 | Ah, so --
01:31:10.880 | I'm not sure if this is a problem.
01:31:11.880 | You have seen with some of the customers, but --
01:31:13.880 | Maybe so that --
01:31:15.880 | I think the way we might solve it is that if you create the hash of the file and use that
01:31:22.880 | as the ID, and you only use the operation create, and it would reject any duplicate writes,
01:31:29.880 | you would at least not ingest and then create the vector representation again.
01:31:34.880 | You will still send it over again, and it would need to get rejected.
01:31:39.880 | If a doc ID would have to be the hash of the file.
01:31:42.880 | If you have that doc ID, and then you need to set the operation to just create and not update
01:31:47.880 | or upset, then it would just be rejected and you would only write it once.
01:31:52.880 | I'm not sure if that is a great use case or if you might want to keep, like, I don't know,
01:31:57.880 | an outside cache of, like, all the hashes that you've already had and deduplicate it there,
01:32:01.880 | but that would be the elasticsearch solution of, like, using the hash as the ID and then just
01:32:05.880 | writing to that.
01:32:06.880 | Okay.
01:32:07.880 | And although with create on that, yeah.
01:32:08.880 | That is, I think, the intuitive or most native approach that we could offer for that.
01:32:19.880 | Yeah.
01:32:20.880 | I think there was some other question somewhere.
01:32:21.880 | Yeah.
01:32:22.880 | Yeah.
01:32:23.880 | I just wanted to add on to the Postgres question a minute ago.
01:32:26.880 | Postgres does have a native .
01:32:29.880 | It's pretty good.
01:32:30.880 | It's a .
01:32:33.880 | There has also been recently a lot of work.
01:32:36.880 | So there is a .
01:32:43.880 | But from what I remember, the default Postgres full-text search does not do full BM25,
01:32:53.880 | but it only does -- it doesn't have all the statistics, I think, from what I remember.
01:32:56.880 | Right.
01:32:57.880 | Yeah.
01:32:58.880 | Any other questions?
01:33:02.880 | Joe, please, go ahead.
01:33:05.880 | Do you plan to cover more on the receiver thing, like the receiver concept?
01:33:12.880 | To show now?
01:33:13.880 | Yeah.
01:33:14.880 | I mean, how much chase do you want to see?
01:33:15.880 | Well, okay.
01:33:16.880 | I'm practically interested in -- I mean, you're modeling -- so you have multiple -- I love the
01:33:22.880 | receiver concept.
01:33:23.880 | It's a great concept.
01:33:24.880 | I think you have multiple different candidates.
01:33:27.880 | What kind of flexibility do you have on that?
01:33:30.880 | Because I'm not interested in using this.
01:33:36.880 | I'm interested in re-scoring them.
01:33:39.880 | Not necessarily looking at any, but I'm modeling the effect.
01:33:41.880 | You're just receiving a bunch of stuff.
01:33:44.880 | Can they?
01:33:45.880 | And then you're actually elevating your .
01:33:49.880 | I'm sure there's .
01:33:52.880 | Maybe for -- before we dive into that, for everybody else, like, re-scoring is like,
01:33:57.880 | let's say we have a million documents, and then we have one cheaper way of retrieving them,
01:34:01.880 | and we retrieve the top, I don't know, 1,000 candidates, and then we have a more expensive
01:34:06.880 | way, but higher quality way of actually re-scoring them, then we will run this more expensive re-scoring
01:34:12.880 | on just the top 1,000 to get our ultimate result list of results.
01:34:18.880 | But the re-scoring algorithm would be too expensive to run it across.
01:34:22.880 | Like a million documents, that's why you don't want to do that.
01:34:25.880 | That's why you have a two-step process.
01:34:27.880 | And that's why you might want to have the re-scoring.
01:34:30.880 | So, yes, we have a -- you can in Elasticsearch now, you can do re-scoring because it becomes
01:34:35.880 | more and more popular.
01:34:37.880 | I don't have a full example there, but we do have like -- we do have a re-scoring model built
01:34:43.880 | in by default now, let me pull that up.
01:34:50.880 | So, we have currently the version 1 re-ranking, but we have a built-in re-ranking model now
01:34:57.880 | as well.
01:34:58.880 | So, for one of the tasks that we can do, you can see here we have the other tasks, for example,
01:35:05.880 | the dense text embedding, now we have a re-ranking task that you can also call.
01:35:11.880 | Your question?
01:35:12.880 | How do you express that?
01:35:14.880 | Okay.
01:35:15.880 | It might be too easy.
01:35:26.880 | No, re-ranking is good.
01:35:28.880 | Let me -- somehow my keyboard binding is broken.
01:35:33.880 | This is very annoying.
01:35:37.880 | Okay.
01:35:40.880 | We re-rank results.
01:35:42.880 | Let me see.
01:35:43.880 | Somewhere here there should be -- so, there's learn to rank, but it should not be the only
01:35:54.880 | This is what we want.
01:35:55.880 | Okay.
01:35:56.880 | We have our re-ranking model.
01:36:01.880 | Unless, Dave, you know from the top of your head where we have the right docs for this.
01:36:08.880 | organization of our docs can't help you with -- but retrievers --
01:36:11.880 | Yeah, retrievers could find them --
01:36:13.880 | Starting in 8.16 or 8.17, the retrievers API added a specific parent level retriever that
01:36:21.880 | you -- it's like nested around outside the retrievers that go inside this.
01:36:26.880 | And it's specifically called the text re-ranking retriever.
01:36:31.880 | And so if you have a cross encoder, say from Huggy Face, or you're using the elastic re-ranker,
01:36:36.880 | using one of these things that complies that kind of inference task of taking a bunch of things and re-ranking them against the query, right?
01:36:46.880 | Taking the full token stream, right?
01:36:48.880 | And the full token stream with the context and doing that.
01:36:51.880 | So you can target a parent level text field of the documents that are being retrieved.
01:36:57.880 | So it works really well for the one document or chunk, kind of re-ranking use case.
01:37:03.880 | I've also seen people just do it outside in the second API call, say if you wanted to do it on a highlighted thing.
01:37:09.880 | Or if you wanted to do re-ranking sub-document chunks, that works pretty well for the API.
01:37:21.880 | But there's a text re-ranker retriever that specifically got added in 8.16 or 8.16.
01:37:29.880 | Yeah, so I think this is a simple example.
01:37:31.880 | Like, we have a standard match.
01:37:33.880 | Like, this will be very cheap.
01:37:35.880 | And then we have the text-similarity re-ranker, which uses our elastic re-ranker.
01:37:41.880 | That falls back to that model behind the scenes.
01:37:44.880 | So you can think about it.
01:37:46.880 | I was a functional programmer, so don't mind the parentheses.
01:37:50.880 | But you would have, like, the text re-ranker retriever.
01:37:54.880 | Inside of that, you have the RRF.
01:37:56.880 | Inside of that, you would have lexical and KNN as peers.
01:38:00.880 | And it works from the inside out.
01:38:02.880 | Hey, do each of those retrieval methodologies.
01:38:05.880 | Do, like, the Venn diagram.
01:38:07.880 | Find the best results.
01:38:08.880 | And then take the full text of those results and run them on re-ranker.
01:38:12.880 | It's almost like a little mini LLM saying, what do you actually answer the question?
01:38:16.880 | And then we'll ask what outcomes.
01:38:18.880 | It's pretty good.
01:38:19.880 | The cool thing about the re-ranker is you can run it on structured lexical retrieval.
01:38:25.880 | You don't have to run it on a vector.
01:38:28.880 | You can run it on anything you want.
01:38:30.880 | So you don't want to pay the vector search everything.
01:38:32.880 | Or maybe the text is too small, the vector search.
01:38:35.880 | You're not needing there to actually have the model lock on the stuff.
01:38:39.880 | The re-ranker, when you run it on just kind of actual customer data sets,
01:38:44.880 | how to do it, they're like, yeah, our evaluation score is bumped by 10 points.
01:38:49.880 | Basically, for free, it feels like cheap.
01:38:53.880 | Right?
01:38:54.880 | So when you run against, like, a Gemini API,
01:38:57.880 | and you're like, wow, why is this 10 points better than the Amazon one?
01:39:00.880 | It's because they threw on their retriever .
01:39:03.880 | Right?
01:39:04.880 | So there's a lot of black box stuff out there that we're exposing.
01:39:08.880 | So don't be scared if we're telling you how it works inside.
01:39:13.880 | But this is what the leading retrieval technology is doing under the hood
01:39:18.880 | and reselling to you as if they're, you know, it's all AI.
01:39:23.880 | Right?
01:39:24.880 | Yeah.
01:39:25.880 | Does that answer the question?
01:39:28.880 | Yeah.
01:39:29.880 | So my wish list is I want to be able to do in the retriever's API, re-ranking on sub-documentments.
01:39:39.880 | A lot of my things are about sub-documentology retrieval.
01:39:41.880 | Right now, I've got to do it outside of the retriever's API, but I'm bending here as a developer.
01:39:50.880 | Yeah.
01:39:51.880 | So just to give you the example of, like, I don't think I have a re-ranking example here,
01:39:56.880 | but this one uses a classic keyword match for retriever.
01:40:03.880 | And then we have -- we normalize here the score.
01:40:06.880 | I think somebody else asked about the normalize -- or we had a discussion about the normalizing.
01:40:10.880 | We do a min/max normalizing.
01:40:12.880 | We weight this with two.
01:40:14.880 | And then I use the OpenAI embeddings with a -- again, normalized with a weight of 1.5.
01:40:21.880 | And then they will get blended together and you get the results that won't surprise you
01:40:25.880 | that these are not the droids you're looking for.
01:40:27.880 | If you search for droid and robot, will be by far the highest ranking document.
01:40:33.880 | You had a question somewhere.
01:40:35.880 | How much control does the last -- if you're doing the re-ranking thing, currently we do something
01:40:40.880 | similar, but we, like, do the different steps of the re-inkers at kind of, like, different
01:40:43.880 | levels of re-distribution hierarchy.
01:40:44.880 | So, like, you have, like, you have, like, you're, like, charted droid processing nodes and you
01:40:53.880 | some, like, you do some ranking there and, like, once you, like, rejoin before you return
01:40:58.880 | the result from the query engine, that's when you run a final re-ranker.
01:41:03.880 | So, we would retrieve, like, x candidates and you could define the number of candidates
01:41:19.880 | and then we would run the re-ranking on top of those.
01:41:22.880 | So, that will be a trade-off for you, like, the larger the window is, the slower it will
01:41:27.880 | be, but the potentially higher quality your overall results will be because you will just
01:41:32.880 | have everything in your data set that you can then re-rank at the end of the day.
01:41:38.880 | Is that what you meant or you wanted something per node or --
01:41:43.880 | Yeah, I mean, like, right now we, like, are actually confirmed, like, specifically,
01:41:48.880 | we're kind of, like, we're in the compute and that happens, so that you can, like, get out
01:41:53.880 | for our rest of the week and do, like, keep re-ranking and then do some re-ranking of one,
01:41:58.880 | but -- but maybe that doesn't apply to these .
01:42:03.880 | Like, how the query is mechanically running?
01:42:06.880 | Yeah, I don't think that's how we do it.
01:42:09.880 | So, what you can control here is, like, this is a window of, like, what you might retrieve
01:42:14.880 | and then we have the minimum score, like a cutoff point, to throw out what might not be relevant
01:42:19.880 | anyway, to keep that a bit cheaper, that's what we have here.
01:42:29.880 | Those are the retrievers.
01:42:30.880 | And then you could do the RRF that I've explained where you blend results together.
01:42:34.880 | All of that is easy.
01:42:36.880 | One final note, if you got tired of all the JSON, we have a new way of defining those queries
01:42:45.880 | as well, where here we have a match operator, like, the one we have used all the time, that
01:42:51.880 | you can use either on a keyword field, but it could also be either a dense or a sparse vector
01:42:55.880 | embedding, and then you can just run a query on that, and then just get the scores from that.
01:43:01.880 | So, it is a particular language, it's a bit more like, I don't know, like a shell.
01:43:05.880 | But if you don't want to type all the JSON anymore, this is how you can do that.
01:43:10.880 | And here my screen size is a bit off.
01:43:12.880 | But, yeah, you get the code that we retrieved, the speaker, and the score.
01:43:18.880 | Maybe I'll take out the speaker to make this slightly more readable.
01:43:25.880 | No, it broke.
01:43:37.880 | This is, you could write queries with a fraction of the JSON.
01:43:43.220 | This will also support funny things like joins.
01:43:46.220 | It doesn't have every single search feature yet, but it's getting pretty close.
01:43:50.640 | So this is more like a closing out at the end.
01:43:53.320 | If you're tired of all the JSON queries, you don't have to write JSON queries anymore.
01:43:59.040 | This is nice both for, like, observability use cases where you have, like, just like aggregations
01:44:04.380 | and things like that, but it's also very helpful for full text search now if you want to write
01:44:09.100 | different queries.
01:44:10.100 | I think the main answer is that the language support in the different languages like Java, etc.,
01:44:15.220 | is not very strong yet.
01:44:16.220 | You basically give it strings and then it gives you a result back that you need to parse
01:44:19.940 | out again.
01:44:20.940 | So it is not as strongly typed on the client side yet as the other languages.
01:44:29.100 | Any final questions?
01:44:31.100 | Can you just talk about hybrid search and I was just curious, like, what is kind of the
01:44:35.820 | recommended best practices?
01:44:36.820 | like, we also do hybrid search today, what we do is we trigger two elastic queries.
01:44:42.540 | One to do, like, a basic keyword search, the other one to do, like, a key-end method search
01:44:46.540 | and then we walk through some, like, like a code here, a rewrite, we can get the final results.
01:44:51.820 | But, like, from the reviewers, you just showed us, like, it's better, like, to combine those
01:44:53.540 | two queries, in one retriever and, like, would that make the results significantly better?
01:45:06.540 | Like, one of the, like, each of the query will have its own stats and normalizer and things
01:45:10.540 | in, like, more, I don't know, just, like, in general, it sounds better.
01:45:12.540 | I mean, we can make your life easier.
01:45:16.360 | Yeah.
01:45:17.360 | It's just all behind one single query endpoint.
01:45:20.300 | So you could use the two different methods to retrieve and then you could still re-rank
01:45:25.420 | but all from one single query so you don't have to do it yourself.
01:45:29.300 | I mean, it's not like we want to stop you, but you don't have to and we can make your life
01:45:32.800 | a bit easier.
01:45:33.800 | I mean, it's only one single query that you need to run and, like, one single round
01:45:46.480 | trip to the server that you need to do.
01:45:48.400 | Yeah.
01:45:49.400 | But, like, I was just curious comparing to the two queries method, would you do that?
01:45:56.280 | I mean, if you still need to do the retrieval, like, you do the retrieval, like, all the individual
01:46:00.540 | pieces are still there.
01:46:01.540 | If you have two parts of the query, you will still retrieve those if that is the main cost
01:46:06.140 | and then you have the re-ranking so you're not getting out of those completely.
01:46:10.460 | But you can just do it in one single request that you send.
01:46:12.860 | We take care of all of that for you and then send you one result setback rather than sending
01:46:17.140 | more back to your application.
01:46:18.900 | So it will potentially be a little less work on the Elasticsearch side, but it will mostly
01:46:24.140 | be less work on your application side.
01:46:27.280 | So if you don't have any problems, you may not notice, but you're running two mapper users,
01:46:42.880 | right?
01:46:43.880 | When you could be running one, and you're denying the optimizer the opportunity to do any short
01:46:50.460 | gets to say, oh, there's no more results that are better.
01:46:55.060 | Yeah.
01:46:56.060 | So you're potentially going to be a little bit more performance and resources.
01:47:02.060 | So you're going to be a little bit more performance.
01:47:14.660 | You're going to be a little bit more performance.
01:47:17.660 | So if it's not hurting you, by all means, keep going.
01:47:20.660 | But if you have some point, you're going to start vertically scaling your hardware when you
01:47:24.720 | don't shoot you, you can get further, it's just a higher percent.
01:47:35.320 | Yeah.
01:47:36.320 | Perfect.
01:47:39.320 | Thank you so much.
01:47:40.320 | I hope everybody learned something.
01:47:42.320 | I will let the instance running for today or so, so you can still play around with the queries
01:47:46.320 | if you feel like it.
01:47:48.320 | Thanks a lot for joining.
01:47:49.320 | We want stickers.
01:47:50.320 | We have stickers up there.
01:47:51.320 | We also have a booth the next few days.
01:47:52.960 | Come join and get some proper swag from us there.
01:47:58.300 | Thank you.
01:47:59.300 | See you around.
01:47:59.620 | Thanks.
01:48:00.620 | Thanks.
01:48:03.620 | We'll see you next time.