back to indexInformation Retrieval from the Ground Up - Philipp Krenn, Elastic

00:00:00.000 |
Let's get going. Audio is okay for everybody. I have some slight feedback, but I'll try to 00:00:20.520 |
manage. I hope it's okay for you. Hi, I'm Philip. Let's talk a bit about retrieval. I'll show you 00:00:27.300 |
some retrieval from the ground up. We'll keep it pretty hands on. You will have a chance 00:00:32.660 |
to follow along and do everything that I show you as well. I have a demo instance that you 00:00:37.020 |
can use. Or you can just watch me if you have any questions, ask at any moment. If anything 00:00:44.340 |
is too small to reach out, then we'll try to make it larger. We'll try to adjust as we 00:00:48.760 |
go along. So, I guess we're not over reg yet. But reg is a thing. And we'll focus on the 00:00:57.760 |
R in reg. The retrieval augmented generation. We'll just focus on the retrieval. Just let's 00:01:04.760 |
see where we are with retrieval. Quick show of hands. Who has done reg before? Okay. That's 00:01:10.760 |
about half or so. Who has done anything with vector search and reg? Do I need vector search 00:01:19.580 |
for reg or can I do anything else? Yeah. Yeah. So, you can do anything. Retrieval is actually 00:01:28.620 |
a very old thing. Depending on how you define it, it might be 50, 70, whatever years old. 00:01:34.560 |
It's just getting the right context to the generation. I'll ignore all the generation 00:01:39.820 |
for today. We'll keep it very simple. We'll just focus on the retrieval part of getting 00:01:43.220 |
the right information in. Partially from the old stuff, like the classics. But we'll get 00:01:49.440 |
to some new things as well as we go along. Who has done keyword search before? Just that 00:01:56.840 |
is fewer than vector search, I feel like. Which almost reminds me of like 15 years ago or so, 00:02:04.140 |
when no SQL came up, like more people had done MongoDB, Redis, whatever else, rather than 00:02:09.020 |
SQL. That has changed again. I think it will be kind of similar for retrieval. The way I 00:02:15.380 |
would always say that vector search is a feature of retrieval. It's only one of multiple features 00:02:22.820 |
or many features that you want in retrieval. And we'll see a bit why and how and we'll dive 00:02:27.180 |
into those details. So, I work for Elastic, the company behind Elasticsearch. We're the most downloaded, 00:02:33.720 |
deployed, whatever else, search engine. We do vector search, we do keyword search, we do 00:02:38.280 |
hybrid search. We'll dive into various examples. Everything that I will show you works -- well, 00:02:44.960 |
the query language is Elasticsearch. But if you use anything built on Apache Lucene, everything 00:02:50.540 |
behaves very similarly. If you use something that is a clone or close to Lucene, like anything 00:02:56.680 |
built on 10 TV or anything like that, it will be very similar. The foundation, keyword search 00:03:03.680 |
and vector search will apply broadly everywhere. So, let's get going. We'll keep this pretty 00:03:09.800 |
hands on. Who remembers in Star Wars when he's making that hand gesture? What is the quote? 00:03:16.920 |
These are not the droids you're looking for. We'll keep this relatively Star Wars based. Feel 00:03:26.040 |
free to come in and filter on the sites or whatever. I'm afraid we have, I think, one chair over 00:03:31.740 |
there otherwise and one down there. Otherwise, it's getting a bit full. Okay. Let's look at 00:03:40.040 |
this of what these are not the droids you're looking for, does for search. And I will start 00:03:45.200 |
kind of like with the classic approach. Keyword search or lexical search is like you search 00:03:50.040 |
for the words that you have stored and we want to find what is relevant in our examples. 00:03:56.760 |
If you want to follow along, there is a gist which has all the code that I'm showing you. 00:04:03.160 |
So, let's go to the last slash AI dot engineer. There is one important thing. It's I have one 00:04:11.000 |
shared instance basically for everybody. So, you can all just use this without signing up 00:04:15.400 |
for any accounts or anything. So, this is just a cloud instance that you can use. There is my handle. 00:04:21.240 |
It's in the index name. If you don't want to fight and overwrite each other's data, replace that with 00:04:28.840 |
your unique handle or something that is specific to you. Because otherwise, you will all work on 00:04:33.320 |
the same index and kind of like overwrite each other's data. You can also just watch me. If you 00:04:37.960 |
don't have a computer handy, that's fine. But if you want to follow along, last slash AI dot engineer, 00:04:44.360 |
there will be a gist. It will have the connection string. Like there is a URL and then the credentials 00:04:49.800 |
are workshop, workshop. If you go into log in, it will say log in with elastic search. That's where you 00:04:55.000 |
use workshop, workshop. Then you will be able to log in. And you can just run all the queries that I'm 00:05:00.360 |
showing you. You can try out stuff. If you have any questions, shout. I have a couple of colleagues 00:05:05.960 |
dispersed in the room. So, if we have too many questions, we will somehow divide and conquer. 00:05:11.080 |
So, let's get going and see what we have here. And I will show you most of the stuff live. 00:05:19.960 |
I think this is large enough in the back row. If it's not large enough for anybody, shout and we will 00:05:24.440 |
see how much larger I can make this. And let me turn off the Wi-Fi and hope that my wired connection 00:05:33.640 |
is good enough. Let's refresh to see. Ooh. Maybe we will use my phone after all. 00:06:26.600 |
Okay. Hardest problem of the day solved. We have network. 00:06:34.200 |
Okay. So, we have the sentence. These are not the droids you are looking for. And we will start 00:06:38.040 |
with the classic keyword or lexical search. Like, what happens behind the scenes? 00:06:41.880 |
So, what you generally want to do is you basically want to extract the individual words and then make 00:06:47.480 |
them searchable. So, here, I'm not storing anything. I'm just looking at, like, how would that look like 00:06:53.240 |
if I stored something? I'm using this underscore analyze endpoint to see what I will actually store in 00:07:01.080 |
the background to make them searchable. So, these are not the droids you are looking for. And you see, 00:07:06.120 |
these are not the droids you are looking for. In Western languages, the first step that happens is 00:07:21.160 |
the tokenization. In Western languages, it's pretty simple. It's normally any white spaces and punctuation 00:07:25.800 |
marks where you just break out the individual tokens. Especially Asian languages are a bit more 00:07:31.880 |
complicated around that. But we will gloss over that for today. And we have a couple of interesting 00:07:37.320 |
pieces of information here. So, we have the token. So, this is the first token. We have the start offset 00:07:43.720 |
and the end offset. Why would I need a start and end offset? Why would I extract and then store that 00:07:50.200 |
potentially? Any guesses? Yeah? Yes. Especially if you have a longer text, you would want to have that 00:07:58.360 |
highlighting feature that you want to say, this is where my hit actually was. So, if I'm searching for 00:08:03.720 |
these, which is maybe not a great word, but you would very easily be able to highlight where you had 00:08:08.120 |
actually the match. And the trick that you're doing in search, generally what differentiates it from a 00:08:14.040 |
database is a database just stores what you give it and then does basically almost everything at query 00:08:19.560 |
or search time. Whereas a search engine does a lot of the work at ingestion or when you store the data. 00:08:25.480 |
So, we break out the individual tokens. We calculate these offsets and store them. So, whenever we have a 00:08:31.400 |
match afterwards, we never need to reanalyze the actual text, which could potentially be multiple pages 00:08:36.760 |
long. But we could just highlight where we have that match because we have extracted those positions. 00:08:41.640 |
We have a position. Why would I want to store the position with the text that I have? 00:08:46.600 |
Yeah? Annotation. So, the main use case that you have is if you have these positions and later on, 00:08:57.880 |
we'll briefly look at if you want to look for a phrase, if you want to look for this word followed 00:09:02.040 |
by that word. So, you could then just look for all the text that contain these words. But then you 00:09:07.880 |
could also just compare the positions and basically look for n, n plus 1, et cetera. And you never need 00:09:12.840 |
to look at the string again. But you can just look at the positions to figure out like this was one 00:09:17.480 |
continuous phrase. Even if you have broken it out into the individual tokens. Most of the things that 00:09:23.960 |
we see here is alpha num for alpha numeric. An alternative would be synonyms. We'll skip over 00:09:30.440 |
synonym definition because it's not fun to define tons of synonyms. But this is all the things that we're 00:09:36.200 |
storing here in the background. You can also customize this analysis. And that is one of the features, 00:09:41.960 |
again, of full text search and lexical searches that you preprocess a lot of the information to make 00:09:47.400 |
that search afterwards faster. So, here you can see I'm stripping out the HTML because nobody's going 00:09:53.320 |
to search for this emphasis tag. I use a standard tokenizer that breaks up, for example, on dashes. 00:10:01.480 |
You will see that. Alternatives would be white space that you only break up on white spaces. 00:10:05.880 |
I lowercase everything, which is most of the times what you want because nobody searches in Google 00:10:13.880 |
with proper casing or at least maybe my parents. But nobody else searches with proper casing in Google. 00:10:20.200 |
We remove stop words. We'll get to stop words in a moment. And we do stemming with the snowball stemmer. 00:10:27.800 |
What stemming is it basically reduces a word down to the root. So, you don't care about singular, 00:10:32.280 |
plural or like the flexion of a verb anymore. But you really care more about the concept. So, 00:10:38.360 |
if I run through that analysis, does anybody want to guess what will remain of 00:10:43.800 |
this phrase or which tokens will be extracted and in what form? 00:10:58.520 |
Yeah, close. So, we'll actually have three. So, we have Droid, You and Look. And you can see 00:11:08.840 |
all the others were stop words which were removed. The stemming 00:11:13.720 |
reduced looking down to look because we don't care if it looks, looking, look. We just reduce it to the 00:11:19.960 |
word stem. So, we do this when we store the data. And by the way, when you search afterwards, your text 00:11:26.040 |
will run through the same analysis that you would have exact matches. So, you don't need to do anything 00:11:30.520 |
like a like search anymore in the future. So, this will be much more performant than anything that you 00:11:35.800 |
would do in a relational database because you have direct matches. And we'll look at the data structure 00:11:40.200 |
behind it in a moment. But what we get is Droid, You and Look with the right positions. So, for example, 00:11:47.640 |
if we search for Droid, You, we could easily retrieve that because we have the positions, 00:11:52.200 |
even though that is a weird phrase. Do we start indexing at zero or one? 00:11:57.320 |
0, yes. It's the only right way. There is a different discussion here. So, we are -- the positions are 00:12:09.080 |
based starting at zero. And these are the tokens that are remaining. If you do this for a different 00:12:15.640 |
language, like you might hear I'm a native German speaker. This is the text in German. And you would, 00:12:21.720 |
if you use a German analyzer, it would know the rules for German and then would analyze the text in the 00:12:27.800 |
right way. So, then you would have remaining Droid, den, such. Anybody wants to guess what happens if I 00:12:36.280 |
have the wrong language for a text? It will go very poorly. Because the -- so, how this works is, basically, 00:12:47.160 |
you have rules for every single language. It's like, what is the stop word? How does stemming work? 00:12:51.720 |
If you apply the wrong rules, you basically just get wrong stuff out. So, it will not do what you want. 00:12:58.520 |
So, what you get here is, like, this is an article. But, well, in English, the rule is an S at the end 00:13:06.200 |
just gets stemmed away, even though this doesn't make any sense. So, you apply the wrong rules and you 00:13:10.840 |
just produce pretty much garbage. So, don't do that. Just to give you another example, 00:13:16.360 |
French, this is the same phrase in French. And then you see Droid, La, and Recherche are the words 00:13:27.480 |
that are remaining in these examples. Otherwise, it works the same. But you need to have the right 00:13:31.960 |
analysis for what you're doing. Otherwise, you'll just produce garbage. A couple of things as we're 00:13:39.160 |
going along. The stop word list, by default, which you could overwrite, is relatively short. This is 00:13:45.160 |
linguists have spent many years figuring out what are the right list of stop words. And you don't want 00:13:49.960 |
to have too many or too few. In English, I always forget, I think it's 33 or so. This is where you 00:13:56.120 |
can find it in the source code. It's -- I don't want to say well hidden, but it's not easy to find either. 00:14:00.440 |
So, every language has, like, a list of stop words that are defined that will be automatically removed for. 00:14:06.440 |
These are not the droids you are looking for. By accident, more or less, we had a lot of stop words, 00:14:11.640 |
since that why not a lot remained here in the phrase. And then for all other languages, 00:14:15.320 |
you will have a similar list of stop words. Should you always remove stop words? 00:14:25.560 |
Yes, no. Yes. That is, by the way, another not is a very good -- I'm not sure if everybody heard that. 00:14:36.040 |
The comment was about not. One important thing here, we're talking about lexical or keyword search, 00:14:41.800 |
which is dumb, but scalable. It doesn't understand if there is a droid or there's no droid. It's just 00:14:49.640 |
defined as a stop word. It does just keyword matching. That is, in vector search or anything 00:14:56.200 |
with a machine learning model behind it will be a bit of a different story afterwards, where these 00:15:00.840 |
things might make a difference. But this is very simple, because it just matches on similar strings, 00:15:06.120 |
basically. It doesn't understand the context. It doesn't know what's going on. That's why the linguists 00:15:10.840 |
decided not this is a good stop word. You could overwrite that if for your specific use case, this is not a 00:15:17.240 |
good idea. Always removing stop words, yes, no, maybe. So, our favorite phrase is it depends. 00:15:27.560 |
And then you have to explain, like, what it depends on. So, what it depends on is there are scenarios 00:15:34.360 |
where removing all stop words does not give you the desired result. And maybe you want to have, like, 00:15:39.400 |
a text with and without stop words. Like, sometimes stop words are just, like, a lot of noise that blow up 00:15:44.520 |
the index size and don't really add a lot of value. That's why we have to find them and try to remove 00:15:48.840 |
them by default. But if you had, for example, to be or not to be, these are all stop words. 00:15:54.840 |
It would all be gone when you run it through analysis. So, it is tricky to figure out, like, 00:16:02.600 |
what is the right balance for stop words or what works for your use case. But you might have unexpected surprises 00:16:08.040 |
in all of this. Okay. We have seen the German examples. Let's do some more queries. Or let's 00:16:17.560 |
actually store something. So far, we only pretended or we only looked at what would happen if we would 00:16:23.240 |
store something. Now, I'm actually creating an index. Again, if you're running this yourself, 00:16:28.680 |
please use a different name than me. Just replace all my handle instances with your handle or whatever you 00:16:37.400 |
want. Since this is a shared instance. If you have too many collisions, I might jump to another instance 00:16:42.440 |
that I have as a backup in the background. But what I'm doing here is I'm creating this analysis pipeline 00:16:48.360 |
that I have looked at before. Like, I'm throwing out the HTML. I use a standard tokenizer, lower casing, 00:16:53.400 |
stop word removal, and stemming. And then I call this my analyzer. And then I'm basically applying this 00:17:00.760 |
my analyzer on a field called "quote". We call this a mapping. It's kind of like the equivalent of a schema in a 00:17:09.000 |
relational database. But this defines how different fields behave. Okay. And somebody did not replace 00:17:19.800 |
the query. By the way, you need to keep user_. Let me quickly do this myself. Oops. I should have seen this 00:17:38.840 |
the query is coming. We want to replace, and we'll use, oops, oops. Please don't copy that. And I want to 00:18:07.160 |
let's try it again. So we're creating our own index. And now I just to double check, I'll just again run this 00:18:19.160 |
underscore analyze against this field that I've set up to just double check that I've set it up correctly. 00:18:37.160 |
And now I'm actually starting to store documents. Bless you. So we'll store -- these are not the droids you're 00:18:45.160 |
looking for. I have two others that I'll index just so we have a bit more to search. No, I am your father. 00:18:53.880 |
Any guesses what will remain here? Father. Father. Yeah. 00:19:00.680 |
Okay. Let's try this out. Let me copy my -- this one actually has way fewer stoppers than you would expect. 00:19:11.160 |
Let's quickly do this. Since I didn't do the HTML removal, let's take these out manually. 00:19:23.560 |
So what you get is, no, I am your father. And this was stupid because this was not what I wanted. We need to run this against the right analysis. 00:19:33.160 |
This happens when you copy-paste. Okay, um, uh, sorry. And we'll do text. 00:19:47.320 |
No, I think I've patched this back together. Okay, I am your father. So no is the only stopper in this list, actually. 00:20:01.000 |
No was on the stopper in this list, all the others are not. Um, okay, let's try another one. 00:20:19.720 |
Obi-Wan never told you what happened to your father. How many tokens will Obi-Wan be? Two? One? 00:20:32.040 |
No, Obi-Wan will be two. Like Obi-Wan -- because we use the default tokenizer or standard tokenizer, 00:20:41.880 |
that one breaks up at dashes. If you had used another tokenizer like white space, that would keep it 00:20:46.760 |
together because it breaks up in white spaces. So there are various reasons why you want or would 00:20:51.560 |
not want to do it. I don't want to go into all the details. But there are a lot of things to do 00:20:55.880 |
right or wrong when you ingest the data, which will then allow you to query the data in specific ways. 00:21:01.160 |
So, for example, if you would have an email address, that one is also weirdly broken up. Like, 00:21:09.960 |
you might use, like, there's a dedicated tokenizer for URL and email addresses. So, depending on what type of 00:21:16.200 |
data you have, you will need to process the data the right way because pretty much all the smart 00:21:21.400 |
pieces are kind of like an ingestion here to make the search afterwards easier. So, you can easily do 00:21:27.800 |
that. Let's see. Let's index all my three documents so that we can actually search for them. Now, if I 00:21:35.720 |
start searching for droid, it should match. These are not the droids you're looking for. Yes or no. Because this 00:21:43.800 |
one is singular and uppercase, and the droid that we stored was plural and lowercase. Will that match, 00:21:49.240 |
yes or no? Yes. Why? Because of the stemming. 00:21:54.520 |
Yes, we had the stemming. We had the lower casing. And when we search, so we store the text, it runs 00:22:01.560 |
through this pipeline or the analysis. And for the search, it does the same thing. So, it will lowercase the 00:22:07.880 |
droid. It has stemmed down the droids in the text to droid and then we have an exact match. So, what the 00:22:16.360 |
data structure behind the scene actually looks like. The magic is kind of like in this so-called inverted 00:22:23.080 |
index. What the inverted index is, is these are all the tokens that remained that I have extracted. 00:22:30.920 |
I have alphabetically sorted them. And they basically have a pointer and say in this document, like with 00:22:36.120 |
the IDs 1, 2, 3 that I have stored, we have how many occurrences? Like 0, 1, yeah, nothing had 2. 00:22:45.080 |
And then we also know at which position they appeared. So, search for droid now. This is what I have stored. 00:22:55.560 |
I lowercase the droid to droid. I have an exact match here. Then I go through the list and see, 00:23:00.600 |
retrieve this document, skip this one, skip this one. And at position 4, you have that hit. And then you 00:23:06.440 |
could easily highlight that. So, you have almost done all the hard work at ingestion and this retrieval 00:23:11.640 |
afterwards will be very fast and efficient. That's the classic data structure for search, the inverted 00:23:17.880 |
index where you have this alphabetic list of all the tokens that you have extracted to do that. 00:23:23.160 |
And this will just be built in the background for you and that's how you can retrieve all of this. 00:23:26.920 |
Let's look at a few other queries and how they behave. If I search for robot, will I find anything? 00:23:41.240 |
No, because there was no robot. There was a droid. We could now define a synonym and say, like, 00:23:51.160 |
all droids are robots, for example. Who likes creating synonym lists? Nobody anymore. Okay. 00:24:00.920 |
Normally, I would have said that's the Stockholm syndrome because there is sometimes somebody who 00:24:05.320 |
likes creating synonym lists because they have done that for so many years. But it got easier nowadays. 00:24:10.840 |
Now you can use LLMs to generate the synonyms. So, it can get a bit easier to create them. But they're 00:24:16.200 |
still limited because you have always this mapping. So, with synonyms, you can expand the right way. 00:24:21.480 |
Where it gets trickier if you have homonyms. If a word has multiple meanings, like a bat could be the 00:24:28.360 |
animal or it could be the thing you hit a ball with. There it just gets trickier because there is no meaning 00:24:34.360 |
behind the words or no context. So, you just match strings and that is inherently limited. But, like I said, 00:24:40.680 |
it's dumb, but it scales very well. And that's why it has been around for a long time. And it does 00:24:46.920 |
surprisingly well for many things because there's not a lot of things that are unexpected or that can 00:24:51.640 |
go totally wrong. Now, other things that you can do. You could do a phrase search where you say, 00:24:58.680 |
I am your father. Will this find anything? Yes. Because we had no, I am your father. 00:25:11.320 |
What happens if I say, for example, I am, let's see, I am not your father. Yes, no? No. No. Why? So, you're right. 00:25:27.080 |
Looking for an exact match based on the position. 00:25:29.240 |
Not as a stop word. But not as a stop word. But you're right because the positions still don't match. 00:25:38.280 |
So, the stop word not would be filtered out, but it still doesn't match because the positions are off. 00:25:45.240 |
That is one of the things that sometimes can be confusing. So, even if something is a stop word 00:25:51.240 |
and will be filtered out, it doesn't work like that. One thing that you can do is, though, 00:25:56.520 |
that the factor is called slop, where you basically say if there is something missing, 00:26:01.480 |
it would still work. So, I am your father and I am father with slop zero, that's kind of like the 00:26:09.000 |
implicit one. Will not find anything. But if I say one, then I basically say, like, 00:26:13.880 |
there can be a one-off in there. Like, one word can be missing. 00:26:24.040 |
However, I am his father. Here, his would not match. So, this still will not work. 00:26:28.680 |
The slope is really just to skip a word. Yeah? 00:26:37.080 |
I assume that -- no, I possibly am your father. I assume that won't work. 00:26:45.160 |
There you might need to do something like a synonym where you say slash m gets replaced by m. 00:26:52.680 |
Or we will need to have some more machine learning capabilities behind the scenes to do stuff like 00:26:57.640 |
that. Are there any libraries that would predefine contractions like that? 00:27:02.760 |
So, what is built in is generally a very simple set of rules. What you will need to do for things 00:27:11.800 |
like this is normally you need a dictionary. The problem around these is they are normally not available for 00:27:17.000 |
free or open source. Funnily enough, they are often coming out of university, the dictionaries, 00:27:24.200 |
because they have a lot of free labor. The students. That's why the universities have been creating a lot 00:27:31.240 |
of dictionaries. But they often come out under the weirdest licenses. That's why they are not very widely 00:27:35.720 |
available. But, yes, there is a smarter or more powerful approach if you have a dictionary and you 00:27:41.480 |
can do these things. For example, one thing to show is like, maybe that's a good thing to also mention. 00:27:50.920 |
You don't always get words out of the stemming. It's not a dictionary. It doesn't really get what 00:27:58.520 |
you're doing. It just applies some rules. So, for example, Blackberry. Blackberry. Sorry, Blackberries, 00:28:09.000 |
I think that this will be stamped down differently. Ah, sorry, I need English. Without English, this will 00:28:15.160 |
not work. So, this will stand down to this weird word Blackberry. And it will also stem down the singular 00:28:27.160 |
Blackberry. So, there's a rule that applies this. But it's just a rule. It's not dictionary-based. 00:28:32.600 |
It's not very smart. And it only has some rules built in that work for this. But you will definitely 00:28:38.680 |
hit limits. And the other thing, by the way, why I picked Blackberry as an example, you have some 00:28:45.720 |
annoying languages like German, Korean, and others that compound nouns like Blackberry, where you have 00:28:53.000 |
basically two words. Black would never find Blackberry in the simplest form because it's not a complete 00:28:59.880 |
string. There are various ways to work around that that all come with their own downsides. And either 00:29:05.960 |
you have a dictionary or you extract the so-called engrams. It's like group of words and then you mention 00:29:10.680 |
group of words. But all of those are one of the many tools how we try to make this a bit better or smarter, 00:29:17.560 |
but it all has limitations. I hope that answers the question and makes sense. 00:29:22.360 |
So, there are dictionaries, but they're generally not free or not under an easy license available. 00:29:29.080 |
For some languages, by the way, even the stemmers are not freely available. I think there is a stemmer 00:29:35.880 |
or analyzer for Hebrew. I think that has also like some commercial license or at least you can't use it for 00:29:43.240 |
free or free in commercial products. Though licensing with machine learning models is also its own dark 00:29:52.360 |
Yes. That is what an engram is doing. Let me see if I can. 00:30:18.120 |
An engram is normally a word group, normally a trigram. This is way too small. 00:30:23.800 |
Somehow I have weirdly overwritten my command plus so I can't use that. Let me make this slightly larger. 00:30:32.440 |
Okay. Here we basically use one or two letters as word groups, which is way too small. 00:30:46.360 |
But just to show the example, and this is very hard to read. Let me copy that over to my console. 00:30:52.920 |
There you can -- there you can -- oops. There you can see this. But this is a great question. 00:31:03.080 |
So, we'll use engram for quick fox. And then you can see the tokens that I extract here are the first letter, 00:31:10.200 |
the first two, the second, the second and third, et cetera. And you end up with a ton of tokens. 00:31:16.760 |
The downside is, A, you have to do more work when you store this. B, it creates a lot of storage on 00:31:23.640 |
disk because you extract so many different tokens. And then your search will also be pretty expensive 00:31:28.120 |
because normally you would at least do three trigrams. But even that creates a ton of tokens and a ton of 00:31:36.440 |
matches. And then you need to find the ones with the most matches. And it works. But A, it is pretty 00:31:42.280 |
expensive in disk but also query time. And it might also create undecided results or results that are a bit 00:31:49.160 |
unexpected for the end user. It is, I would call it, again, it's a very dumb tool that works reasonably 00:31:56.760 |
well for some scenarios. But it's only one of many potential factors. What you could potentially do is, 00:32:03.400 |
and I don't have a full example for that, but we could build it quickly, what you would do in reality 00:32:09.080 |
probably, you might store a text more than one way. So, you might store it, like, with stop words and 00:32:15.720 |
without stop words and maybe with engrams. And then you give a lower weight to the engrams and say, 00:32:22.040 |
like, if I have an exact match, then I want this first. But if I don't have anything in the exact 00:32:26.040 |
matches, then I want to look into my engram list. And then I want to kind of, like, take whatever is 00:32:31.400 |
coming up next. So, even keyword-based search will be more complex if you combine different methods. 00:32:38.680 |
Engrams are interesting, but, again, they're a dumb but pretty heavy hammer. 00:32:45.320 |
Use them with the right, at the right scenario. 00:32:48.040 |
Sorry, quick question about this engram. Is it by default one or two? 00:32:51.560 |
Yes. But you could redefine that. So, we can, let me go back to the docs. The engram, 00:33:01.000 |
you can say mingram and maxgram. If you set both to three, you would have trigrams, where it's always 00:33:06.840 |
groups of three, like 1, 2, 3, 2, 3, 4, et cetera. You could also have something called edge engram, 00:33:14.200 |
where you expect that somebody types the first few letters right, and then you only start from the 00:33:18.600 |
beginning but not in the middle of the word, which sometimes avoids unexpected results. And, of course, 00:33:24.680 |
reduces the number of tokens quite a bit. So, somewhere in here, edge engram. 00:33:38.120 |
Let's just copy that over so I won't type. So, here we have edge engram with quick, and you can see it 00:33:47.400 |
only does the first and the first two letters, but nothing else. And, in reality, you would probably 00:33:53.160 |
define this like 2, 2, 5, or more, or whatever else you want. But, here, we only do from the start and 00:33:59.880 |
nothing else, which reduces the tokens tremendously. But, of course, if you have blackberry and you want to 00:34:06.840 |
match the berry, you're out of luck. Makes sense. Anybody else? Anything else? 00:34:16.120 |
Yeah, so, if you have multiple languages, do not mix them up. That will just create chaos. Because we'll get to that in a moment. But, how keyword search works is basically word frequency. 00:34:23.240 |
And if you mix languages, it screws up all frequencies and statistics. So, what you would do is, either you 00:34:44.520 |
field English and you would have field English and then you would have field whatever the abbreviation for 00:34:53.800 |
Hebrew is. Hebrew. And then you would have that. And then you would need to define the right analyzer for that 00:35:03.080 |
specific field. So, you break it out either into different fields or you could even do different indices. 00:35:07.080 |
And ideally, we even have that built-in. We have a language analyzer. Even if you just provide a couple of 00:35:16.360 |
words, it will guess, or not guess, it will infer the language with a very high degree of certainty. 00:35:25.640 |
Especially Hebrew will be very easy to identify. If you have your own diacrites, it's easy. 00:35:29.640 |
But even if you just throw random languages in there, it will be a very high degree of certainty. 00:35:34.440 |
Especially Hebrew will be very easy to identify. If you have your own diacrites, it's easy. But even if 00:35:44.760 |
you just throw random languages at it, it will have a very good chance, just with a few words, to know 00:35:49.880 |
this is this language and then you can treat it the right way. Good. Let's continue. 00:36:03.320 |
So, we have done all of these searches. We have done slope. One more thing before we get into the 00:36:12.200 |
relevance. One other very heavy hammer that people often overuse is fuzziness. So, bless you. If you 00:36:21.240 |
have a misspelling, so I misspelled Obi-Wan Kenobi. We already know that this is broken out into two different 00:36:29.080 |
words or tokens. It will still match your Obi-Wan because we have this 00:36:33.240 |
fuzziness, which allows edits. It's like a Lievenstein distance. So, you can have one. By default here, 00:36:42.200 |
you could either give it an absolute value, like you can have one edit, which could be one character 00:36:49.400 |
too much, too little, or one character different. You could set it to two or three. You can't do 00:36:55.800 |
more because otherwise you match almost anything. And auto is kind of smart because, depending on 00:37:03.160 |
on how long the token that you're searching for, it will set a specific value. If you have zero to two 00:37:10.120 |
characters, auto fuzziness, I think is one from two to -- no, zero to two characters is zero. Three to five 00:37:18.440 |
characters is one. And after that, it's two. So, you can match these. Will this one match? 00:37:40.680 |
Yes. So, we have -- we have -- both of those are misspelled. 00:37:48.360 |
You get tokenized separately and you can't have a single . 00:37:52.680 |
Yes. That is a bit of a gotcha. So, yes. You need to know the tokenizer. So, we tokenize with standard, 00:37:58.920 |
so it's two tokens, and then the fuzziness applies per token, which is another slightly surprising 00:38:04.920 |
thing. But, yes, that's how you end up here. Okay. Now, we could look at how the Levenstein 00:38:14.600 |
distance works behind the scenes, but it's basically a Levenstein automaton which looks something like 00:38:21.240 |
this. If you search for food and you have two edits, this is how the automaton would work in the 00:38:25.320 |
background to figure out, like, what are all the possible permutations. It's a fancy algorithm that 00:38:30.440 |
was, I think, pretty hard to implement, but it's in Levenstein nowadays. Okay. Now, let's talk about 00:38:38.520 |
scoring. One thing that you have seen that you don't have anywhere or in a non-search engine or 00:38:44.200 |
just in a database is like we have to score. It's like, how well does this match? How does the score 00:38:50.680 |
work here? Let's look at the details of that one. So, the basic algorithm, which is also most of us, 00:39:01.640 |
or pretty much all of us here, term frequency inverse document frequency, or TF-IDF. It has been slightly 00:39:09.720 |
tweaked, like the new implementation is called BM25, which stands for best match, and it's the 25th 00:39:15.240 |
iteration of the best match algorithm. So, what they look like is you have the term frequency. If I 00:39:22.200 |
search for Droid, how many times does Droid appear in the text that I'm looking for? And it's basically 00:39:28.840 |
the square root of that. So, the assumption is if a text contains Droid once, this is the relevancy. If I 00:39:35.880 |
have a text that contains Droid 10 times, this is the relevancy. The tweak between TF-IDF, that one just 00:39:42.680 |
keeps growing, BM25 says, like, once you hit, like, five Droid in a text, it doesn't really get much more 00:39:48.840 |
relevant anymore. So, it kind of, like, flattens out the curve. That is the idea of term frequency. 00:39:55.000 |
The next thing is the inverse document frequency, which is almost the inverse curve. The assumption here is 00:40:04.120 |
over my entire text, this is how often the term Droid appears. So, if a term is rare, it is much more 00:40:13.080 |
relevant than if a term is very common, then it's kind of, like, less relevant. Basically, the assumption 00:40:18.920 |
is rare is relevant and interesting. Very common is not very interesting anymore. And then it's kind of, 00:40:24.520 |
like, just works its curve out like that. And the final thing is the field length norm is, like, 00:40:32.280 |
the shorter a field is and you have a match, the more relevant it is. Which assumes, like, 00:40:38.680 |
if you have a short title and your keyword appears there, it's much more relevant than if there's a 00:40:42.600 |
very long text body and your keyword and you have a match there. And these are the three main components 00:40:49.240 |
of TF-IDF. So, let's take a look at how this looks like. You can make this a bit more complicated. 00:40:56.200 |
This will show you why something matches. Don't be confused by the -- or let me take that out for the 00:41:05.240 |
first try. So, I'm looking for father. And I am -- no, I am your father. And Obi-Wan never told you what 00:41:12.600 |
happened to your father. One is more relevant than the other. Why is the first one more relevant than 00:41:19.160 |
the second one? Yeah. Term frequency is the same. Both contain father ones. The inverse document frequency 00:41:29.560 |
is also the same because we are looking for the same term. The only difference is that the second one 00:41:34.840 |
is longer than the first one. And that's why it's more relevant here. So, this is very simple. And you 00:41:44.360 |
can then, if you're unsure why something is calculated in a specific way, you can add this explained true. 00:41:49.800 |
And then it will tell you all the details of, like, okay, we have father. And it then calculates basically 00:41:57.880 |
all the different pieces of the formula for you and shows you how it did the calculation. So, you can 00:42:01.960 |
debug that if you need to. But it's probably a bit too much output for the everyday use case. 00:42:07.640 |
And then you can customize the score if you want to. Here I'm doing a random score. So, 00:42:16.280 |
my two fathers -- this is a bit hard to show -- they will just be in random order because their score is 00:42:21.960 |
here randomly assigned. But you could do this more intelligently that you combine, like, the score and, 00:42:29.240 |
like, if you have, I don't know, the margin on the product that you sell or the rating that you 00:42:34.200 |
include that in the rating somehow and you can build a custom score for things like that. 00:42:38.120 |
So, you can influence that any way you want. One thing that I see every now and then that is a 00:42:46.680 |
very bad idea and we'll skip this one because it's probably a bit too much. This one, by the way, 00:42:52.040 |
is the total formula that you can do or maybe I'll show you the parts that I skipped. What happens if 00:42:58.200 |
you search for two terms and they're not the same, they don't have the same relevancy? So, what the 00:43:03.560 |
calculation behind the scenes basically looks like is let's say we search for father. Father is very rare, 00:43:11.080 |
that's why it's much more relevant than your. Your is pretty common. And then we have a document that 00:43:16.040 |
contains your father. It's kind of like this axis. This will be the best match. But will a document 00:43:22.680 |
that only contains father be more relevant or only your? Intuitively, the one with just father will 00:43:29.800 |
be more relevant. But how does it calculate that? It basically calculates like this is the relevancy of 00:43:35.880 |
father. This is the ideal document and this is your. And then it looks like which one has the shorter 00:43:41.880 |
angle. And this is the one that is more relevant. So, if you have a multi-term search, you can figure 00:43:49.000 |
out which term is more relevant and how they are combined. And then you can also have the coordination 00:43:54.040 |
factor which basically rewards documents containing more of the terms that you're searching for. So, 00:43:59.080 |
if I'm searching for three terms like I am father, whatever. If a document contains all three, this 00:44:09.320 |
will be the formula that combines the scores of all three and multiplies it by three divided by three. 00:44:14.920 |
If it only contains two of them, it would only have the relevancy of 2/3 and with one 1/3. And then you 00:44:21.080 |
put it all together and this is the formula that happens behind the scenes and you don't have to do that 00:44:25.800 |
in your head, luckily. Cool. We have seen these. One thing that we see every now and then is that 00:44:34.600 |
people try to translate the score into percentages. Like you say, this is a 100% score and this is only like 00:44:44.120 |
a 50% match. Who wants to do that? Hopefully nobody, because the Lucene documentation is pretty 00:44:52.680 |
explicit about that. You should not think about the problem in that way, because it doesn't work. 00:44:59.480 |
And I'll show you why it doesn't work or how this breaks. Let's take another example. 00:45:05.560 |
Let's say we take this short text. These are my father's machines. I think of a good Star Wars quote to 00:45:18.280 |
use here, but bear with me. So what remains if I run this through my analyzer? My father machine. 00:45:23.720 |
These are the three tokens that remain. Now, I will store that. You remember the three tokens that we have 00:45:31.160 |
stored. And if I search for my father machine, you might be inclined to say this is the perfect score. 00:45:43.960 |
This is like 100%. Agreed? Because all the three tokens that I have stored in these are my father's 00:45:51.400 |
machines are there. So this must be like my perfect match. So it's 3.2, that would be 100%. 00:45:57.080 |
The problem now is every time you add or remove a document, the statistics will change and your score 00:46:03.400 |
will change. So if I delete that document and I search the same thing again, I don't know what percentage 00:46:10.520 |
this is now. Is this now the new 100% the best document or is this a zero point or, I don't know, 00:46:15.800 |
20%? How does this compare? And then you can play funny tricks where these droids are my father's 00:46:24.840 |
father's machines. And you can see I have a term frequency of 2 for father here. So if I store that one 00:46:32.200 |
then and then search it, is this now 100%, is this now 110%? So don't try to translate scores into 00:46:44.200 |
percentages. They're only relevant within one query. They're also not comparable across queries. They're 00:46:50.200 |
really just sorting within one query to do that. Okay. Let me get rid of this one again. 00:46:57.400 |
Now, we've seen the limitations of keyword search. We don't want to define our synonyms. We might want 00:47:06.840 |
to extract a bit more meaning. So we'll do some simple examples to extend. I will add, from OpenAI, 00:47:17.400 |
text embedding, text embedding small. I'm basically connecting that inference API for text embeddings 00:47:24.520 |
here in my instance. I have removed the API key. You will need to use your own API key if you want to 00:47:30.760 |
use it. But it is already configured. So let me pull up the inference services that we have here. I have 00:47:38.280 |
done -- or I have added two different models. One sparse, one dense. Let's go to these. By the way, 00:47:49.000 |
if you try to do this with a 100% score, don't do this. Because it will just not work. Okay. 00:47:58.520 |
Not everybody has worked with dense vectors, right? So I have a couple of graphics coming back to our 00:48:05.480 |
Star Wars theme, just to look at how that works. So what you do with dense vectors is we keep this 00:48:12.840 |
very simple. This one just has a single dimension. And it has, like, the axis is pretty much like 00:48:20.280 |
realistic Star Wars characters and cartoonish Star Wars characters. And this one falls on the realistic 00:48:25.560 |
side and that other one is just cartoonish. And you have a model behind the scenes that can rate those 00:48:31.720 |
images and figure out where they fall. Now, in reality, you will have more dimensions than one. 00:48:38.760 |
And you will also have floating point precision. So it's not just, like, minus one, zero, or one. 00:48:45.400 |
But you will have more dimensions. So, for example, here, in human and machine, and in a realistic model, 00:48:55.000 |
you don't have -- the dimensions are not labeled as nicely and clearly understandable. The machine has 00:48:59.800 |
learned what they represent. But they're not representing an actual thing that you can extract like that. 00:49:04.680 |
But in our simple example here, now, we can say this layer character is realistic and a human versus, 00:49:14.200 |
I don't know, the Darth Vader is cartoonish and, I don't know, somewhere between human and machine. 00:49:23.080 |
So this is the representation in the vector space. And then you could have, like I said, 00:49:27.080 |
you could have floating point values and then you can have different characters. And similar characters, 00:49:32.840 |
like, both of those are human. Without the hand, he's only, like, not quite as human anymore, 00:49:38.200 |
so he's a bit lower down here. So he's a bit closer to the machines. So you can have all of your entities in 00:49:47.960 |
this vector space. And then if you search for something, you could figure out, like, 00:49:51.880 |
which characters are the closest to this one. And again, in reality, you will have hundreds of 00:49:57.880 |
dimensions. It will be much harder to say, like, these are the explicit things and this is why it 00:50:03.320 |
works like that. It will depend on how good your model is in interpreting your data and extracting the 00:50:09.960 |
right meaning from it. But that is the general idea of dense vector representation. You have your documents 00:50:18.200 |
or sometimes it's like chunks of documents that are represented in this vector space and then you 00:50:24.360 |
try to find something that is close to it for that. Does that make sense for everybody or any specific 00:50:31.800 |
questions? So it's a bit more opaque, I want to say. It's not quite as easy because you say, like, 00:50:39.720 |
these five characters match these other five characters here. But you need to trust or evaluate 00:50:45.880 |
that you have the right model to figure out how these things connect. So let's see how that looks like. 00:50:56.280 |
I have one dense vector model down here. We have OpenAI embedding. This one is a very small model. It 00:51:05.800 |
only has 128 dimensions. The results will not be great, but it's actually for demonstrating it actually 00:51:14.760 |
helpful. So we'll see that. The other model that we have, and let me show you the output of that. So if I 00:51:20.760 |
take my text, these are not the droids you are looking for, this is the representation. It's basically 00:51:26.840 |
an array of floating point values that will be stored and then you just look for similar floating point 00:51:32.600 |
values. And then you have these are not the droids you are looking for. Here on the previous one, 00:51:37.320 |
dense text embedding. This one here does sparse embedding. Sparse is the main model used for that 00:51:47.720 |
It is called splayed. Our input of splayed is, we call it ELSER. It's kind of like a slightly improved splayed, 00:51:56.280 |
but the concept is still the same. What you get is, you take your words, and this is not just a TF-IDF. 00:52:05.960 |
This is a learned representation where I take all of my tokens and then expand them and say, like, for this text, 00:52:14.680 |
these are all the tokens that I think are relevant. And this number here tells me how relevant they are. 00:52:20.920 |
Again, not all of these make sense intuitively. And you might get some funky results, for example, 00:52:29.720 |
with foreign languages. This currently only supports English. But these are all the terms that we have 00:52:37.240 |
extracted. Normally, yeah, you get, like, 100-something or so. So, the idea is that this text is represented 00:52:46.920 |
by all of these tokens. And the higher the score here, the more important it is. And what you will do is, 00:52:52.920 |
you store that behind the scenes. When you search for something, you will generate a similar list, 00:52:58.680 |
and then you look for the ones that have an overlap, and you basically multiply the scores together, and the 00:53:04.120 |
ones with the highest values will then find the most relevant document. This is insofar interesting or 00:53:12.040 |
nice because it's a bit easier to interpret. It's not just, like, long array of floating point values. 00:53:17.880 |
Sometimes these don't make sense. The main downside of this, though, is that it gets pretty expensive at 00:53:25.640 |
time. Because you store a ton of different tokens here for this. When you retrieve it, the search query will 00:53:33.560 |
generate a similar long list of terms. And if you have a large enough text body, a query might hit a very 00:53:41.960 |
large percentage of your entire stored documents with these OR matches. Because, basically, these are just a lot of ORs that you combine, calculate the score, and then return the most or the highest ranking results. 00:53:54.840 |
So, it's an interesting approach. It didn't gain as much traction as dense vector models, but it can be, as a first step or an easy and interpretable step, it can be a good starting point to dive into the details here. 00:54:08.280 |
So, these are not the droids looking for. It's basically represented by this embedding here. 00:54:22.520 |
So, it's like this entire list of terms with this, yeah, with this relevancy, basically. This is the representation of this string. And then, when I search for something, I will generate the same 00:54:36.280 |
list and then I basically try to match the two together. Like for what has the most or the highest matches here. 00:54:57.720 |
I will create a new index. This one keeps the configuration from before, but I'm adding this semantic text for the sparse model and the dense model. 00:55:21.000 |
So, I've created this one. And now I'll just put three documents. I have my other index. 00:55:26.600 |
As you can see here, it says three documents were moved over. 00:55:30.600 |
So, we can then start searching here. And if I look at that, the first document is still, 00:55:36.680 |
these are not the droids you're looking for. You don't see, like for the, for a keyword search, 00:55:41.480 |
you don't see the extracted tokens here. We also don't show you the dense vector representation or the 00:55:47.080 |
sparse vector representation. Those are just stored behind the scenes for querying, but there's no real 00:55:53.080 |
point in retrieving them because you're not going to do anything with that huge array of dense vectors. 00:55:58.040 |
It will just slow down your searches. You can look at the mapping and you can see I'm basically copying my 00:56:06.920 |
existing quote field to these other two that I can also search those. 00:56:10.920 |
Okay. So, if I look for machine on my original quote, will it find anything? 00:56:21.080 |
No, because it only had -- these are not the droids you're looking for. And this is still the keyword search. 00:56:32.040 |
It doesn't work, shouldn't work. That's exactly the result that we want out of this here. Now, 00:56:48.040 |
if I say answer and I say machine, then it will match here. These are not the droids you're looking for. 00:56:55.400 |
And you can see this one matches pretty well, I don't know, at 0.9. But it also has some overlap, 00:57:01.800 |
with no I am your father. I mean, it is much lower in terms of relevance. But something had an overlap 00:57:08.360 |
here. And only the third document, Obi-Wan never told you what happened to your father. Only that one 00:57:15.320 |
is not in our result list at all. But there was something here. I don't know the expansion. We would 00:57:21.000 |
need to basically run -- where was it? We would need to run this one here for all the strings and look 00:57:30.840 |
then for the expansion of the query. And then there would be some overlap, and that's how we retrieve 00:57:39.400 |
You could define a threshold. It will, though, depend -- let's see. 00:57:46.040 |
This is not the droids you're looking for. Let's say if I -- if I say -- I'm not sure if this will change anything. 00:57:56.840 |
I mean, the relevance here is still -- it's still 10x or so. But, yeah, this one still -- we'll just have a 00:58:12.440 |
very low -- it's still -- terms you look for. The score just totally jumps around. It's a bit hard to define the threshold. 00:58:19.320 |
Because here you can see, in my previous query, we might have said 0.2 is the cutoff point. But now it's 00:58:28.760 |
actually 0.4, even though it's not super relevant. So it might be a bit tricky, or you might need to 00:58:34.920 |
have a more dynamic threshold depending on how many terms you're looking for and what is a relevant result. 00:58:40.440 |
In the bigger picture, the assumption would be if you have hundreds of thousands or even millions of 00:58:46.920 |
documents, you will probably not have the problem that anything that is so remotely connected will 00:58:53.320 |
actually be in the top 10 or 20 or whatever list that you want to retrieve. So for larger proper data 00:59:00.360 |
sets, this should be less of an issue. With my hello world example of three documents, it can be a bit 00:59:06.520 |
misleading. But, yes, you can have a cutoff point if you figure out what for your data set and your 00:59:11.160 |
queries is a good cutoff point. You could define the cutoff point. 00:59:14.360 |
No, sorry, you have three documents. How come it's only showing two? Is it because of -- 00:59:18.280 |
So the query gets expanded into, I don't know, those 100 tokens or whatever. And then for those two, 00:59:25.640 |
there is some overlap, but the third one just didn't have any overlap. But I -- so we -- okay, we can do that. 00:59:33.400 |
It's just a bit tricky to figure out that the term that has the overlap. So we will need to take this 00:59:40.280 |
one, machine -- no, I am your father. Let's take this one. What you need to do is to figure that one 00:59:48.760 |
out. I don't know, actually, we should be able -- let me see. Let's see. 01:00:09.320 |
This is a pretty long output. Somewhere I was actually hoping that it would show me the term that has 01:00:20.040 |
matched here. Okay. I see something -- okay, there is something puppet that seems to be the overlap. 01:00:28.360 |
How much sense that term expansion for the stored text and the query text makes is a bit of a different 01:00:35.880 |
discussion. But in here with that explained true, you can actually see how it matched and what happened 01:00:41.880 |
behind the scenes. If you have any really hard or weird queries or something that is hard to explain, 01:00:46.760 |
to debug that. But the third one didn't match. Now, if I take the dense vector model with OpenAI and I 01:00:54.280 |
search for machine, how many results do you expect to get back from this one? 0, 1, 2, 3. Yes, 3. Why 3? 01:01:08.120 |
Yes, because there's always some match. That is the other -- or let me run the query first. 01:01:19.240 |
These are not the droids you're looking for. This one is the first one. I don't think that this model 01:01:23.880 |
is generally great because here the results are super close. It is -- I mean, the droids with the 01:01:29.320 |
machines, that is the first one. But the score is super close to the second one, which is no, I am 01:01:34.520 |
your father, which feels pretty unrelated. And Obi-Wan never told you what happened to your father. Even 01:01:39.320 |
that one is still with a reasonably close score. But why do we have those? Because if we say, 01:01:49.080 |
what is the relevance, I mean, it's further away, but it's always like there's always kind of like 01:01:55.960 |
some angle to it, even if the kind of like the angle here or depending on the similarity calculation that 01:02:01.800 |
you do, but it's still always related. There is no easy way to say something is totally unrelated. 01:02:07.080 |
That is, by the way, one good thing about keyword search where it was relatively easy to have a cutoff 01:02:14.520 |
point of things that are totally not relevant, where you're not going to confuse your users. 01:02:18.600 |
Whereas here, if you don't have great matches, you might get almost -- it's not random, but it's 01:02:25.000 |
potentially -- it looks very unrelated to your end users what you might return. 01:02:31.000 |
Is it fair to say, then, that the OpenAI embedding search is worse for this kind of toy example, 01:02:39.880 |
I'm careful with worse, because it's really a hello world example, so I don't take this as a 01:02:47.080 |
quality measurement in any way. I -- yeah, I mean, the OpenAI model with 128 dimensions is very few 01:02:54.520 |
dimensions. I think it will probably be cheap, but not give you great results necessarily. But don't use 01:02:59.880 |
this as a benchmark. I think it's just a good way to see that this is now much harder, because now you 01:03:07.400 |
need to pick the right machine learning model to actually figure out what is a good match. With 01:03:12.760 |
keyword-based search, it was a bit of a different story. There you need to pay more attention to, 01:03:16.680 |
like, how do I tokenize, and do I have the right language, and do I do stemming or not stemming. 01:03:22.120 |
But most of that work is relatively, I want to say, almost algorithmic, and then you can figure that 01:03:28.120 |
out, and you configure it, and then it's very predictable at query time. Whereas with the dense 01:03:33.640 |
vector representation, you really need to evaluate for the queries that you run and the data that you 01:03:39.320 |
have, like, is that relevant, and is this an improvement or not? It's very easy to get going and 01:03:45.320 |
just throw a dense vector model together, and you will match -- you will always match something 01:03:51.080 |
that might be an advantage over the lexical search where you don't have any matches, which 01:03:55.160 |
sometimes is the other problem that nothing comes back and you would want to have at least some 01:03:59.480 |
results. Here it might just be unrelated. So that can be tricky. That you want to have some results is, 01:04:09.160 |
by the way, a funny story that the European e-commerce store once told me, they said they accidentally 01:04:15.880 |
deleted, I think, two-thirds of their data that they had for the products that you could buy. 01:04:21.320 |
And then I asked them, like, okay, so how much revenue did you lose because of that? And they 01:04:26.120 |
said, basically nothing, because as long as you showed some somewhat relevant results quickly enough, 01:04:32.920 |
people would still buy that. So only if you have no results, that's probably the worst. 01:04:37.080 |
So for an e-commerce store, you might want to show stuff a bit further out, because people might still 01:04:42.840 |
buy it. But it really depends on the -- I'm coming to you in a moment -- it really depends on your use 01:04:47.960 |
case. E-commerce is kind of like one extreme where you want to show always something for people to buy. 01:04:53.800 |
If you have a database of legal cases or something like that, you probably don't want that approach, 01:05:00.040 |
because that will go horribly wrong. So it is very domain specific. That's, I think, 01:05:06.200 |
also the good thing about search, because it keeps a lot of people employed, because it's not an easy 01:05:10.280 |
problem. It's almost job security, because it depends much on the -- this is the data that you have, 01:05:16.840 |
and this is the query that people run, and this is the expectation of what will happen, and this 01:05:20.680 |
is for this domain, the right behavior. So there's no easy right or wrong with the checkbox. And the 01:05:27.720 |
other thing is you might make -- if you tune it, you might make it better for one case, but worse for 20 01:05:32.440 |
others. That's why a robust evaluation set is normally very important, though very rare. A lot of people 01:05:39.960 |
YOLO it, and you will see that in the results. And for the e-commerce store, it probably works well enough. 01:05:44.520 |
Sorry, you had a question. Can I limit the semantic enrichment to a subset of my index based off 01:05:51.320 |
the properties of the document? So if I have a very large shared index with a lot of customers, 01:05:56.280 |
and I want to enable AI for a subset of the index, can I say, hey, only do the semantic enrichment if the 01:06:03.080 |
document has this property where maybe it's like an AI customer? Yeah, so the way we would do it in 01:06:09.800 |
our product is that you would probably have two different indices with different mappings. 01:06:15.560 |
Yeah, but then it's not so fun, like the customer upgrades and I have to migrate them to the new 01:06:21.160 |
index. Dave, please. Yeah, so if you, for example, have an index in Elasticsearch, you can think of it 01:06:31.240 |
almost like a sparse table, right? So there's no penalty for having a field that is not populated. 01:06:36.920 |
So either in your application or an ingest processor, you could have an inference statement and say, 01:06:42.200 |
Yeah, that's how we do it now. No, we'll only move it over. 01:06:45.000 |
With this automatic way where you kind of turn it off. 01:06:47.800 |
No, the problem is the data structure, like if the field is there, so the data structure that we build 01:06:52.440 |
in the background is called HNN. And either we build a data structure or we don't build it. 01:06:57.560 |
Yeah, so if you had, you know, 10 billion entries in your vector index, your index is set up for vectors, 01:07:07.560 |
right? And you just don't populate the thing that is either putting in a dense vector or triggering the 01:07:14.920 |
inference to create a dense vector to put into there, then it's just going to be a, you know, 01:07:20.120 |
a bunch of, the index is just a bunch of pointers and none of them head towards the HNNW and it won't 01:07:24.600 |
show up in search results. The penalty is nothing, right? But you're going to have 01:07:30.120 |
to manage what does or does not create the vector. You could do that in an ingest processor by just 01:07:36.680 |
saying, Hey, we're going to use the copy command to have two copies of the text, one that's meant for 01:07:42.120 |
non-vector indexing, one that's meant for actual vector indexing. You'd have to manage that with some 01:07:46.920 |
tricky, complex AI technology called if-then-else, right? Somewhere inside of your ingesting pipeline, 01:07:54.120 |
then it would work just fine. Yeah. One more question. When we did HNNW in the last week, 01:08:02.440 |
we found that it was extremely slow at write times and the community suggested that we freeze our index 01:08:08.920 |
if we were going to use HNNW. Force merge or? Yeah, I think just freeze writes. They said, 01:08:17.160 |
they said build the index and free it, otherwise you'll put a ton of load on the computer. I mean, yes. 01:08:21.960 |
What we found is that some of the defaults kind of like have been around in Elasticsearch for 10 01:08:26.840 |
years settings with the merge scheduler really optimize the keyword search and for high update 01:08:33.400 |
workloads on HNNW, we've got some suggestions. They take a little bit of parameter speed tuning to go 01:08:39.960 |
and find something right for your IOPS and for your actual update workload. So sometimes it's about the 01:08:44.920 |
merge scheduler and not doing kind of an inefficient HNNW build when it's not important for the use case. 01:08:51.480 |
Okay. The other thing you'd say is that sometimes friends don't let friends run Elasticsearch 8.11, 01:08:58.920 |
upgrade, upgrade, upgrade. They put a lot of optimization work in here. It should be simple. 01:09:03.240 |
That's great. So the reason -- It used to be that. 01:09:06.120 |
The reason why that is, it's like merging -- so because you have the immutable segment structure in 01:09:14.040 |
Elasticsearch. And HNNW, you cannot easily merge. You basically need to rebuild them. The one trick -- I forgot which 01:09:21.320 |
version it was. I'm not sure, Dave, if you remember. I think it was even before 8.11. But basically, 01:09:26.120 |
if we do a merge, we would take the largest segment with not deleted documents and basically plop the 01:09:31.880 |
new documents on top of them rather than starting from scratch from two HNNW data structures. There's 01:09:37.560 |
another optimization somewhere now in 9.0 that will make that a lot faster. So it really depends on the 01:09:44.040 |
the version that you have. And there are a couple of tricks that you can play. But yeah, 01:09:48.760 |
that is one of the downsides of like the way immutable segments work and HNNW is built, 01:09:54.840 |
that you can't easily just merge it as easily together as other data structures because you really 01:09:59.480 |
need to rebuild the HNNW data structure or like take the largest one and then prop the other one in. 01:10:04.520 |
Okay. Some of the things where we just like -- we fixed it in the next version of the scene, so I want you to -- 01:10:09.320 |
We found like KNNW was too slow and then KNNW broke our CPU and then we moved to find them, 01:10:21.640 |
So for like traditional document search, you know what I'm saying, like, hey, 01:10:25.960 |
please find me a document that contains my search query, right? For R in the context of RAG, 01:10:32.200 |
it might be something more like, hey, come up with a fun plan for my weekend, right? And then the 01:10:37.480 |
documents that we want to find don't necessarily look like the search query, right? Yeah. 01:10:41.480 |
So like one approach to that is you just give -- it's an agent and you give it a search tool, 01:10:46.840 |
and it searches, right? So like I'm just curious what you -- how do you think about that in general? 01:10:51.160 |
Yeah. I feel like RAG has been very heavily abused. It's -- or like the mental model I 01:10:56.280 |
think started off as like you do retrieval and then you do the generation, but you could do the 01:11:00.360 |
generation earlier on as well, that you do the rewriting and expanded query. So I -- my favorite 01:11:07.560 |
example for that is you're looking for a recipe. You don't need to have the LLM regenerate the recipe. 01:11:16.200 |
You just want to find the recipe. But maybe you have a scenario where you forgot what the thing is 01:11:20.600 |
called that you want to cook. And then you could use the LLM, for example, to tell you what you're 01:11:24.600 |
looking for. Like you say, like, oh, I'm looking for this Italian dish that has like these layers of 01:11:31.640 |
pasta and then some meat in between. And then the LLM says, oh, you're looking for lasagna. And then 01:11:36.280 |
you basically do the generation first or a query rewriting and then search and then get the results. 01:11:42.440 |
as a very explicit example here. Your example would look very different and probably smarter than my 01:11:49.480 |
example. But query rewriting is one thing. There's also this concept of height where your documents and 01:11:58.840 |
your queries often look very different. And that you use an LLM to generate something from the query that 01:12:04.200 |
looks more close like the documents that you have. And then you match the documents together because they're 01:12:09.560 |
more similar in structure. So there are all kinds of interesting things that you can do. Like I said 01:12:15.640 |
earlier, it depends is becoming a bigger and bigger factor. But, yeah, your use case is probably 01:12:21.080 |
might be, yeah, maybe a multi retrieval where you figure out, like, oh, you look, I don't know, 01:12:29.080 |
I know the example from an e-commerce store where it's like, I'm going to a theme party from the 1920s, 01:12:36.520 |
give me some suggestions. And then the LLM will need to figure out, like, what am I searching for? 01:12:40.680 |
And then it can retrieve the right items and rewrite the query and then actually give you proper 01:12:44.920 |
suggestions. But it's not just running a query anymore. Yeah? 01:12:59.000 |
Along with your theory, you can say, like, this is the kind of thing I am doing. 01:13:07.000 |
Like, you can have document query embeddings. 01:13:13.500 |
We try to embed queries from documents we're going to find, like the text. 01:13:18.500 |
You can have instruction when you embed models that, instead of saying, like, I'm actually 01:13:26.400 |
And then, I don't know, why do you think you have a problem? 01:13:37.900 |
How should we be thinking about the number of dimensions in the embedding model? 01:13:41.400 |
Is, like, a 5 or 12-dimensional model necessarily better than a 1, 2, 3? 01:13:51.300 |
That feels almost like a blast from the past. 01:13:53.240 |
I remember, like, two or three years ago, there was this big debate of, like, how many dimensions 01:13:57.800 |
does each data store support and, like, how many dimensions should you have? 01:14:02.000 |
And at first, it looked like, oh, more dimensions is always better. 01:14:04.300 |
But then it turned out more dimensions are very expensive. 01:14:07.240 |
So, it really depends on the model and what you're trying to solve. 01:14:10.300 |
Like, if you can get away with fewer dimensions, it's potentially much cheaper and faster. 01:14:15.240 |
But I don't think there is a hard rule, like, maybe the model with more dimensions can express 01:14:21.840 |
more in because it just has more data and then it will come in handy. 01:14:26.580 |
But maybe it's not necessary for a specific use case and then you're just wasting a lot 01:14:30.740 |
I don't think there is an easy answer to say, like, yes, for this use case, you need at least 01:14:40.660 |
But it depends on the model, how many dimensions it will output, and then maybe you have some 01:14:44.880 |
quantization in the background to reduce that again or reduce either the number of dimensions 01:14:52.940 |
So there are a lot of different tradeoffs in that performance consideration. 01:14:55.980 |
But it will mostly rely on, like, how good does the model work for the use case that you're 01:15:27.620 |
So I want to say historically what you would do is you would have a golden data set and then 01:15:33.760 |
you would know what people are searching for and then you would have human experts who rate 01:15:38.380 |
And then you run different queries against it and then you see, like, is it getting better 01:15:44.500 |
Now LLMs open a new opportunity where you might have human experts in the loop to help them out 01:15:50.360 |
a bit, but they might be actually good at evaluating the results. 01:15:55.200 |
So you almost nobody has, like, the golden data set and test against that. 01:16:00.420 |
But you can either use it, look at the behavior of your end users and try to infer something 01:16:05.600 |
from that or you have an LLM that evaluates, like, what you have or you have a human together 01:16:14.900 |
So you have various tools, but, again, it's -- and it depends on really not an easy question 01:16:25.040 |
Maybe you can get away with something simple. 01:16:26.680 |
So the classic approach I want to say is, like, you looked at the clickstream of how your users 01:16:31.520 |
behaved and then you saw, like, they clicked on the first or up to the third result. 01:16:36.200 |
The result was potentially good and it didn't just go back and then click on something else, 01:16:42.220 |
If they don't click on anything and just leave, it might be very bad. 01:16:45.680 |
If they go to the second or third page, it might also not be great. 01:16:48.720 |
So there are some quality signals that you can infer from that or you really look into 01:16:52.540 |
the quality aspect and try to evaluate, like, what people were doing and how it behaves. 01:16:58.420 |
But you can make this from relatively simple to pretty complicated. 01:17:17.100 |
Obviously, if I search for the AdWords query extension, it will find my father example. 01:17:27.720 |
And this one here will still, again, match my droids pretty much like the opening AI example. 01:17:36.720 |
One thing that I wanted to show you what is also happening behind the scenes here, this is 01:17:41.100 |
a very long segment, like, it's a lot of information with different speakers. 01:17:47.600 |
What I have created here, though, is we have created multiple chunks behind the scenes. 01:17:54.640 |
And if I search for that, I think looking for murder in the Skywalk saga works pretty well 01:18:03.820 |
It finds the document that I have retrieved, but it can also highlight -- so here I say, 01:18:09.760 |
show me the fragment that actually matched best here. 01:18:13.760 |
And if I search here for murder, it didn't find anything. 01:18:18.700 |
But I think the term that it found was in this highlighted segment here, it found kill and 01:18:28.380 |
So here I have broken up my long text field into multiple chunks and there are multiple 01:18:34.320 |
You can do that by page, by paragraph, by sentence. 01:18:39.120 |
You could do it overlapping or not overlapping. 01:18:43.280 |
Many strategies will depend on how you want to retrieve what works best for your use case. 01:18:48.540 |
But you want to kind of like reduce the context per element that you're matching because there's 01:18:53.140 |
only so much context that a dense vector representation can hold. 01:18:57.400 |
So you want to chunk that up, especially if you have like a full book, you want to break 01:19:04.120 |
And then find the relevant part where the match is. 01:19:09.720 |
The point in this query here is also to show you, I didn't define any chunks. 01:19:14.740 |
I didn't say like, okay, send this representation of a dense vector there and then when it comes 01:19:22.900 |
This is all happening behind the scenes just to make this easier. 01:19:25.500 |
So the entire behavior here is still very similar to the keyword matching even though there's 01:19:30.220 |
a lot more magic happening behind the scenes. 01:19:39.940 |
How does everybody feel about long adjacent queries? 01:19:45.940 |
We'll see about alternatives and maybe we can make this a bit simpler again. 01:19:50.940 |
But let me show you one more way of looking at it. 01:19:57.940 |
They're a more powerful mechanism to actually combine different types of searches. 01:20:05.280 |
Combining different types of searches, let me get from my slides actually. 01:20:08.660 |
When we talk about combining searches and how this all plays together. 01:20:12.660 |
This is kind of my little interactive map of what you do when you do retrieval or what your 01:20:20.660 |
We started here in the lexical keyword search and then we run the match query and we're matching 01:20:31.180 |
This often combined with some rank features are often what we call full text search. 01:20:37.820 |
The rank features could be either you extract a specific signal or it could also be something, 01:20:42.380 |
however you influence that ranking, it could be the margin on the product, how many people 01:20:49.900 |
There are many different signals that you could include, not just with the match of the text, 01:20:55.820 |
but any other signals that you want to combine for retrieving that. 01:20:59.580 |
And then you have full text search as a whole. 01:21:02.460 |
On top of that, I kept it to the side here, you might have a Boolean filter where you have a hard 01:21:10.620 |
include or exclude of certain attributes, this does not contribute to the score, this is just like 01:21:16.780 |
black and white, this is included or excluded, whereas this here calculates the score for you, 01:21:24.220 |
And then this was kind of like the algorithmic side. 01:21:29.660 |
And then we have this machine learning, the learn side or the semantic search where you have a model 01:21:35.100 |
behind the scenes split into the dense vector embeddings and the sparse vector embeddings 01:21:41.900 |
for vector search or learn sparse retrieval, I think those are the two common terms. 01:21:47.180 |
And the interesting thing is all of these, including the sparse one, these are the sparse vector 01:21:56.860 |
representation in the background, and only this one here is the dense vector representation. 01:22:03.180 |
And then when you combine any grouping down here to combine for one search, this is then what we 01:22:13.420 |
would call hybrid search, even though there can be big discussions of like what is exactly hybrid search 01:22:18.460 |
or not, I will definitely stick to the definition that as soon as you combine more than one type of 01:22:23.900 |
search it could be sparse and dense or it could be dense and keyword or maybe if you combine two 01:22:30.140 |
dense vector searches, hybrid search because you have multiple approaches and then you can either 01:22:37.500 |
boost them together, you could do re-ranking which is becoming more and more popular. One thing that we 01:22:43.020 |
lean heavily into is RRF which is reciprocal rank fusion that doesn't rely on the score but it relies on the 01:22:51.180 |
position of the position of each search mechanism. So it basically says like the lexical search had this 01:22:57.340 |
documented position four and the dense vector search had it at position two and then it kind of like evens out the 01:23:03.740 |
position and gives you an overall position by blending them together rather than looking at the individual 01:23:09.020 |
scores because they might be totally different. So this is kind of like the information retrieval map 01:23:15.420 |
map overall and we have, okay, we didn't do a lot of filters but I think filters are intuitively 01:23:21.180 |
relatively clear that you just say like I'm only interested in users with this ID or whatever other 01:23:26.220 |
criteria. It could be a geo-based filter like only things within 10 kilometers or only products that came 01:23:31.740 |
out in the last year. Like a hard yes or no. All the others will give you a value for the relevance and then you 01:23:42.140 |
can blend that potentially together to give you the overall results. That is kind of like the 01:23:47.180 |
total map of search. Can you give an example of the signal one too? 01:23:54.300 |
Yeah, for signal, so we have our own data structure for these rank features. It could be, for example, 01:24:01.180 |
the rating of a book and then you combine the keyword match for, I don't know, you search for murder 01:24:15.340 |
mysteries but then another feature would be how well they are ranked and then you would see that. Or it 01:24:22.700 |
could be your margin on the product or the stock you have available and you would want to show the product 01:24:28.060 |
where you have more in stock. Or it might even be a simple like a click stream like what have people 01:24:34.220 |
clicked before. There are a lot of different signals that you could include in all of this searching then. 01:24:39.420 |
Any other questions? Are everybody good for now? Yeah? 01:24:47.900 |
You would have to normalize them. Depending on the comparison that you do for dense vectors, it might be 01:25:13.500 |
between 0 and 1. But you saw that for the keyword search, also depending on how many words I was 01:25:19.340 |
searching for, it might be a much higher value. There is no real ceiling for that. Or you could add a 01:25:26.060 |
boost and say, like, this field is 20 times more important than this other field. There is no real 01:25:32.220 |
max value that you would have here. You could normalize the score and then basically say, like, 01:25:37.260 |
I'll take the highest value in this sub query as 100% and then reduce everything down by that factor. 01:25:44.060 |
And then I combine them. Maybe that works well. RF is a very simple paper. I think it's like two pages. 01:25:51.020 |
And it really just takes the different positions. I think it's one divided by 60, which is like a factor 01:25:57.340 |
that figured out made sense, plus the position. And then you add the scores or like the positions for each 01:26:03.580 |
document together. And then that value gives you the overall position. It really just, it doesn't look 01:26:10.620 |
at the score anymore, but it blends the different positions together and like how they are interleaving 01:26:20.580 |
So, just for vector search, why should I do the last check-over CD vector or something like that? 01:26:34.580 |
We change, probably via CDC, change in the capture. 01:26:46.580 |
I was just curious, what sort of systems, like, what have you seen in production? 01:26:59.580 |
I mean, PG vector will always be there because, like, if you are already using Postgres, it's 01:27:05.580 |
I think then the question is, like, does it have all the features that you need? 01:27:13.580 |
It has some matching, but it's not the full BM25 algorithm because I don't think it keeps 01:27:19.760 |
It will be a question of, like, scaling out Postgres can be a problem and then just, like, the breadth 01:27:27.180 |
If you only need vector search, I think my or our default question back to that is, like, 01:27:36.920 |
Maybe for your use case, but for many use cases, you probably need hybrid search. 01:27:41.540 |
One area, for example, where vector search will not do great is, like, if somebody searches 01:27:48.720 |
Because there is no easy representation in most models for the specific brand and it will be 01:27:55.160 |
So there will be very -- and also your users will be very angry when they know you have 01:27:59.300 |
this word somewhere in your documents or in your data set, but you don't give me the result 01:28:04.500 |
So there are many scenarios where you probably want hybrid search, I feel like that's the -- we 01:28:10.580 |
started two years ago, we started with just vector search, but I feel like the overall trend 01:28:15.100 |
trend is coming more to hybrid search, because you probably want some sort of key search and 01:28:21.260 |
then you want to have that combined, probably with some model for the added benefit and extra 01:28:30.880 |
It might also depend a bit on, like, the types of queries that your users run. 01:28:34.900 |
So if your users run single word queries, like I've done in my examples, that's often not 01:28:39.860 |
really ideal for vector search because you live off like any machine learning model because 01:28:48.080 |
So depending on that, I've seen some people build searches where it's like if you search 01:28:52.740 |
for one or two words, they do keyword search, but if you search for more, they might fall 01:28:57.620 |
So it depends a bit on the context what works. 01:29:00.820 |
If you really only need vector search and PG vector is small enough to do all of that, and 01:29:08.440 |
Postgres is your primary data store, then that's probably where you will do well. 01:29:12.900 |
But there are plenty of scenarios where that will, or not all of those are necessary boxes 01:29:18.800 |
Specifically for code, let's say you have a file, and it's like, you don't get repository, 01:29:27.680 |
And so you have two options, either you embed at a file level, or you embed at a chunk level, 01:29:35.140 |
But I don't want to pay the penalty across thousands of unchanged, like the file hasn't changed for 01:29:38.040 |
thousand comments, but just for this one it has changed, right? 01:29:42.500 |
So I cut any shadow copies of the same thing, and have you seen, like, what are some like 01:29:48.500 |
tips and tricks that people use to not have exploding storage costs, and like, this might not be 01:29:58.960 |
a plastic problem, but a general vector problem, like how do I just pay the penalty once of storing 01:30:13.420 |
the same embedding, and only when it changes, I re-embed and then, uh, in that sense. 01:30:22.880 |
But so, you would create, so it's one dataset basically with thousands of files that all are 01:30:28.920 |
chunked together, and so one change would invalidate all of them, or -- 01:30:33.880 |
So, I think I have a file that is reportedly, right? 01:30:34.880 |
It has like 5,000 commits, and there's one file, . It didn't change for 4,999 commits, but on the 5,000 commits, it did change, right? 01:30:47.880 |
And so, if it hasn't changed, I want to only change the file to one of the embedding, right? 01:30:53.880 |
But only when I insert file contents and change, I need -- I want to re-ingess, right? 01:31:00.880 |
But for those 4,999 times, I don't want to store, like, this hash, like, this hash has the same embedding, same embedding. 01:31:11.880 |
You have seen with some of the customers, but -- 01:31:15.880 |
I think the way we might solve it is that if you create the hash of the file and use that 01:31:22.880 |
as the ID, and you only use the operation create, and it would reject any duplicate writes, 01:31:29.880 |
you would at least not ingest and then create the vector representation again. 01:31:34.880 |
You will still send it over again, and it would need to get rejected. 01:31:39.880 |
If a doc ID would have to be the hash of the file. 01:31:42.880 |
If you have that doc ID, and then you need to set the operation to just create and not update 01:31:47.880 |
or upset, then it would just be rejected and you would only write it once. 01:31:52.880 |
I'm not sure if that is a great use case or if you might want to keep, like, I don't know, 01:31:57.880 |
an outside cache of, like, all the hashes that you've already had and deduplicate it there, 01:32:01.880 |
but that would be the elasticsearch solution of, like, using the hash as the ID and then just 01:32:08.880 |
That is, I think, the intuitive or most native approach that we could offer for that. 01:32:20.880 |
I think there was some other question somewhere. 01:32:23.880 |
I just wanted to add on to the Postgres question a minute ago. 01:32:43.880 |
But from what I remember, the default Postgres full-text search does not do full BM25, 01:32:53.880 |
but it only does -- it doesn't have all the statistics, I think, from what I remember. 01:33:05.880 |
Do you plan to cover more on the receiver thing, like the receiver concept? 01:33:16.880 |
I'm practically interested in -- I mean, you're modeling -- so you have multiple -- I love the 01:33:24.880 |
I think you have multiple different candidates. 01:33:27.880 |
What kind of flexibility do you have on that? 01:33:39.880 |
Not necessarily looking at any, but I'm modeling the effect. 01:33:52.880 |
Maybe for -- before we dive into that, for everybody else, like, re-scoring is like, 01:33:57.880 |
let's say we have a million documents, and then we have one cheaper way of retrieving them, 01:34:01.880 |
and we retrieve the top, I don't know, 1,000 candidates, and then we have a more expensive 01:34:06.880 |
way, but higher quality way of actually re-scoring them, then we will run this more expensive re-scoring 01:34:12.880 |
on just the top 1,000 to get our ultimate result list of results. 01:34:18.880 |
But the re-scoring algorithm would be too expensive to run it across. 01:34:22.880 |
Like a million documents, that's why you don't want to do that. 01:34:27.880 |
And that's why you might want to have the re-scoring. 01:34:30.880 |
So, yes, we have a -- you can in Elasticsearch now, you can do re-scoring because it becomes 01:34:37.880 |
I don't have a full example there, but we do have like -- we do have a re-scoring model built 01:34:50.880 |
So, we have currently the version 1 re-ranking, but we have a built-in re-ranking model now 01:34:58.880 |
So, for one of the tasks that we can do, you can see here we have the other tasks, for example, 01:35:05.880 |
the dense text embedding, now we have a re-ranking task that you can also call. 01:35:28.880 |
Let me -- somehow my keyboard binding is broken. 01:35:43.880 |
Somewhere here there should be -- so, there's learn to rank, but it should not be the only 01:36:01.880 |
Unless, Dave, you know from the top of your head where we have the right docs for this. 01:36:08.880 |
organization of our docs can't help you with -- but retrievers -- 01:36:13.880 |
Starting in 8.16 or 8.17, the retrievers API added a specific parent level retriever that 01:36:21.880 |
you -- it's like nested around outside the retrievers that go inside this. 01:36:26.880 |
And it's specifically called the text re-ranking retriever. 01:36:31.880 |
And so if you have a cross encoder, say from Huggy Face, or you're using the elastic re-ranker, 01:36:36.880 |
using one of these things that complies that kind of inference task of taking a bunch of things and re-ranking them against the query, right? 01:36:48.880 |
And the full token stream with the context and doing that. 01:36:51.880 |
So you can target a parent level text field of the documents that are being retrieved. 01:36:57.880 |
So it works really well for the one document or chunk, kind of re-ranking use case. 01:37:03.880 |
I've also seen people just do it outside in the second API call, say if you wanted to do it on a highlighted thing. 01:37:09.880 |
Or if you wanted to do re-ranking sub-document chunks, that works pretty well for the API. 01:37:21.880 |
But there's a text re-ranker retriever that specifically got added in 8.16 or 8.16. 01:37:35.880 |
And then we have the text-similarity re-ranker, which uses our elastic re-ranker. 01:37:41.880 |
That falls back to that model behind the scenes. 01:37:46.880 |
I was a functional programmer, so don't mind the parentheses. 01:37:50.880 |
But you would have, like, the text re-ranker retriever. 01:37:56.880 |
Inside of that, you would have lexical and KNN as peers. 01:38:02.880 |
Hey, do each of those retrieval methodologies. 01:38:08.880 |
And then take the full text of those results and run them on re-ranker. 01:38:12.880 |
It's almost like a little mini LLM saying, what do you actually answer the question? 01:38:19.880 |
The cool thing about the re-ranker is you can run it on structured lexical retrieval. 01:38:30.880 |
So you don't want to pay the vector search everything. 01:38:32.880 |
Or maybe the text is too small, the vector search. 01:38:35.880 |
You're not needing there to actually have the model lock on the stuff. 01:38:39.880 |
The re-ranker, when you run it on just kind of actual customer data sets, 01:38:44.880 |
how to do it, they're like, yeah, our evaluation score is bumped by 10 points. 01:38:57.880 |
and you're like, wow, why is this 10 points better than the Amazon one? 01:39:04.880 |
So there's a lot of black box stuff out there that we're exposing. 01:39:08.880 |
So don't be scared if we're telling you how it works inside. 01:39:13.880 |
But this is what the leading retrieval technology is doing under the hood 01:39:18.880 |
and reselling to you as if they're, you know, it's all AI. 01:39:29.880 |
So my wish list is I want to be able to do in the retriever's API, re-ranking on sub-documentments. 01:39:39.880 |
A lot of my things are about sub-documentology retrieval. 01:39:41.880 |
Right now, I've got to do it outside of the retriever's API, but I'm bending here as a developer. 01:39:51.880 |
So just to give you the example of, like, I don't think I have a re-ranking example here, 01:39:56.880 |
but this one uses a classic keyword match for retriever. 01:40:03.880 |
And then we have -- we normalize here the score. 01:40:06.880 |
I think somebody else asked about the normalize -- or we had a discussion about the normalizing. 01:40:14.880 |
And then I use the OpenAI embeddings with a -- again, normalized with a weight of 1.5. 01:40:21.880 |
And then they will get blended together and you get the results that won't surprise you 01:40:25.880 |
that these are not the droids you're looking for. 01:40:27.880 |
If you search for droid and robot, will be by far the highest ranking document. 01:40:35.880 |
How much control does the last -- if you're doing the re-ranking thing, currently we do something 01:40:40.880 |
similar, but we, like, do the different steps of the re-inkers at kind of, like, different 01:40:44.880 |
So, like, you have, like, you have, like, you're, like, charted droid processing nodes and you 01:40:53.880 |
some, like, you do some ranking there and, like, once you, like, rejoin before you return 01:40:58.880 |
the result from the query engine, that's when you run a final re-ranker. 01:41:03.880 |
So, we would retrieve, like, x candidates and you could define the number of candidates 01:41:19.880 |
and then we would run the re-ranking on top of those. 01:41:22.880 |
So, that will be a trade-off for you, like, the larger the window is, the slower it will 01:41:27.880 |
be, but the potentially higher quality your overall results will be because you will just 01:41:32.880 |
have everything in your data set that you can then re-rank at the end of the day. 01:41:38.880 |
Is that what you meant or you wanted something per node or -- 01:41:43.880 |
Yeah, I mean, like, right now we, like, are actually confirmed, like, specifically, 01:41:48.880 |
we're kind of, like, we're in the compute and that happens, so that you can, like, get out 01:41:53.880 |
for our rest of the week and do, like, keep re-ranking and then do some re-ranking of one, 01:41:58.880 |
but -- but maybe that doesn't apply to these . 01:42:09.880 |
So, what you can control here is, like, this is a window of, like, what you might retrieve 01:42:14.880 |
and then we have the minimum score, like a cutoff point, to throw out what might not be relevant 01:42:19.880 |
anyway, to keep that a bit cheaper, that's what we have here. 01:42:30.880 |
And then you could do the RRF that I've explained where you blend results together. 01:42:36.880 |
One final note, if you got tired of all the JSON, we have a new way of defining those queries 01:42:45.880 |
as well, where here we have a match operator, like, the one we have used all the time, that 01:42:51.880 |
you can use either on a keyword field, but it could also be either a dense or a sparse vector 01:42:55.880 |
embedding, and then you can just run a query on that, and then just get the scores from that. 01:43:01.880 |
So, it is a particular language, it's a bit more like, I don't know, like a shell. 01:43:05.880 |
But if you don't want to type all the JSON anymore, this is how you can do that. 01:43:12.880 |
But, yeah, you get the code that we retrieved, the speaker, and the score. 01:43:18.880 |
Maybe I'll take out the speaker to make this slightly more readable. 01:43:37.880 |
This is, you could write queries with a fraction of the JSON. 01:43:43.220 |
This will also support funny things like joins. 01:43:46.220 |
It doesn't have every single search feature yet, but it's getting pretty close. 01:43:50.640 |
So this is more like a closing out at the end. 01:43:53.320 |
If you're tired of all the JSON queries, you don't have to write JSON queries anymore. 01:43:59.040 |
This is nice both for, like, observability use cases where you have, like, just like aggregations 01:44:04.380 |
and things like that, but it's also very helpful for full text search now if you want to write 01:44:10.100 |
I think the main answer is that the language support in the different languages like Java, etc., 01:44:16.220 |
You basically give it strings and then it gives you a result back that you need to parse 01:44:20.940 |
So it is not as strongly typed on the client side yet as the other languages. 01:44:31.100 |
Can you just talk about hybrid search and I was just curious, like, what is kind of the 01:44:36.820 |
like, we also do hybrid search today, what we do is we trigger two elastic queries. 01:44:42.540 |
One to do, like, a basic keyword search, the other one to do, like, a key-end method search 01:44:46.540 |
and then we walk through some, like, like a code here, a rewrite, we can get the final results. 01:44:51.820 |
But, like, from the reviewers, you just showed us, like, it's better, like, to combine those 01:44:53.540 |
two queries, in one retriever and, like, would that make the results significantly better? 01:45:06.540 |
Like, one of the, like, each of the query will have its own stats and normalizer and things 01:45:10.540 |
in, like, more, I don't know, just, like, in general, it sounds better. 01:45:17.360 |
It's just all behind one single query endpoint. 01:45:20.300 |
So you could use the two different methods to retrieve and then you could still re-rank 01:45:25.420 |
but all from one single query so you don't have to do it yourself. 01:45:29.300 |
I mean, it's not like we want to stop you, but you don't have to and we can make your life 01:45:33.800 |
I mean, it's only one single query that you need to run and, like, one single round 01:45:49.400 |
But, like, I was just curious comparing to the two queries method, would you do that? 01:45:56.280 |
I mean, if you still need to do the retrieval, like, you do the retrieval, like, all the individual 01:46:01.540 |
If you have two parts of the query, you will still retrieve those if that is the main cost 01:46:06.140 |
and then you have the re-ranking so you're not getting out of those completely. 01:46:10.460 |
But you can just do it in one single request that you send. 01:46:12.860 |
We take care of all of that for you and then send you one result setback rather than sending 01:46:18.900 |
So it will potentially be a little less work on the Elasticsearch side, but it will mostly 01:46:27.280 |
So if you don't have any problems, you may not notice, but you're running two mapper users, 01:46:43.880 |
When you could be running one, and you're denying the optimizer the opportunity to do any short 01:46:50.460 |
gets to say, oh, there's no more results that are better. 01:46:56.060 |
So you're potentially going to be a little bit more performance and resources. 01:47:02.060 |
So you're going to be a little bit more performance. 01:47:14.660 |
You're going to be a little bit more performance. 01:47:17.660 |
So if it's not hurting you, by all means, keep going. 01:47:20.660 |
But if you have some point, you're going to start vertically scaling your hardware when you 01:47:24.720 |
don't shoot you, you can get further, it's just a higher percent. 01:47:42.320 |
I will let the instance running for today or so, so you can still play around with the queries 01:47:52.960 |
Come join and get some proper swag from us there.