back to indexLayering every technique in RAG, one query at a time - David Karam, Pi Labs (fmr. Google Search)

Chapters
0:0 Introduction and Context
1:41 Quality Engineering Loop and Mindset
4:9 In-Memory Retrieval
4:50 Term-Based Retrieval (BM25)
5:18 Relevance Embeddings (Vector Search)
6:15 Re-Rankers (Cross Encoders)
7:59 Custom Embeddings
9:40 Domain-Specific Ranking Signals
11:9 User Preference Signals
12:17 Query Orchestration (Fan Out)
14:26 Supplementary Retrieval
16:9 Distillation
17:14 Punting the Problem and Graceful Degradation
00:00:00.040 |
I'll just give you all a little bit of contact. 00:00:16.980 |
So my co-founder and I, and a lot of our team, 00:00:24.720 |
And we're all nerds for information-freevo and search. 00:00:27.800 |
And so there's going to be a little bit of that. 00:00:30.540 |
Just going to go through a whole bunch of ways 00:00:32.460 |
you can actually show up and improve your RAC systems. 00:00:35.280 |
I think one thing that I personally sometimes struggle 00:00:37.880 |
with is there's a lot of talk about things sometimes 00:00:41.900 |
Like, oh, specific techniques, and you do RL this way, 00:00:45.580 |
It's like doesn't help me orient in the space. 00:00:47.440 |
Like, what are all these things, and how do I hang on them? 00:00:52.460 |
is like a whole bunch of buzzwords and hype and such. 00:00:59.160 |
So just, you know, I think a lot of what I'll do today 00:01:03.940 |
Just trying to like set up a framework, right? 00:01:07.740 |
if you are trying to show up the quality of your system, 00:01:10.780 |
And then where do all the things you hear about day in, day out, 00:01:18.800 |
and we always did in Google, we always do in PyLabs, 00:01:21.840 |
is just like look at things, look at cases, look at queries, 00:01:26.440 |
And that's really the essence of like quality engineering, 00:01:30.180 |
If you do want the slides, there's like 50 slides. 00:01:35.960 |
But you can catch the slides here if you want. 00:01:39.680 |
Withpy.ai/rag-talk, it should point to the slides 00:01:44.420 |
And as I mentioned, plain English, no hype, no buzz, 00:01:50.300 |
Before we go into techniques and get into the weeds of it, 00:01:56.940 |
You're always trying to solve some product problem. 00:01:59.420 |
And generally, the best way to visualize something like this, 00:02:01.800 |
you have a certain quality bar you want to reach. 00:02:04.040 |
And there were very interesting talk this week 00:02:06.560 |
about like, you know, benchmarks aren't really helpful. 00:02:14.480 |
that you can actually put it out into the world. 00:02:21.140 |
And you're trying to like come up with different ways 00:02:24.340 |
And those ways are like sort of the techniques there. 00:02:26.720 |
And you know, this is sort of your own personal benchmark. 00:02:29.060 |
You start with some of the easy bars you want to hit. 00:02:32.580 |
And then there's like medium benchmarks and hard benchmarks. 00:02:36.920 |
And then, you know, depending on what you want to reach 00:02:43.160 |
And this is what we call like quality engineering loop. 00:02:46.980 |
OK, you want to achieve, you know, you want CRM. 00:02:51.140 |
And your quality is there just through the simplest way 00:03:00.280 |
Now, the reason I say this is because like, OK, 00:03:04.720 |
And one of the things that I think biggest problems is like 00:03:18.500 |
Because many times you actually don't need these things. 00:03:25.180 |
is what I call complexity-adjusted impact or stay 00:03:28.240 |
lazy in a sense of like, always look at what's broken. 00:03:34.520 |
And we'll go through a lot of techniques today. 00:03:40.600 |
The most important two columns are the ones to the right, 00:03:45.660 |
And most times, like BM-25-- BM-25 is pretty easy. 00:03:49.420 |
And it does show up your quality quite a bit. 00:03:52.900 |
But should I build custom embeddings for retrieval? 00:04:02.260 |
And just relevance embeddings don't do enough for them. 00:04:05.660 |
And then they're willing to put all that work and effort. 00:04:17.020 |
This is the whole like, is reg dead, is reg not dead, 00:04:36.140 |
Or maybe you just pollute the context window too much. 00:04:44.300 |
The documents are not attended properly by the LLM. 00:04:47.260 |
And here, like, the five things that are breaking. 00:04:50.840 |
So now you try something very simple, which is, 00:04:58.060 |
Query terms, frequency of those query terms, length of the document, 00:05:12.520 |
things are not have that nature, like, as I was saying, 00:05:15.540 |
when they don't have that nature of, like, a keyword-based search, 00:05:20.860 |
And relevance embeddings are pretty interesting, 00:05:24.240 |
And vector space can handle way more nuance than, like, keyword space. 00:05:28.600 |
But, you know, they also fail in certain ways, 00:05:30.540 |
especially when you're looking for keyword matching. 00:05:32.860 |
And it's actually pretty easy to know when things work 00:05:36.500 |
Like, I went to ChatGPT and I asked, like, hey, 00:05:42.960 |
And you can see, like, exactly what's going on here, right? 00:05:45.220 |
If your query stream looks like iPhone battery life, 00:05:52.860 |
how long does an iPhone, like, last before I need to charge it 00:05:55.000 |
again, then you absolutely need, like, things like vector search. 00:05:58.160 |
And this is where you need to be, like, tuned 00:05:59.820 |
to what every technique gives you before you go and invest in it. 00:06:02.580 |
And when you do your loss analysis and you see, oh, most of my queries 00:06:05.380 |
actually look like the ones on the right-hand side, 00:06:07.840 |
then you should absolutely start investing in this area. 00:06:12.420 |
You did vector because your query sets look exactly like that. 00:06:18.040 |
And this is where re-rankers have quite a bit. 00:06:20.140 |
And when people say re-rankers, they're usually referring 00:06:25.720 |
for the relevance embeddings was you're getting a vector for the query, 00:06:34.300 |
They actually take both the query and the document, 00:06:36.440 |
and they give you a score while attending to both at the same time. 00:06:41.600 |
Now, they are more powerful, but they're actually pretty expensive. 00:06:47.920 |
So now you have to add, like, this fancy thing 00:06:51.180 |
and then ranking a smaller set of things with a technique like that. 00:06:55.480 |
But it is really powerful, and you should use it. 00:06:59.540 |
And now when you hit those cases, then you move to the next thing. 00:07:05.320 |
And this is a big problem with, like, you know, 00:07:11.360 |
And there's a thing, like, these are all proxy metrics. 00:07:13.480 |
At the end, like, your application is your application, 00:07:21.020 |
And this is something that we learned in Google Search sort of-- 00:07:24.260 |
it's been, like, 15, 20 years where, you know, 00:07:27.320 |
Well, they look at a lot of other things than just relevance. 00:07:30.420 |
And this is-- you know, this came from, like, actually, 00:07:32.480 |
the talk from Harvey and Lance DB was really, really interesting. 00:07:35.500 |
And he gave the example of this query, right? 00:07:38.760 |
Like, it has so much semantics for the legal domain 00:07:43.460 |
that it's impossible to catch these with just relevance. 00:07:47.340 |
And again, what does a word, like, regime means? 00:07:51.900 |
It actually has a very specific meaning in the legal term. 00:07:54.780 |
And then there's, like, things that are very specific to the domain 00:07:56.960 |
that need to be retrieved, like laws and regulations and such. 00:08:02.300 |
And you say, you know what, just fetching on relevance is not enough for me. 00:08:05.520 |
And now I need to go and, like, model my own domain in its own vector space. 00:08:10.020 |
And now I can actually fetch some of these things. 00:08:12.300 |
Now, again, go back to ChatGPT, like, is this interesting? 00:08:16.520 |
So I asked him to give me a list of things that would fail in a standard relevance search 00:08:21.520 |
And you start to see, like, oh, all these things would fail. 00:08:23.860 |
The words, like, moot don't mean the same thing. 00:08:26.280 |
Words, like, material don't mean the same thing. 00:08:28.860 |
And when you have a vocabulary that is so specific and just off, 00:08:36.460 |
Like, you need to have, again, you need to have evals. 00:08:39.120 |
You need to look at things that are breaking and decide that, 00:08:41.460 |
oh, the things that are breaking have to do with the vocabulary 00:08:44.000 |
just being out of distribution of a standard relevance model. 00:08:49.920 |
So don't, like, again, don't think too much about it. 00:09:01.780 |
So embeddings are very interesting because they 00:09:08.060 |
So now, if you think relevance doesn't work with retrieval, 00:09:19.000 |
I asked, like, give me cheap gifts for a gift for my son. 00:09:21.880 |
And I followed up with a query, like, but I have a budget of 50 00:09:24.720 |
Because when I said cheap, it started giving me, like, $10. 00:09:33.060 |
gave me $15 and $40, both of which are actually below $50. 00:09:39.240 |
Because what we call, like, in standard terms, 00:09:41.260 |
like, for information retrieval, this is a signal. 00:09:45.640 |
And it's not being translated into the query. 00:09:47.020 |
And it's definitely not being translated into the ranking. 00:09:49.520 |
So now you have to, like, think of, OK, I have ranking. 00:09:53.100 |
And I need the ranking to see the semantics of my corpus 00:09:58.520 |
Like, when you think of your corpus and queries, again, 00:10:03.620 |
But things like price signals, things like merchant signals, 00:10:06.640 |
if you're doing, like, podcasts, how many times has it been listened to? 00:10:14.420 |
will see things that are, for example, more popular 00:10:18.620 |
And there's a talk you mentioned, like, the PageRank algorithm. 00:10:24.240 |
How many things outside of my document point to me? 00:10:27.980 |
That has nothing to do with relevance and everything 00:10:32.960 |
So this is a signal about the shape of the data 00:10:39.540 |
think of it like you have horizontal semantics 00:10:42.700 |
And if you're in vertical domain where the semantics are very verticalized, 00:10:46.020 |
right, let's say you're doing a CRM or you're doing emails, 00:10:49.400 |
and it's a very complex bar that you're trying to hit 00:10:54.060 |
understand that relevance will be a very tiny, tiny part 00:10:57.560 |
And the harder you try to go, the more you're going to hit this wall 00:11:05.800 |
At sufficient complexity, things will keep breaking. 00:11:08.200 |
So now the thing that breaks with even custom semantics 00:11:12.520 |
Because even when you get to all this, OK, you're saying, 00:11:14.680 |
I'm doing relevance and I'm doing price signals and merchant signals. 00:11:22.540 |
They're clicking on stuff you thought they're not going to click on. 00:11:25.400 |
And they're clicking-- they're not clicking on thoughts on things 00:11:29.680 |
And this is where you need to bring in the click signal, 00:11:35.700 |
So we're not going to talk about how to implement them. 00:11:38.400 |
Just because, again, in this case, for example, 00:11:39.860 |
you have to be able to click through a signal-- 00:11:43.360 |
And then you take that signal and then you combine it 00:11:46.340 |
So now if you look at your ranking function, it's doing, OK, 00:11:50.000 |
I want it to have this semi-structured price signal 00:11:55.780 |
Plus, I want to get the user preference in that. 00:11:57.560 |
And then you take all these signals and you add them, 00:12:03.080 |
And this is how you go from, like, oh, it's just relevance, 00:12:07.560 |
to, oh, no, it's not just relevance and my semantics 00:12:10.960 |
and my user preferences all rolled up into one. 00:12:18.920 |
That's happening a lot because this goes into more orchestration. 00:12:25.340 |
and you're telling them to use a certain tool. 00:12:29.160 |
is an impedance mismatch between what the search engine expects. 00:12:33.560 |
Let's say you tune the search engine and it expects keyword queries 00:12:38.820 |
But you cannot describe all of that to the LLM. 00:12:40.860 |
And the LLM is reasoning about your application 00:12:46.160 |
So one thing that we've seen many companies do, 00:12:50.500 |
take more control of the actual orchestration. 00:12:56.820 |
I took the screenshot from AI mode in Google. 00:13:01.100 |
You have to catch it because after the animation goes away, 00:13:11.920 |
what are all the sub queries in it, and then fan them out. 00:13:15.040 |
Now you might think, hey, why isn't the LLM doing it? 00:13:17.160 |
The LLM is kind of doing it, but the LLM doesn't know 00:13:19.840 |
It doesn't know enough about your search engine. 00:13:23.880 |
that you can actually teach the LLM just through prompting 00:13:27.180 |
what to expect from the search on the other back end. 00:13:30.520 |
This is why people are still like, oh, is it agent autonomous? 00:13:36.280 |
And it will take a while for this to be solved. 00:13:38.040 |
Because again, it's unclear where the boundary is. 00:13:40.500 |
Is it the search engine should be able to handle 00:13:43.660 |
And then the LLM will just throw anything its way? 00:13:48.380 |
about what the search engine can support so it can tailor it. 00:13:51.900 |
And right now, you need control because the quality 00:13:58.320 |
If you have sort of like this assistant input, 00:13:59.900 |
and you're turning it into these narrow queries, 00:14:08.920 |
And it's very, very hard to know without knowing enough 00:14:11.360 |
about your application that these are the queries that 00:14:13.140 |
matter and not on the ones on the left-hand side. 00:14:15.320 |
And if you send the thing on the left-hand side 00:14:16.980 |
to a search engine, it will absolutely tip over, 00:14:21.340 |
And this is where you need to calibrate the boundary. 00:14:26.880 |
Are you asking them to all the right back ends? 00:14:28.380 |
And this is another place where it all fails. 00:14:32.200 |
one technique is called supplementary retrieval. 00:14:33.880 |
This is something you notice like clients do quite a bit, 00:14:45.660 |
Like this is similar to what we talked about dynamic content, 00:14:55.860 |
there was this query falafel, which was really hard to do. 00:14:59.300 |
And then you think like, oh, we're in Google search. 00:15:03.080 |
And it's stumped an organization of 6,000 people. 00:15:05.320 |
Like, oh my god, what's so hard about this query? 00:15:12.440 |
to actually understand enough about it, right? 00:15:16.040 |
At which point, I want to show you restaurants. 00:15:21.380 |
Now what Google ended up doing is that they ask-- 00:15:28.180 |
is a great technique to just even increase the recall more. 00:15:42.040 |
I tried to generate an image, but then I realized 00:15:43.760 |
they're actually a pretty good image that is real. 00:15:45.500 |
Somebody took a server rack and threw it from the roof. 00:16:08.080 |
I mean, Sam Altman keeps complaining about GPUs melting. 00:16:11.580 |
And this is the part where you need to start doing distillation. 00:16:14.280 |
And distillation is a very interesting thing, 00:16:21.240 |
You sort of have to hold the quality bar constant 00:16:25.800 |
The reason you can do that is kind of like in that graph. 00:16:33.360 |
Like an LLM, a very large language model is actually 00:16:36.460 |
mostly overqualified for the task you want to do. 00:16:39.660 |
Because what you really want to do is just one thing. 00:16:41.600 |
Like perplexity, they're doing question answering. 00:16:45.360 |
I mean, when you use perplexity in certain context, 00:16:48.640 |
Because they trained this one model to do this one very 00:16:50.900 |
specific thing, which is just be really, really good 00:16:56.880 |
So I wouldn't do it unless latency becomes a really 00:17:04.400 |
If I can make it in two seconds, users don't churn. 00:17:28.260 |
When everything fails, make sure it's not your fault. 00:17:31.000 |
But I'll say, there's something really important here. 00:17:45.700 |
to how magical it can seem, because if you try 00:17:48.640 |
to be more magical than your product surface can absorb, 00:17:56.760 |
Probably a more complex one would be a human in the loop 00:18:01.480 |
like, OK, in some cases, the bot can handle by its own, 00:18:10.760 |
and when does the human need to be in the loop? 00:18:12.580 |
This is a much simpler example from Google Shopping. 00:18:15.880 |
There's some cases where Google has a lot of great data, 00:18:19.900 |
The fidelity of the understanding is really high. 00:18:22.140 |
And then it shows what we call a high-promise UI. 00:18:26.440 |
There's filters, because I just understand this really well. 00:18:29.400 |
And there's things Google does not understand at all, 00:18:36.580 |
If you understand more, you show a more kind of, like, 00:18:43.940 |
But you degrade it to something that is still workable. 00:18:52.300 |
It has to be part of every-- and this is sort of like-- 00:18:54.780 |
always understand, like, there's only so much engineering 00:18:56.700 |
you can do until you have to, like, actually change your product 00:18:59.700 |
to accommodate this sort of stochastic nature. 00:19:03.840 |
depending on, like, the level of your understanding. 00:19:05.940 |
And again, I'll flash these two slides at the end. 00:19:09.240 |
because you can absolutely get into theoretical debates. 00:19:18.700 |
In this domain, when you're doing, like, this sort of thing, 00:19:36.540 |
If not, are there at least medium things I could do? 00:19:40.480 |
and do, like, some really, really hard things? 00:19:42.700 |
But always remember, like, the choice is on you, 00:19:56.680 |
And if you want the slides, they're here, again. 00:20:03.320 |
because it's always nice to find, like, friends who 00:20:08.700 |
So reach out and happy to talk about, you know, 00:20:11.480 |
rag challenges and such and some of the models