Layering every technique in RAG, one query at a time - David Karam, Pi Labs (fmr. Google Search)

I'll just give you all a little bit of contact. So my co-founder and I, and a lot of our team, we're actually working on Google Search. And then we left and started PyLabs. And I loved the exit talk. And we're all nerds for information-freevo and search. And so there's going to be a little bit of that.

Just going to go through a whole bunch of ways you can actually show up and improve your RAC systems. I think one thing that I personally sometimes struggle with is there's a lot of talk about things sometimes like too much in the weeds. Like, oh, specific techniques, and you do RL this way, and you can tune the model this way.

It's like doesn't help me orient in the space. Like, what are all these things, and how do I hang on them? Or you have the complete opposite, which is like a whole bunch of buzzwords and hype and such. And like, RAC is dead. No, RAC is not dead. It's like agents, like, wait, what?

So just, you know, I think a lot of what I'll do today is just what I call like plain English. Just trying to like set up a framework, right? Like very centered around like, OK, if you are trying to show up the quality of your system, how do you do that?

And then where do all the things you hear about day in, day out, like fit? And then just how to approach that. And I'll give a lot of examples. I think one thing that I always love, and we always did in Google, we always do in PyLabs, is just like look at things, look at cases, look at queries, see what's working, see what's not working.

And that's really the essence of like quality engineering, as we used to call it at Google. If you do want the slides, there's like 50 slides. And I set a challenge for myself to go through 50 slides in 19 minutes. But you can catch the slides here if you want.

I'll slash this towards the end as well. Withpy.ai/rag-talk, it should point to the slides that we're going through. And as I mentioned, plain English, no hype, no buzz, no debates, no like. All right, so how to think about techniques. Before we go into techniques and get into the weeds of it, why does this even matter?

And the way we always think about it is like always start with outcomes. You're always trying to solve some product problem. And generally, the best way to visualize something like this, you have a certain quality bar you want to reach. And there were very interesting talk this week about like, you know, benchmarks aren't really helpful.

And absolutely, evals are helpful. You're trying to launch a CRM agent. And you sort of have a launch bar, like a place where you feel comfortable that you can actually put it out into the world. And techniques fit somewhere here. You have that like kind of end metric. And you're trying to like come up with different ways to shore up the quality.

And those ways are like sort of the techniques there. And you know, this is sort of your own personal benchmark. You start with some of the easy bars you want to hit. And then there's like medium benchmarks and hard benchmarks. So these are crazy sets you're setting up. And then, you know, depending on what you want to reach and at what time frame, then you end up trying different things.

And this is what we call like quality engineering loop. You sort of like baseline yourself. OK, you want to achieve, you know, you want CRM. And this is the easy query set. And your quality is there just through the simplest way you can try it. Do a loss analysis.

OK, what's broken? There were a lot of eval talks this week. And then what we call quality engineering. Now, the reason I say this is because like, OK, techniques fit this in this last bucket. And one of the things that I think biggest problems is like people sometimes start there.

And it doesn't make any sense. Because you say, oh, do I need BM-25? Or do I need like vector retrieval? It's like, I don't know. What are you trying to do? And what is your query set? And where are things failing? Because many times you actually don't need these things.

And you end up implementing them. It doesn't make a lot of sense. Anyway, so usually the thing I say is what I call complexity-adjusted impact or stay lazy in a sense of like, always look at what's broken. And if it's not broken, don't fix it. And if it is broken, do fix it.

And we'll go through a lot of techniques today. But this is a good way to think about them. It's just a cluster. It's a catalog of stuff. The most important two columns are the ones to the right, difficulty and impact. And if it's easy, go ahead and try it.

And most times, like BM-25-- BM-25 is pretty easy. You should absolutely try it. And it does show up your quality quite a bit. But should I build custom embeddings for retrieval? Like, I don't know. Let's take a look. This is actually really, really hard. Harvey gave a talk. They build custom embeddings.

But they have a really hard problem space. And just relevance embeddings don't do enough for them. And then they're willing to put all that work and effort. All right, queries, examples. Let's start stuff. First technique, in-memory retrieval. Easiest thing, bring all your documents. Shove them all to the LLM.

This is the whole like, is reg dead, is reg not dead, context windows. Well, context windows are pretty easy. So you should definitely start there. One example, Notebook LLM. Very nice product. You actually put in five documents. You just ask questions about them. You don't need any reg. Just shove the whole thing in.

Now, questions might get cut too long. And this is where it breaks, right? Maybe things don't fit in memory. Or maybe you just pollute the context window too much. So this is where you start to think, oh, OK, that's what's happening. I have too many documents. Oh, that's what's happening.

The documents are not attended properly by the LLM. And here, like, the five things that are breaking. OK, great. Let's move to the next one. So now you try something very simple, which is, can I retrieve just based on terms? So BM25, what is BM25? BM25 is kind of like four things.

Query terms, frequency of those query terms, length of the document, and just how rare a certain term is. It's a very nice thing. It actually works pretty well. And it's very easy to try. And it has a problem that, like, when things are not have that nature, like, as I was saying, when they don't have that nature of, like, a keyword-based search, they didn't work.

And this is where you bring in something like relevance embeddings. And relevance embeddings are pretty interesting, because now you're in vector space. And vector space can handle way more nuance than, like, keyword space. But, you know, they also fail in certain ways, especially when you're looking for keyword matching.

And it's actually pretty easy to know when things work and when they don't. Actually, this was queried. Like, I went to ChatGPT and I asked, like, hey, give me a bunch of keywords, ones that work for, like, standard term matching, and ones that work for relevance embedding. And you can see, like, exactly what's going on here, right?

If your query stream looks like iPhone battery life, then you don't need vector search. But if they look something like, oh, how long does an iPhone, like, last before I need to charge it again, then you absolutely need, like, things like vector search. And this is where you need to be, like, tuned to what every technique gives you before you go and invest in it.

And when you do your loss analysis and you see, oh, most of my queries actually look like the ones on the right-hand side, then you should absolutely start investing in this area. All right. Now you did BM25. You did vector because your query sets look exactly like that. And now you have conflicted candidate set.

And this is where re-rankers have quite a bit. And when people say re-rankers, they're usually referring to, like, cross encoders. And this is a specific architecture. If you remember the architecture here for the relevance embeddings was you're getting a vector for the query, and you're getting a vector for a document, and then you're just measuring distance.

Now, cross encoders are more sophisticated. They actually take both the query and the document, and they give you a score while attending to both at the same time. And that's why they were much more powerful. Now, they are more powerful, but they're actually pretty expensive. And now this is a failure state as well.

You can do it on all your documents. So now you have to add, like, this fancy thing where you're retrieving a lot of things and then ranking a smaller set of things with a technique like that. But it is really powerful, and you should use it. And it fails in certain cases.

And now when you hit those cases, then you move to the next thing. Now, where does it fail? It's still the relevance. And this is a big problem with, like, you know, standard embeddings and standard re-rankers. They only measure semantic similarity. And there's a thing, like, these are all proxy metrics.

At the end, like, your application is your application, and your set of information needs is your set of information needs. And you try to proxy with relevance. But relevance is not ranking. And this is something that we learned in Google Search sort of-- it's been, like, 15, 20 years where, you know, what brings the magic of Google Search?

Well, they look at a lot of other things than just relevance. And this is-- you know, this came from, like, actually, the talk from Harvey and Lance DB was really, really interesting. And he gave the example of this query, right? It's a really interesting query. Like, it has so much semantics for the legal domain that it's impossible to catch these with just relevance.

And again, what does a word, like, regime means? That's a very specific, like, legal term. Material, what does it mean? It actually has a very specific meaning in the legal term. And then there's, like, things that are very specific to the domain that need to be retrieved, like laws and regulations and such.

And this is where you get to building things like custom embeddings. And you say, you know what, just fetching on relevance is not enough for me. And now I need to go and, like, model my own domain in its own vector space. And now I can actually fetch some of these things.

Now, again, go back to ChatGPT, like, is this interesting? Should I actually even do it? So I asked him to give me a list of things that would fail in a standard relevance search in the legal domain. And you start to see, like, oh, all these things would fail.

The words, like, moot don't mean the same thing. Words, like, material don't mean the same thing. And when you have a vocabulary that is so specific and just off, you will not get good results, right? So now, how do you match that? Like, you need to have, again, you need to have evals.

You need to have query sets. You need to look at things that are breaking and decide that, oh, the things that are breaking have to do with the vocabulary just being out of distribution of a standard relevance model. And that's how you decide, right? So don't, like, again, don't think too much about it.

Like, oh, should I do it? Should I not do it? Like, what is your query telling you? What is your data telling you? And then go and try to do it or not do it. There's also an example from shopping. So embeddings are very interesting because they help you a lot with retrieval and recall.

But you still need good ranking, right? So now, if you think relevance doesn't work with retrieval, it also probably doesn't work with ranking. There's an example I pulled from perplexity. I was just trying to break it today. It didn't take too much to break it. I asked, like, give me cheap gifts for a gift for my son.

And I followed up with a query, like, but I have a budget of 50 bucks or more. Because when I said cheap, it started giving me, like, $10. You know, cheap for me is, like, $50. But I didn't know that. So it's fine. I told it that. But when I said $50 or more, it still gave me $15 and $40, both of which are actually below $50.

And this is kind of interesting, right? Because what we call, like, in standard terms, like, for information retrieval, this is a signal. It's a price signal. And it's not being caught. And it's not being translated into the query. And it's definitely not being translated into the ranking. So now you have to, like, think of, OK, I have ranking.

And I need the ranking to see the semantics of my corpus and my queries. And this has a very specific meaning. Like, when you think of your corpus and queries, again, it's not just relevance. Relevance helps you with natural language. But things like price signals, things like merchant signals, if you're doing, like, podcasts, how many times has it been listened to?

It's a very important signal. It has nothing to do with relevance, right? And in many, many applications, you will see things that are, for example, more popular tend to rank more highly. And there's a talk you mentioned, like, the PageRank algorithm. PageRank is not about relevance. It's about prominence.

How many things outside of my document point to me? That has nothing to do with relevance and everything to do with the structure of the web corpus. So that's the shape of the data. So this is a signal about the shape of the data and not a signal about, like, the relevance.

And, you know, best way to think about it, think of it like you have horizontal semantics and then you have vertical semantics. And if you're in vertical domain where the semantics are very verticalized, right, let's say you're doing a CRM or you're doing emails, and it's a very complex bar that you're trying to hit that is way beyond just natural language, understand that relevance will be a very tiny, tiny part of the semantic universe.

And the harder you try to go, the more you're going to hit this wall and the more you-- all right, this breaks again. Things keep breaking. I'm sorry. At sufficient complexity, things will keep breaking. So now the thing that breaks with even custom semantics is user preference. Because even when you get to all this, OK, you're saying, I'm doing relevance and I'm doing price signals and merchant signals.

I'm doing everything. Now I know the shopping domain. Now you don't know the shopping domain. Because now users are using your product. They're clicking on stuff you thought they're not going to click on. And they're clicking-- they're not clicking on thoughts on things you thought they were going to click on.

And this is where you need to bring in the click signal, thumbs up, thumbs down signal. Now, these things get very complex. So we're not going to talk about how to implement them. Just because, again, in this case, for example, you have to be able to click through a signal-- prediction signal.

And then you take that signal and then you combine it with all your other signals. So now if you look at your ranking function, it's doing, OK, I want it to be relevant. I want it to have this semi-structured price signal and query understanding related to that. Plus, I want to get the user preference in that.

And then you take all these signals and you add them, and that becomes your ranking score. So it becomes a very balanced function. And this is how you go from, like, oh, it's just relevance, to, oh, no, it's not just relevance, to, oh, no, it's not just relevance and my semantics and my user preferences all rolled up into one.

I'll mention two more things. You calling the wrong craze. That's happening a lot because this goes into more orchestration. And you're trying to do complex things, especially now when you have agents and you're telling them to use a certain tool. This is happening quite a bit because there is an impedance mismatch between what the search engine expects.

Right? Let's say you tune the search engine and it expects keyword queries or expects even more complex queries. But you cannot describe all of that to the LLM. And the LLM is reasoning about your application and then making queries by itself. And this is a big problem. So one thing that we've seen many companies do, we've done this also at Google, you actually take more control of the actual orchestration.

So you take the BigQuery and you make n smaller queries out of it. I took the screenshot from AI mode in Google. And it's very brief. You have to catch it because after the animation goes away, but you see it's actually making x queries. It's making 15 queries. It's making 20 queries.

So what we call fan out. Take very complex thing, try to figure out what are all the sub queries in it, and then fan them out. Now you might think, hey, why isn't the LLM doing it? The LLM is kind of doing it, but the LLM doesn't know about your tool.

It doesn't know enough about your search engine. I love MCP, but I'm not a big believer that you can actually teach the LLM just through prompting what to expect from the search on the other back end. This is why people are still like, oh, is it agent autonomous? Do I need to do workflows?

This is very, very complicated. And it will take a while for this to be solved. Because again, it's unclear where the boundary is. Is it the search engine should be able to handle more complex things? And then the LLM will just throw anything its way? Or is it the other way around?

The LLM has to have more information about what the search engine can support so it can tailor it. And right now, you need control because the quality is still not there. So this looks like this. If you have sort of like this assistant input, and you're turning it into these narrow queries, like for example, what is David working on?

This has very, very specific semantics. And it's more like, oh, Jira tickets David. Slack threads David. And it's very, very hard to know without knowing enough about your application that these are the queries that matter and not on the ones on the left-hand side. And if you send the thing on the left-hand side to a search engine, it will absolutely tip over, unless it understands your domain.

And this is where you need to calibrate the boundary. OK. So now you're asking all the right queries. Are you asking them to all the right back ends? And this is another place where it all fails. And this is what we call-- one technique is called supplementary retrieval. This is something you notice like clients do quite a bit, which is they don't call search enough.

And sometimes people try to over-optimize. When you're trying to get high-to-call, you should always be searching more. Like I was always like, just search more. Like this is similar to what we talked about dynamic content, like the in-memory retrieval. Just like, just give more things. So it never fails to give more things.

I know in the description we said, like, there was this query falafel, which was really hard to do. And then you think like, oh, we're in Google search. And it's a very simple Middle Eastern dish. And it's stumped an organization of 6,000 people. Like, oh my god, what's so hard about this query?

What's so hard about this query is like, it's an ambiguous intent. So you need to reach to a lot of back ends to actually understand enough about it, right? Because you might be asking about food. At which point, I want to show you restaurants. You might be asking this for pictures.

At which point, I want to show you images. Now what Google ended up doing is that they ask-- they create all the back ends, and then they put the whole thing in. And I think I would recommend this is a great technique to just even increase the recall more.

Just call more things. And don't try to be skimpy, unless you're running through some real cost overload. And that's the last one. You're running into cost overloads. GPs are melting. I tried to generate an image, but then I realized they're actually a pretty good image that is real. Somebody took a server rack and threw it from the roof.

This was-- I didn't need to go to ChatGPT and generate this image. Apparently, this was an advertisement. Pretty expensive one. All right. So this happens a lot. Like when you get to a certain scale, and you have all these back ends, and you're making all these queries, and it's just getting very, very complex.

And this-- I mean, Google's there. Perplexity's there. I mean, Sam Altman keeps complaining about GPUs melting. And this is the part where you need to start doing distillation. And distillation is a very interesting thing, because to do that, you have to learn how the fine-tune models. And this gets to be a little bit complex.

You sort of have to hold the quality bar constant while you decrease the size of the model. The reason you can do that is kind of like in that graph. Like, hey, hire me. I know everything. Actually, I'm firing you. It's overqualified. Like an LLM, a very large language model is actually mostly overqualified for the task you want to do.

Because what you really want to do is just one thing. Like perplexity, they're doing question answering. And they're pretty fast. I mean, when you use perplexity in certain context, they're really, really fast, really fast, which is amazing. Because they trained this one model to do this one very specific thing, which is just be really, really good at question answering.

And this is very hard. So I wouldn't do it unless latency becomes a really important thing for your users, right? Like, oh, the thing is taking 10 seconds. Users churn. If I can make it in two seconds, users don't churn. Actually, that's a really good place to be, because then you can use this technique and just bring everything down.

All right. You've done everything, Ken. Things are still failing. This is everybody. OK, what do you do? We have a bunch of engineers here. What do you do when everything fails? Yes. You blame the product manager. That's the last trick in the book. When everything fails, make sure it's not your fault.

But I'll say, there's something really important here. Quality engineering will never-- it'll never be 100%. Things will always fail. These are stochastic systems. So then you have to punt the problem. You have to punt it upwards. So it's kind of a joke, but it's not a joke. The design of the product matters a lot to how magical it can seem, because if you try to be more magical than your product surface can absorb, you will run into a bunch of problems.

I use a very simple example. Probably a more complex one would be a human in the loop for customer support, where you're like, OK, in some cases, the bot can handle by its own, but then you'd like to punt to a human. This is basically UX design, right? Like, when do you trust the machine to do what the machine needs to do, and when does the human need to be in the loop?

This is a much simpler example from Google Shopping. There's some cases where Google has a lot of great data, so what we call high understanding. The fidelity of the understanding is really high. And then it shows what we call a high-promise UI. Like, I'll show you things. You can click on them.

There's reviews. There's filters, because I just understand this really well. And there's things Google does not understand at all, mostly as web documents, bag of words. And what's really interesting about the UI is the UI changes. If you understand more, you show a more kind of, like, filterable, high-promise UI.

If you don't understand enough, you actually degrade your experience. But you degrade it to something that is still workable. Like, I'll show you 10 things you choose. Oh, no, I know exactly what you want. I'll show you one thing. And this is really, really important. It has to be part of every-- and this is sort of like-- always understand, like, there's only so much engineering you can do until you have to, like, actually change your product to accommodate this sort of stochastic nature.

So gracefully degrade, gracefully upgrade, depending on, like, the level of your understanding. And again, I'll flash these two slides at the end. Like, always remember what you're doing, because you can absolutely get into theoretical debates. Again, context window versus rag. This versus that. Like, is this, you know, agents versus-- I don't know.

Just everything is empirical. In this domain, when you're doing, like, this sort of thing, oh, I have my evals. I'm trying to, like, step by step go up. I have, like, a toolbox under my disposal. Everything, everything is empirical. So again, baseline, analyze your losses, and then look at your toolbox and see, are there easy things here I can do?

If not, are there at least medium things I could do? If not, you know, should I hire more people and do, like, some really, really hard things? But always remember, like, the choice is on you, and you should be principled, because this can be an absolute waste of time if you're doing it too far ahead of the curve.

All right, again, the slides are here. I think-- oh, I achieved it. 30 seconds left. And if you want the slides, they're here, again. And reach out to us. We're always happy to talk. I think I was very happy with the exit talk, because it's always nice to find, like, friends who are nerds in information retrieval.

We are also such. So reach out and happy to talk about, you know, rag challenges and such and some of the models we are building. All right, thank you so much. We'll see you next time.

Layering every technique in RAG, one query at a time - David Karam, Pi Labs (fmr. Google Search)

Chapters

Transcript