Layering every technique in RAG, one query at a time - David Karam, Pi Labs (fmr. Google Search)

00:00:00.040 | I'll just give you all a little bit of contact.

00:00:16.980 | So my co-founder and I, and a lot of our team,

00:00:19.440 | we're actually working on Google Search.

00:00:20.860 | And then we left and started PyLabs.

00:00:22.240 | And I loved the exit talk.

00:00:24.720 | And we're all nerds for information-freevo and search.

00:00:27.800 | And so there's going to be a little bit of that.

00:00:30.540 | Just going to go through a whole bunch of ways

00:00:32.460 | you can actually show up and improve your RAC systems.

00:00:35.280 | I think one thing that I personally sometimes struggle

00:00:37.880 | with is there's a lot of talk about things sometimes

00:00:40.700 | like too much in the weeds.

00:00:41.900 | Like, oh, specific techniques, and you do RL this way,

00:00:44.160 | and you can tune the model this way.

00:00:45.580 | It's like doesn't help me orient in the space.

00:00:47.440 | Like, what are all these things, and how do I hang on them?

00:00:51.040 | Or you have the complete opposite, which

00:00:52.460 | is like a whole bunch of buzzwords and hype and such.

00:00:54.600 | And like, RAC is dead.

00:00:55.660 | No, RAC is not dead.

00:00:56.660 | It's like agents, like, wait, what?

00:00:59.160 | So just, you know, I think a lot of what I'll do today

00:01:01.520 | is just what I call like plain English.

00:01:03.940 | Just trying to like set up a framework, right?

00:01:06.300 | Like very centered around like, OK,

00:01:07.740 | if you are trying to show up the quality of your system,

00:01:09.760 | how do you do that?

00:01:10.780 | And then where do all the things you hear about day in, day out,

00:01:13.560 | like fit?

00:01:14.400 | And then just how to approach that.

00:01:15.900 | And I'll give a lot of examples.

00:01:17.120 | I think one thing that I always love,

00:01:18.800 | and we always did in Google, we always do in PyLabs,

00:01:21.840 | is just like look at things, look at cases, look at queries,

00:01:24.900 | see what's working, see what's not working.

00:01:26.440 | And that's really the essence of like quality engineering,

00:01:28.800 | as we used to call it at Google.

00:01:30.180 | If you do want the slides, there's like 50 slides.

00:01:32.220 | And I set a challenge for myself to go

00:01:33.960 | through 50 slides in 19 minutes.

00:01:35.960 | But you can catch the slides here if you want.

00:01:37.800 | I'll slash this towards the end as well.

00:01:39.680 | Withpy.ai/rag-talk, it should point to the slides

00:01:43.440 | that we're going through.

00:01:44.420 | And as I mentioned, plain English, no hype, no buzz,

00:01:47.380 | no debates, no like.

00:01:48.920 | All right, so how to think about techniques.

00:01:50.300 | Before we go into techniques and get into the weeds of it,

00:01:52.560 | why does this even matter?

00:01:54.280 | And the way we always think about it

00:01:55.560 | is like always start with outcomes.

00:01:56.940 | You're always trying to solve some product problem.

00:01:59.420 | And generally, the best way to visualize something like this,

00:02:01.800 | you have a certain quality bar you want to reach.

00:02:04.040 | And there were very interesting talk this week

00:02:06.560 | about like, you know, benchmarks aren't really helpful.

00:02:08.720 | And absolutely, evals are helpful.

00:02:10.120 | You're trying to launch a CRM agent.

00:02:11.680 | And you sort of have a launch bar,

00:02:13.020 | like a place where you feel comfortable

00:02:14.480 | that you can actually put it out into the world.

00:02:16.800 | And techniques fit somewhere here.

00:02:18.780 | You have that like kind of end metric.

00:02:21.140 | And you're trying to like come up with different ways

00:02:23.380 | to shore up the quality.

00:02:24.340 | And those ways are like sort of the techniques there.

00:02:26.720 | And you know, this is sort of your own personal benchmark.

00:02:29.060 | You start with some of the easy bars you want to hit.

00:02:32.580 | And then there's like medium benchmarks and hard benchmarks.

00:02:34.640 | So these are crazy sets you're setting up.

00:02:36.920 | And then, you know, depending on what you want to reach

00:02:39.320 | and at what time frame, then you end up

00:02:42.020 | trying different things.

00:02:43.160 | And this is what we call like quality engineering loop.

00:02:45.400 | You sort of like baseline yourself.

00:02:46.980 | OK, you want to achieve, you know, you want CRM.

00:02:49.480 | And this is the easy query set.

00:02:51.140 | And your quality is there just through the simplest way

00:02:53.740 | you can try it.

00:02:54.640 | Do a loss analysis.

00:02:55.820 | OK, what's broken?

00:02:56.840 | There were a lot of eval talks this week.

00:02:58.480 | And then what we call quality engineering.

00:03:00.280 | Now, the reason I say this is because like, OK,

00:03:02.520 | techniques fit this in this last bucket.

00:03:04.720 | And one of the things that I think biggest problems is like

00:03:07.480 | people sometimes start there.

00:03:08.720 | And it doesn't make any sense.

00:03:09.940 | Because you say, oh, do I need BM-25?

00:03:11.680 | Or do I need like vector retrieval?

00:03:14.140 | It's like, I don't know.

00:03:15.140 | What are you trying to do?

00:03:16.220 | And what is your query set?

00:03:17.260 | And where are things failing?

00:03:18.500 | Because many times you actually don't need these things.

00:03:20.200 | And you end up implementing them.

00:03:21.200 | It doesn't make a lot of sense.

00:03:22.900 | Anyway, so usually the thing I say

00:03:25.180 | is what I call complexity-adjusted impact or stay

00:03:28.240 | lazy in a sense of like, always look at what's broken.

00:03:31.020 | And if it's not broken, don't fix it.

00:03:32.400 | And if it is broken, do fix it.

00:03:34.520 | And we'll go through a lot of techniques today.

00:03:36.500 | But this is a good way to think about them.

00:03:38.380 | It's just a cluster.

00:03:39.100 | It's a catalog of stuff.

00:03:40.600 | The most important two columns are the ones to the right,

00:03:42.900 | difficulty and impact.

00:03:44.260 | And if it's easy, go ahead and try it.

00:03:45.660 | And most times, like BM-25-- BM-25 is pretty easy.

00:03:48.380 | You should absolutely try it.

00:03:49.420 | And it does show up your quality quite a bit.

00:03:52.900 | But should I build custom embeddings for retrieval?

00:03:55.260 | Like, I don't know.

00:03:56.080 | Let's take a look.

00:03:56.680 | This is actually really, really hard.

00:03:58.320 | Harvey gave a talk.

00:03:59.140 | They build custom embeddings.

00:04:00.460 | But they have a really hard problem space.

00:04:02.260 | And just relevance embeddings don't do enough for them.

00:04:05.660 | And then they're willing to put all that work and effort.

00:04:08.720 | All right, queries, examples.

00:04:10.000 | Let's start stuff.

00:04:10.760 | First technique, in-memory retrieval.

00:04:12.900 | Easiest thing, bring all your documents.

00:04:15.220 | Shove them all to the LLM.

00:04:17.020 | This is the whole like, is reg dead, is reg not dead,

00:04:18.900 | context windows.

00:04:19.700 | Well, context windows are pretty easy.

00:04:21.140 | So you should definitely start there.

00:04:22.860 | One example, Notebook LLM.

00:04:24.860 | Very nice product.

00:04:25.640 | You actually put in five documents.

00:04:27.220 | You just ask questions about them.

00:04:28.620 | You don't need any reg.

00:04:29.460 | Just shove the whole thing in.

00:04:30.520 | Now, questions might get cut too long.

00:04:32.820 | And this is where it breaks, right?

00:04:34.080 | Maybe things don't fit in memory.

00:04:36.140 | Or maybe you just pollute the context window too much.

00:04:38.620 | So this is where you start to think, oh, OK,

00:04:40.980 | that's what's happening.

00:04:42.300 | I have too many documents.

00:04:43.400 | Oh, that's what's happening.

00:04:44.300 | The documents are not attended properly by the LLM.

00:04:47.260 | And here, like, the five things that are breaking.

00:04:49.060 | OK, great.

00:04:49.520 | Let's move to the next one.

00:04:50.840 | So now you try something very simple, which is,

00:04:52.820 | can I retrieve just based on terms?

00:04:54.560 | So BM25, what is BM25?

00:04:56.120 | BM25 is kind of like four things.

00:04:58.060 | Query terms, frequency of those query terms, length of the document,

00:05:02.020 | and just how rare a certain term is.

00:05:05.020 | It's a very nice thing.

00:05:06.020 | It actually works pretty well.

00:05:07.300 | And it's very easy to try.

00:05:09.220 | And it has a problem that, like, when

00:05:12.520 | things are not have that nature, like, as I was saying,

00:05:15.540 | when they don't have that nature of, like, a keyword-based search,

00:05:17.640 | they didn't work.

00:05:18.460 | And this is where you bring in something

00:05:19.840 | like relevance embeddings.

00:05:20.860 | And relevance embeddings are pretty interesting,

00:05:22.600 | because now you're in vector space.

00:05:24.240 | And vector space can handle way more nuance than, like, keyword space.

00:05:28.600 | But, you know, they also fail in certain ways,

00:05:30.540 | especially when you're looking for keyword matching.

00:05:32.860 | And it's actually pretty easy to know when things work

00:05:35.000 | and when they don't.

00:05:35.700 | Actually, this was queried.

00:05:36.500 | Like, I went to ChatGPT and I asked, like, hey,

00:05:38.040 | give me a bunch of keywords, ones that

00:05:39.580 | work for, like, standard term matching,

00:05:41.300 | and ones that work for relevance embedding.

00:05:42.960 | And you can see, like, exactly what's going on here, right?

00:05:45.220 | If your query stream looks like iPhone battery life,

00:05:48.140 | then you don't need vector search.

00:05:51.440 | But if they look something like, oh,

00:05:52.860 | how long does an iPhone, like, last before I need to charge it

00:05:55.000 | again, then you absolutely need, like, things like vector search.

00:05:58.160 | And this is where you need to be, like, tuned

00:05:59.820 | to what every technique gives you before you go and invest in it.

00:06:02.580 | And when you do your loss analysis and you see, oh, most of my queries

00:06:05.380 | actually look like the ones on the right-hand side,

00:06:07.840 | then you should absolutely start investing in this area.

00:06:10.720 | All right.

00:06:11.380 | Now you did BM25.

00:06:12.420 | You did vector because your query sets look exactly like that.

00:06:15.760 | And now you have conflicted candidate set.

00:06:18.040 | And this is where re-rankers have quite a bit.

00:06:20.140 | And when people say re-rankers, they're usually referring

00:06:22.000 | to, like, cross encoders.

00:06:23.260 | And this is a specific architecture.

00:06:24.600 | If you remember the architecture here

00:06:25.720 | for the relevance embeddings was you're getting a vector for the query,

00:06:29.600 | and you're getting a vector for a document,

00:06:30.880 | and then you're just measuring distance.

00:06:32.600 | Now, cross encoders are more sophisticated.

00:06:34.300 | They actually take both the query and the document,

00:06:36.440 | and they give you a score while attending to both at the same time.

00:06:39.880 | And that's why they were much more powerful.

00:06:41.600 | Now, they are more powerful, but they're actually pretty expensive.

00:06:44.380 | And now this is a failure state as well.

00:06:46.180 | You can do it on all your documents.

00:06:47.920 | So now you have to add, like, this fancy thing

00:06:49.620 | where you're retrieving a lot of things

00:06:51.180 | and then ranking a smaller set of things with a technique like that.

00:06:55.480 | But it is really powerful, and you should use it.

00:06:57.240 | And it fails in certain cases.

00:06:59.540 | And now when you hit those cases, then you move to the next thing.

00:07:02.400 | Now, where does it fail?

00:07:03.980 | It's still the relevance.

00:07:05.320 | And this is a big problem with, like, you know,

00:07:06.640 | standard embeddings and standard re-rankers.

00:07:08.680 | They only measure semantic similarity.

00:07:11.360 | And there's a thing, like, these are all proxy metrics.

00:07:13.480 | At the end, like, your application is your application,

00:07:15.240 | and your set of information needs

00:07:16.300 | is your set of information needs.

00:07:17.880 | And you try to proxy with relevance.

00:07:19.840 | But relevance is not ranking.

00:07:21.020 | And this is something that we learned in Google Search sort of--

00:07:24.260 | it's been, like, 15, 20 years where, you know,

00:07:26.040 | what brings the magic of Google Search?

00:07:27.320 | Well, they look at a lot of other things than just relevance.

00:07:30.420 | And this is-- you know, this came from, like, actually,

00:07:32.480 | the talk from Harvey and Lance DB was really, really interesting.

00:07:35.500 | And he gave the example of this query, right?

00:07:37.800 | It's a really interesting query.

00:07:38.760 | Like, it has so much semantics for the legal domain

00:07:43.460 | that it's impossible to catch these with just relevance.

00:07:47.340 | And again, what does a word, like, regime means?

00:07:49.000 | That's a very specific, like, legal term.

00:07:50.560 | Material, what does it mean?

00:07:51.900 | It actually has a very specific meaning in the legal term.

00:07:54.780 | And then there's, like, things that are very specific to the domain

00:07:56.960 | that need to be retrieved, like laws and regulations and such.

00:07:59.800 | And this is where you get to building things

00:08:01.380 | like custom embeddings.

00:08:02.300 | And you say, you know what, just fetching on relevance is not enough for me.

00:08:05.520 | And now I need to go and, like, model my own domain in its own vector space.

00:08:10.020 | And now I can actually fetch some of these things.

00:08:12.300 | Now, again, go back to ChatGPT, like, is this interesting?

00:08:15.220 | Should I actually even do it?

00:08:16.520 | So I asked him to give me a list of things that would fail in a standard relevance search

00:08:20.520 | in the legal domain.

00:08:21.520 | And you start to see, like, oh, all these things would fail.

00:08:23.860 | The words, like, moot don't mean the same thing.

00:08:26.280 | Words, like, material don't mean the same thing.

00:08:28.860 | And when you have a vocabulary that is so specific and just off,

00:08:32.500 | you will not get good results, right?

00:08:35.000 | So now, how do you match that?

00:08:36.460 | Like, you need to have, again, you need to have evals.

00:08:38.260 | You need to have query sets.

00:08:39.120 | You need to look at things that are breaking and decide that,

00:08:41.460 | oh, the things that are breaking have to do with the vocabulary

00:08:44.000 | just being out of distribution of a standard relevance model.

00:08:48.360 | And that's how you decide, right?

00:08:49.920 | So don't, like, again, don't think too much about it.

00:08:52.260 | Like, oh, should I do it?

00:08:53.200 | Should I not do it?

00:08:53.880 | Like, what is your query telling you?

00:08:55.620 | What is your data telling you?

00:08:56.780 | And then go and try to do it or not do it.

00:08:58.960 | There's also an example from shopping.

00:09:01.780 | So embeddings are very interesting because they

00:09:04.080 | help you a lot with retrieval and recall.

00:09:06.020 | But you still need good ranking, right?

00:09:08.060 | So now, if you think relevance doesn't work with retrieval,

00:09:11.620 | it also probably doesn't work with ranking.

00:09:13.620 | There's an example I pulled from perplexity.

00:09:15.160 | I was just trying to break it today.

00:09:16.980 | It didn't take too much to break it.

00:09:19.000 | I asked, like, give me cheap gifts for a gift for my son.

00:09:21.880 | And I followed up with a query, like, but I have a budget of 50

00:09:24.100 | bucks or more.

00:09:24.720 | Because when I said cheap, it started giving me, like, $10.

00:09:27.460 | You know, cheap for me is, like, $50.

00:09:29.560 | But I didn't know that.

00:09:30.380 | So it's fine.

00:09:30.760 | I told it that.

00:09:31.340 | But when I said $50 or more, it still

00:09:33.060 | gave me $15 and $40, both of which are actually below $50.

00:09:37.940 | And this is kind of interesting, right?

00:09:39.240 | Because what we call, like, in standard terms,

00:09:41.260 | like, for information retrieval, this is a signal.

00:09:43.360 | It's a price signal.

00:09:44.600 | And it's not being caught.

00:09:45.640 | And it's not being translated into the query.

00:09:47.020 | And it's definitely not being translated into the ranking.

00:09:49.520 | So now you have to, like, think of, OK, I have ranking.

00:09:53.100 | And I need the ranking to see the semantics of my corpus

00:09:55.600 | and my queries.

00:09:56.920 | And this has a very specific meaning.

00:09:58.520 | Like, when you think of your corpus and queries, again,

00:10:00.860 | it's not just relevance.

00:10:01.780 | Relevance helps you with natural language.

00:10:03.620 | But things like price signals, things like merchant signals,

00:10:06.640 | if you're doing, like, podcasts, how many times has it been listened to?

00:10:09.500 | It's a very important signal.

00:10:10.580 | It has nothing to do with relevance, right?

00:10:12.620 | And in many, many applications, you

00:10:14.420 | will see things that are, for example, more popular

00:10:16.820 | tend to rank more highly.

00:10:18.620 | And there's a talk you mentioned, like, the PageRank algorithm.

00:10:21.500 | PageRank is not about relevance.

00:10:23.000 | It's about prominence.

00:10:24.240 | How many things outside of my document point to me?

00:10:27.980 | That has nothing to do with relevance and everything

00:10:29.800 | to do with the structure of the web corpus.

00:10:31.680 | So that's the shape of the data.

00:10:32.960 | So this is a signal about the shape of the data

00:10:34.960 | and not a signal about, like, the relevance.

00:10:37.880 | And, you know, best way to think about it,

00:10:39.540 | think of it like you have horizontal semantics

00:10:41.140 | and then you have vertical semantics.

00:10:42.700 | And if you're in vertical domain where the semantics are very verticalized,

00:10:46.020 | right, let's say you're doing a CRM or you're doing emails,

00:10:49.400 | and it's a very complex bar that you're trying to hit

00:10:51.880 | that is way beyond just natural language,

00:10:54.060 | understand that relevance will be a very tiny, tiny part

00:10:56.340 | of the semantic universe.

00:10:57.560 | And the harder you try to go, the more you're going to hit this wall

00:11:00.360 | and the more you--

00:11:01.460 | all right, this breaks again.

00:11:03.260 | Things keep breaking.

00:11:04.160 | I'm sorry.

00:11:05.800 | At sufficient complexity, things will keep breaking.

00:11:08.200 | So now the thing that breaks with even custom semantics

00:11:10.980 | is user preference.

00:11:12.520 | Because even when you get to all this, OK, you're saying,

00:11:14.680 | I'm doing relevance and I'm doing price signals and merchant signals.

00:11:17.380 | I'm doing everything.

00:11:18.220 | Now I know the shopping domain.

00:11:19.800 | Now you don't know the shopping domain.

00:11:21.160 | Because now users are using your product.

00:11:22.540 | They're clicking on stuff you thought they're not going to click on.

00:11:25.400 | And they're clicking-- they're not clicking on thoughts on things

00:11:27.740 | you thought they were going to click on.

00:11:29.680 | And this is where you need to bring in the click signal,

00:11:31.880 | thumbs up, thumbs down signal.

00:11:33.420 | Now, these things get very complex.

00:11:35.700 | So we're not going to talk about how to implement them.

00:11:38.400 | Just because, again, in this case, for example,

00:11:39.860 | you have to be able to click through a signal--

00:11:42.480 | prediction signal.

00:11:43.360 | And then you take that signal and then you combine it

00:11:45.060 | with all your other signals.

00:11:46.340 | So now if you look at your ranking function, it's doing, OK,

00:11:48.560 | I want it to be relevant.

00:11:50.000 | I want it to have this semi-structured price signal

00:11:53.380 | and query understanding related to that.

00:11:55.780 | Plus, I want to get the user preference in that.

00:11:57.560 | And then you take all these signals and you add them,

00:11:59.400 | and that becomes your ranking score.

00:12:01.060 | So it becomes a very balanced function.

00:12:03.080 | And this is how you go from, like, oh, it's just relevance,

00:12:05.900 | to, oh, no, it's not just relevance,

00:12:07.560 | to, oh, no, it's not just relevance and my semantics

00:12:10.960 | and my user preferences all rolled up into one.

00:12:14.380 | I'll mention two more things.

00:12:17.220 | You calling the wrong craze.

00:12:18.920 | That's happening a lot because this goes into more orchestration.

00:12:21.840 | And you're trying to do complex things,

00:12:23.780 | especially now when you have agents

00:12:25.340 | and you're telling them to use a certain tool.

00:12:27.080 | This is happening quite a bit because there

00:12:29.160 | is an impedance mismatch between what the search engine expects.

00:12:33.060 | Right?

00:12:33.560 | Let's say you tune the search engine and it expects keyword queries

00:12:35.900 | or expects even more complex queries.

00:12:38.820 | But you cannot describe all of that to the LLM.

00:12:40.860 | And the LLM is reasoning about your application

00:12:42.760 | and then making queries by itself.

00:12:45.020 | And this is a big problem.

00:12:46.160 | So one thing that we've seen many companies do,

00:12:48.220 | we've done this also at Google, you actually

00:12:50.500 | take more control of the actual orchestration.

00:12:52.560 | So you take the BigQuery and you make

00:12:54.940 | n smaller queries out of it.

00:12:56.820 | I took the screenshot from AI mode in Google.

00:12:59.260 | And it's very brief.

00:13:01.100 | You have to catch it because after the animation goes away,

00:13:03.580 | but you see it's actually making x queries.

00:13:05.620 | It's making 15 queries.

00:13:06.640 | It's making 20 queries.

00:13:08.540 | So what we call fan out.

00:13:10.060 | Take very complex thing, try to figure out

00:13:11.920 | what are all the sub queries in it, and then fan them out.

00:13:15.040 | Now you might think, hey, why isn't the LLM doing it?

00:13:17.160 | The LLM is kind of doing it, but the LLM doesn't know

00:13:19.240 | about your tool.

00:13:19.840 | It doesn't know enough about your search engine.

00:13:22.160 | I love MCP, but I'm not a big believer

00:13:23.880 | that you can actually teach the LLM just through prompting

00:13:27.180 | what to expect from the search on the other back end.

00:13:30.520 | This is why people are still like, oh, is it agent autonomous?

00:13:33.220 | Do I need to do workflows?

00:13:34.700 | This is very, very complicated.

00:13:36.280 | And it will take a while for this to be solved.

00:13:38.040 | Because again, it's unclear where the boundary is.

00:13:40.500 | Is it the search engine should be able to handle

00:13:42.780 | more complex things?

00:13:43.660 | And then the LLM will just throw anything its way?

00:13:45.840 | Or is it the other way around?

00:13:46.920 | The LLM has to have more information

00:13:48.380 | about what the search engine can support so it can tailor it.

00:13:51.900 | And right now, you need control because the quality

00:13:54.840 | is still not there.

00:13:56.180 | So this looks like this.

00:13:58.320 | If you have sort of like this assistant input,

00:13:59.900 | and you're turning it into these narrow queries,

00:14:01.820 | like for example, what is David working on?

00:14:03.500 | This has very, very specific semantics.

00:14:05.520 | And it's more like, oh, Jira tickets David.

00:14:07.080 | Slack threads David.

00:14:08.920 | And it's very, very hard to know without knowing enough

00:14:11.360 | about your application that these are the queries that

00:14:13.140 | matter and not on the ones on the left-hand side.

00:14:15.320 | And if you send the thing on the left-hand side

00:14:16.980 | to a search engine, it will absolutely tip over,

00:14:19.400 | unless it understands your domain.

00:14:21.340 | And this is where you need to calibrate the boundary.

00:14:24.880 | OK.

00:14:25.380 | So now you're asking all the right queries.

00:14:26.880 | Are you asking them to all the right back ends?

00:14:28.380 | And this is another place where it all fails.

00:14:30.940 | And this is what we call--

00:14:32.200 | one technique is called supplementary retrieval.

00:14:33.880 | This is something you notice like clients do quite a bit,

00:14:36.380 | which is they don't call search enough.

00:14:38.620 | And sometimes people try to over-optimize.

00:14:40.820 | When you're trying to get high-to-call,

00:14:42.600 | you should always be searching more.

00:14:44.200 | Like I was always like, just search more.

00:14:45.660 | Like this is similar to what we talked about dynamic content,

00:14:47.900 | like the in-memory retrieval.

00:14:50.500 | Just like, just give more things.

00:14:52.580 | So it never fails to give more things.

00:14:54.220 | I know in the description we said, like,

00:14:55.860 | there was this query falafel, which was really hard to do.

00:14:59.300 | And then you think like, oh, we're in Google search.

00:15:01.340 | And it's a very simple Middle Eastern dish.

00:15:03.080 | And it's stumped an organization of 6,000 people.

00:15:05.320 | Like, oh my god, what's so hard about this query?

00:15:07.340 | What's so hard about this query is like,

00:15:08.760 | it's an ambiguous intent.

00:15:10.640 | So you need to reach to a lot of back ends

00:15:12.440 | to actually understand enough about it, right?

00:15:14.720 | Because you might be asking about food.

00:15:16.040 | At which point, I want to show you restaurants.

00:15:17.720 | You might be asking this for pictures.

00:15:19.400 | At which point, I want to show you images.

00:15:21.380 | Now what Google ended up doing is that they ask--

00:15:23.320 | they create all the back ends, and then

00:15:24.720 | they put the whole thing in.

00:15:26.200 | And I think I would recommend this

00:15:28.180 | is a great technique to just even increase the recall more.

00:15:31.440 | Just call more things.

00:15:33.460 | And don't try to be skimpy, unless you're

00:15:35.500 | running through some real cost overload.

00:15:38.180 | And that's the last one.

00:15:39.640 | You're running into cost overloads.

00:15:40.860 | GPs are melting.

00:15:42.040 | I tried to generate an image, but then I realized

00:15:43.760 | they're actually a pretty good image that is real.

00:15:45.500 | Somebody took a server rack and threw it from the roof.

00:15:49.140 | This was-- I didn't need to go to ChatGPT

00:15:51.960 | and generate this image.

00:15:53.520 | Apparently, this was an advertisement.

00:15:55.300 | Pretty expensive one.

00:15:57.660 | All right.

00:15:58.220 | So this happens a lot.

00:15:59.080 | Like when you get to a certain scale,

00:16:00.780 | and you have all these back ends,

00:16:01.980 | and you're making all these queries,

00:16:03.160 | and it's just getting very, very complex.

00:16:05.740 | And this-- I mean, Google's there.

00:16:07.240 | Perplexity's there.

00:16:08.080 | I mean, Sam Altman keeps complaining about GPUs melting.

00:16:11.580 | And this is the part where you need to start doing distillation.

00:16:14.280 | And distillation is a very interesting thing,

00:16:16.680 | because to do that, you have to learn how

00:16:18.340 | the fine-tune models.

00:16:19.240 | And this gets to be a little bit complex.

00:16:21.240 | You sort of have to hold the quality bar constant

00:16:23.380 | while you decrease the size of the model.

00:16:25.800 | The reason you can do that is kind of like in that graph.

00:16:28.960 | Like, hey, hire me.

00:16:29.680 | I know everything.

00:16:30.760 | Actually, I'm firing you.

00:16:32.540 | It's overqualified.

00:16:33.360 | Like an LLM, a very large language model is actually

00:16:36.460 | mostly overqualified for the task you want to do.

00:16:39.660 | Because what you really want to do is just one thing.

00:16:41.600 | Like perplexity, they're doing question answering.

00:16:44.360 | And they're pretty fast.

00:16:45.360 | I mean, when you use perplexity in certain context,

00:16:47.080 | they're really, really fast,

00:16:47.540 | really fast, which is amazing.

00:16:48.640 | Because they trained this one model to do this one very

00:16:50.900 | specific thing, which is just be really, really good

00:16:53.340 | at question answering.

00:16:55.420 | And this is very hard.

00:16:56.880 | So I wouldn't do it unless latency becomes a really

00:17:00.300 | important thing for your users, right?

00:17:01.700 | Like, oh, the thing is taking 10 seconds.

00:17:03.440 | Users churn.

00:17:04.400 | If I can make it in two seconds, users don't churn.

00:17:06.540 | Actually, that's a really good place to be,

00:17:07.980 | because then you can use this technique

00:17:09.200 | and just bring everything down.

00:17:11.860 | All right.

00:17:12.760 | You've done everything, Ken.

00:17:13.800 | Things are still failing.

00:17:15.320 | This is everybody.

00:17:17.540 | OK, what do you do?

00:17:18.460 | We have a bunch of engineers here.

00:17:19.660 | What do you do when everything fails?

00:17:22.540 | Yes.

00:17:23.560 | You blame the product manager.

00:17:25.060 | That's the last trick in the book.

00:17:28.260 | When everything fails, make sure it's not your fault.

00:17:31.000 | But I'll say, there's something really important here.

00:17:34.000 | Quality engineering will never--

00:17:35.600 | it'll never be 100%.

00:17:36.640 | Things will always fail.

00:17:37.480 | These are stochastic systems.

00:17:38.740 | So then you have to punt the problem.

00:17:40.160 | You have to punt it upwards.

00:17:41.440 | So it's kind of a joke, but it's not a joke.

00:17:43.780 | The design of the product matters a lot

00:17:45.700 | to how magical it can seem, because if you try

00:17:48.640 | to be more magical than your product surface can absorb,

00:17:52.100 | you will run into a bunch of problems.

00:17:54.880 | I use a very simple example.

00:17:56.760 | Probably a more complex one would be a human in the loop

00:17:59.940 | for customer support, where you're

00:18:01.480 | like, OK, in some cases, the bot can handle by its own,

00:18:04.000 | but then you'd like to punt to a human.

00:18:06.000 | This is basically UX design, right?

00:18:07.860 | Like, when do you trust the machine

00:18:09.360 | to do what the machine needs to do,

00:18:10.760 | and when does the human need to be in the loop?

00:18:12.580 | This is a much simpler example from Google Shopping.

00:18:15.880 | There's some cases where Google has a lot of great data,

00:18:18.360 | so what we call high understanding.

00:18:19.900 | The fidelity of the understanding is really high.

00:18:22.140 | And then it shows what we call a high-promise UI.

00:18:24.220 | Like, I'll show you things.

00:18:25.120 | You can click on them.

00:18:25.820 | There's reviews.

00:18:26.440 | There's filters, because I just understand this really well.

00:18:29.400 | And there's things Google does not understand at all,

00:18:31.520 | mostly as web documents, bag of words.

00:18:33.800 | And what's really interesting about the UI

00:18:35.460 | is the UI changes.

00:18:36.580 | If you understand more, you show a more kind of, like,

00:18:39.300 | filterable, high-promise UI.

00:18:41.000 | If you don't understand enough, you actually

00:18:42.600 | degrade your experience.

00:18:43.940 | But you degrade it to something that is still workable.

00:18:46.180 | Like, I'll show you 10 things you choose.

00:18:48.160 | Oh, no, I know exactly what you want.

00:18:49.620 | I'll show you one thing.

00:18:50.960 | And this is really, really important.

00:18:52.300 | It has to be part of every-- and this is sort of like--

00:18:54.780 | always understand, like, there's only so much engineering

00:18:56.700 | you can do until you have to, like, actually change your product

00:18:59.700 | to accommodate this sort of stochastic nature.

00:19:01.820 | So gracefully degrade, gracefully upgrade,

00:19:03.840 | depending on, like, the level of your understanding.

00:19:05.940 | And again, I'll flash these two slides at the end.

00:19:07.740 | Like, always remember what you're doing,

00:19:09.240 | because you can absolutely get into theoretical debates.

00:19:11.580 | Again, context window versus rag.

00:19:13.700 | This versus that.

00:19:14.580 | Like, is this, you know, agents versus--

00:19:16.400 | I don't know.

00:19:17.220 | Just everything is empirical.

00:19:18.700 | In this domain, when you're doing, like, this sort of thing,

00:19:21.000 | oh, I have my evals.

00:19:22.600 | I'm trying to, like, step by step go up.

00:19:24.360 | I have, like, a toolbox under my disposal.

00:19:26.680 | Everything, everything is empirical.

00:19:28.320 | So again, baseline, analyze your losses,

00:19:32.380 | and then look at your toolbox and see,

00:19:34.800 | are there easy things here I can do?

00:19:36.540 | If not, are there at least medium things I could do?

00:19:38.520 | If not, you know, should I hire more people

00:19:40.480 | and do, like, some really, really hard things?

00:19:42.700 | But always remember, like, the choice is on you,

00:19:44.600 | and you should be principled, because this

00:19:46.160 | can be an absolute waste of time if you're

00:19:48.440 | doing it too far ahead of the curve.

00:19:50.540 | All right, again, the slides are here.

00:19:52.400 | I think-- oh, I achieved it.

00:19:54.600 | 30 seconds left.

00:19:56.680 | And if you want the slides, they're here, again.

00:19:58.820 | And reach out to us.

00:20:00.740 | We're always happy to talk.

00:20:01.920 | I think I was very happy with the exit talk,

00:20:03.320 | because it's always nice to find, like, friends who

00:20:05.100 | are nerds in information retrieval.

00:20:07.720 | We are also such.

00:20:08.700 | So reach out and happy to talk about, you know,

00:20:11.480 | rag challenges and such and some of the models

00:20:13.380 | we are building.

00:20:14.880 | All right, thank you so much.

00:20:16.660 | We'll see you next time.

Layering every technique in RAG, one query at a time - David Karam, Pi Labs (fmr. Google Search)

Chapters