back to index

Layering every technique in RAG, one query at a time - David Karam, Pi Labs (fmr. Google Search)


Chapters

0:0 Introduction and Context
1:41 Quality Engineering Loop and Mindset
4:9 In-Memory Retrieval
4:50 Term-Based Retrieval (BM25)
5:18 Relevance Embeddings (Vector Search)
6:15 Re-Rankers (Cross Encoders)
7:59 Custom Embeddings
9:40 Domain-Specific Ranking Signals
11:9 User Preference Signals
12:17 Query Orchestration (Fan Out)
14:26 Supplementary Retrieval
16:9 Distillation
17:14 Punting the Problem and Graceful Degradation

Whisper Transcript | Transcript Only Page

00:00:00.040 | I'll just give you all a little bit of contact.
00:00:16.980 | So my co-founder and I, and a lot of our team,
00:00:19.440 | we're actually working on Google Search.
00:00:20.860 | And then we left and started PyLabs.
00:00:22.240 | And I loved the exit talk.
00:00:24.720 | And we're all nerds for information-freevo and search.
00:00:27.800 | And so there's going to be a little bit of that.
00:00:30.540 | Just going to go through a whole bunch of ways
00:00:32.460 | you can actually show up and improve your RAC systems.
00:00:35.280 | I think one thing that I personally sometimes struggle
00:00:37.880 | with is there's a lot of talk about things sometimes
00:00:40.700 | like too much in the weeds.
00:00:41.900 | Like, oh, specific techniques, and you do RL this way,
00:00:44.160 | and you can tune the model this way.
00:00:45.580 | It's like doesn't help me orient in the space.
00:00:47.440 | Like, what are all these things, and how do I hang on them?
00:00:51.040 | Or you have the complete opposite, which
00:00:52.460 | is like a whole bunch of buzzwords and hype and such.
00:00:54.600 | And like, RAC is dead.
00:00:55.660 | No, RAC is not dead.
00:00:56.660 | It's like agents, like, wait, what?
00:00:59.160 | So just, you know, I think a lot of what I'll do today
00:01:01.520 | is just what I call like plain English.
00:01:03.940 | Just trying to like set up a framework, right?
00:01:06.300 | Like very centered around like, OK,
00:01:07.740 | if you are trying to show up the quality of your system,
00:01:09.760 | how do you do that?
00:01:10.780 | And then where do all the things you hear about day in, day out,
00:01:13.560 | like fit?
00:01:14.400 | And then just how to approach that.
00:01:15.900 | And I'll give a lot of examples.
00:01:17.120 | I think one thing that I always love,
00:01:18.800 | and we always did in Google, we always do in PyLabs,
00:01:21.840 | is just like look at things, look at cases, look at queries,
00:01:24.900 | see what's working, see what's not working.
00:01:26.440 | And that's really the essence of like quality engineering,
00:01:28.800 | as we used to call it at Google.
00:01:30.180 | If you do want the slides, there's like 50 slides.
00:01:32.220 | And I set a challenge for myself to go
00:01:33.960 | through 50 slides in 19 minutes.
00:01:35.960 | But you can catch the slides here if you want.
00:01:37.800 | I'll slash this towards the end as well.
00:01:39.680 | Withpy.ai/rag-talk, it should point to the slides
00:01:43.440 | that we're going through.
00:01:44.420 | And as I mentioned, plain English, no hype, no buzz,
00:01:47.380 | no debates, no like.
00:01:48.920 | All right, so how to think about techniques.
00:01:50.300 | Before we go into techniques and get into the weeds of it,
00:01:52.560 | why does this even matter?
00:01:54.280 | And the way we always think about it
00:01:55.560 | is like always start with outcomes.
00:01:56.940 | You're always trying to solve some product problem.
00:01:59.420 | And generally, the best way to visualize something like this,
00:02:01.800 | you have a certain quality bar you want to reach.
00:02:04.040 | And there were very interesting talk this week
00:02:06.560 | about like, you know, benchmarks aren't really helpful.
00:02:08.720 | And absolutely, evals are helpful.
00:02:10.120 | You're trying to launch a CRM agent.
00:02:11.680 | And you sort of have a launch bar,
00:02:13.020 | like a place where you feel comfortable
00:02:14.480 | that you can actually put it out into the world.
00:02:16.800 | And techniques fit somewhere here.
00:02:18.780 | You have that like kind of end metric.
00:02:21.140 | And you're trying to like come up with different ways
00:02:23.380 | to shore up the quality.
00:02:24.340 | And those ways are like sort of the techniques there.
00:02:26.720 | And you know, this is sort of your own personal benchmark.
00:02:29.060 | You start with some of the easy bars you want to hit.
00:02:32.580 | And then there's like medium benchmarks and hard benchmarks.
00:02:34.640 | So these are crazy sets you're setting up.
00:02:36.920 | And then, you know, depending on what you want to reach
00:02:39.320 | and at what time frame, then you end up
00:02:42.020 | trying different things.
00:02:43.160 | And this is what we call like quality engineering loop.
00:02:45.400 | You sort of like baseline yourself.
00:02:46.980 | OK, you want to achieve, you know, you want CRM.
00:02:49.480 | And this is the easy query set.
00:02:51.140 | And your quality is there just through the simplest way
00:02:53.740 | you can try it.
00:02:54.640 | Do a loss analysis.
00:02:55.820 | OK, what's broken?
00:02:56.840 | There were a lot of eval talks this week.
00:02:58.480 | And then what we call quality engineering.
00:03:00.280 | Now, the reason I say this is because like, OK,
00:03:02.520 | techniques fit this in this last bucket.
00:03:04.720 | And one of the things that I think biggest problems is like
00:03:07.480 | people sometimes start there.
00:03:08.720 | And it doesn't make any sense.
00:03:09.940 | Because you say, oh, do I need BM-25?
00:03:11.680 | Or do I need like vector retrieval?
00:03:14.140 | It's like, I don't know.
00:03:15.140 | What are you trying to do?
00:03:16.220 | And what is your query set?
00:03:17.260 | And where are things failing?
00:03:18.500 | Because many times you actually don't need these things.
00:03:20.200 | And you end up implementing them.
00:03:21.200 | It doesn't make a lot of sense.
00:03:22.900 | Anyway, so usually the thing I say
00:03:25.180 | is what I call complexity-adjusted impact or stay
00:03:28.240 | lazy in a sense of like, always look at what's broken.
00:03:31.020 | And if it's not broken, don't fix it.
00:03:32.400 | And if it is broken, do fix it.
00:03:34.520 | And we'll go through a lot of techniques today.
00:03:36.500 | But this is a good way to think about them.
00:03:38.380 | It's just a cluster.
00:03:39.100 | It's a catalog of stuff.
00:03:40.600 | The most important two columns are the ones to the right,
00:03:42.900 | difficulty and impact.
00:03:44.260 | And if it's easy, go ahead and try it.
00:03:45.660 | And most times, like BM-25-- BM-25 is pretty easy.
00:03:48.380 | You should absolutely try it.
00:03:49.420 | And it does show up your quality quite a bit.
00:03:52.900 | But should I build custom embeddings for retrieval?
00:03:55.260 | Like, I don't know.
00:03:56.080 | Let's take a look.
00:03:56.680 | This is actually really, really hard.
00:03:58.320 | Harvey gave a talk.
00:03:59.140 | They build custom embeddings.
00:04:00.460 | But they have a really hard problem space.
00:04:02.260 | And just relevance embeddings don't do enough for them.
00:04:05.660 | And then they're willing to put all that work and effort.
00:04:08.720 | All right, queries, examples.
00:04:10.000 | Let's start stuff.
00:04:10.760 | First technique, in-memory retrieval.
00:04:12.900 | Easiest thing, bring all your documents.
00:04:15.220 | Shove them all to the LLM.
00:04:17.020 | This is the whole like, is reg dead, is reg not dead,
00:04:18.900 | context windows.
00:04:19.700 | Well, context windows are pretty easy.
00:04:21.140 | So you should definitely start there.
00:04:22.860 | One example, Notebook LLM.
00:04:24.860 | Very nice product.
00:04:25.640 | You actually put in five documents.
00:04:27.220 | You just ask questions about them.
00:04:28.620 | You don't need any reg.
00:04:29.460 | Just shove the whole thing in.
00:04:30.520 | Now, questions might get cut too long.
00:04:32.820 | And this is where it breaks, right?
00:04:34.080 | Maybe things don't fit in memory.
00:04:36.140 | Or maybe you just pollute the context window too much.
00:04:38.620 | So this is where you start to think, oh, OK,
00:04:40.980 | that's what's happening.
00:04:42.300 | I have too many documents.
00:04:43.400 | Oh, that's what's happening.
00:04:44.300 | The documents are not attended properly by the LLM.
00:04:47.260 | And here, like, the five things that are breaking.
00:04:49.060 | OK, great.
00:04:49.520 | Let's move to the next one.
00:04:50.840 | So now you try something very simple, which is,
00:04:52.820 | can I retrieve just based on terms?
00:04:54.560 | So BM25, what is BM25?
00:04:56.120 | BM25 is kind of like four things.
00:04:58.060 | Query terms, frequency of those query terms, length of the document,
00:05:02.020 | and just how rare a certain term is.
00:05:05.020 | It's a very nice thing.
00:05:06.020 | It actually works pretty well.
00:05:07.300 | And it's very easy to try.
00:05:09.220 | And it has a problem that, like, when
00:05:12.520 | things are not have that nature, like, as I was saying,
00:05:15.540 | when they don't have that nature of, like, a keyword-based search,
00:05:17.640 | they didn't work.
00:05:18.460 | And this is where you bring in something
00:05:19.840 | like relevance embeddings.
00:05:20.860 | And relevance embeddings are pretty interesting,
00:05:22.600 | because now you're in vector space.
00:05:24.240 | And vector space can handle way more nuance than, like, keyword space.
00:05:28.600 | But, you know, they also fail in certain ways,
00:05:30.540 | especially when you're looking for keyword matching.
00:05:32.860 | And it's actually pretty easy to know when things work
00:05:35.000 | and when they don't.
00:05:35.700 | Actually, this was queried.
00:05:36.500 | Like, I went to ChatGPT and I asked, like, hey,
00:05:38.040 | give me a bunch of keywords, ones that
00:05:39.580 | work for, like, standard term matching,
00:05:41.300 | and ones that work for relevance embedding.
00:05:42.960 | And you can see, like, exactly what's going on here, right?
00:05:45.220 | If your query stream looks like iPhone battery life,
00:05:48.140 | then you don't need vector search.
00:05:51.440 | But if they look something like, oh,
00:05:52.860 | how long does an iPhone, like, last before I need to charge it
00:05:55.000 | again, then you absolutely need, like, things like vector search.
00:05:58.160 | And this is where you need to be, like, tuned
00:05:59.820 | to what every technique gives you before you go and invest in it.
00:06:02.580 | And when you do your loss analysis and you see, oh, most of my queries
00:06:05.380 | actually look like the ones on the right-hand side,
00:06:07.840 | then you should absolutely start investing in this area.
00:06:10.720 | All right.
00:06:11.380 | Now you did BM25.
00:06:12.420 | You did vector because your query sets look exactly like that.
00:06:15.760 | And now you have conflicted candidate set.
00:06:18.040 | And this is where re-rankers have quite a bit.
00:06:20.140 | And when people say re-rankers, they're usually referring
00:06:22.000 | to, like, cross encoders.
00:06:23.260 | And this is a specific architecture.
00:06:24.600 | If you remember the architecture here
00:06:25.720 | for the relevance embeddings was you're getting a vector for the query,
00:06:29.600 | and you're getting a vector for a document,
00:06:30.880 | and then you're just measuring distance.
00:06:32.600 | Now, cross encoders are more sophisticated.
00:06:34.300 | They actually take both the query and the document,
00:06:36.440 | and they give you a score while attending to both at the same time.
00:06:39.880 | And that's why they were much more powerful.
00:06:41.600 | Now, they are more powerful, but they're actually pretty expensive.
00:06:44.380 | And now this is a failure state as well.
00:06:46.180 | You can do it on all your documents.
00:06:47.920 | So now you have to add, like, this fancy thing
00:06:49.620 | where you're retrieving a lot of things
00:06:51.180 | and then ranking a smaller set of things with a technique like that.
00:06:55.480 | But it is really powerful, and you should use it.
00:06:57.240 | And it fails in certain cases.
00:06:59.540 | And now when you hit those cases, then you move to the next thing.
00:07:02.400 | Now, where does it fail?
00:07:03.980 | It's still the relevance.
00:07:05.320 | And this is a big problem with, like, you know,
00:07:06.640 | standard embeddings and standard re-rankers.
00:07:08.680 | They only measure semantic similarity.
00:07:11.360 | And there's a thing, like, these are all proxy metrics.
00:07:13.480 | At the end, like, your application is your application,
00:07:15.240 | and your set of information needs
00:07:16.300 | is your set of information needs.
00:07:17.880 | And you try to proxy with relevance.
00:07:19.840 | But relevance is not ranking.
00:07:21.020 | And this is something that we learned in Google Search sort of--
00:07:24.260 | it's been, like, 15, 20 years where, you know,
00:07:26.040 | what brings the magic of Google Search?
00:07:27.320 | Well, they look at a lot of other things than just relevance.
00:07:30.420 | And this is-- you know, this came from, like, actually,
00:07:32.480 | the talk from Harvey and Lance DB was really, really interesting.
00:07:35.500 | And he gave the example of this query, right?
00:07:37.800 | It's a really interesting query.
00:07:38.760 | Like, it has so much semantics for the legal domain
00:07:43.460 | that it's impossible to catch these with just relevance.
00:07:47.340 | And again, what does a word, like, regime means?
00:07:49.000 | That's a very specific, like, legal term.
00:07:50.560 | Material, what does it mean?
00:07:51.900 | It actually has a very specific meaning in the legal term.
00:07:54.780 | And then there's, like, things that are very specific to the domain
00:07:56.960 | that need to be retrieved, like laws and regulations and such.
00:07:59.800 | And this is where you get to building things
00:08:01.380 | like custom embeddings.
00:08:02.300 | And you say, you know what, just fetching on relevance is not enough for me.
00:08:05.520 | And now I need to go and, like, model my own domain in its own vector space.
00:08:10.020 | And now I can actually fetch some of these things.
00:08:12.300 | Now, again, go back to ChatGPT, like, is this interesting?
00:08:15.220 | Should I actually even do it?
00:08:16.520 | So I asked him to give me a list of things that would fail in a standard relevance search
00:08:20.520 | in the legal domain.
00:08:21.520 | And you start to see, like, oh, all these things would fail.
00:08:23.860 | The words, like, moot don't mean the same thing.
00:08:26.280 | Words, like, material don't mean the same thing.
00:08:28.860 | And when you have a vocabulary that is so specific and just off,
00:08:32.500 | you will not get good results, right?
00:08:35.000 | So now, how do you match that?
00:08:36.460 | Like, you need to have, again, you need to have evals.
00:08:38.260 | You need to have query sets.
00:08:39.120 | You need to look at things that are breaking and decide that,
00:08:41.460 | oh, the things that are breaking have to do with the vocabulary
00:08:44.000 | just being out of distribution of a standard relevance model.
00:08:48.360 | And that's how you decide, right?
00:08:49.920 | So don't, like, again, don't think too much about it.
00:08:52.260 | Like, oh, should I do it?
00:08:53.200 | Should I not do it?
00:08:53.880 | Like, what is your query telling you?
00:08:55.620 | What is your data telling you?
00:08:56.780 | And then go and try to do it or not do it.
00:08:58.960 | There's also an example from shopping.
00:09:01.780 | So embeddings are very interesting because they
00:09:04.080 | help you a lot with retrieval and recall.
00:09:06.020 | But you still need good ranking, right?
00:09:08.060 | So now, if you think relevance doesn't work with retrieval,
00:09:11.620 | it also probably doesn't work with ranking.
00:09:13.620 | There's an example I pulled from perplexity.
00:09:15.160 | I was just trying to break it today.
00:09:16.980 | It didn't take too much to break it.
00:09:19.000 | I asked, like, give me cheap gifts for a gift for my son.
00:09:21.880 | And I followed up with a query, like, but I have a budget of 50
00:09:24.100 | bucks or more.
00:09:24.720 | Because when I said cheap, it started giving me, like, $10.
00:09:27.460 | You know, cheap for me is, like, $50.
00:09:29.560 | But I didn't know that.
00:09:30.380 | So it's fine.
00:09:30.760 | I told it that.
00:09:31.340 | But when I said $50 or more, it still
00:09:33.060 | gave me $15 and $40, both of which are actually below $50.
00:09:37.940 | And this is kind of interesting, right?
00:09:39.240 | Because what we call, like, in standard terms,
00:09:41.260 | like, for information retrieval, this is a signal.
00:09:43.360 | It's a price signal.
00:09:44.600 | And it's not being caught.
00:09:45.640 | And it's not being translated into the query.
00:09:47.020 | And it's definitely not being translated into the ranking.
00:09:49.520 | So now you have to, like, think of, OK, I have ranking.
00:09:53.100 | And I need the ranking to see the semantics of my corpus
00:09:55.600 | and my queries.
00:09:56.920 | And this has a very specific meaning.
00:09:58.520 | Like, when you think of your corpus and queries, again,
00:10:00.860 | it's not just relevance.
00:10:01.780 | Relevance helps you with natural language.
00:10:03.620 | But things like price signals, things like merchant signals,
00:10:06.640 | if you're doing, like, podcasts, how many times has it been listened to?
00:10:09.500 | It's a very important signal.
00:10:10.580 | It has nothing to do with relevance, right?
00:10:12.620 | And in many, many applications, you
00:10:14.420 | will see things that are, for example, more popular
00:10:16.820 | tend to rank more highly.
00:10:18.620 | And there's a talk you mentioned, like, the PageRank algorithm.
00:10:21.500 | PageRank is not about relevance.
00:10:23.000 | It's about prominence.
00:10:24.240 | How many things outside of my document point to me?
00:10:27.980 | That has nothing to do with relevance and everything
00:10:29.800 | to do with the structure of the web corpus.
00:10:31.680 | So that's the shape of the data.
00:10:32.960 | So this is a signal about the shape of the data
00:10:34.960 | and not a signal about, like, the relevance.
00:10:37.880 | And, you know, best way to think about it,
00:10:39.540 | think of it like you have horizontal semantics
00:10:41.140 | and then you have vertical semantics.
00:10:42.700 | And if you're in vertical domain where the semantics are very verticalized,
00:10:46.020 | right, let's say you're doing a CRM or you're doing emails,
00:10:49.400 | and it's a very complex bar that you're trying to hit
00:10:51.880 | that is way beyond just natural language,
00:10:54.060 | understand that relevance will be a very tiny, tiny part
00:10:56.340 | of the semantic universe.
00:10:57.560 | And the harder you try to go, the more you're going to hit this wall
00:11:00.360 | and the more you--
00:11:01.460 | all right, this breaks again.
00:11:03.260 | Things keep breaking.
00:11:04.160 | I'm sorry.
00:11:05.800 | At sufficient complexity, things will keep breaking.
00:11:08.200 | So now the thing that breaks with even custom semantics
00:11:10.980 | is user preference.
00:11:12.520 | Because even when you get to all this, OK, you're saying,
00:11:14.680 | I'm doing relevance and I'm doing price signals and merchant signals.
00:11:17.380 | I'm doing everything.
00:11:18.220 | Now I know the shopping domain.
00:11:19.800 | Now you don't know the shopping domain.
00:11:21.160 | Because now users are using your product.
00:11:22.540 | They're clicking on stuff you thought they're not going to click on.
00:11:25.400 | And they're clicking-- they're not clicking on thoughts on things
00:11:27.740 | you thought they were going to click on.
00:11:29.680 | And this is where you need to bring in the click signal,
00:11:31.880 | thumbs up, thumbs down signal.
00:11:33.420 | Now, these things get very complex.
00:11:35.700 | So we're not going to talk about how to implement them.
00:11:38.400 | Just because, again, in this case, for example,
00:11:39.860 | you have to be able to click through a signal--
00:11:42.480 | prediction signal.
00:11:43.360 | And then you take that signal and then you combine it
00:11:45.060 | with all your other signals.
00:11:46.340 | So now if you look at your ranking function, it's doing, OK,
00:11:48.560 | I want it to be relevant.
00:11:50.000 | I want it to have this semi-structured price signal
00:11:53.380 | and query understanding related to that.
00:11:55.780 | Plus, I want to get the user preference in that.
00:11:57.560 | And then you take all these signals and you add them,
00:11:59.400 | and that becomes your ranking score.
00:12:01.060 | So it becomes a very balanced function.
00:12:03.080 | And this is how you go from, like, oh, it's just relevance,
00:12:05.900 | to, oh, no, it's not just relevance,
00:12:07.560 | to, oh, no, it's not just relevance and my semantics
00:12:10.960 | and my user preferences all rolled up into one.
00:12:14.380 | I'll mention two more things.
00:12:17.220 | You calling the wrong craze.
00:12:18.920 | That's happening a lot because this goes into more orchestration.
00:12:21.840 | And you're trying to do complex things,
00:12:23.780 | especially now when you have agents
00:12:25.340 | and you're telling them to use a certain tool.
00:12:27.080 | This is happening quite a bit because there
00:12:29.160 | is an impedance mismatch between what the search engine expects.
00:12:33.060 | Right?
00:12:33.560 | Let's say you tune the search engine and it expects keyword queries
00:12:35.900 | or expects even more complex queries.
00:12:38.820 | But you cannot describe all of that to the LLM.
00:12:40.860 | And the LLM is reasoning about your application
00:12:42.760 | and then making queries by itself.
00:12:45.020 | And this is a big problem.
00:12:46.160 | So one thing that we've seen many companies do,
00:12:48.220 | we've done this also at Google, you actually
00:12:50.500 | take more control of the actual orchestration.
00:12:52.560 | So you take the BigQuery and you make
00:12:54.940 | n smaller queries out of it.
00:12:56.820 | I took the screenshot from AI mode in Google.
00:12:59.260 | And it's very brief.
00:13:01.100 | You have to catch it because after the animation goes away,
00:13:03.580 | but you see it's actually making x queries.
00:13:05.620 | It's making 15 queries.
00:13:06.640 | It's making 20 queries.
00:13:08.540 | So what we call fan out.
00:13:10.060 | Take very complex thing, try to figure out
00:13:11.920 | what are all the sub queries in it, and then fan them out.
00:13:15.040 | Now you might think, hey, why isn't the LLM doing it?
00:13:17.160 | The LLM is kind of doing it, but the LLM doesn't know
00:13:19.240 | about your tool.
00:13:19.840 | It doesn't know enough about your search engine.
00:13:22.160 | I love MCP, but I'm not a big believer
00:13:23.880 | that you can actually teach the LLM just through prompting
00:13:27.180 | what to expect from the search on the other back end.
00:13:30.520 | This is why people are still like, oh, is it agent autonomous?
00:13:33.220 | Do I need to do workflows?
00:13:34.700 | This is very, very complicated.
00:13:36.280 | And it will take a while for this to be solved.
00:13:38.040 | Because again, it's unclear where the boundary is.
00:13:40.500 | Is it the search engine should be able to handle
00:13:42.780 | more complex things?
00:13:43.660 | And then the LLM will just throw anything its way?
00:13:45.840 | Or is it the other way around?
00:13:46.920 | The LLM has to have more information
00:13:48.380 | about what the search engine can support so it can tailor it.
00:13:51.900 | And right now, you need control because the quality
00:13:54.840 | is still not there.
00:13:56.180 | So this looks like this.
00:13:58.320 | If you have sort of like this assistant input,
00:13:59.900 | and you're turning it into these narrow queries,
00:14:01.820 | like for example, what is David working on?
00:14:03.500 | This has very, very specific semantics.
00:14:05.520 | And it's more like, oh, Jira tickets David.
00:14:07.080 | Slack threads David.
00:14:08.920 | And it's very, very hard to know without knowing enough
00:14:11.360 | about your application that these are the queries that
00:14:13.140 | matter and not on the ones on the left-hand side.
00:14:15.320 | And if you send the thing on the left-hand side
00:14:16.980 | to a search engine, it will absolutely tip over,
00:14:19.400 | unless it understands your domain.
00:14:21.340 | And this is where you need to calibrate the boundary.
00:14:25.380 | So now you're asking all the right queries.
00:14:26.880 | Are you asking them to all the right back ends?
00:14:28.380 | And this is another place where it all fails.
00:14:30.940 | And this is what we call--
00:14:32.200 | one technique is called supplementary retrieval.
00:14:33.880 | This is something you notice like clients do quite a bit,
00:14:36.380 | which is they don't call search enough.
00:14:38.620 | And sometimes people try to over-optimize.
00:14:40.820 | When you're trying to get high-to-call,
00:14:42.600 | you should always be searching more.
00:14:44.200 | Like I was always like, just search more.
00:14:45.660 | Like this is similar to what we talked about dynamic content,
00:14:47.900 | like the in-memory retrieval.
00:14:50.500 | Just like, just give more things.
00:14:52.580 | So it never fails to give more things.
00:14:54.220 | I know in the description we said, like,
00:14:55.860 | there was this query falafel, which was really hard to do.
00:14:59.300 | And then you think like, oh, we're in Google search.
00:15:01.340 | And it's a very simple Middle Eastern dish.
00:15:03.080 | And it's stumped an organization of 6,000 people.
00:15:05.320 | Like, oh my god, what's so hard about this query?
00:15:07.340 | What's so hard about this query is like,
00:15:08.760 | it's an ambiguous intent.
00:15:10.640 | So you need to reach to a lot of back ends
00:15:12.440 | to actually understand enough about it, right?
00:15:14.720 | Because you might be asking about food.
00:15:16.040 | At which point, I want to show you restaurants.
00:15:17.720 | You might be asking this for pictures.
00:15:19.400 | At which point, I want to show you images.
00:15:21.380 | Now what Google ended up doing is that they ask--
00:15:23.320 | they create all the back ends, and then
00:15:24.720 | they put the whole thing in.
00:15:26.200 | And I think I would recommend this
00:15:28.180 | is a great technique to just even increase the recall more.
00:15:31.440 | Just call more things.
00:15:33.460 | And don't try to be skimpy, unless you're
00:15:35.500 | running through some real cost overload.
00:15:38.180 | And that's the last one.
00:15:39.640 | You're running into cost overloads.
00:15:40.860 | GPs are melting.
00:15:42.040 | I tried to generate an image, but then I realized
00:15:43.760 | they're actually a pretty good image that is real.
00:15:45.500 | Somebody took a server rack and threw it from the roof.
00:15:49.140 | This was-- I didn't need to go to ChatGPT
00:15:51.960 | and generate this image.
00:15:53.520 | Apparently, this was an advertisement.
00:15:55.300 | Pretty expensive one.
00:15:57.660 | All right.
00:15:58.220 | So this happens a lot.
00:15:59.080 | Like when you get to a certain scale,
00:16:00.780 | and you have all these back ends,
00:16:01.980 | and you're making all these queries,
00:16:03.160 | and it's just getting very, very complex.
00:16:05.740 | And this-- I mean, Google's there.
00:16:07.240 | Perplexity's there.
00:16:08.080 | I mean, Sam Altman keeps complaining about GPUs melting.
00:16:11.580 | And this is the part where you need to start doing distillation.
00:16:14.280 | And distillation is a very interesting thing,
00:16:16.680 | because to do that, you have to learn how
00:16:18.340 | the fine-tune models.
00:16:19.240 | And this gets to be a little bit complex.
00:16:21.240 | You sort of have to hold the quality bar constant
00:16:23.380 | while you decrease the size of the model.
00:16:25.800 | The reason you can do that is kind of like in that graph.
00:16:28.960 | Like, hey, hire me.
00:16:29.680 | I know everything.
00:16:30.760 | Actually, I'm firing you.
00:16:32.540 | It's overqualified.
00:16:33.360 | Like an LLM, a very large language model is actually
00:16:36.460 | mostly overqualified for the task you want to do.
00:16:39.660 | Because what you really want to do is just one thing.
00:16:41.600 | Like perplexity, they're doing question answering.
00:16:44.360 | And they're pretty fast.
00:16:45.360 | I mean, when you use perplexity in certain context,
00:16:47.080 | they're really, really fast,
00:16:47.540 | really fast, which is amazing.
00:16:48.640 | Because they trained this one model to do this one very
00:16:50.900 | specific thing, which is just be really, really good
00:16:53.340 | at question answering.
00:16:55.420 | And this is very hard.
00:16:56.880 | So I wouldn't do it unless latency becomes a really
00:17:00.300 | important thing for your users, right?
00:17:01.700 | Like, oh, the thing is taking 10 seconds.
00:17:03.440 | Users churn.
00:17:04.400 | If I can make it in two seconds, users don't churn.
00:17:06.540 | Actually, that's a really good place to be,
00:17:07.980 | because then you can use this technique
00:17:09.200 | and just bring everything down.
00:17:11.860 | All right.
00:17:12.760 | You've done everything, Ken.
00:17:13.800 | Things are still failing.
00:17:15.320 | This is everybody.
00:17:17.540 | OK, what do you do?
00:17:18.460 | We have a bunch of engineers here.
00:17:19.660 | What do you do when everything fails?
00:17:23.560 | You blame the product manager.
00:17:25.060 | That's the last trick in the book.
00:17:28.260 | When everything fails, make sure it's not your fault.
00:17:31.000 | But I'll say, there's something really important here.
00:17:34.000 | Quality engineering will never--
00:17:35.600 | it'll never be 100%.
00:17:36.640 | Things will always fail.
00:17:37.480 | These are stochastic systems.
00:17:38.740 | So then you have to punt the problem.
00:17:40.160 | You have to punt it upwards.
00:17:41.440 | So it's kind of a joke, but it's not a joke.
00:17:43.780 | The design of the product matters a lot
00:17:45.700 | to how magical it can seem, because if you try
00:17:48.640 | to be more magical than your product surface can absorb,
00:17:52.100 | you will run into a bunch of problems.
00:17:54.880 | I use a very simple example.
00:17:56.760 | Probably a more complex one would be a human in the loop
00:17:59.940 | for customer support, where you're
00:18:01.480 | like, OK, in some cases, the bot can handle by its own,
00:18:04.000 | but then you'd like to punt to a human.
00:18:06.000 | This is basically UX design, right?
00:18:07.860 | Like, when do you trust the machine
00:18:09.360 | to do what the machine needs to do,
00:18:10.760 | and when does the human need to be in the loop?
00:18:12.580 | This is a much simpler example from Google Shopping.
00:18:15.880 | There's some cases where Google has a lot of great data,
00:18:18.360 | so what we call high understanding.
00:18:19.900 | The fidelity of the understanding is really high.
00:18:22.140 | And then it shows what we call a high-promise UI.
00:18:24.220 | Like, I'll show you things.
00:18:25.120 | You can click on them.
00:18:25.820 | There's reviews.
00:18:26.440 | There's filters, because I just understand this really well.
00:18:29.400 | And there's things Google does not understand at all,
00:18:31.520 | mostly as web documents, bag of words.
00:18:33.800 | And what's really interesting about the UI
00:18:35.460 | is the UI changes.
00:18:36.580 | If you understand more, you show a more kind of, like,
00:18:39.300 | filterable, high-promise UI.
00:18:41.000 | If you don't understand enough, you actually
00:18:42.600 | degrade your experience.
00:18:43.940 | But you degrade it to something that is still workable.
00:18:46.180 | Like, I'll show you 10 things you choose.
00:18:48.160 | Oh, no, I know exactly what you want.
00:18:49.620 | I'll show you one thing.
00:18:50.960 | And this is really, really important.
00:18:52.300 | It has to be part of every-- and this is sort of like--
00:18:54.780 | always understand, like, there's only so much engineering
00:18:56.700 | you can do until you have to, like, actually change your product
00:18:59.700 | to accommodate this sort of stochastic nature.
00:19:01.820 | So gracefully degrade, gracefully upgrade,
00:19:03.840 | depending on, like, the level of your understanding.
00:19:05.940 | And again, I'll flash these two slides at the end.
00:19:07.740 | Like, always remember what you're doing,
00:19:09.240 | because you can absolutely get into theoretical debates.
00:19:11.580 | Again, context window versus rag.
00:19:13.700 | This versus that.
00:19:14.580 | Like, is this, you know, agents versus--
00:19:16.400 | I don't know.
00:19:17.220 | Just everything is empirical.
00:19:18.700 | In this domain, when you're doing, like, this sort of thing,
00:19:21.000 | oh, I have my evals.
00:19:22.600 | I'm trying to, like, step by step go up.
00:19:24.360 | I have, like, a toolbox under my disposal.
00:19:26.680 | Everything, everything is empirical.
00:19:28.320 | So again, baseline, analyze your losses,
00:19:32.380 | and then look at your toolbox and see,
00:19:34.800 | are there easy things here I can do?
00:19:36.540 | If not, are there at least medium things I could do?
00:19:38.520 | If not, you know, should I hire more people
00:19:40.480 | and do, like, some really, really hard things?
00:19:42.700 | But always remember, like, the choice is on you,
00:19:44.600 | and you should be principled, because this
00:19:46.160 | can be an absolute waste of time if you're
00:19:48.440 | doing it too far ahead of the curve.
00:19:50.540 | All right, again, the slides are here.
00:19:52.400 | I think-- oh, I achieved it.
00:19:54.600 | 30 seconds left.
00:19:56.680 | And if you want the slides, they're here, again.
00:19:58.820 | And reach out to us.
00:20:00.740 | We're always happy to talk.
00:20:01.920 | I think I was very happy with the exit talk,
00:20:03.320 | because it's always nice to find, like, friends who
00:20:05.100 | are nerds in information retrieval.
00:20:07.720 | We are also such.
00:20:08.700 | So reach out and happy to talk about, you know,
00:20:11.480 | rag challenges and such and some of the models
00:20:13.380 | we are building.
00:20:14.880 | All right, thank you so much.
00:20:16.660 | We'll see you next time.