Stanford CS25: V3 I Retrieval Augmented Language Models

00:00:00.000 | - Hey guys, welcome to our last lecture of this quarter.

00:00:05.000 | And we're very happy to have Dawa here.

00:00:15.360 | He's the CEO of Contextual BI, the enterprise LLM company,

00:00:20.360 | as well as an adjunct professor

00:00:22.320 | in symbolic systems here at Stanford.

00:00:24.960 | And previously he was the head of research at ClickBase.

00:00:28.080 | And before that, a research scientist

00:00:29.880 | at Facebook AI Research.

00:00:32.400 | He received his PhD and master's

00:00:34.200 | from the University of Cambridge,

00:00:36.080 | as well as a master's in logic

00:00:37.400 | from the University of Amsterdam,

00:00:39.120 | and studied philosophy and cognitive AI in undergrad.

00:00:42.560 | And his work focuses on machine learning as well as NLP,

00:00:46.540 | specifically on developing better models

00:00:48.480 | for language understanding and generation,

00:00:51.200 | and better tools for evaluation and many more.

00:00:54.680 | Yeah, give it up for Adele.

00:00:56.880 | - Right, thank you.

00:01:00.120 | So I guess I have to sort of stand here in the corner

00:01:02.760 | so people can see me on the Zoom as well.

00:01:05.000 | Yeah, thanks so much for having me here.

00:01:09.280 | So I asked Stephen what I should talk about.

00:01:12.280 | There were a couple of things I could talk about,

00:01:14.120 | multimodality or evaluation.

00:01:16.600 | And this was the preferred topic, I guess,

00:01:20.120 | because the others were already covered.

00:01:22.420 | So yeah, I'm very happy to talk to you

00:01:25.160 | about everything retrieval augmentation.

00:01:27.720 | I think this is really one of the coolest topics

00:01:30.740 | right now in our field.

00:01:32.960 | So I'll just give you an overview of what's been happening

00:01:35.960 | and what I think are the interesting questions

00:01:38.360 | to think about.

00:01:39.200 | So first of all, obviously, in case you've missed it,

00:01:43.420 | we are in the age of language models.

00:01:45.920 | And I just wanted to do a quick poll here

00:01:48.840 | in this not super big audience.

00:01:51.200 | I guess there's more people on the Zoom,

00:01:52.480 | but who invented language models?

00:01:54.860 | If you thought OpenAI, then I'm angry with you, right?

00:02:00.840 | So actually, this is a very, very old idea.

00:02:04.160 | So the idea is just you take a sequence

00:02:06.360 | and you factorize out the token probabilities, right?

00:02:09.800 | And so it wasn't invented by OpenAI.

00:02:12.880 | It's not like a few years old.

00:02:14.680 | It's actually several decades old.

00:02:16.640 | So I'm bringing this up because I was talking to someone

00:02:19.920 | and they were like, "OpenAI invented language models."

00:02:22.300 | And I was like, "You're kidding me, right?"

00:02:24.440 | So I went back to the literature

00:02:28.320 | and this is the oldest one I could find, actually.

00:02:30.280 | 1991, first neural language model.

00:02:32.680 | There's a very nice paper from 2003 from Bengio

00:02:36.720 | where they actually have word embeddings

00:02:39.800 | and everything already in there.

00:02:42.000 | So obviously, these are LLMs, not LLMs.

00:02:45.360 | And as it turns out, if you make them really big

00:02:48.200 | and you parameterize them with these massive neural nets,

00:02:51.560 | then you get something really powerful

00:02:53.000 | that really shows emergent properties.

00:02:55.680 | And that's why we're all so excited in this stuff.

00:02:58.180 | So if we think about this from a classic CS perspective,

00:03:02.940 | there's input-output, right?

00:03:04.240 | There's this kind of thing in the middle.

00:03:05.980 | It's the generator.

00:03:07.240 | So we take a sequence, the input sequence,

00:03:10.240 | and then the task of the model is to predict the next token.

00:03:14.340 | Very, very simple model.

00:03:16.000 | And so that's why it was so easy to come up with this

00:03:19.880 | in 1991 already, because the idea is very intuitive.

00:03:23.860 | But for a long time, what was really broken with this

00:03:26.880 | was the user interface.

00:03:28.460 | And this, I think a lot of people kind of misunderstand

00:03:33.000 | what ChatGPT was about.

00:03:34.960 | That's really what ChatGPT fixed.

00:03:37.080 | So that initially you had to come up

00:03:39.360 | with these very weird prompts

00:03:41.000 | in order to get your language model

00:03:42.440 | to do what you wanted it to do.

00:03:44.400 | And humans are terrible at this, right?

00:03:46.320 | So we're much better at sort of telling people

00:03:49.240 | or things around us what we want, right?

00:03:51.100 | So if we have a dog, we say, "Sit."

00:03:52.800 | We don't prompt it in a very weird way so that it sits,

00:03:56.280 | right?

00:03:57.120 | And it's the same with the language model.

00:03:58.360 | If you wanted to generate some rap lyrics

00:04:01.240 | in the style of a pirate or Shakespeare or something,

00:04:04.160 | then you tell it generate some rap lyrics

00:04:06.200 | in the style of a pirate, right?

00:04:08.000 | So that kind of instruction data

00:04:10.160 | actually turns out to be super, super rare in just web data.

00:04:14.160 | So what you need to do is you need to fix the user interface

00:04:16.840 | to the language model.

00:04:18.040 | And the classic recipe for doing that

00:04:20.280 | is the sequence basically that ChatGPT used.

00:04:23.440 | So you prompt the model in a specific way,

00:04:25.040 | you instruction finds in the model,

00:04:26.560 | and you do some alignment, RLHF,

00:04:29.560 | whatever you do on top of that.

00:04:31.840 | So that's the first thing.

00:04:32.880 | So now you have a working language model

00:04:35.000 | with a working user interface.

00:04:37.800 | So are we done then?

00:04:39.200 | Obviously we're not, right?

00:04:41.120 | So right now language models

00:04:43.140 | are kind of taking the world by storm.

00:04:45.040 | But if you talk to anyone, especially in an enterprise,

00:04:47.960 | for example, where they have

00:04:48.800 | very strict accuracy requirements,

00:04:51.800 | they will tell you that

00:04:52.760 | they can't really productionize this yet.

00:04:55.160 | And the reason is

00:04:56.400 | because there are all these familiar problems,

00:04:57.920 | probably a bunch of you are working on these problems

00:04:59.920 | right now around hallucination.

00:05:03.320 | So these models, they kind of make up stuff

00:05:05.360 | very often with very high confidence,

00:05:06.960 | which is even more scary in a way.

00:05:10.640 | Attribution, so we don't really know

00:05:12.120 | why these models are saying what they're saying.

00:05:14.800 | Staleness, they go out of date.

00:05:16.400 | And so this was a big problem with sort of chat GPT,

00:05:18.840 | not knowing anything that happened

00:05:20.320 | after a certain cutoff date,

00:05:22.000 | and they keep updating it every once in a while.

00:05:23.960 | But you want to have a system

00:05:24.960 | that's always completely up to date, that never goes stale.

00:05:27.960 | You want to be able to revise the information in the system.

00:05:31.440 | So if you're a European organization,

00:05:34.540 | you have to worry about GDPR,

00:05:36.760 | which means that you need to be able to remove information

00:05:38.900 | from the language model or maybe revise facts,

00:05:42.140 | which we don't really know how to do.

00:05:43.840 | So again, this is a very interesting area of study

00:05:47.320 | for a lot of folks, model editing.

00:05:49.840 | But so this is something

00:05:50.960 | that we really want to be able to fix.

00:05:53.320 | And then there's this big question

00:05:55.040 | of how do you customize these models?

00:05:57.720 | So different people have different use cases,

00:05:59.760 | you have different data, if you're a company,

00:06:01.640 | or if you want to have a language model on your own data,

00:06:04.320 | how do you make it work on your own data?

00:06:06.680 | So one of the solutions

00:06:08.960 | that everybody has started using right now

00:06:11.400 | is to couple it to an external memory.

00:06:13.240 | So that's really just RAG, right?

00:06:15.240 | This whole lecture is basically about RAG,

00:06:19.720 | but the way to understand what is going on here

00:06:22.840 | is we have this generator just like before,

00:06:25.760 | we have the input and the prompt just like before,

00:06:27.680 | but now instead of just giving those two things,

00:06:31.000 | we give this additional context.

00:06:32.680 | So we contextualize the language model

00:06:34.960 | using things we've retrieved.

00:06:37.040 | And the retriever is very often pretty simple,

00:06:40.600 | it's just a query in a document encoder.

00:06:43.000 | And then you get a bunch of documents,

00:06:45.240 | you give them as context to the model.

00:06:47.920 | So super simple architecture.

00:06:49.880 | And I think it's useful to think about it

00:06:53.960 | from the perspective of these two separate paradigms.

00:06:57.680 | So if you've ever taken an exam, I'm sure you have, right?

00:07:01.160 | You can have a closed book exam

00:07:02.440 | where you have to memorize all of this,

00:07:03.760 | so you have to cram all the knowledge

00:07:05.120 | into your parameters, your neurons,

00:07:08.440 | or you have an open book exam

00:07:09.720 | where you have all of this information in the book

00:07:11.920 | that you can access when you do the exam.

00:07:14.800 | So it's a very similar thing with rank, right?

00:07:16.720 | You can just make it an open book setting

00:07:18.480 | where you can give it access to this external information,

00:07:21.160 | Wikipedia, or something else,

00:07:22.880 | or basically the entire internet,

00:07:25.560 | and then have the language model do its job

00:07:27.440 | without having to memorize all of it in its parameters.

00:07:30.360 | So the other, I think, useful distinction here

00:07:33.720 | is that cramming everything into your parameters,

00:07:36.800 | that's the parametric approach, right?

00:07:38.600 | So what we're doing with RAG

00:07:40.680 | is we're adding this non-parametric retrieval component.

00:07:44.120 | So you might call this semi-parametric

00:07:47.440 | if you want to give this a name.

00:07:49.240 | All right, so why does that actually solve these issues?

00:07:54.880 | And so the answer is basically

00:07:56.720 | that if you have this separate index,

00:07:58.960 | right, this separate retriever,

00:08:00.600 | you can swap it in, you can swap it out,

00:08:02.320 | you can replace it with a new index,

00:08:04.480 | so you can really customize it.

00:08:06.240 | And so you can customize your language model system

00:08:09.360 | for what the user really wants to see.

00:08:11.720 | And then obviously you can update this index

00:08:14.440 | so it doesn't really go still,

00:08:16.720 | and you can revise it if everything goes wrong,

00:08:18.800 | if anything goes wrong.

00:08:19.960 | The other thing you get is grounding, right?

00:08:23.440 | So that's initially why I became interested

00:08:25.760 | in this kind of architecture,

00:08:27.040 | because I was thinking a lot about grounding

00:08:28.720 | and multimodality and things like that.

00:08:30.320 | And actually one really nice way to ground things

00:08:32.840 | is to find some other information

00:08:34.920 | that you can ground your generation in.

00:08:36.600 | And so you really want the language model

00:08:38.600 | to only say things that it has evidence for

00:08:41.640 | in this other piece of text,

00:08:43.640 | or even multimodal data that it retrieves separately.

00:08:46.600 | So if you do that, then you get less hallucination,

00:08:48.760 | because you can always point back to your source,

00:08:50.600 | it's always grounded in your source.

00:08:52.600 | And you get attribution because you don't know

00:08:54.520 | why the model is saying what it's saying,

00:08:56.160 | it's because it found this thing here.

00:08:58.960 | Is that clear?

00:09:02.040 | All right, so for the rest of this lecture,

00:09:05.440 | we're gonna talk about this basic architecture.

00:09:09.000 | And so it kind of looks like a pretty simple thing, right?

00:09:12.800 | But there are actually lots and lots of questions

00:09:14.800 | you can ask about what this system should really look like.

00:09:18.840 | And this doesn't even cover

00:09:21.040 | half the questions you can ask.

00:09:23.080 | So it really is about how do we optimize

00:09:25.800 | this entire system, right?

00:09:27.920 | So we have these separate components,

00:09:29.400 | the retriever, the generator,

00:09:31.280 | and then there are things like this query encoder,

00:09:34.640 | how do we encode queries?

00:09:35.960 | How do we do the retrieval?

00:09:38.080 | Do we update the documents encoder?

00:09:40.320 | How do we actually define a document, right?

00:09:43.440 | Is it like a full document, or is it a paragraph,

00:09:45.560 | or a chunk, or a sentence, or a couple of words?

00:09:48.600 | So there are lots of questions to ask.

00:09:50.720 | And as you'll see, there are lots of possible answers

00:09:54.720 | to these questions as well.

00:09:56.680 | So this is what we'll cover.

00:09:59.200 | So there are lots of architectures

00:10:03.680 | going into these questions.

00:10:06.040 | And I think as we go through them,

00:10:08.760 | it's useful for you to think about

00:10:10.600 | what happens during training time

00:10:12.120 | and what happens during test time, right?

00:10:14.360 | So during training time, it's really,

00:10:16.720 | okay, we have this language model,

00:10:17.960 | we have this retriever, which one do we update?

00:10:21.800 | How do we update them?

00:10:22.960 | How do we train this entire system?

00:10:24.800 | Do we maybe not train it at all?

00:10:27.000 | Do we pre-train it from scratch?

00:10:28.480 | Do we initialize it with components

00:10:31.040 | that were already separately trained?

00:10:33.000 | These are the kinds of questions that you have to answer

00:10:35.120 | if you wanna design a system like this.

00:10:37.840 | And then during test time, you have this entire system,

00:10:41.520 | right, so actually multiple models in a way

00:10:43.760 | that are working together.

00:10:46.000 | So there's also different things you can do there, right?

00:10:49.720 | So give it different indices during test time

00:10:52.000 | or manipulate kind of how you're sampling,

00:10:54.560 | things like that.

00:10:55.480 | So the starting point for all of this stuff,

00:10:59.680 | I think if you ask someone now, like, what is RAG,

00:11:02.320 | they will think of this thing.

00:11:04.600 | So this is frozen RAG, basically.

00:11:07.160 | There's no training here at all.

00:11:09.680 | So going back to this question of train time, test time,

00:11:12.160 | there's only test time here.

00:11:13.320 | Train time happens separately

00:11:14.960 | with these kind of black box models

00:11:16.840 | that we don't necessarily have control over, right?

00:11:19.000 | So there's this document embedding model,

00:11:22.000 | whatever is currently at the top

00:11:23.720 | of some open source leaderboard.

00:11:25.920 | You use that to, oops, sorry,

00:11:29.240 | to get some vectors that you then use

00:11:32.040 | to create this vector database.

00:11:34.120 | And then the vector database just does search

00:11:36.040 | and it gives the information from the search

00:11:38.440 | to the language model.

00:11:39.840 | And it just passes it as the context, right?

00:11:43.720 | So this only works because of in-context learning.

00:11:47.840 | And I think as a machine learner myself,

00:11:51.760 | this feels very inelegant.

00:11:54.160 | So what this lecture is about is,

00:11:56.280 | can we do better than this frozen thing?

00:11:59.480 | So let's start from the left side of this.

00:12:04.960 | Like, okay, if we want to outperform

00:12:06.560 | this frozen thing itself with just the vector database,

00:12:09.560 | like, what would that look like

00:12:11.000 | from a retrieval perspective?

00:12:12.720 | And the starting point for everything retrieval

00:12:16.640 | is TF-IDF.

00:12:18.080 | Does everybody know what TF-IDF is?

00:12:20.200 | No, okay.

00:12:22.160 | So TF-IDF is basically a sparse retrieval method

00:12:26.360 | where you have a score function

00:12:28.840 | that looks at documents and queries, so D and Q.

00:12:33.280 | And then there are basically two terms that matter.

00:12:35.240 | One is the TF, the term frequency,

00:12:37.280 | and the other is the IDF, the inverse document frequency.

00:12:40.680 | So this inverse document frequency

00:12:42.120 | is actually a really nice idea from Karen Spark-Jones,

00:12:45.120 | a really underrated researcher.

00:12:46.520 | She's done some amazing work.

00:12:48.040 | But the basic idea is that you want to look at the words

00:12:52.240 | that are very special,

00:12:53.560 | so that don't occur in lots of different documents.

00:12:55.880 | And so the overlap between the word "the"

00:12:58.400 | doesn't really matter, right?

00:12:59.440 | Like, "the" occurs everywhere.

00:13:01.480 | So you want to have sort of the special words.

00:13:04.040 | So that's what TF-IDF does in a nutshell.

00:13:06.440 | It gives you a score for document query overlap.

00:13:10.000 | And then you can do all kinds of things here

00:13:12.440 | with how you weight it.

00:13:13.720 | So there's all these weird, different parameters,

00:13:15.600 | like this B and things like that,

00:13:17.640 | that allow you to make it better

00:13:19.560 | than just having the TF-IDF score.

00:13:22.320 | So there's a couple of tweaks you can do there.

00:13:24.480 | So BM25, actually, in case you're wondering,

00:13:27.040 | stands for Best Match 25.

00:13:29.360 | So I tried to discover, like,

00:13:31.200 | where does the 25 actually come from?

00:13:33.960 | That's because the prior,

00:13:35.800 | sort of the preceding 24 experiments failed, right?

00:13:39.120 | So it's literally the 25th one that seemed to work,

00:13:41.480 | and that's why it's called BM25.

00:13:44.000 | It's bizarre, right?

00:13:44.800 | But so this is sparse retrieval.

00:13:48.720 | It's just counting words, right?

00:13:49.960 | So you have this massive, massive vector

00:13:52.200 | of all these word occurrences.

00:13:53.840 | It's sparse because most words never occur, right?

00:13:56.360 | So it's sort of like a vector

00:13:58.080 | of vocabulary size dimensions.

00:14:02.160 | So most of that is obviously zero.

00:14:04.840 | But so that's actually kind of a nice property

00:14:07.080 | if you want to do fast search on a CPU, right?

00:14:09.680 | Because on a CPU, sparse dot product

00:14:12.400 | is very easy to compute.

00:14:14.440 | So this is used in the system called DrQA,

00:14:19.360 | which is really one of the first neural instances

00:14:22.400 | of this open domain,

00:14:23.840 | sort of open book question answering paradigm.

00:14:27.280 | So you have a question,

00:14:28.800 | like how many of Warsaw's inhabitants, blah, blah.

00:14:31.720 | So you want to ask, basically, Wikipedia

00:14:34.160 | what the answer is for this.

00:14:35.280 | So then you have this document retriever

00:14:37.040 | based on the sparse, so BM25, I think, in this case.

00:14:41.600 | Retrieval methods, you pass that to,

00:14:44.040 | I think this was still by LSTM at the time,

00:14:48.280 | a document reader model,

00:14:50.680 | and then that model gives you the answer.

00:14:53.080 | So this, I think, is really the first instance

00:14:56.440 | of having sort of this separation

00:14:58.080 | between a retrieval and a generator system

00:15:01.280 | that you use for answering complicated questions

00:15:03.520 | based on sort of open domain knowledge.

00:15:06.640 | So after the sparse stuff,

00:15:09.000 | there was a bunch of work on dense retrieval.

00:15:12.800 | And so the advantage of dense retrieval,

00:15:15.320 | so this is just like word embeddings, basically vectors,

00:15:18.000 | like they're dense now, no longer sparse,

00:15:20.240 | so they're much smaller in terms of dimensionality.

00:15:24.080 | And a nice advantage of dense retrieval

00:15:26.760 | is that it's not really about specific words, right?

00:15:29.000 | So if there are synonyms,

00:15:31.440 | you can still find the relevant document,

00:15:35.080 | which you couldn't really do with a sparse representation.

00:15:37.760 | So that's really the advantage of dense

00:15:39.760 | is that you get like semantic similarity.

00:15:41.920 | So you can do this over word embeddings.

00:15:46.040 | That doesn't really work all that well,

00:15:47.440 | but at the time that people started thinking about this,

00:15:50.240 | BERT was already out there,

00:15:51.440 | and BERT is really great for giving you

00:15:53.000 | a vector representation for an entire sequence of words.

00:15:56.320 | So a sentence representation or a passage representation.

00:15:59.560 | So there are all these cool systems like ORCA

00:16:01.920 | and DPR, the Dense Passage Retriever,

00:16:05.200 | where they essentially use the retrieval

00:16:08.840 | as a kind of latent variable in the system.

00:16:11.400 | And the way to get the latent variable to work,

00:16:14.720 | to be good enough essentially to train the entire system

00:16:18.200 | is to pre-train the retriever on relevant information.

00:16:21.600 | So for ORCA, they do something called inverse close.

00:16:25.640 | So they do kind of a close task

00:16:27.080 | where you want to find passages

00:16:30.280 | that are sort of relevant to the preceding passage.

00:16:33.520 | And in DPR, they just train it on a supervised thing.

00:16:36.680 | But really the core idea here is that,

00:16:39.000 | as you can see in this graph here,

00:16:40.840 | you can do better than VM25 if you add lots of documents

00:16:44.560 | and the way you compute the score function

00:16:46.320 | is much simpler, it's just a dot product.

00:16:48.360 | So the nice thing about dot products

00:16:53.960 | is that you can do them very, very efficiently

00:16:56.440 | on the GPU as well if you know what you're doing.

00:17:00.560 | So what you really want to get at

00:17:03.080 | is maximum inner product search, MIPS, right?

00:17:05.480 | This is one of the kind of core ideas

00:17:07.080 | of a lot of this stuff.

00:17:08.720 | And you can do MIPS with ANN,

00:17:12.200 | approximate near neighbor search.

00:17:14.040 | And so there's this really brilliant piece of work

00:17:17.960 | out of there for my colleagues at the time,

00:17:20.960 | called FACE, which really underlies

00:17:23.240 | all of these modern vector databases, right?

00:17:26.160 | So all the popular ones,

00:17:28.160 | they're sort of re-implementations of this FACE idea.

00:17:30.480 | One is in Rust, one is in Go,

00:17:32.080 | but it's all basically the same idea, it's just FACE.

00:17:35.000 | And so FACE really powers a lot of this stuff.

00:17:39.320 | And whenever somebody tells you something

00:17:41.640 | about a vector database, just think about FACE,

00:17:44.120 | very fast dot product.

00:17:45.720 | So obviously, you can go beyond dot product, yes?

00:17:51.880 | - What is it, what is FACE?

00:17:53.640 | - What is FACE?

00:17:55.240 | So it's an open source library,

00:17:57.040 | Facebook AI similarity search.

00:17:59.200 | No, so it's just basic off-the-shelf ANN algorithms.

00:18:06.640 | Yeah, so there are all kinds of different,

00:18:13.160 | I don't know if you, do you know what like

00:18:14.600 | product quantization is and things like that?

00:18:17.000 | So there are basically, so you have a bunch of vectors

00:18:20.440 | and you can just compute the full dot product,

00:18:23.440 | which is sort of inefficient, right?

00:18:24.880 | So what you can do is try to compress subspaces

00:18:28.480 | of the vector, and then just look at the kind of centroids.

00:18:31.880 | So you can quantize sub-vectors of the full vector

00:18:36.520 | and then do much faster search over just the centroids.

00:18:39.640 | It's a good question, any other questions?

00:18:44.400 | All right, so about this dot product idea.

00:18:50.920 | So what we have here is,

00:18:53.120 | some people call this a Siamese network,

00:18:55.200 | I guess it is, right?

00:18:56.040 | So you have two different BERT models

00:18:59.000 | or whatever your encoder is here.

00:19:00.680 | And then at the end, you get these two vectors

00:19:02.480 | and then you just do dot product

00:19:03.960 | so you get one single score.

00:19:05.760 | But you can do all kinds of much fancier things

00:19:08.040 | if you're willing to give up

00:19:09.840 | on this bi-encoder approach, right?

00:19:12.440 | So a really nice example from one of your colleagues

00:19:15.920 | here at Stanford is Colbert.

00:19:19.440 | So what this does is late interaction.

00:19:22.920 | So instead of just having this dot product here,

00:19:25.640 | you have a kind of more complicated version

00:19:29.480 | of computing a score where you aggregate

00:19:31.480 | over sort of maximum similarity scores

00:19:33.520 | between different words.

00:19:35.120 | So I only recently actually discovered

00:19:36.920 | that this is called Colbert

00:19:37.960 | because of the late night show, Colbert.

00:19:40.560 | So it's sort of Omar's joke, actually, this name,

00:19:43.600 | but just so you know, if you run into it.

00:19:48.960 | So, but I think if we look at kind of where

00:19:52.520 | the state of the art has been going now,

00:19:55.320 | one of the nice things about these vector databases

00:19:57.520 | is that they're super efficient, right?

00:19:58.960 | So dot product is much more efficient

00:20:00.800 | than this late interaction stuff,

00:20:02.200 | especially if you do the approximate

00:20:03.840 | nearest neighbor search.

00:20:05.080 | But there's been some really cool work.

00:20:08.000 | So things like SPLADE,

00:20:09.440 | they basically have sparse meet dense in a way.

00:20:14.560 | So one of the big problems, as I said,

00:20:16.080 | with sparse is that you can't really handle synonyms

00:20:18.280 | and things like that.

00:20:19.440 | But what you could do is take a dense model,

00:20:22.120 | like a bird model, look at kind of this one word

00:20:25.520 | in your sequence, try to see which other words

00:20:28.120 | fit in the same slot.

00:20:29.480 | So that gives you the synonyms.

00:20:31.720 | So now you can give all these synonyms to a sparse vector,

00:20:36.000 | and then you can just do sparse dot product.

00:20:38.120 | And so I have a much more efficient way to do search

00:20:41.000 | without sort of giving up on all the cool stuff

00:20:45.440 | that you get from a dense representation.

00:20:48.120 | So that's one thing.

00:20:49.280 | And this other idea I really like is called DRAGON.

00:20:52.720 | So this I think is really the best

00:20:56.400 | generalized dense retriever.

00:20:57.840 | So if you want to take something off the shelf right now

00:20:59.760 | and just go to Hugging Face or something,

00:21:01.760 | then this DRAGON or DRAGON+ is probably the thing

00:21:04.560 | you want to use for a dense retriever.

00:21:06.560 | And the way they train this is through this

00:21:09.080 | progressive data augmentation strategy

00:21:11.480 | to make the model better and better over time

00:21:13.600 | by sampling very difficult negatives.

00:21:16.040 | And that gives you very good representations.

00:21:20.440 | And so the other thing about this,

00:21:22.600 | I think this is the only sort of final point

00:21:24.880 | about retrieval in general

00:21:27.080 | is that what we see happening right now,

00:21:29.480 | if you look at sort of the developer community around DRAGON

00:21:32.120 | is that they're all doing hybrid search right now.

00:21:34.840 | So you can actually just combine the search results

00:21:37.200 | from your sparse BN25 or whatever thing, or SPLADE,

00:21:41.280 | and you can combine them with your DRAGON,

00:21:44.040 | and then you'll get this ranking that works even better.

00:21:47.160 | So then you kind of get best of both worlds,

00:21:48.840 | but then you get all these questions

00:21:50.080 | about how do you combine the results.

00:21:52.160 | Any questions on this part?

00:21:55.760 | - Oh, can you hear me?

00:21:59.040 | - Yes.

00:21:59.960 | - Oh, sorry.

00:22:01.000 | On the earlier slide, has there been any work on benchmark

00:22:05.520 | how much less hallucination RAG incurs

00:22:08.200 | over a closed book question answering,

00:22:11.040 | for example, directly asking

00:22:12.520 | the large language model the question,

00:22:14.440 | has there been any benchmarking studies in this?

00:22:17.480 | - Yeah, so there's a great paper,

00:22:19.280 | if I can say so myself,

00:22:20.800 | on the fact that retrieval augmentation

00:22:22.560 | reduces hallucination.

00:22:24.520 | It's from 2021, I think.

00:22:26.120 | So yeah, you can just find,

00:22:28.680 | if you literally look for retrieval augmentation

00:22:31.080 | reduces hallucination, then you'll find the paper.

00:22:34.240 | - Thank you.

00:22:38.200 | (indistinct)

00:22:40.600 | - Yeah, so very often you want to have

00:22:47.560 | very precise word overlap for things

00:22:51.920 | where you don't want to have the synonyms

00:22:53.800 | or the kind of nearest neighbors, right?

00:22:55.320 | So if there's like a brand name or something like that,

00:23:00.120 | then like, let's say the brand is Apple, right?

00:23:02.560 | You don't want to find stuff about the pairs, right?

00:23:05.160 | So that's what you would do with a dense retriever.

00:23:08.480 | So it really kind of depends on what you want to use it for.

00:23:12.120 | That's why hybrid is probably the way to go.

00:23:14.320 | It's a good question.

00:23:17.000 | - Like with the dense,

00:23:19.320 | it's contextualized in that inspection,

00:23:24.440 | it realized Apple, the company would be different.

00:23:28.120 | - No, so if they were actually contextualized, then yes,

00:23:31.520 | but very often it's a frozen retrieval system, right?

00:23:35.160 | That's one of the problems

00:23:36.120 | with all the frozen rack stuff.

00:23:37.960 | (indistinct)

00:23:44.360 | No, so the sort of document and the query,

00:24:00.120 | they're the same, right?

00:24:01.280 | So they're either sparse or they're dense.

00:24:03.760 | So if they're sparse,

00:24:04.920 | the components of the vector are literally the other words.

00:24:08.120 | (indistinct)

00:24:12.320 | So it's literally counts, right?

00:24:22.040 | So basically it's a one big matrix of documents as rows

00:24:26.720 | and the columns are the words in the documents.

00:24:29.320 | And then you just count how often a word occurs

00:24:31.600 | in a document, right?

00:24:33.160 | So that's as far as that.

00:24:35.240 | (indistinct)

00:24:37.640 | Yeah, and so in the field,

00:24:42.520 | we call them sparse embeddings or sparse retrieval

00:24:46.520 | because most of that vector is zero, right?

00:24:49.640 | Because most words don't occur in that document.

00:24:52.080 | Does that make sense?

00:24:55.240 | - Yeah.

00:24:56.080 | - Cool.

00:25:00.440 | So let's talk about doing slightly better.

00:25:05.000 | So going back to Stephen's question about,

00:25:07.000 | okay, we have this kind of retrieval thing,

00:25:08.920 | but how do we actually make this retriever good

00:25:11.320 | for the context that is going to be used in, right?

00:25:14.520 | So can we contextualize the retriever for the generator,

00:25:18.320 | even if it's a generator

00:25:20.040 | where we might not have access to the weights?

00:25:22.200 | So it could be a GPT-4 model,

00:25:24.160 | we just send it to some API, we get some stuff back.

00:25:28.200 | And so one paper I really like is called Replug.

00:25:31.560 | So just to kind of explain what this looks like,

00:25:35.040 | so you have this context,

00:25:36.240 | you have a retriever that we do

00:25:38.320 | the standard retrieval step with,

00:25:39.840 | this is a dense retriever.

00:25:42.040 | And now, sorry, and now you compute the likelihood.

00:25:46.880 | So basically just normalize the scores

00:25:48.960 | that you get for the top K documents

00:25:51.480 | to get a distribution here.

00:25:53.040 | And then you'll give each one of the retreat documents

00:25:57.000 | separately to this generator, to your language model.

00:26:00.680 | So you can look at the perplexity of the correct answer

00:26:04.200 | for that language model, right?

00:26:06.200 | So now we have these two probability distributions

00:26:08.840 | or two likelihoods essentially,

00:26:10.360 | and we can minimize the KL divergence

00:26:12.640 | to make sure that we can actually retrieve the documents

00:26:16.000 | that lead to the lowest perplexity

00:26:18.200 | on the right answer for the language model.

00:26:20.440 | So super simple idea, works really, really well.

00:26:26.520 | And the nice thing about this is completely agnostic

00:26:29.560 | of what happens upstream, right?

00:26:31.080 | So this will work for any sort of encoder, decoder,

00:26:33.600 | for any language model.

00:26:35.760 | What you need is a perplexity score,

00:26:39.000 | but for most language models, you can get that,

00:26:41.440 | not necessarily all of them.

00:26:43.200 | So that's one thing.

00:26:44.040 | And then there's this other really nice approach.

00:26:46.680 | (indistinct)

00:26:51.920 | So in the retriever,

00:26:53.440 | you're literally updating the dense representations, right?

00:26:58.360 | So you're encoder basically for your dense representation.

00:27:01.000 | That's a good question.

00:27:01.840 | We'll get into that a little bit more.

00:27:04.680 | So there's another paper

00:27:06.640 | on in-context retrieval augmented language models,

00:27:09.840 | where the whole paper is basically about just doing BM25

00:27:14.080 | and just giving stuff directly to the context

00:27:16.160 | of the language model and things kind of work.

00:27:18.000 | So it's sort of frozen rag,

00:27:19.800 | but even more primitive in a way

00:27:22.640 | where the retriever is this very old sparse algorithm,

00:27:26.960 | but it works really, really well.

00:27:29.040 | But then they have this really awesome section

00:27:31.320 | where they show that you can just have this re-ranker

00:27:35.120 | on top of the BM25 results

00:27:37.640 | and you can backdrop into this re-ranker.

00:27:40.240 | So now you still keep the language model completely fixed.

00:27:43.280 | So that's sort of this part of the loss here.

00:27:46.800 | So you have kind of a stop gradient on the parameters data.

00:27:49.480 | That's just your language model.

00:27:51.360 | But now you have this kind of rank function here

00:27:55.240 | that you can backdrop into, right?

00:27:57.000 | So that's your re-ranker.

00:27:58.280 | It's basically, it can be a BERT model

00:27:59.880 | or anything like that that works on top of the things

00:28:01.840 | you initially retrieved from your BM25.

00:28:04.160 | And now you have this BERT re-ranker

00:28:06.200 | that you can backdrop into.

00:28:07.560 | So this also works really, really nice.

00:28:11.400 | So we're slowly progressing towards having a system

00:28:14.520 | that is much more optimized for being properly

00:28:18.240 | retrieval augmented in a way where it's useful

00:28:20.440 | and contextualized for what you want to use it for.

00:28:23.240 | So yeah, just to point out kind of what that looks like

00:28:27.560 | with this re-ranker.

00:28:28.400 | So you just have this extra step essentially, right?

00:28:30.960 | So we have our retriever, then we have a re-ranker,

00:28:33.120 | then we have our generator and our output.

00:28:35.240 | - (indistinct)

00:28:39.960 | - No, not necessarily.

00:28:41.320 | So for this one you do, yeah.

00:28:45.520 | But so for re-plug you don't, right?

00:28:48.840 | - Yeah. - Yeah.

00:28:51.000 | Yeah, yeah, yeah.

00:28:52.080 | So basically, yeah, you need to get...

00:28:54.040 | - (indistinct)

00:28:55.640 | - Not all of them.

00:28:57.240 | Some of them do, but yeah, there are all kinds of tricks

00:29:00.160 | you can do on top of that, yeah.

00:29:01.760 | So basically the question is how do we get

00:29:07.560 | sort of gradients flowing into this, right?

00:29:09.360 | So if you don't actually have access

00:29:11.280 | to the full parameters of the model

00:29:13.120 | so that you can backdrop all the way through it,

00:29:14.960 | then you can do a reinforce style loss on the retrieval.

00:29:19.960 | And then you just pass the kind of log-like view

00:29:22.800 | if you have access to that

00:29:24.520 | or some other kind of black box function.

00:29:26.560 | All right, so the next thing you can do

00:29:36.120 | is to optimize both the retriever and the generator.

00:29:39.280 | And so this really starts getting

00:29:43.080 | to the proper kind of contextualization

00:29:45.440 | of the entire architecture

00:29:46.680 | where you want everything to work together, right?

00:29:48.960 | So rather than having this frozen thing

00:29:50.480 | where everything is basically not aware

00:29:52.760 | that the other part exists, right?

00:29:54.160 | It's like two halves of the brain

00:29:55.400 | they're not talking to each other.

00:29:56.880 | One is your retriever, the other is your language model.

00:29:59.040 | There's no connection.

00:29:59.880 | They're just like sort of like something

00:30:01.440 | is thrown over the fence and then you hope for the best.

00:30:04.080 | So instead of that, we have everything much closer

00:30:06.360 | and learning together.

00:30:08.920 | So one of the first ways of doing this

00:30:13.560 | with a generator was RAG, retrieval augmented generation

00:30:17.400 | which we did at FAIR in 2020.

00:30:19.440 | And it's very similar to what we've already seen.

00:30:23.840 | We basically have this retriever here

00:30:25.720 | that works over different documents.

00:30:27.440 | You get some score function

00:30:29.360 | that gets given to this generator that generates the answer.

00:30:33.640 | And now you want to backdrop all the way

00:30:35.800 | and update your generator as well, right?

00:30:38.120 | So in the previous two architectures

00:30:40.040 | we saw you keep the generator fixed.

00:30:42.000 | You backdrop into your retriever

00:30:44.640 | but here we update everything.

00:30:46.680 | Well, not exactly everything as you'll see

00:30:48.600 | but we'll also update the part of the retriever

00:30:52.320 | and the generator.

00:30:53.280 | So in this RAG model,

00:30:56.000 | we actually have two different ways of doing this.

00:30:59.160 | And this is probably something that when we talk about this

00:31:02.640 | if you think about this long enough, then you'll think like

00:31:05.800 | okay, but when actually do I need to retrieve?

00:31:08.720 | Like do I retrieve every time I generate a new token

00:31:12.240 | or do I just retrieve once

00:31:14.000 | and then generate an entire sequence, right?

00:31:16.800 | Or maybe I want to retrieve every N tokens, right?

00:31:20.760 | So these are hyperparameters

00:31:21.800 | or maybe I want to learn when to retrieve.

00:31:23.600 | As we'll see that's also something people have done.

00:31:27.000 | So these are two different ways to do it.

00:31:29.160 | And what we do in this paper

00:31:32.200 | basically the whole point of the paper

00:31:33.840 | is that this frozen thing doesn't really work all that well.

00:31:37.400 | So I think what people call RAG now

00:31:40.160 | is usually refers to the frozen thing

00:31:43.640 | but the whole paper basically

00:31:44.920 | would never have been accepted anywhere

00:31:46.640 | if we had just done the frozen thing.

00:31:48.240 | The whole point of the paper is that you want to optimize it.

00:31:52.400 | And so at my company Contextual

00:31:54.640 | we call this frozen thing Frankenstein's monster

00:31:57.080 | because it's really like you cobble together

00:31:58.880 | these different pieces, right?

00:32:00.440 | You sort of, yeah, it's really like Frankenstein

00:32:02.720 | and just put it together and then it sort of walks, you know

00:32:06.000 | but it doesn't really have the soul.

00:32:07.200 | It doesn't really actually work.

00:32:08.720 | It's not the real thing.

00:32:09.920 | So that's great for everyone here, I think

00:32:13.760 | because there are so many opportunities to do better

00:32:15.920 | than what most people are using right now.

00:32:18.240 | So one of the limitations of the original RAG architecture

00:32:24.640 | is that it only supports a very small cave, right?

00:32:27.600 | So if you have lots and lots of documents

00:32:30.960 | then the problem is that you have to fit all

00:32:33.000 | of them in the context

00:32:34.160 | but how do you really get that to fit, right?

00:32:38.120 | So one thing you can do is you first encode things

00:32:43.120 | so that you get one single representation

00:32:45.680 | or only the few sort of top level representations

00:32:48.240 | then you concatenate those

00:32:49.840 | and then you just feed them to the decoder.

00:32:51.600 | So this is FID fusion and decoder.

00:32:54.440 | And as you can see the skills

00:32:56.520 | to a much higher number of passages

00:33:00.280 | and that leads to corresponding improvements

00:33:03.240 | in the scores that you care about.

00:33:05.520 | So that's a really cool idea.

00:33:08.440 | And so we're slowly moving

00:33:10.600 | towards more decoder only architectures, right?

00:33:13.560 | So in RAG, we have this BART model

00:33:15.320 | it's sort of an encoder decoder architecture

00:33:17.360 | but here you just have this decoder

00:33:18.880 | that does some fancy attention

00:33:21.400 | over stuff that you retrieved before.

00:33:23.440 | And so another like pure decoder language model

00:33:29.800 | architecture is this one KNNLM

00:33:33.400 | which I think is very elegant in its simplicity.

00:33:36.680 | So it's basically you just have a normal language model

00:33:39.880 | but you interpolate the normal language model weights

00:33:43.920 | with things that you retrieved.

00:33:46.960 | So basically you have some sort of prompts, right?

00:33:49.320 | So like Obama's birthplace is, you go to your big corpus

00:33:52.880 | you find similar things.

00:33:54.840 | You look at the words that come next to the similar things.

00:33:58.520 | You rank that thing, you sample your top K

00:34:01.880 | you renormalize that.

00:34:03.440 | So now you have a bunch of scores

00:34:05.720 | and now you can just interpolate

00:34:07.280 | between your retrieved kind of non-parametric memory scores

00:34:11.320 | and your parametric language model scores.

00:34:13.440 | So this is very late fusion in a sense, right?

00:34:16.120 | At the very end, you combine these two

00:34:18.560 | and it allows you to re-weight

00:34:20.320 | the pure language model probabilities or likelihoods.

00:34:23.000 | So this works really well and it scales especially well

00:34:26.720 | if you have a huge retrieval corpus, right?

00:34:30.680 | So if you have trillions and trillions of tokens in there

00:34:33.200 | you can have a much smaller language model

00:34:35.400 | that does not that much heavy lifting

00:34:37.440 | because you can really rely on this big source corpus

00:34:40.480 | that you're working from.

00:34:41.920 | And so that idea was exploited by this paper

00:34:45.720 | called "Retro Out of Deep Mind"

00:34:48.000 | where they showed that you can have a 25 times smaller

00:34:51.840 | retrieval augmented language model trained from scratch.

00:34:54.680 | So really pre-trained entirely from scratch

00:34:57.680 | that outperforms this 25 times bigger language model

00:35:01.440 | on the same data in terms of perplexity

00:35:03.400 | which is pretty impressive, right?

00:35:05.240 | So this architecture is much more efficient

00:35:07.880 | than a parametric model

00:35:09.560 | because you can rely on this external memory.

00:35:12.080 | So if your external memory is big enough

00:35:14.680 | you can get pretty huge gains.

00:35:17.520 | So there was a lot of excitement about "Retro"

00:35:19.720 | when it was announced, but it's a "Deep Mind" paper.

00:35:22.520 | So there's really no open source,

00:35:24.640 | nothing really to validate that this actually works.

00:35:27.680 | And so very recently there has been a bit of work

00:35:31.120 | from NVIDIA called "Retro++"

00:35:34.080 | where they have this hybrid between the "Retro" architecture

00:35:38.840 | and then they do basically "Rag"

00:35:40.840 | sort of they put the top one or the top K results

00:35:44.040 | in the context of the language model after all.

00:35:46.480 | So it's sort of a crossover between "Rag" and "Retro"

00:35:50.000 | and they showed some really nice results here

00:35:52.000 | but I think it's sort of pointing to this big flaw

00:35:55.880 | I think is that why is there still no good

00:35:58.240 | open source "Retro" model?

00:35:59.680 | That probably tells you something

00:36:02.720 | about whether it actually really works.

00:36:04.440 | I spent a lot of time in my career

00:36:06.280 | trying to reproduce "Deep Mind" papers

00:36:08.280 | that didn't necessarily always work.

00:36:11.320 | And so I think the same is true for "Retro"

00:36:15.600 | and that's why we need to do this in context "Rag"

00:36:18.560 | on top of "Retro" to actually get it to work.

00:36:21.960 | But could it just be a 2.8?

00:36:24.440 | There's such a treat on both end.

00:36:28.640 | Yeah, but so...

00:36:29.640 | So "Deep Mind" and stuff.

00:36:32.640 | No, so doing retrieval over that big corpus

00:36:36.360 | is not that difficult actually.

00:36:38.200 | Yeah, so there are even like distributed face packages

00:36:42.920 | you can just do everything yourself.

00:36:44.680 | So, yeah.

00:36:46.320 | So in terms of compute it's actually not that hard anymore

00:36:49.560 | to reproduce something like this.

00:36:52.400 | But I've tried several times

00:36:54.400 | and it's not really reproducible.

00:36:56.440 | So the only way to get it to work

00:36:58.640 | is if you do this in context "Rag"

00:37:00.280 | on top of the "Retro" thing.

00:37:01.440 | And then as you can see here in the results

00:37:03.520 | then it actually gives you a gain over the pure GPT model.

00:37:06.600 | So it starts from a GPT and then they kind of retrofit

00:37:09.320 | as they call it the GPT model.

00:37:11.080 | So in short, I think there's still a lot of work

00:37:15.120 | to be done in pre-training these systems

00:37:17.320 | really from scratch.

00:37:18.760 | And "Retro" kind of showed that it might be possible

00:37:20.840 | but we don't necessarily know exactly

00:37:23.040 | how to do it the right way.

00:37:24.400 | And this is really one of the interesting open questions.

00:37:27.400 | Any questions on that?

00:37:30.360 | Online?

00:37:35.600 | No, okay.

00:37:41.320 | Then we'll move on.

00:37:43.520 | So let's go all the way with the contextualization now.

00:37:48.520 | So with "Retro" and with "Rag"

00:37:52.560 | what we actually did is we only updated the query encoder.

00:37:56.880 | So updating the document encoder is very expensive.

00:38:01.800 | So one of the first papers actually kind of the OG

00:38:04.640 | of the non-frozen dense retrieval augmented methods

00:38:08.080 | is this paper called "Realm".

00:38:10.400 | This is really like visionary work.

00:38:12.480 | This was basically the first kind of version

00:38:16.520 | that did this properly where they updated it all the way

00:38:19.840 | including the document encoder.

00:38:21.640 | So can someone explain to me why it's expensive

00:38:25.320 | to update the document encoder?

00:38:27.120 | So let's say we have a trillion tokens in our corpus.

00:38:34.360 | So now we go all the way.

00:38:37.160 | So we basically do a forward pass.

00:38:39.600 | We get a gradient at the end.

00:38:41.080 | Now we back propagate the gradient through the retriever.

00:38:43.800 | We update the query encoder.

00:38:45.240 | Now we have to update the document encoder.

00:38:48.040 | So what do we then need to do

00:38:49.200 | after we've updated the document encoder?

00:38:52.000 | We need to re-encode the entire internet, right?

00:38:54.920 | So basically every single gradient update

00:38:57.200 | we have to re-encode whatever our index is.

00:38:59.720 | Which, and so if this is like trillions of tokens

00:39:02.320 | it's like re-encoding the internet

00:39:04.200 | after every batch update.

00:39:06.240 | So that's not very efficient.

00:39:09.400 | (indistinct)

00:39:11.800 | - Yeah.

00:39:24.800 | Yeah, that's one way to do it.

00:39:28.840 | So there are a bunch of different ways

00:39:31.000 | to update the document encoder.

00:39:33.160 | So what they do in Realm

00:39:35.040 | is they basically do it for T batches.

00:39:38.680 | Then they stop, they re-encode the entire internet

00:39:41.760 | and then they train again.

00:39:43.520 | So it's sort of asynchronous updates.

00:39:45.560 | They have this very fancy sort of sharding mechanisms

00:39:48.600 | where they take down certain parts of their entire index

00:39:52.880 | and then update them kind of on the fly.

00:39:54.880 | So you can do it, it's just very expensive.

00:39:57.760 | So one of the things that a lot of people

00:39:59.720 | have been thinking about, not exactly the Delora idea

00:40:02.000 | but similar versions of that are around like,

00:40:06.640 | can you make it more efficient

00:40:07.720 | so that you don't have to do this asynchronously?

00:40:10.960 | So one of the downsides of this Realm architecture

00:40:16.320 | is that it's really just a BERT model

00:40:18.480 | but then you do this retrieval augmentation

00:40:20.200 | on a BERT model with other BERT models.

00:40:21.840 | So it's not pretty generative.

00:40:23.040 | It's not really gen AI in the modern paradigm.

00:40:26.280 | But if you wanna read like one paper on this topic

00:40:29.960 | like this is a very good one to read.

00:40:31.800 | The other one that is really, really good to read

00:40:36.480 | is this paper called Atlas.

00:40:38.680 | So Atlas is, so this is out of there

00:40:43.680 | with a bunch of folks, the folks who did like RAG

00:40:46.040 | and the folks who did FID

00:40:47.840 | and really a brilliant set of people.

00:40:51.200 | And this is really a comprehensive analysis

00:40:54.560 | of everything that's happening in this architecture.

00:40:57.360 | So the first question they really look at

00:40:58.920 | is how do we train this retriever?

00:41:00.640 | So we've seen a couple of versions of this

00:41:04.120 | but which one actually works better?

00:41:06.560 | They haven't really been compared in a head to head setting.

00:41:09.680 | So one thing is we have this FID style

00:41:12.040 | sort of attention distillation.

00:41:14.120 | So that's really too complicated to go into detail here

00:41:17.720 | but the others are actually very simple.

00:41:20.440 | So one is this loss we've basically seen before, right?

00:41:24.800 | So we've seen this, I think with the in-context RAG one,

00:41:28.000 | right, so we have a stop gradient on the language model

00:41:30.360 | and then we update the retriever.

00:41:32.680 | The other one is what we've seen with Replug.

00:41:35.440 | So this is basically exactly the Replug loss, right?

00:41:37.680 | So we have the KL divergence of the documents

00:41:42.120 | and sort of the improvement that you see

00:41:44.680 | when you give it that document.

00:41:46.320 | The other thing they have

00:41:48.480 | is basically the inverse of that one.

00:41:50.600 | So if I take this one document out,

00:41:53.360 | how does that affect my perplexity of the language model?

00:41:57.520 | And so this one I think is actually quite elegant

00:42:02.280 | because that really gets to like how valuable

00:42:04.720 | is this one single document for me

00:42:07.320 | answering this question correctly.

00:42:09.080 | So they compare all of these different versions

00:42:13.520 | and what you can see is that the kind of Replug style loss

00:42:18.400 | and this leave one out loss,

00:42:20.080 | they perform a lot better than all of these others.

00:42:22.320 | So this fixed retriever or no joint pre-training,

00:42:25.200 | these are really kind of the baseline

00:42:26.800 | sort of frozen RAG models or closed book.

00:42:29.960 | And as you can see, you can do really a lot better

00:42:33.240 | if you optimize things.

00:42:35.160 | And so this leave one out thing

00:42:36.920 | is probably the best I would say.

00:42:39.240 | So then the other question is,

00:42:41.360 | how do you actually like train that entire system?

00:42:44.200 | Like what data or what tasks do you train this on?

00:42:46.800 | So they also experiment with a bunch of different versions.

00:42:50.680 | So one is doing prefix LM, if you're familiar with that.

00:42:55.160 | So they basically take a chunk

00:42:57.920 | that occurs somewhere on the internet

00:42:59.840 | and then they predict the next chunk from that chunk.

00:43:03.000 | So it's really like sentence to sentence.

00:43:05.880 | So maybe like skip thought back in the day,

00:43:07.680 | but now you have this retrieval step

00:43:09.320 | where you predict the next sentence.

00:43:11.120 | Then they just do T5 style sort of denoising.

00:43:14.720 | So that's mass language modeling,

00:43:16.000 | if you're familiar with T5.

00:43:17.800 | And then they have this title for section generation piece.

00:43:21.440 | So I think the takeaway from this table

00:43:23.960 | is basically that whatever you do here,

00:43:26.240 | so they're using T5 model.

00:43:28.280 | So whatever you do here needs to be the same

00:43:30.000 | that your language model expects.

00:43:32.240 | So for T5, that's T5 style loss.

00:43:35.600 | And then the next sort of final question

00:43:40.440 | that they look into going back to what we talked about,

00:43:43.520 | how exactly do we update this retriever?

00:43:46.240 | So do we have to update the document encoder

00:43:49.200 | or do we maybe have to do some sort of re-ranking

00:43:52.360 | or do we maybe just update the query?

00:43:54.680 | And quite surprisingly, I think they find that

00:43:57.760 | just updating the query,

00:43:59.000 | so like in the original RAD paper,

00:44:01.400 | is actually already basically good enough in many cases.

00:44:05.120 | So that's nice because it's much more efficient

00:44:08.160 | if you don't have to update your documents all the time.

00:44:11.680 | I think the real question here though is like,

00:44:14.560 | how good is your document representation to begin with?

00:44:17.400 | So you need to have a very, very high quality

00:44:20.360 | embedding model for this to work.

00:44:21.800 | If you don't have that, then this will not work.

00:44:23.880 | But if you do have that,

00:44:24.840 | then you get a very nice kind of query side fine-tuning thing.

00:44:28.280 | So the Atlas paper is about trying to do few-shot

00:44:36.520 | sort of language modeling tasks.

00:44:39.280 | So it's how many examples are given in the context.

00:44:42.160 | Yeah, so the main takeaway here is that

00:44:50.080 | if you compare like the closed-book equivalent model

00:44:52.760 | to the retrieval augmented model,

00:44:55.200 | you see very big improvements.

00:44:58.120 | That's really the only takeaway of this entire section.

00:45:02.680 | But I think that that's really saying something

00:45:08.520 | in terms of what we should be thinking about.

00:45:11.640 | How much time do I have until?

00:45:13.160 | - There's still time.

00:45:16.800 | - Okay, okay.

00:45:18.560 | All right, other questions?

00:45:20.720 | (indistinct)

00:45:23.120 | - Yeah, so they can be different.

00:45:31.040 | So in Atlas, Atlas basically tries everything.

00:45:36.480 | So they also tried to see what happens

00:45:38.320 | if I train this on Wikipedia,

00:45:40.080 | but I swap in like a sort of common crawl index.

00:45:43.760 | So in Atlas, but also in Retro,

00:45:47.360 | the main finding is just the more, the better.

00:45:50.840 | So it's really just like the bigger your index,

00:45:53.160 | the more likely you are to find the exact right thing

00:45:56.760 | and then make the right prediction.

00:45:59.680 | Any other questions on this?

00:46:07.000 | - Oh yeah, sorry.

00:46:08.400 | This is a question about the generator

00:46:10.640 | in the, I guess, the rack system.

00:46:14.120 | So recently I saw a paper on Mistral 7B.

00:46:19.040 | So it introduces a lot of these new architectural changes

00:46:22.760 | like the sliding window attention

00:46:24.680 | to handle longer sequences at a smaller cost

00:46:27.040 | and the group query attention for faster inference.

00:46:30.040 | I'd like to like know your thoughts

00:46:32.640 | on designing a generator specifically for RAG,

00:46:36.840 | leveraging, for example, where Mistral 7B currently is.

00:46:40.880 | Because for example, like the sliding window attention,

00:46:43.600 | I could see how that could be adapted to the RAG case.

00:46:47.960 | - Yeah, so maybe your read on sort of

00:46:49.880 | what makes Mistral's special is a bit different from mine.

00:46:52.840 | So I don't think that the sliding attention window thing

00:46:55.480 | is actually that interesting.

00:46:56.760 | The reason Mistral works so well

00:46:58.200 | is because it's trained on a lot of data

00:47:00.440 | and you can do that more efficiently

00:47:02.040 | because you have sliding window attention

00:47:03.640 | so you don't need to attend to everything.

00:47:05.760 | But so to answer your question,

00:47:10.040 | I guess you're asking sort of about the architecture

00:47:12.760 | of the generator if you know

00:47:14.920 | that there's gonna be a retriever.

00:47:16.400 | So I think that's basically what Retro tried to do.

00:47:20.760 | So Retro actually, some of the people on the Retro paper

00:47:25.600 | are at Mistral now.

00:47:27.600 | So they have this chunk cross-attention idea here.

00:47:32.080 | So you basically have the language model,

00:47:34.080 | but the way it does attention over the things you retrieve

00:47:37.480 | in your Retro architecture,

00:47:40.680 | they kind of get integrated into a model

00:47:44.440 | not using the standard attention mechanism,

00:47:46.840 | but using this slightly different chunk cross-attention.

00:47:50.160 | - Oh, okay.

00:47:51.960 | So I think the sliding window attention point

00:47:54.760 | I was trying to get at was that it uses a fixed window

00:47:59.040 | so that whenever you're doing the query key computation

00:48:02.600 | with the query vectors and the key vectors,

00:48:05.840 | you're using a fixed window attention.

00:48:07.800 | So I think my idea was to actually,

00:48:11.600 | one, use a dynamic window

00:48:13.520 | because for example, the rag case,

00:48:15.480 | if you use a fixed window when you're doing attention,

00:48:19.520 | it is possible that you actually are leaving,

00:48:23.200 | you're only looking at a fixed span of information.

00:48:26.880 | So if you could maybe adapt Mistral

00:48:29.760 | so that you could make it better for the rag case

00:48:32.240 | in, for example, making the fixed window

00:48:35.120 | size the dynamic window, yeah.

00:48:38.440 | - Yeah, I think it's an interesting idea.

00:48:39.800 | So for me, what Mistral is doing with the sliding window,

00:48:44.800 | that's basically like a conf net, right?

00:48:47.520 | So we had all these convolutional light conf nets

00:48:50.800 | where we would have word embeddings

00:48:52.440 | and you would do convolutions over it and then pool,

00:48:55.160 | and then you would still get the information out.

00:48:57.320 | So it's not that the sliding window

00:48:59.360 | prohibits you from looking earlier,

00:49:01.280 | it's just that that happens higher up

00:49:03.160 | in your transformer sort of.

00:49:05.400 | - Yeah, yeah.

00:49:06.480 | Okay.

00:49:08.600 | So I think that definitely is an interesting direction

00:49:11.680 | to think in, yeah.

00:49:13.160 | - Yeah, so I think it's like not too crazy to say,

00:49:17.080 | are there any architectural changes

00:49:19.000 | that we can introduce into these

00:49:20.720 | 7 billion parameter models

00:49:22.560 | so that they could be better adapted to the rag case?

00:49:25.840 | - Yeah, so there might be, yeah.

00:49:30.280 | I think one question is just how do you do the attention

00:49:33.760 | over things you've retrieved,

00:49:35.320 | which I think is what you're doing.

00:49:38.000 | Yeah, thanks.

00:49:39.080 | - So just to make sure I understand,

00:49:42.920 | so yes, I mean, in this virtual model,

00:49:45.600 | you are retrieving each block,

00:49:47.800 | and when you talk about putting the retrieval in the context,

00:49:53.120 | are you saying that you only do it at the beginning

00:49:55.000 | and you don't do it at each block?

00:49:57.200 | - Yeah, so in context,

00:49:59.160 | so this is, it's not exactly every layer sort of,

00:50:01.800 | so it's every token, right?

00:50:02.960 | So every step basically, not every block.

00:50:07.600 | So it doesn't make sense.

00:50:10.600 | So it's not every layer that you do the retrieval, right?

00:50:14.440 | Yeah, so every step, right?

00:50:16.360 | So this is kind of like what rag token is.

00:50:20.320 | So you retrieve every token,

00:50:22.880 | so you generate and then you can retrieve again.

00:50:25.800 | Or in the case of retro,

00:50:26.840 | you can generate like a chunk

00:50:28.120 | and then you retrieve chunks again.

00:50:29.920 | If you look at the in-context case,

00:50:32.840 | you retrieve once at the beginning and then you give it.

00:50:36.800 | - So that's what you're saying.

00:50:37.640 | You're saying that during the retrieval,

00:50:40.320 | nobody has had enough to say.

00:50:42.560 | - Yeah, so the in-context thing,

00:50:45.160 | so here you don't actually give it as context at all,

00:50:50.440 | like directly to the model, right?

00:50:51.920 | So here you let the decoder kind of attend over it.

00:50:56.440 | - Like cross-attention.

00:50:57.800 | - Yeah.

00:50:58.640 | - And that nobody has to do.

00:51:01.280 | - So I don't think cross-attention really works, yeah.

00:51:06.480 | - Yeah.

00:51:07.320 | - Other questions?

00:51:12.880 | - Yeah, we did inside that in the case

00:51:16.600 | which retrieving on the retriever is not so necessary

00:51:19.720 | because of the large loss.

00:51:22.760 | So I'm wondering what inside of the cases,

00:51:25.600 | like what cases are really necessarily need

00:51:29.440 | to do A, B, and X update

00:51:31.040 | or any way to update those document or, yeah.

00:51:35.320 | - Yeah, so you do want to update the retriever, right?

00:51:37.640 | But only part of the retriever is necessary

00:51:40.440 | to be updated for a lot of these cases.

00:51:43.600 | But so I think it,

00:51:46.840 | so these are very specific data sets, right?

00:51:49.720 | Natural questions, Wizard of Wikipedia, and Fever.

00:51:52.280 | So they're really very kind of knowledge-intensive tasks.

00:51:56.920 | So in that case, if you already have a very good system

00:52:00.080 | like DPR that is specifically pre-trained for those tasks,

00:52:03.920 | then you only need to update the query encoder.

00:52:06.840 | So I would expect that if you move beyond this

00:52:09.400 | to kind of general language modeling things like Retro,

00:52:13.080 | then you probably do want to update the document encoder

00:52:16.160 | at least in a way where you can scale it.

00:52:18.480 | - So I think that's in the,

00:52:22.480 | these tasks are very knowledge-intensive.

00:52:26.280 | And actually, we covered for (indistinct)

00:52:33.800 | as long as we have a good (indistinct) knowledge

00:52:36.320 | of the documents by those good models.

00:52:41.320 | - Yeah, but so you need to learn

00:52:45.560 | how to kind of query into that index, right?

00:52:48.400 | So if you don't do that,

00:52:50.880 | then yeah, you don't get really good performance.

00:52:54.040 | So that's sort of like your closed book performance, right?

00:52:57.120 | If you just have the language model

00:52:58.680 | and you're just like,

00:52:59.920 | what does the parametric model on its own

00:53:02.320 | without the retriever?

00:53:03.240 | What does it actually know?

00:53:04.840 | As you can see, there are pretty big gaps there.

00:53:07.240 | Other questions?

00:53:13.840 | Otherwise, I will cover other questions.

00:53:15.840 | No?

00:53:19.800 | - Hello?

00:53:20.640 | - Yeah, go for it.

00:53:21.880 | - A quick question.

00:53:22.720 | Like, so what about like more hierarchical retrieval?

00:53:26.000 | Like I suppose there'll be methods trying to

00:53:28.440 | not just retrieve a single chunk,

00:53:29.800 | but there's some kind of like groups of chunks

00:53:31.600 | or something or some right expressions.

00:53:34.680 | - There's been some interesting work on doing that

00:53:37.240 | where you first try to find,

00:53:38.680 | so you can have multiple indices

00:53:40.120 | and they can kind of cascade, right?

00:53:41.480 | So first you want to find the relevant document.

00:53:43.840 | So you have some document representation

00:53:45.640 | and then within that document,

00:53:46.800 | you want to find the relevant chunk.

00:53:49.880 | So you can do it sort of that direction.

00:53:51.360 | You can also do it in reverse.

00:53:52.920 | I think I have something on a slide there

00:53:54.600 | where you can find the chunk

00:53:56.200 | and then sort of expand the context around it

00:53:59.640 | and then give that to the language model.

00:54:01.760 | And so I think, yeah,

00:54:04.160 | there are all kinds of interesting things

00:54:05.560 | you can do there.

00:54:06.440 | - Cool.

00:54:09.320 | Thanks.

00:54:10.160 | I guess another thing, just like,

00:54:11.800 | can you compare RAD versus like long context efforts?

00:54:15.680 | So there are lots of things like around

00:54:18.120 | just having a really long context

00:54:19.920 | and the extreme, it could replace RAD,

00:54:21.920 | but I don't know, like if it takes.

00:54:24.560 | - Yeah, so everybody understands this question, right?

00:54:28.760 | So there's a trend where we want to have

00:54:31.360 | very long context language models

00:54:33.240 | so that basically you can like take Harry Potter

00:54:35.880 | or something, just put it in the context

00:54:37.640 | and then ask a question,

00:54:38.760 | like what is the name of like Harry Potter's owl

00:54:41.040 | or something, right?

00:54:42.440 | And then it can just attend over the entire thing.

00:54:45.520 | So attending over all of Harry Potter

00:54:48.120 | to answer that one question is super inefficient, right?

00:54:51.840 | So most of Harry Potter has nothing to do with the owl.

00:54:55.040 | So, but you are still kind of reading it

00:54:57.200 | if you do it with the long context window.

00:54:59.840 | So that's why I think doing it the RAG way

00:55:02.560 | where you have this non-parametric component

00:55:05.000 | is a much more efficient way to solve this problem.

00:55:07.920 | And if you actually look at the literature

00:55:09.640 | on long context windows,

00:55:11.920 | the way they solve the problem

00:55:14.080 | of scaling the attention mechanism

00:55:16.360 | is by making it very sparse.

00:55:18.600 | So they're basically turning it,

00:55:20.400 | so that's a different kind of sparse,

00:55:21.840 | but they're turning it into a non-parametric

00:55:24.040 | retrieval problem kind of behind the scenes.

00:55:26.960 | So they're not actually all that different.

00:55:29.000 | If you want to scale long context,

00:55:30.360 | then you're going to move towards a RAG style architecture.

00:55:33.360 | - Cool, thanks.

00:55:36.440 | - All right.

00:55:40.600 | So let's talk about some other interesting questions.

00:55:43.320 | So one thing, and I already alluded to this,

00:55:46.440 | is when do we actually retrieve?

00:55:48.760 | So if we're doing like,

00:55:50.160 | if we want to like retrieve every token,

00:55:53.880 | that's also very inefficient

00:55:55.080 | because I probably don't have to retrieve

00:55:56.960 | to generate the, right?

00:55:59.600 | I can probably do that on my own with the language model

00:56:01.880 | as sort of a way to go and retrieve stuff.

00:56:04.680 | But if I only retrieve once

00:56:06.760 | at the beginning of the sequence,

00:56:08.000 | that's probably also not great, right?

00:56:09.800 | So what we ideally want to be able to do is to say,

00:56:13.080 | okay, sometimes I want to retrieve,

00:56:14.520 | sometimes I don't want to retrieve,

00:56:15.880 | and I'm going to learn when I want to kind of expend

00:56:18.880 | the compute budget on doing the retrieval.

00:56:22.760 | So a nice paper where they have a stab at,

00:56:25.400 | this is called Flare for Active Retrieval Augmentation,

00:56:28.480 | where they basically have the language model decide

00:56:31.760 | when it should do a search

00:56:33.320 | and what it should do the search for.

00:56:35.200 | So I think this fits in a general trend

00:56:39.720 | that you can see in the field around kind of agents, right?

00:56:42.560 | So we can talk a little bit more about that too.

00:56:44.960 | So this other question

00:56:47.760 | that I think we've also kind of covered already here

00:56:50.480 | is how do we train this at scale, right?

00:56:52.280 | So we can do these asynchronous updates,

00:56:54.240 | we can do re-rankers, we can do query-side only.

00:56:57.120 | There's this really nice paper,

00:56:59.360 | which is quite close, I think, to the idea you proposed,

00:57:02.920 | where you first use VM25 to create a batch, basically,

00:57:07.480 | where everything is very similar

00:57:10.000 | in terms of what you've retrieved.

00:57:11.960 | And now you have this kind of in-batch update.

00:57:16.000 | So it's sort of like a re-ranker

00:57:17.600 | where you encode the information

00:57:18.840 | that is just in your batch using this other model.

00:57:22.000 | And now you can update this model on the fly.

00:57:24.360 | So you don't have to worry too much

00:57:25.720 | about doing the full kind of document-side update.

00:57:28.400 | And again, here, what really matters

00:57:31.120 | is how big is your index?

00:57:32.560 | If you have an amazing index,

00:57:33.920 | you can basically solve any problem just by looking it up.

00:57:37.600 | So rather than cramming it into your parameters,

00:57:39.880 | you can just find it.

00:57:40.960 | This is a really nice paper called "Silo."

00:57:47.000 | So one of the interesting things,

00:57:49.520 | I think that's going to happen in the next year or two,

00:57:52.720 | around language models is there,

00:57:54.040 | and you've seen this already,

00:57:55.120 | there's a bunch of lawsuits against OpenAI and other places

00:57:58.240 | around where does the data exactly come from.

00:58:00.680 | So one very elegant solution, I think,

00:58:05.000 | is to have a RAG system that you train on data

00:58:07.520 | that you know is safe.

00:58:09.480 | So you can train that thing on Wikipedia,

00:58:12.080 | but now during test time, you can give it a data store

00:58:14.920 | that has maybe slightly riskier information in it.

00:58:18.560 | So this massive index of all the stuff on the internet,

00:58:21.600 | including some things that are maybe higher risk,

00:58:26.080 | you can still have them in your index,

00:58:28.080 | but your language model,

00:58:29.760 | your retrieval augmented language model, I should say,

00:58:32.040 | you know that that thing is safe

00:58:33.440 | because it was trained on data that is public domain.

00:58:36.400 | So that's what they do in Silo,

00:58:37.720 | and they show that that works really well.

00:58:39.760 | So that's one possible solution

00:58:42.640 | to a lot of the kind of compliance and legal risk

00:58:45.760 | around language model deployments.

00:58:48.480 | - There's a great paper also from one of your colleagues

00:58:53.480 | around context getting lost in the middle.

00:58:57.360 | I think this is also kind of a fascinating phenomenon.

00:58:59.680 | This is on a frozen RAG system,

00:59:01.400 | but language models are very similar to humans

00:59:06.800 | in what things they pay attention to.

00:59:09.320 | So if you give them a bunch of things that you've retrieved,

00:59:12.400 | what they will look at are the first things you list

00:59:15.080 | and the last things you list,

00:59:16.440 | and they will sort of ignore the middle.

00:59:18.440 | So if it actually respected the rank function,

00:59:21.400 | then this curve would go down all the way, right?

00:59:23.960 | But it sort of goes up.

00:59:25.520 | So I think that's a very interesting observation,

00:59:30.040 | which kind of shows how brittle these systems can be, right?

00:59:34.560 | So if you have a frozen RAG system,

00:59:36.360 | it can be very, very brittle

00:59:37.640 | where like the order of the retrieved context

00:59:40.240 | matters a lot in whether you get the right answer or not.

00:59:44.760 | (indistinct)

00:59:47.160 | - Yeah, so what I just described,

01:00:06.760 | someone asked like, how do you actually,

01:00:09.520 | so I said there are other ways to do this,

01:00:11.040 | and then the question was, how do you do that?

01:00:12.960 | So the way that you do that is using reinforce.

01:00:15.320 | So yeah, there has been work on doing that.

01:00:18.720 | So some of the older papers were playing with this,

01:00:22.040 | but one of the big problems with,

01:00:23.760 | so I think the replug solution is sort of more elegant

01:00:28.280 | for solving that problem,

01:00:31.160 | because you actually sort of use signal

01:00:33.040 | from the language model.

01:00:34.080 | And if you just do reinforce, it's very high variance.

01:00:36.520 | So it's gonna be super finicky

01:00:39.840 | if you don't want to destroy your index.

01:00:42.920 | But people have tried it, yeah.

01:00:44.480 | So there's some really nice work from OpenAI

01:00:53.440 | where they basically show,

01:00:55.480 | and again, we're sort of like thinking more and more

01:00:57.800 | about agents here, right?

01:01:00.280 | Where they show something very similar

01:01:01.960 | to the FLARE results from earlier with active retrieval

01:01:04.400 | that doesn't necessarily have to be some index

01:01:06.760 | that you only can be just some web search, right?

01:01:10.800 | And obviously in this case,

01:01:11.840 | you don't really have access to the web search necessarily.

01:01:14.240 | So Bing or whatever they use here

01:01:16.080 | is not gonna update its parameters.

01:01:18.640 | But I just wanted to kind of put this in your mind,

01:01:21.200 | like this is another thing you can do, right?

01:01:23.960 | And if we take this really to the general form,

01:01:26.360 | then you can think of language models as just tool users.

01:01:31.080 | So rather than just retrieval augmenting language models,

01:01:34.240 | we can tool augment language models

01:01:36.240 | and retrieval is just one of the many tools

01:01:38.320 | that language models have access to.

01:01:40.400 | We can have re-rankers and things

01:01:42.640 | on top of the outputs of these tools.

01:01:45.360 | And so one of the big questions I think

01:01:48.000 | is how do you actually get the system to learn stuff, right?

01:01:51.680 | So we're gonna need RL if we want this system

01:01:53.960 | to really learn how to take these actions properly.

01:01:56.920 | And so, yeah, this has been taken to the extreme

01:02:03.640 | in this sort of self-drag architecture

01:02:05.960 | where they have this sort of retrieval step

01:02:07.760 | and it's active and then you criticize it

01:02:09.640 | and then you basically do some natural language inference

01:02:13.160 | and all of that just with one language model

01:02:15.280 | to answer the questions.

01:02:17.200 | So the other missing piece,

01:02:20.840 | so I'm just kind of going through a bunch

01:02:22.440 | of open questions that people have looked at,

01:02:25.880 | but feel free to interrupt me

01:02:27.080 | if there's anything you wanna know.

01:02:29.520 | But so instruction tuning,

01:02:31.280 | we established at the beginning of the lecture

01:02:33.080 | that this is pretty important for getting things to work,

01:02:35.520 | so fixing the user interface.

01:02:38.520 | But the instruction tuning has almost always

01:02:41.760 | only happened on the language model

01:02:43.280 | and not on the entire system.

01:02:45.080 | So I think one of the interesting things

01:02:47.200 | that people are looking at now

01:02:48.600 | with things like RADiT and InstructRetro

01:02:50.920 | is how can we instruction fine-tune

01:02:52.360 | an entire retrieval augmented system?

01:02:54.440 | So all the way into the retrieval step,

01:02:57.160 | can we generate data

01:02:58.160 | so that that also follows the instructions properly,

01:03:01.000 | which currently doesn't happen

01:03:02.200 | in any of these model architectures.

01:03:06.080 | And then finally, I think I would be remiss

01:03:08.520 | if I didn't really talk

01:03:09.880 | about what people call advanced RAG.

01:03:12.240 | So like the developer community

01:03:14.120 | has been really doing some awesome stuff.

01:03:16.880 | So like frameworks like Lama Index and LangChain,

01:03:19.560 | and there's all these open source vector databases

01:03:21.880 | like Chroma and Weave8,

01:03:23.200 | and they're all sort of about making RAG really easy,

01:03:26.240 | but this is all frozen RAG, right?

01:03:28.720 | But even with frozen RAG,

01:03:30.080 | you can really do incredible things.

01:03:32.240 | So we mentioned some of these already,

01:03:34.800 | so Child-Parent Recursive Retriever.

01:03:36.680 | So you find small parts

01:03:38.560 | and then you give the big parts around it

01:03:40.280 | to the language model.

01:03:41.600 | You can do hybrid search

01:03:42.880 | where we use reciprocal rank fusion.

01:03:44.880 | So we have like different search results

01:03:46.720 | that we then combine

01:03:48.080 | before we give the final thing to the language model.

01:03:51.160 | There's ZeroShot, like a large language model re-ranker.

01:03:53.880 | So basically the score function is not,

01:03:55.920 | it doesn't come from your retrieval.

01:03:57.320 | It comes directly from the language model.

01:04:00.080 | And then Hypothetical Document Embeddings,

01:04:02.640 | which I think is a really cool idea.

01:04:04.040 | So you just, basically you fix hallucination

01:04:07.920 | through hallucination.

01:04:09.800 | So you get a question,

01:04:11.120 | then you let the language model

01:04:12.240 | hallucinate a bunch of possible answers.

01:04:14.480 | Then you go and search for nearest neighbors

01:04:16.640 | to the possible answers,

01:04:17.880 | and you give those as context,

01:04:19.360 | and then it gives the right answer based on that.

01:04:22.120 | So it was really like hallucinating answers.

01:04:24.760 | I think it's a brilliant solution.

01:04:26.840 | So there's a lot of stuff happening

01:04:29.400 | in the kind of frozen RAG community too,

01:04:32.280 | that I think is very interesting to look at.

01:04:34.520 | So just to wrap up,

01:04:38.280 | kind of looking at the future of this stuff,

01:04:40.760 | there are still lots of very interesting open questions.

01:04:44.480 | So if you're a student thinking about

01:04:46.080 | how to solve any of these,

01:04:48.120 | I think you can have quite a lot of impact.

01:04:50.480 | So how exactly do we do like pre-training

01:04:55.080 | of this architecture?

01:04:56.000 | And do we even need to pre-train?

01:04:57.640 | I think even retro kind of shows

01:04:59.600 | that you don't necessarily have to pre-train.

01:05:01.440 | So maybe there's something wrong with how we do that.

01:05:05.640 | What do scaling laws look like?

01:05:07.240 | So I think there's a really interesting question here

01:05:09.160 | around if I have a huge index

01:05:11.320 | and a very rich encoder

01:05:12.680 | of all the information in that index,

01:05:14.880 | maybe I can move,

01:05:16.040 | so basically decouple all the memorization to this index.

01:05:19.680 | So I have a language model that doesn't know anything.

01:05:21.920 | It just speaks English.

01:05:23.400 | It just sort of reasons on top,

01:05:24.760 | but it has no knowledge

01:05:25.720 | because that always comes from this retriever.

01:05:28.160 | If you can do something like that,

01:05:29.480 | then you get very interesting scaling trade-offs, right?

01:05:31.720 | So you can have a tiny language model

01:05:33.880 | and do your retrieval to do a lot of the heavy lifting

01:05:37.560 | with your retrieval,

01:05:38.720 | which is nice because that's a cached computation, right?

01:05:41.240 | So you can just,

01:05:42.200 | you already have the embeddings.

01:05:43.880 | You just need to do the dot product.

01:05:46.120 | So it's much more efficient

01:05:47.200 | than kind of self-attention in the language model.

01:05:49.720 | Can we move beyond by encoder?

01:05:52.560 | So vector databases,

01:05:54.120 | I like people who build vector databases,

01:05:57.720 | but I'm not sure how long we're gonna keep vector databases

01:06:00.640 | because I think re-rankers probably work just as well,

01:06:06.360 | and VM25 is much more efficient than a vector database.

01:06:09.680 | So I don't really see why we need dedicated vector databases.

01:06:15.960 | And so what we're seeing,

01:06:17.280 | but maybe this is a bit of a critique

01:06:19.040 | of maybe Silicon Valley investment strategies

01:06:22.600 | and things like that,

01:06:23.440 | but a lot of these vector database companies

01:06:27.200 | are basically becoming database companies now.

01:06:29.080 | So they are adding all this sparse stuff

01:06:30.920 | because the density is not enough.

01:06:33.800 | And as it turns out,

01:06:34.800 | there are a lot of pretty good sparse databases

01:06:38.040 | out there already like Postgres and things like that.

01:06:40.560 | And they're also all adding vectors to their databases.

01:06:44.040 | So I think that's all gonna kind of coalesce into databases.

01:06:48.880 | So I think there are some interesting things to look at

01:06:55.880 | for kind of the data.

01:06:56.960 | So to this instruction problem,

01:06:59.080 | can we generate much better data

01:07:01.880 | for training rack systems synthetically?

01:07:04.720 | And then I think there's this massive open question

01:07:06.800 | around how we actually measure

01:07:08.200 | whether the rack system is any good.

01:07:10.000 | So right now we just look at downstream performance,

01:07:12.600 | which is sort of okay,

01:07:15.360 | but if you mess up the retrieval, it's very hard to measure.

01:07:19.000 | But how to measure whether your retrieval is right

01:07:22.120 | is also very difficult.

01:07:23.240 | So there are some frameworks

01:07:24.560 | where they try to take like the harmonic mean

01:07:26.640 | of your retrieval accuracy

01:07:28.000 | and your language model accuracy.

01:07:30.280 | But I think those are also very shoddy

01:07:32.280 | because we don't really have very good data sets

01:07:34.680 | to measure that on.

01:07:35.600 | So I think that's a very cool problem to work on as well.

01:07:39.160 | So the other problem that I personally

01:07:43.120 | am always very excited about is multimodality.

01:07:46.040 | And so why would we stop with rack systems with just text?

01:07:51.400 | So you can do the same thing with images.

01:07:54.400 | You can augment language models with vision.

01:07:56.880 | So we did this work on lens

01:07:58.360 | where we have a language model enhanced to see

01:08:01.920 | where you can just give a kind of a computer vision pipeline

01:08:05.840 | just like a retrieval pipeline

01:08:07.320 | and give that to a frozen language model

01:08:09.200 | and pass it to the context.

01:08:10.600 | And that system actually is an amazing

01:08:12.720 | visual question answering system.

01:08:14.920 | It's close to state of the art

01:08:16.920 | sort of Flamingo from DeepMind,

01:08:19.040 | which is also very hard to reproduce

01:08:20.600 | because there's no open source version of that.

01:08:23.960 | So we've done some early work on this in 2021

01:08:28.160 | where we have this cross-modal retrieval

01:08:30.160 | and there's some more recent work out of FAIR

01:08:32.960 | where they also look at this.

01:08:34.800 | So I think that's really like,

01:08:35.920 | if you look at the trend in the field,

01:08:37.720 | multimodality with GPD 4 or V and things like that

01:08:40.520 | is really a hot topic.

01:08:41.480 | So everything is kind of going in that direction.

01:08:44.440 | So it's an interesting thing to think about.

01:08:46.640 | So overall, I think it would be nice

01:08:51.560 | if everybody sort of moves away from RAG 1.0

01:08:54.520 | to frozen Frankenstein RAG

01:08:56.640 | and moves towards this much more

01:08:58.760 | optimized version RAG 2.0.

01:09:00.680 | So it's really about systems over models, right?

01:09:03.400 | It's not just your language model

01:09:05.000 | and your retriever and they're kind of separate.

01:09:06.640 | It's about thinking from a systems perspective

01:09:09.520 | about the entire thing

01:09:10.560 | and the problem you're trying to solve.

01:09:12.200 | And so I think that really is the way

01:09:14.720 | that in deep learning things have always progressed

01:09:17.240 | where if you optimize the system end-to-end,

01:09:19.840 | that's always going to win out.

01:09:21.520 | Like back in the day in computer vision or NLP,

01:09:23.560 | we had like parsers and scene parsers

01:09:25.720 | and all this kind of stuff.

01:09:26.640 | And all of that just doesn't exist anymore now

01:09:29.120 | because we optimize the system end-to-end.

01:09:32.000 | And so that's what's going to happen here too.

01:09:35.200 | So if we take that to the extreme,

01:09:36.520 | like there's this chunker thing in your documents, right?

01:09:38.640 | Like cutting it up into pieces,

01:09:40.280 | like you could back drop into debt.

01:09:42.160 | Like, why not?

01:09:43.720 | Somebody should really do that.

01:09:46.640 | And so, yeah, I think like trading off costs and quality

01:09:50.080 | and zero-shot domain generalization,

01:09:52.000 | that's really like where this stuff is going to come in.

01:09:54.240 | So language models right now, they're amazing,

01:09:56.320 | but very often they're way too expensive

01:09:58.640 | for being deployed somewhere

01:09:59.920 | where you can actually make money from them

01:10:01.680 | if you're in a company.

01:10:03.280 | So what you want to do is make it much more efficient

01:10:06.360 | and have the right cost quality trade-off.

01:10:08.160 | And the easiest way I can think of

01:10:10.080 | is to do it through retrieval augmentation.

01:10:12.080 | But obviously I'm very biased.

01:10:15.720 | So yeah, that was all I had actually.

01:10:18.480 | So if you're interested in this, I'm at Stanford.

01:10:20.720 | So I can work with you on research projects on these topics,

01:10:24.680 | or if you want, you can also join Contextual

01:10:26.760 | because we work on this stuff every day.

01:10:29.480 | Thank you.

01:10:30.320 | - Well, sorry, I had a question from earlier.

01:10:34.280 | Yeah, I think you said something really,

01:10:39.120 | I think really super helpful earlier about Mistral 7B.

01:10:42.400 | You talked about,

01:10:43.240 | you compared the sliding window attention

01:10:45.400 | to convolutional neural networks.

01:10:46.880 | And I do see the parallel

01:10:48.120 | because with convolutional neural networks,

01:10:49.680 | you have several layers of,

01:10:51.600 | several different layers of convolutional layers.

01:10:53.680 | And the top convolutional layers are able to see

01:10:56.960 | a larger receptive field

01:10:58.280 | in the bottom convolutional layers.

01:11:00.120 | And with convolutional layers,

01:11:02.400 | you're able to tune the filter sizes and the strides.

01:11:07.000 | So you're able to see a different receptive field.

01:11:10.120 | And I was wondering if you could see that same innovation

01:11:12.440 | in Mistral 7B by tuning,

01:11:15.360 | because you have different transformer layers

01:11:17.000 | and each transformer layer will have a span

01:11:19.160 | over a different set of tokens.

01:11:20.560 | And if you can tune, I guess, the transformer architecture,

01:11:23.520 | the way you tune those convolution layers,

01:11:25.560 | the filter sizes, the receptive field,

01:11:27.720 | perhaps we can do some optimization in the transformer realm

01:11:30.800 | that we have already done in convolution layers.

01:11:33.560 | - Yeah, I think that, so that's a good idea.

01:11:36.240 | There's a great paper on light convolutions,

01:11:38.920 | I think from Michael Owley and David Gange

01:11:42.520 | and a bunch of people,

01:11:43.360 | where it's basically, this came out

01:11:45.720 | at exactly the same time as the transformer.

01:11:48.320 | And the transformer is slightly more optimized

01:11:50.320 | for GPU computation,

01:11:52.200 | but the convolutional model was actually slightly better

01:11:55.120 | than the transformer.

01:11:56.120 | So it's definitely worth exploring.

01:12:00.560 | - Okay, cool, thanks.

01:12:01.680 | - It's probably the advantage of the re-branch

01:12:06.760 | over VM25, but does that give up a lot of the advantages

01:12:10.640 | of this massive search or is it a trade-off of upgrades?

01:12:14.880 | - Yeah, so it depends on the problem.

01:12:16.360 | I think what you probably want to do

01:12:18.280 | is sort of cast a white net with VM25

01:12:22.240 | and then just narrow it down with dense search.

01:12:25.200 | So you often see that kind of as a two-stage process

01:12:27.840 | where the first one is kind of noisy,

01:12:29.720 | you can add noise actually to your retrieval

01:12:31.760 | and then you use the dense one to filter it down.

01:12:35.400 | - Yeah, everyone's trying to maybe adapt their

01:12:38.960 | large-scale model to almost domain-specific areas.

01:12:43.360 | Like I think there are mainly two ways to approach it.

01:12:46.640 | One way is to use the instruction-tuning

01:12:49.080 | in a small-type learning way

01:12:50.520 | or fine-tuning like QE method.

01:12:52.800 | And another way is just, the main topic of this lecture

01:12:56.200 | is using virtual-augmented work.

01:12:59.200 | So I wonder if you guys know a method

01:13:03.000 | of low-cost advantage of virtual-augmented way?

01:13:06.880 | Do you think the capacity or the quality

01:13:09.520 | of virtual-augmented way can be matched

01:13:12.160 | with those tuning methods, fine-tuning type learning?

01:13:15.640 | - Yeah, so I think actually what's gonna happen

01:13:19.000 | is that all of this will come together, right?

01:13:21.840 | So if you actually train things

01:13:24.320 | like end-to-end, React 2.0 style,

01:13:26.760 | then you can also fine-tune that system

01:13:29.040 | on some use case end-to-end, right?

01:13:31.960 | So why would you just take the retrieval-augmented system

01:13:35.440 | if you can also fine-tune it on the thing you care about?

01:13:38.040 | So I think in the end,

01:13:38.880 | everybody's gonna do all of those things.

01:13:41.240 | And then there's questions like,

01:13:42.280 | how do you do that efficiently?

01:13:43.360 | So that's why you would use it after,

01:13:44.840 | sort of things like that.

01:13:46.080 | I think there was another question.

01:13:51.360 | - I'm curious about hardware.

01:13:54.720 | You said it's gonna become a database kind of thing,

01:13:57.680 | like a smart database,

01:13:58.520 | but what about retrieval of hardware?

01:14:03.080 | And because you've got so much of the learning part,

01:14:08.080 | but what about, because it's huge.

01:14:11.560 | - Yeah, yeah.

01:14:12.400 | - There's trillions of sets.

01:14:13.560 | So do you have any idea it's just a database problem?

01:14:17.080 | - So I don't know if I'm allowed

01:14:18.280 | to say this exactly, actually.

01:14:19.720 | But so one of the biggest chip manufacturers

01:14:24.720 | that recently, their stock has done really well,

01:14:27.560 | they have some dedicated retrieval hardware coming out.

01:14:30.920 | I think sooner it might already be out.

01:14:32.920 | So yeah, very efficient dense retrieval

01:14:39.600 | is a very big business.

01:14:42.280 | Other questions?

01:14:48.560 | - That's the thing.

01:14:51.920 | Like, have you been solving the flag

01:14:54.280 | that will solve this issue with that?

01:14:56.880 | Similar thing in the business?

01:14:59.200 | - Yes, I think so, if you take it to the extreme.

01:15:02.480 | So one of the big problems right now

01:15:04.200 | is that if you contextualize an existing language model

01:15:07.200 | that already hallucinates,

01:15:09.240 | then it's gonna be kind of hard

01:15:11.000 | to get rid of the hallucination, right?

01:15:12.480 | So if you do replug on GPT-4, GPT-4 might still hallucinate.

01:15:17.480 | So it could basically just ignore all the stuff

01:15:20.000 | you retrieved and just do whatever it wants anyway.

01:15:22.880 | So that's one of the reasons why you want

01:15:24.480 | to train the system end-to-end.

01:15:25.800 | And if you take that to the extreme where,

01:15:28.000 | like I said, right, if you can just have

01:15:29.560 | the language model only reason and speak,

01:15:32.840 | so it knows English and reasoning,

01:15:34.280 | but it has no knowledge,

01:15:35.520 | which all comes from somewhere else,

01:15:37.800 | then you can't hallucinate.

01:15:39.560 | So it's really all grounded in whatever is in your index.

01:15:43.040 | But so they're about hallucination.

01:15:51.200 | I'm sort of frustrated that a lot of people in the field

01:15:53.640 | misunderstand what hallucination even means, right?

01:15:56.320 | So a lot of people are conflating hallucination

01:15:58.440 | with correctness or incorrectness.

01:16:00.720 | So they're like, oh, the model made a mistake.

01:16:02.440 | It hallucinated.

01:16:03.280 | It's like, no, we've made a mistake.

01:16:05.480 | That's different from hallucination.

01:16:06.920 | Hallucination, I think is very specific kind of,

01:16:09.800 | I've retrieved something.

01:16:10.960 | So I have some sort of counterfactual ground truth.

01:16:13.720 | And what I'm saying does not correspond

01:16:16.760 | to that ground truth.

01:16:19.320 | And so, yeah, I think there's a bunch of folks

01:16:22.800 | at Stanford also working on better measurements

01:16:25.200 | of hallucination and definitions and things like that.

01:16:27.920 | - If I'm understanding correctly,

01:16:32.560 | then your definition of hallucination

01:16:34.160 | only makes sense in a context with people.

01:16:37.720 | - Yeah, of some ground truth, right?

01:16:39.920 | So hallucination is really like,

01:16:42.800 | there is something that is true, right?

01:16:44.800 | So if we're talking about like hallucination and,

01:16:48.160 | yeah, so if we're talking about

01:16:49.160 | just general parametric language models,

01:16:51.160 | then sort of the ground truth

01:16:52.320 | is whatever we consider to be true, right?

01:16:55.320 | But we had to work for like language models

01:16:59.800 | making mistakes before it was called making mistakes.

01:17:02.720 | - Yeah, on the ground truth,

01:17:09.080 | I guess this is solving the hallucination question

01:17:12.360 | I was looking for that path.

01:17:14.960 | Are you working on ground truth per se?

01:17:17.600 | And so, you know, if I generate the building documents

01:17:20.160 | saying, "Oh, well, I've never been a president,"

01:17:21.960 | then everything falls apart.

01:17:24.520 | Are you considering work on that, on this ground truth?

01:17:27.800 | - Yeah, so I like the sort of silo mentioned there as well.

01:17:31.920 | So I think the whole point is that you can have

01:17:35.320 | different indices and different definitions

01:17:37.240 | of ground truth, right?

01:17:38.320 | So I think you could say, "I only trust the archive,"

01:17:42.760 | or, "I only trust like peer-reviewed papers

01:17:45.240 | "and not just archive."

01:17:47.520 | And so you can make decisions in your architecture

01:17:49.680 | during test time about what you define as ground truth.

01:17:52.720 | And I also think actually that,

01:17:55.520 | and there's a bunch of work, I think,

01:17:58.200 | happening on this right now.

01:17:59.160 | You can control for how grounded you want it to be

01:18:02.280 | in your ground truth.

01:18:04.040 | So that's another kind of misconception about hallucinations.

01:18:07.840 | Like sometimes hallucinations are actually good, right?

01:18:10.080 | If you have a creative writing assistant

01:18:12.040 | and you wanted to come up with some cool new ideas,

01:18:14.080 | you want the language model to hallucinate.

01:18:16.800 | So I think what you want to have is kind of a tunable knob

01:18:19.760 | where you say like, "Oh, now you can hallucinate,

01:18:21.720 | "and now maybe you should really tell me the truth only."

01:18:24.720 | Anything else?

01:18:33.280 | - It has, I think, a lot of gravity

01:18:34.840 | that's already on to it, how much you're saying.

01:18:37.960 | (indistinct)

01:18:40.720 | - Yeah.

01:18:41.560 | So, but the temperature,

01:18:44.280 | that's just about how you sample, right?

01:18:46.320 | So how flat your distribution is that you sample from.

01:18:49.440 | (indistinct)

01:18:51.000 | Yeah.

01:18:51.840 | Yes.

01:18:54.480 | So even if you have a low temperature,

01:18:56.560 | it can still come up with random stuff, right?

01:18:59.000 | So it just says that then you're very likely

01:19:01.120 | to do greedy sampling.

01:19:02.600 | So I think what you want to get at

01:19:06.160 | is something more sophisticated than that.

01:19:09.120 | - Okay, lots of interesting questions.

01:19:17.000 | - Yeah, I like the question.

01:19:18.120 | - Thanks, Don, again, for the great talk.

01:19:19.920 | - Thank you.

01:19:20.920 | (upbeat music)

01:19:23.520 | [BLANK_AUDIO]