back to index

Stanford CS25: V3 I Retrieval Augmented Language Models


Whisper Transcript | Transcript Only Page

00:00:00.000 | - Hey guys, welcome to our last lecture of this quarter.
00:00:05.000 | And we're very happy to have Dawa here.
00:00:15.360 | He's the CEO of Contextual BI, the enterprise LLM company,
00:00:20.360 | as well as an adjunct professor
00:00:22.320 | in symbolic systems here at Stanford.
00:00:24.960 | And previously he was the head of research at ClickBase.
00:00:28.080 | And before that, a research scientist
00:00:29.880 | at Facebook AI Research.
00:00:32.400 | He received his PhD and master's
00:00:34.200 | from the University of Cambridge,
00:00:36.080 | as well as a master's in logic
00:00:37.400 | from the University of Amsterdam,
00:00:39.120 | and studied philosophy and cognitive AI in undergrad.
00:00:42.560 | And his work focuses on machine learning as well as NLP,
00:00:46.540 | specifically on developing better models
00:00:48.480 | for language understanding and generation,
00:00:51.200 | and better tools for evaluation and many more.
00:00:54.680 | Yeah, give it up for Adele.
00:00:56.880 | - Right, thank you.
00:01:00.120 | So I guess I have to sort of stand here in the corner
00:01:02.760 | so people can see me on the Zoom as well.
00:01:05.000 | Yeah, thanks so much for having me here.
00:01:09.280 | So I asked Stephen what I should talk about.
00:01:12.280 | There were a couple of things I could talk about,
00:01:14.120 | multimodality or evaluation.
00:01:16.600 | And this was the preferred topic, I guess,
00:01:20.120 | because the others were already covered.
00:01:22.420 | So yeah, I'm very happy to talk to you
00:01:25.160 | about everything retrieval augmentation.
00:01:27.720 | I think this is really one of the coolest topics
00:01:30.740 | right now in our field.
00:01:32.960 | So I'll just give you an overview of what's been happening
00:01:35.960 | and what I think are the interesting questions
00:01:38.360 | to think about.
00:01:39.200 | So first of all, obviously, in case you've missed it,
00:01:43.420 | we are in the age of language models.
00:01:45.920 | And I just wanted to do a quick poll here
00:01:48.840 | in this not super big audience.
00:01:51.200 | I guess there's more people on the Zoom,
00:01:52.480 | but who invented language models?
00:01:54.860 | If you thought OpenAI, then I'm angry with you, right?
00:02:00.840 | So actually, this is a very, very old idea.
00:02:04.160 | So the idea is just you take a sequence
00:02:06.360 | and you factorize out the token probabilities, right?
00:02:09.800 | And so it wasn't invented by OpenAI.
00:02:12.880 | It's not like a few years old.
00:02:14.680 | It's actually several decades old.
00:02:16.640 | So I'm bringing this up because I was talking to someone
00:02:19.920 | and they were like, "OpenAI invented language models."
00:02:22.300 | And I was like, "You're kidding me, right?"
00:02:24.440 | So I went back to the literature
00:02:28.320 | and this is the oldest one I could find, actually.
00:02:30.280 | 1991, first neural language model.
00:02:32.680 | There's a very nice paper from 2003 from Bengio
00:02:36.720 | where they actually have word embeddings
00:02:39.800 | and everything already in there.
00:02:42.000 | So obviously, these are LLMs, not LLMs.
00:02:45.360 | And as it turns out, if you make them really big
00:02:48.200 | and you parameterize them with these massive neural nets,
00:02:51.560 | then you get something really powerful
00:02:53.000 | that really shows emergent properties.
00:02:55.680 | And that's why we're all so excited in this stuff.
00:02:58.180 | So if we think about this from a classic CS perspective,
00:03:02.940 | there's input-output, right?
00:03:04.240 | There's this kind of thing in the middle.
00:03:05.980 | It's the generator.
00:03:07.240 | So we take a sequence, the input sequence,
00:03:10.240 | and then the task of the model is to predict the next token.
00:03:14.340 | Very, very simple model.
00:03:16.000 | And so that's why it was so easy to come up with this
00:03:19.880 | in 1991 already, because the idea is very intuitive.
00:03:23.860 | But for a long time, what was really broken with this
00:03:26.880 | was the user interface.
00:03:28.460 | And this, I think a lot of people kind of misunderstand
00:03:33.000 | what ChatGPT was about.
00:03:34.960 | That's really what ChatGPT fixed.
00:03:37.080 | So that initially you had to come up
00:03:39.360 | with these very weird prompts
00:03:41.000 | in order to get your language model
00:03:42.440 | to do what you wanted it to do.
00:03:44.400 | And humans are terrible at this, right?
00:03:46.320 | So we're much better at sort of telling people
00:03:49.240 | or things around us what we want, right?
00:03:51.100 | So if we have a dog, we say, "Sit."
00:03:52.800 | We don't prompt it in a very weird way so that it sits,
00:03:56.280 | right?
00:03:57.120 | And it's the same with the language model.
00:03:58.360 | If you wanted to generate some rap lyrics
00:04:01.240 | in the style of a pirate or Shakespeare or something,
00:04:04.160 | then you tell it generate some rap lyrics
00:04:06.200 | in the style of a pirate, right?
00:04:08.000 | So that kind of instruction data
00:04:10.160 | actually turns out to be super, super rare in just web data.
00:04:14.160 | So what you need to do is you need to fix the user interface
00:04:16.840 | to the language model.
00:04:18.040 | And the classic recipe for doing that
00:04:20.280 | is the sequence basically that ChatGPT used.
00:04:23.440 | So you prompt the model in a specific way,
00:04:25.040 | you instruction finds in the model,
00:04:26.560 | and you do some alignment, RLHF,
00:04:29.560 | whatever you do on top of that.
00:04:31.840 | So that's the first thing.
00:04:32.880 | So now you have a working language model
00:04:35.000 | with a working user interface.
00:04:37.800 | So are we done then?
00:04:39.200 | Obviously we're not, right?
00:04:41.120 | So right now language models
00:04:43.140 | are kind of taking the world by storm.
00:04:45.040 | But if you talk to anyone, especially in an enterprise,
00:04:47.960 | for example, where they have
00:04:48.800 | very strict accuracy requirements,
00:04:51.800 | they will tell you that
00:04:52.760 | they can't really productionize this yet.
00:04:55.160 | And the reason is
00:04:56.400 | because there are all these familiar problems,
00:04:57.920 | probably a bunch of you are working on these problems
00:04:59.920 | right now around hallucination.
00:05:03.320 | So these models, they kind of make up stuff
00:05:05.360 | very often with very high confidence,
00:05:06.960 | which is even more scary in a way.
00:05:10.640 | Attribution, so we don't really know
00:05:12.120 | why these models are saying what they're saying.
00:05:14.800 | Staleness, they go out of date.
00:05:16.400 | And so this was a big problem with sort of chat GPT,
00:05:18.840 | not knowing anything that happened
00:05:20.320 | after a certain cutoff date,
00:05:22.000 | and they keep updating it every once in a while.
00:05:23.960 | But you want to have a system
00:05:24.960 | that's always completely up to date, that never goes stale.
00:05:27.960 | You want to be able to revise the information in the system.
00:05:31.440 | So if you're a European organization,
00:05:34.540 | you have to worry about GDPR,
00:05:36.760 | which means that you need to be able to remove information
00:05:38.900 | from the language model or maybe revise facts,
00:05:42.140 | which we don't really know how to do.
00:05:43.840 | So again, this is a very interesting area of study
00:05:47.320 | for a lot of folks, model editing.
00:05:49.840 | But so this is something
00:05:50.960 | that we really want to be able to fix.
00:05:53.320 | And then there's this big question
00:05:55.040 | of how do you customize these models?
00:05:57.720 | So different people have different use cases,
00:05:59.760 | you have different data, if you're a company,
00:06:01.640 | or if you want to have a language model on your own data,
00:06:04.320 | how do you make it work on your own data?
00:06:06.680 | So one of the solutions
00:06:08.960 | that everybody has started using right now
00:06:11.400 | is to couple it to an external memory.
00:06:13.240 | So that's really just RAG, right?
00:06:15.240 | This whole lecture is basically about RAG,
00:06:19.720 | but the way to understand what is going on here
00:06:22.840 | is we have this generator just like before,
00:06:25.760 | we have the input and the prompt just like before,
00:06:27.680 | but now instead of just giving those two things,
00:06:31.000 | we give this additional context.
00:06:32.680 | So we contextualize the language model
00:06:34.960 | using things we've retrieved.
00:06:37.040 | And the retriever is very often pretty simple,
00:06:40.600 | it's just a query in a document encoder.
00:06:43.000 | And then you get a bunch of documents,
00:06:45.240 | you give them as context to the model.
00:06:47.920 | So super simple architecture.
00:06:49.880 | And I think it's useful to think about it
00:06:53.960 | from the perspective of these two separate paradigms.
00:06:57.680 | So if you've ever taken an exam, I'm sure you have, right?
00:07:01.160 | You can have a closed book exam
00:07:02.440 | where you have to memorize all of this,
00:07:03.760 | so you have to cram all the knowledge
00:07:05.120 | into your parameters, your neurons,
00:07:08.440 | or you have an open book exam
00:07:09.720 | where you have all of this information in the book
00:07:11.920 | that you can access when you do the exam.
00:07:14.800 | So it's a very similar thing with rank, right?
00:07:16.720 | You can just make it an open book setting
00:07:18.480 | where you can give it access to this external information,
00:07:21.160 | Wikipedia, or something else,
00:07:22.880 | or basically the entire internet,
00:07:25.560 | and then have the language model do its job
00:07:27.440 | without having to memorize all of it in its parameters.
00:07:30.360 | So the other, I think, useful distinction here
00:07:33.720 | is that cramming everything into your parameters,
00:07:36.800 | that's the parametric approach, right?
00:07:38.600 | So what we're doing with RAG
00:07:40.680 | is we're adding this non-parametric retrieval component.
00:07:44.120 | So you might call this semi-parametric
00:07:47.440 | if you want to give this a name.
00:07:49.240 | All right, so why does that actually solve these issues?
00:07:54.880 | And so the answer is basically
00:07:56.720 | that if you have this separate index,
00:07:58.960 | right, this separate retriever,
00:08:00.600 | you can swap it in, you can swap it out,
00:08:02.320 | you can replace it with a new index,
00:08:04.480 | so you can really customize it.
00:08:06.240 | And so you can customize your language model system
00:08:09.360 | for what the user really wants to see.
00:08:11.720 | And then obviously you can update this index
00:08:14.440 | so it doesn't really go still,
00:08:16.720 | and you can revise it if everything goes wrong,
00:08:18.800 | if anything goes wrong.
00:08:19.960 | The other thing you get is grounding, right?
00:08:23.440 | So that's initially why I became interested
00:08:25.760 | in this kind of architecture,
00:08:27.040 | because I was thinking a lot about grounding
00:08:28.720 | and multimodality and things like that.
00:08:30.320 | And actually one really nice way to ground things
00:08:32.840 | is to find some other information
00:08:34.920 | that you can ground your generation in.
00:08:36.600 | And so you really want the language model
00:08:38.600 | to only say things that it has evidence for
00:08:41.640 | in this other piece of text,
00:08:43.640 | or even multimodal data that it retrieves separately.
00:08:46.600 | So if you do that, then you get less hallucination,
00:08:48.760 | because you can always point back to your source,
00:08:50.600 | it's always grounded in your source.
00:08:52.600 | And you get attribution because you don't know
00:08:54.520 | why the model is saying what it's saying,
00:08:56.160 | it's because it found this thing here.
00:08:58.960 | Is that clear?
00:09:02.040 | All right, so for the rest of this lecture,
00:09:05.440 | we're gonna talk about this basic architecture.
00:09:09.000 | And so it kind of looks like a pretty simple thing, right?
00:09:12.800 | But there are actually lots and lots of questions
00:09:14.800 | you can ask about what this system should really look like.
00:09:18.840 | And this doesn't even cover
00:09:21.040 | half the questions you can ask.
00:09:23.080 | So it really is about how do we optimize
00:09:25.800 | this entire system, right?
00:09:27.920 | So we have these separate components,
00:09:29.400 | the retriever, the generator,
00:09:31.280 | and then there are things like this query encoder,
00:09:34.640 | how do we encode queries?
00:09:35.960 | How do we do the retrieval?
00:09:38.080 | Do we update the documents encoder?
00:09:40.320 | How do we actually define a document, right?
00:09:43.440 | Is it like a full document, or is it a paragraph,
00:09:45.560 | or a chunk, or a sentence, or a couple of words?
00:09:48.600 | So there are lots of questions to ask.
00:09:50.720 | And as you'll see, there are lots of possible answers
00:09:54.720 | to these questions as well.
00:09:56.680 | So this is what we'll cover.
00:09:59.200 | So there are lots of architectures
00:10:03.680 | going into these questions.
00:10:06.040 | And I think as we go through them,
00:10:08.760 | it's useful for you to think about
00:10:10.600 | what happens during training time
00:10:12.120 | and what happens during test time, right?
00:10:14.360 | So during training time, it's really,
00:10:16.720 | okay, we have this language model,
00:10:17.960 | we have this retriever, which one do we update?
00:10:21.800 | How do we update them?
00:10:22.960 | How do we train this entire system?
00:10:24.800 | Do we maybe not train it at all?
00:10:27.000 | Do we pre-train it from scratch?
00:10:28.480 | Do we initialize it with components
00:10:31.040 | that were already separately trained?
00:10:33.000 | These are the kinds of questions that you have to answer
00:10:35.120 | if you wanna design a system like this.
00:10:37.840 | And then during test time, you have this entire system,
00:10:41.520 | right, so actually multiple models in a way
00:10:43.760 | that are working together.
00:10:46.000 | So there's also different things you can do there, right?
00:10:49.720 | So give it different indices during test time
00:10:52.000 | or manipulate kind of how you're sampling,
00:10:54.560 | things like that.
00:10:55.480 | So the starting point for all of this stuff,
00:10:59.680 | I think if you ask someone now, like, what is RAG,
00:11:02.320 | they will think of this thing.
00:11:04.600 | So this is frozen RAG, basically.
00:11:07.160 | There's no training here at all.
00:11:09.680 | So going back to this question of train time, test time,
00:11:12.160 | there's only test time here.
00:11:13.320 | Train time happens separately
00:11:14.960 | with these kind of black box models
00:11:16.840 | that we don't necessarily have control over, right?
00:11:19.000 | So there's this document embedding model,
00:11:22.000 | whatever is currently at the top
00:11:23.720 | of some open source leaderboard.
00:11:25.920 | You use that to, oops, sorry,
00:11:29.240 | to get some vectors that you then use
00:11:32.040 | to create this vector database.
00:11:34.120 | And then the vector database just does search
00:11:36.040 | and it gives the information from the search
00:11:38.440 | to the language model.
00:11:39.840 | And it just passes it as the context, right?
00:11:43.720 | So this only works because of in-context learning.
00:11:47.840 | And I think as a machine learner myself,
00:11:51.760 | this feels very inelegant.
00:11:54.160 | So what this lecture is about is,
00:11:56.280 | can we do better than this frozen thing?
00:11:59.480 | So let's start from the left side of this.
00:12:04.960 | Like, okay, if we want to outperform
00:12:06.560 | this frozen thing itself with just the vector database,
00:12:09.560 | like, what would that look like
00:12:11.000 | from a retrieval perspective?
00:12:12.720 | And the starting point for everything retrieval
00:12:16.640 | is TF-IDF.
00:12:18.080 | Does everybody know what TF-IDF is?
00:12:20.200 | No, okay.
00:12:22.160 | So TF-IDF is basically a sparse retrieval method
00:12:26.360 | where you have a score function
00:12:28.840 | that looks at documents and queries, so D and Q.
00:12:33.280 | And then there are basically two terms that matter.
00:12:35.240 | One is the TF, the term frequency,
00:12:37.280 | and the other is the IDF, the inverse document frequency.
00:12:40.680 | So this inverse document frequency
00:12:42.120 | is actually a really nice idea from Karen Spark-Jones,
00:12:45.120 | a really underrated researcher.
00:12:46.520 | She's done some amazing work.
00:12:48.040 | But the basic idea is that you want to look at the words
00:12:52.240 | that are very special,
00:12:53.560 | so that don't occur in lots of different documents.
00:12:55.880 | And so the overlap between the word "the"
00:12:58.400 | doesn't really matter, right?
00:12:59.440 | Like, "the" occurs everywhere.
00:13:01.480 | So you want to have sort of the special words.
00:13:04.040 | So that's what TF-IDF does in a nutshell.
00:13:06.440 | It gives you a score for document query overlap.
00:13:10.000 | And then you can do all kinds of things here
00:13:12.440 | with how you weight it.
00:13:13.720 | So there's all these weird, different parameters,
00:13:15.600 | like this B and things like that,
00:13:17.640 | that allow you to make it better
00:13:19.560 | than just having the TF-IDF score.
00:13:22.320 | So there's a couple of tweaks you can do there.
00:13:24.480 | So BM25, actually, in case you're wondering,
00:13:27.040 | stands for Best Match 25.
00:13:29.360 | So I tried to discover, like,
00:13:31.200 | where does the 25 actually come from?
00:13:33.960 | That's because the prior,
00:13:35.800 | sort of the preceding 24 experiments failed, right?
00:13:39.120 | So it's literally the 25th one that seemed to work,
00:13:41.480 | and that's why it's called BM25.
00:13:44.000 | It's bizarre, right?
00:13:44.800 | But so this is sparse retrieval.
00:13:48.720 | It's just counting words, right?
00:13:49.960 | So you have this massive, massive vector
00:13:52.200 | of all these word occurrences.
00:13:53.840 | It's sparse because most words never occur, right?
00:13:56.360 | So it's sort of like a vector
00:13:58.080 | of vocabulary size dimensions.
00:14:02.160 | So most of that is obviously zero.
00:14:04.840 | But so that's actually kind of a nice property
00:14:07.080 | if you want to do fast search on a CPU, right?
00:14:09.680 | Because on a CPU, sparse dot product
00:14:12.400 | is very easy to compute.
00:14:14.440 | So this is used in the system called DrQA,
00:14:19.360 | which is really one of the first neural instances
00:14:22.400 | of this open domain,
00:14:23.840 | sort of open book question answering paradigm.
00:14:27.280 | So you have a question,
00:14:28.800 | like how many of Warsaw's inhabitants, blah, blah.
00:14:31.720 | So you want to ask, basically, Wikipedia
00:14:34.160 | what the answer is for this.
00:14:35.280 | So then you have this document retriever
00:14:37.040 | based on the sparse, so BM25, I think, in this case.
00:14:41.600 | Retrieval methods, you pass that to,
00:14:44.040 | I think this was still by LSTM at the time,
00:14:48.280 | a document reader model,
00:14:50.680 | and then that model gives you the answer.
00:14:53.080 | So this, I think, is really the first instance
00:14:56.440 | of having sort of this separation
00:14:58.080 | between a retrieval and a generator system
00:15:01.280 | that you use for answering complicated questions
00:15:03.520 | based on sort of open domain knowledge.
00:15:06.640 | So after the sparse stuff,
00:15:09.000 | there was a bunch of work on dense retrieval.
00:15:12.800 | And so the advantage of dense retrieval,
00:15:15.320 | so this is just like word embeddings, basically vectors,
00:15:18.000 | like they're dense now, no longer sparse,
00:15:20.240 | so they're much smaller in terms of dimensionality.
00:15:24.080 | And a nice advantage of dense retrieval
00:15:26.760 | is that it's not really about specific words, right?
00:15:29.000 | So if there are synonyms,
00:15:31.440 | you can still find the relevant document,
00:15:35.080 | which you couldn't really do with a sparse representation.
00:15:37.760 | So that's really the advantage of dense
00:15:39.760 | is that you get like semantic similarity.
00:15:41.920 | So you can do this over word embeddings.
00:15:46.040 | That doesn't really work all that well,
00:15:47.440 | but at the time that people started thinking about this,
00:15:50.240 | BERT was already out there,
00:15:51.440 | and BERT is really great for giving you
00:15:53.000 | a vector representation for an entire sequence of words.
00:15:56.320 | So a sentence representation or a passage representation.
00:15:59.560 | So there are all these cool systems like ORCA
00:16:01.920 | and DPR, the Dense Passage Retriever,
00:16:05.200 | where they essentially use the retrieval
00:16:08.840 | as a kind of latent variable in the system.
00:16:11.400 | And the way to get the latent variable to work,
00:16:14.720 | to be good enough essentially to train the entire system
00:16:18.200 | is to pre-train the retriever on relevant information.
00:16:21.600 | So for ORCA, they do something called inverse close.
00:16:25.640 | So they do kind of a close task
00:16:27.080 | where you want to find passages
00:16:30.280 | that are sort of relevant to the preceding passage.
00:16:33.520 | And in DPR, they just train it on a supervised thing.
00:16:36.680 | But really the core idea here is that,
00:16:39.000 | as you can see in this graph here,
00:16:40.840 | you can do better than VM25 if you add lots of documents
00:16:44.560 | and the way you compute the score function
00:16:46.320 | is much simpler, it's just a dot product.
00:16:48.360 | So the nice thing about dot products
00:16:53.960 | is that you can do them very, very efficiently
00:16:56.440 | on the GPU as well if you know what you're doing.
00:17:00.560 | So what you really want to get at
00:17:03.080 | is maximum inner product search, MIPS, right?
00:17:05.480 | This is one of the kind of core ideas
00:17:07.080 | of a lot of this stuff.
00:17:08.720 | And you can do MIPS with ANN,
00:17:12.200 | approximate near neighbor search.
00:17:14.040 | And so there's this really brilliant piece of work
00:17:17.960 | out of there for my colleagues at the time,
00:17:20.960 | called FACE, which really underlies
00:17:23.240 | all of these modern vector databases, right?
00:17:26.160 | So all the popular ones,
00:17:28.160 | they're sort of re-implementations of this FACE idea.
00:17:30.480 | One is in Rust, one is in Go,
00:17:32.080 | but it's all basically the same idea, it's just FACE.
00:17:35.000 | And so FACE really powers a lot of this stuff.
00:17:39.320 | And whenever somebody tells you something
00:17:41.640 | about a vector database, just think about FACE,
00:17:44.120 | very fast dot product.
00:17:45.720 | So obviously, you can go beyond dot product, yes?
00:17:51.880 | - What is it, what is FACE?
00:17:53.640 | - What is FACE?
00:17:55.240 | So it's an open source library,
00:17:57.040 | Facebook AI similarity search.
00:17:59.200 | No, so it's just basic off-the-shelf ANN algorithms.
00:18:06.640 | Yeah, so there are all kinds of different,
00:18:13.160 | I don't know if you, do you know what like
00:18:14.600 | product quantization is and things like that?
00:18:17.000 | So there are basically, so you have a bunch of vectors
00:18:20.440 | and you can just compute the full dot product,
00:18:23.440 | which is sort of inefficient, right?
00:18:24.880 | So what you can do is try to compress subspaces
00:18:28.480 | of the vector, and then just look at the kind of centroids.
00:18:31.880 | So you can quantize sub-vectors of the full vector
00:18:36.520 | and then do much faster search over just the centroids.
00:18:39.640 | It's a good question, any other questions?
00:18:44.400 | All right, so about this dot product idea.
00:18:50.920 | So what we have here is,
00:18:53.120 | some people call this a Siamese network,
00:18:55.200 | I guess it is, right?
00:18:56.040 | So you have two different BERT models
00:18:59.000 | or whatever your encoder is here.
00:19:00.680 | And then at the end, you get these two vectors
00:19:02.480 | and then you just do dot product
00:19:03.960 | so you get one single score.
00:19:05.760 | But you can do all kinds of much fancier things
00:19:08.040 | if you're willing to give up
00:19:09.840 | on this bi-encoder approach, right?
00:19:12.440 | So a really nice example from one of your colleagues
00:19:15.920 | here at Stanford is Colbert.
00:19:19.440 | So what this does is late interaction.
00:19:22.920 | So instead of just having this dot product here,
00:19:25.640 | you have a kind of more complicated version
00:19:29.480 | of computing a score where you aggregate
00:19:31.480 | over sort of maximum similarity scores
00:19:33.520 | between different words.
00:19:35.120 | So I only recently actually discovered
00:19:36.920 | that this is called Colbert
00:19:37.960 | because of the late night show, Colbert.
00:19:40.560 | So it's sort of Omar's joke, actually, this name,
00:19:43.600 | but just so you know, if you run into it.
00:19:48.960 | So, but I think if we look at kind of where
00:19:52.520 | the state of the art has been going now,
00:19:55.320 | one of the nice things about these vector databases
00:19:57.520 | is that they're super efficient, right?
00:19:58.960 | So dot product is much more efficient
00:20:00.800 | than this late interaction stuff,
00:20:02.200 | especially if you do the approximate
00:20:03.840 | nearest neighbor search.
00:20:05.080 | But there's been some really cool work.
00:20:08.000 | So things like SPLADE,
00:20:09.440 | they basically have sparse meet dense in a way.
00:20:14.560 | So one of the big problems, as I said,
00:20:16.080 | with sparse is that you can't really handle synonyms
00:20:18.280 | and things like that.
00:20:19.440 | But what you could do is take a dense model,
00:20:22.120 | like a bird model, look at kind of this one word
00:20:25.520 | in your sequence, try to see which other words
00:20:28.120 | fit in the same slot.
00:20:29.480 | So that gives you the synonyms.
00:20:31.720 | So now you can give all these synonyms to a sparse vector,
00:20:36.000 | and then you can just do sparse dot product.
00:20:38.120 | And so I have a much more efficient way to do search
00:20:41.000 | without sort of giving up on all the cool stuff
00:20:45.440 | that you get from a dense representation.
00:20:48.120 | So that's one thing.
00:20:49.280 | And this other idea I really like is called DRAGON.
00:20:52.720 | So this I think is really the best
00:20:56.400 | generalized dense retriever.
00:20:57.840 | So if you want to take something off the shelf right now
00:20:59.760 | and just go to Hugging Face or something,
00:21:01.760 | then this DRAGON or DRAGON+ is probably the thing
00:21:04.560 | you want to use for a dense retriever.
00:21:06.560 | And the way they train this is through this
00:21:09.080 | progressive data augmentation strategy
00:21:11.480 | to make the model better and better over time
00:21:13.600 | by sampling very difficult negatives.
00:21:16.040 | And that gives you very good representations.
00:21:20.440 | And so the other thing about this,
00:21:22.600 | I think this is the only sort of final point
00:21:24.880 | about retrieval in general
00:21:27.080 | is that what we see happening right now,
00:21:29.480 | if you look at sort of the developer community around DRAGON
00:21:32.120 | is that they're all doing hybrid search right now.
00:21:34.840 | So you can actually just combine the search results
00:21:37.200 | from your sparse BN25 or whatever thing, or SPLADE,
00:21:41.280 | and you can combine them with your DRAGON,
00:21:44.040 | and then you'll get this ranking that works even better.
00:21:47.160 | So then you kind of get best of both worlds,
00:21:48.840 | but then you get all these questions
00:21:50.080 | about how do you combine the results.
00:21:52.160 | Any questions on this part?
00:21:55.760 | - Oh, can you hear me?
00:21:59.040 | - Yes.
00:21:59.960 | - Oh, sorry.
00:22:01.000 | On the earlier slide, has there been any work on benchmark
00:22:05.520 | how much less hallucination RAG incurs
00:22:08.200 | over a closed book question answering,
00:22:11.040 | for example, directly asking
00:22:12.520 | the large language model the question,
00:22:14.440 | has there been any benchmarking studies in this?
00:22:17.480 | - Yeah, so there's a great paper,
00:22:19.280 | if I can say so myself,
00:22:20.800 | on the fact that retrieval augmentation
00:22:22.560 | reduces hallucination.
00:22:24.520 | It's from 2021, I think.
00:22:26.120 | So yeah, you can just find,
00:22:28.680 | if you literally look for retrieval augmentation
00:22:31.080 | reduces hallucination, then you'll find the paper.
00:22:34.240 | - Thank you.
00:22:38.200 | (indistinct)
00:22:40.600 | - Yeah, so very often you want to have
00:22:47.560 | very precise word overlap for things
00:22:51.920 | where you don't want to have the synonyms
00:22:53.800 | or the kind of nearest neighbors, right?
00:22:55.320 | So if there's like a brand name or something like that,
00:23:00.120 | then like, let's say the brand is Apple, right?
00:23:02.560 | You don't want to find stuff about the pairs, right?
00:23:05.160 | So that's what you would do with a dense retriever.
00:23:08.480 | So it really kind of depends on what you want to use it for.
00:23:12.120 | That's why hybrid is probably the way to go.
00:23:14.320 | It's a good question.
00:23:17.000 | - Like with the dense,
00:23:19.320 | it's contextualized in that inspection,
00:23:24.440 | it realized Apple, the company would be different.
00:23:28.120 | - No, so if they were actually contextualized, then yes,
00:23:31.520 | but very often it's a frozen retrieval system, right?
00:23:35.160 | That's one of the problems
00:23:36.120 | with all the frozen rack stuff.
00:23:37.960 | (indistinct)
00:23:44.360 | No, so the sort of document and the query,
00:24:00.120 | they're the same, right?
00:24:01.280 | So they're either sparse or they're dense.
00:24:03.760 | So if they're sparse,
00:24:04.920 | the components of the vector are literally the other words.
00:24:08.120 | (indistinct)
00:24:12.320 | So it's literally counts, right?
00:24:22.040 | So basically it's a one big matrix of documents as rows
00:24:26.720 | and the columns are the words in the documents.
00:24:29.320 | And then you just count how often a word occurs
00:24:31.600 | in a document, right?
00:24:33.160 | So that's as far as that.
00:24:35.240 | (indistinct)
00:24:37.640 | Yeah, and so in the field,
00:24:42.520 | we call them sparse embeddings or sparse retrieval
00:24:46.520 | because most of that vector is zero, right?
00:24:49.640 | Because most words don't occur in that document.
00:24:52.080 | Does that make sense?
00:24:55.240 | - Yeah.
00:24:56.080 | - Cool.
00:25:00.440 | So let's talk about doing slightly better.
00:25:05.000 | So going back to Stephen's question about,
00:25:07.000 | okay, we have this kind of retrieval thing,
00:25:08.920 | but how do we actually make this retriever good
00:25:11.320 | for the context that is going to be used in, right?
00:25:14.520 | So can we contextualize the retriever for the generator,
00:25:18.320 | even if it's a generator
00:25:20.040 | where we might not have access to the weights?
00:25:22.200 | So it could be a GPT-4 model,
00:25:24.160 | we just send it to some API, we get some stuff back.
00:25:28.200 | And so one paper I really like is called Replug.
00:25:31.560 | So just to kind of explain what this looks like,
00:25:35.040 | so you have this context,
00:25:36.240 | you have a retriever that we do
00:25:38.320 | the standard retrieval step with,
00:25:39.840 | this is a dense retriever.
00:25:42.040 | And now, sorry, and now you compute the likelihood.
00:25:46.880 | So basically just normalize the scores
00:25:48.960 | that you get for the top K documents
00:25:51.480 | to get a distribution here.
00:25:53.040 | And then you'll give each one of the retreat documents
00:25:57.000 | separately to this generator, to your language model.
00:26:00.680 | So you can look at the perplexity of the correct answer
00:26:04.200 | for that language model, right?
00:26:06.200 | So now we have these two probability distributions
00:26:08.840 | or two likelihoods essentially,
00:26:10.360 | and we can minimize the KL divergence
00:26:12.640 | to make sure that we can actually retrieve the documents
00:26:16.000 | that lead to the lowest perplexity
00:26:18.200 | on the right answer for the language model.
00:26:20.440 | So super simple idea, works really, really well.
00:26:26.520 | And the nice thing about this is completely agnostic
00:26:29.560 | of what happens upstream, right?
00:26:31.080 | So this will work for any sort of encoder, decoder,
00:26:33.600 | for any language model.
00:26:35.760 | What you need is a perplexity score,
00:26:39.000 | but for most language models, you can get that,
00:26:41.440 | not necessarily all of them.
00:26:43.200 | So that's one thing.
00:26:44.040 | And then there's this other really nice approach.
00:26:46.680 | (indistinct)
00:26:51.920 | So in the retriever,
00:26:53.440 | you're literally updating the dense representations, right?
00:26:58.360 | So you're encoder basically for your dense representation.
00:27:01.000 | That's a good question.
00:27:01.840 | We'll get into that a little bit more.
00:27:04.680 | So there's another paper
00:27:06.640 | on in-context retrieval augmented language models,
00:27:09.840 | where the whole paper is basically about just doing BM25
00:27:14.080 | and just giving stuff directly to the context
00:27:16.160 | of the language model and things kind of work.
00:27:18.000 | So it's sort of frozen rag,
00:27:19.800 | but even more primitive in a way
00:27:22.640 | where the retriever is this very old sparse algorithm,
00:27:26.960 | but it works really, really well.
00:27:29.040 | But then they have this really awesome section
00:27:31.320 | where they show that you can just have this re-ranker
00:27:35.120 | on top of the BM25 results
00:27:37.640 | and you can backdrop into this re-ranker.
00:27:40.240 | So now you still keep the language model completely fixed.
00:27:43.280 | So that's sort of this part of the loss here.
00:27:46.800 | So you have kind of a stop gradient on the parameters data.
00:27:49.480 | That's just your language model.
00:27:51.360 | But now you have this kind of rank function here
00:27:55.240 | that you can backdrop into, right?
00:27:57.000 | So that's your re-ranker.
00:27:58.280 | It's basically, it can be a BERT model
00:27:59.880 | or anything like that that works on top of the things
00:28:01.840 | you initially retrieved from your BM25.
00:28:04.160 | And now you have this BERT re-ranker
00:28:06.200 | that you can backdrop into.
00:28:07.560 | So this also works really, really nice.
00:28:11.400 | So we're slowly progressing towards having a system
00:28:14.520 | that is much more optimized for being properly
00:28:18.240 | retrieval augmented in a way where it's useful
00:28:20.440 | and contextualized for what you want to use it for.
00:28:23.240 | So yeah, just to point out kind of what that looks like
00:28:27.560 | with this re-ranker.
00:28:28.400 | So you just have this extra step essentially, right?
00:28:30.960 | So we have our retriever, then we have a re-ranker,
00:28:33.120 | then we have our generator and our output.
00:28:35.240 | - (indistinct)
00:28:39.960 | - No, not necessarily.
00:28:41.320 | So for this one you do, yeah.
00:28:45.520 | But so for re-plug you don't, right?
00:28:48.840 | - Yeah. - Yeah.
00:28:51.000 | Yeah, yeah, yeah.
00:28:52.080 | So basically, yeah, you need to get...
00:28:54.040 | - (indistinct)
00:28:55.640 | - Not all of them.
00:28:57.240 | Some of them do, but yeah, there are all kinds of tricks
00:29:00.160 | you can do on top of that, yeah.
00:29:01.760 | So basically the question is how do we get
00:29:07.560 | sort of gradients flowing into this, right?
00:29:09.360 | So if you don't actually have access
00:29:11.280 | to the full parameters of the model
00:29:13.120 | so that you can backdrop all the way through it,
00:29:14.960 | then you can do a reinforce style loss on the retrieval.
00:29:19.960 | And then you just pass the kind of log-like view
00:29:22.800 | if you have access to that
00:29:24.520 | or some other kind of black box function.
00:29:26.560 | All right, so the next thing you can do
00:29:36.120 | is to optimize both the retriever and the generator.
00:29:39.280 | And so this really starts getting
00:29:43.080 | to the proper kind of contextualization
00:29:45.440 | of the entire architecture
00:29:46.680 | where you want everything to work together, right?
00:29:48.960 | So rather than having this frozen thing
00:29:50.480 | where everything is basically not aware
00:29:52.760 | that the other part exists, right?
00:29:54.160 | It's like two halves of the brain
00:29:55.400 | they're not talking to each other.
00:29:56.880 | One is your retriever, the other is your language model.
00:29:59.040 | There's no connection.
00:29:59.880 | They're just like sort of like something
00:30:01.440 | is thrown over the fence and then you hope for the best.
00:30:04.080 | So instead of that, we have everything much closer
00:30:06.360 | and learning together.
00:30:08.920 | So one of the first ways of doing this
00:30:13.560 | with a generator was RAG, retrieval augmented generation
00:30:17.400 | which we did at FAIR in 2020.
00:30:19.440 | And it's very similar to what we've already seen.
00:30:23.840 | We basically have this retriever here
00:30:25.720 | that works over different documents.
00:30:27.440 | You get some score function
00:30:29.360 | that gets given to this generator that generates the answer.
00:30:33.640 | And now you want to backdrop all the way
00:30:35.800 | and update your generator as well, right?
00:30:38.120 | So in the previous two architectures
00:30:40.040 | we saw you keep the generator fixed.
00:30:42.000 | You backdrop into your retriever
00:30:44.640 | but here we update everything.
00:30:46.680 | Well, not exactly everything as you'll see
00:30:48.600 | but we'll also update the part of the retriever
00:30:52.320 | and the generator.
00:30:53.280 | So in this RAG model,
00:30:56.000 | we actually have two different ways of doing this.
00:30:59.160 | And this is probably something that when we talk about this
00:31:02.640 | if you think about this long enough, then you'll think like
00:31:05.800 | okay, but when actually do I need to retrieve?
00:31:08.720 | Like do I retrieve every time I generate a new token
00:31:12.240 | or do I just retrieve once
00:31:14.000 | and then generate an entire sequence, right?
00:31:16.800 | Or maybe I want to retrieve every N tokens, right?
00:31:20.760 | So these are hyperparameters
00:31:21.800 | or maybe I want to learn when to retrieve.
00:31:23.600 | As we'll see that's also something people have done.
00:31:27.000 | So these are two different ways to do it.
00:31:29.160 | And what we do in this paper
00:31:32.200 | basically the whole point of the paper
00:31:33.840 | is that this frozen thing doesn't really work all that well.
00:31:37.400 | So I think what people call RAG now
00:31:40.160 | is usually refers to the frozen thing
00:31:43.640 | but the whole paper basically
00:31:44.920 | would never have been accepted anywhere
00:31:46.640 | if we had just done the frozen thing.
00:31:48.240 | The whole point of the paper is that you want to optimize it.
00:31:52.400 | And so at my company Contextual
00:31:54.640 | we call this frozen thing Frankenstein's monster
00:31:57.080 | because it's really like you cobble together
00:31:58.880 | these different pieces, right?
00:32:00.440 | You sort of, yeah, it's really like Frankenstein
00:32:02.720 | and just put it together and then it sort of walks, you know
00:32:06.000 | but it doesn't really have the soul.
00:32:07.200 | It doesn't really actually work.
00:32:08.720 | It's not the real thing.
00:32:09.920 | So that's great for everyone here, I think
00:32:13.760 | because there are so many opportunities to do better
00:32:15.920 | than what most people are using right now.
00:32:18.240 | So one of the limitations of the original RAG architecture
00:32:24.640 | is that it only supports a very small cave, right?
00:32:27.600 | So if you have lots and lots of documents
00:32:30.960 | then the problem is that you have to fit all
00:32:33.000 | of them in the context
00:32:34.160 | but how do you really get that to fit, right?
00:32:38.120 | So one thing you can do is you first encode things
00:32:43.120 | so that you get one single representation
00:32:45.680 | or only the few sort of top level representations
00:32:48.240 | then you concatenate those
00:32:49.840 | and then you just feed them to the decoder.
00:32:51.600 | So this is FID fusion and decoder.
00:32:54.440 | And as you can see the skills
00:32:56.520 | to a much higher number of passages
00:33:00.280 | and that leads to corresponding improvements
00:33:03.240 | in the scores that you care about.
00:33:05.520 | So that's a really cool idea.
00:33:08.440 | And so we're slowly moving
00:33:10.600 | towards more decoder only architectures, right?
00:33:13.560 | So in RAG, we have this BART model
00:33:15.320 | it's sort of an encoder decoder architecture
00:33:17.360 | but here you just have this decoder
00:33:18.880 | that does some fancy attention
00:33:21.400 | over stuff that you retrieved before.
00:33:23.440 | And so another like pure decoder language model
00:33:29.800 | architecture is this one KNNLM
00:33:33.400 | which I think is very elegant in its simplicity.
00:33:36.680 | So it's basically you just have a normal language model
00:33:39.880 | but you interpolate the normal language model weights
00:33:43.920 | with things that you retrieved.
00:33:46.960 | So basically you have some sort of prompts, right?
00:33:49.320 | So like Obama's birthplace is, you go to your big corpus
00:33:52.880 | you find similar things.
00:33:54.840 | You look at the words that come next to the similar things.
00:33:58.520 | You rank that thing, you sample your top K
00:34:01.880 | you renormalize that.
00:34:03.440 | So now you have a bunch of scores
00:34:05.720 | and now you can just interpolate
00:34:07.280 | between your retrieved kind of non-parametric memory scores
00:34:11.320 | and your parametric language model scores.
00:34:13.440 | So this is very late fusion in a sense, right?
00:34:16.120 | At the very end, you combine these two
00:34:18.560 | and it allows you to re-weight
00:34:20.320 | the pure language model probabilities or likelihoods.
00:34:23.000 | So this works really well and it scales especially well
00:34:26.720 | if you have a huge retrieval corpus, right?
00:34:30.680 | So if you have trillions and trillions of tokens in there
00:34:33.200 | you can have a much smaller language model
00:34:35.400 | that does not that much heavy lifting
00:34:37.440 | because you can really rely on this big source corpus
00:34:40.480 | that you're working from.
00:34:41.920 | And so that idea was exploited by this paper
00:34:45.720 | called "Retro Out of Deep Mind"
00:34:48.000 | where they showed that you can have a 25 times smaller
00:34:51.840 | retrieval augmented language model trained from scratch.
00:34:54.680 | So really pre-trained entirely from scratch
00:34:57.680 | that outperforms this 25 times bigger language model
00:35:01.440 | on the same data in terms of perplexity
00:35:03.400 | which is pretty impressive, right?
00:35:05.240 | So this architecture is much more efficient
00:35:07.880 | than a parametric model
00:35:09.560 | because you can rely on this external memory.
00:35:12.080 | So if your external memory is big enough
00:35:14.680 | you can get pretty huge gains.
00:35:17.520 | So there was a lot of excitement about "Retro"
00:35:19.720 | when it was announced, but it's a "Deep Mind" paper.
00:35:22.520 | So there's really no open source,
00:35:24.640 | nothing really to validate that this actually works.
00:35:27.680 | And so very recently there has been a bit of work
00:35:31.120 | from NVIDIA called "Retro++"
00:35:34.080 | where they have this hybrid between the "Retro" architecture
00:35:38.840 | and then they do basically "Rag"
00:35:40.840 | sort of they put the top one or the top K results
00:35:44.040 | in the context of the language model after all.
00:35:46.480 | So it's sort of a crossover between "Rag" and "Retro"
00:35:50.000 | and they showed some really nice results here
00:35:52.000 | but I think it's sort of pointing to this big flaw
00:35:55.880 | I think is that why is there still no good
00:35:58.240 | open source "Retro" model?
00:35:59.680 | That probably tells you something
00:36:02.720 | about whether it actually really works.
00:36:04.440 | I spent a lot of time in my career
00:36:06.280 | trying to reproduce "Deep Mind" papers
00:36:08.280 | that didn't necessarily always work.
00:36:11.320 | And so I think the same is true for "Retro"
00:36:15.600 | and that's why we need to do this in context "Rag"
00:36:18.560 | on top of "Retro" to actually get it to work.
00:36:21.960 | But could it just be a 2.8?
00:36:24.440 | There's such a treat on both end.
00:36:28.640 | Yeah, but so...
00:36:29.640 | So "Deep Mind" and stuff.
00:36:32.640 | No, so doing retrieval over that big corpus
00:36:36.360 | is not that difficult actually.
00:36:38.200 | Yeah, so there are even like distributed face packages
00:36:42.920 | you can just do everything yourself.
00:36:44.680 | So, yeah.
00:36:46.320 | So in terms of compute it's actually not that hard anymore
00:36:49.560 | to reproduce something like this.
00:36:52.400 | But I've tried several times
00:36:54.400 | and it's not really reproducible.
00:36:56.440 | So the only way to get it to work
00:36:58.640 | is if you do this in context "Rag"
00:37:00.280 | on top of the "Retro" thing.
00:37:01.440 | And then as you can see here in the results
00:37:03.520 | then it actually gives you a gain over the pure GPT model.
00:37:06.600 | So it starts from a GPT and then they kind of retrofit
00:37:09.320 | as they call it the GPT model.
00:37:11.080 | So in short, I think there's still a lot of work
00:37:15.120 | to be done in pre-training these systems
00:37:17.320 | really from scratch.
00:37:18.760 | And "Retro" kind of showed that it might be possible
00:37:20.840 | but we don't necessarily know exactly
00:37:23.040 | how to do it the right way.
00:37:24.400 | And this is really one of the interesting open questions.
00:37:27.400 | Any questions on that?
00:37:30.360 | Online?
00:37:35.600 | No, okay.
00:37:41.320 | Then we'll move on.
00:37:43.520 | So let's go all the way with the contextualization now.
00:37:48.520 | So with "Retro" and with "Rag"
00:37:52.560 | what we actually did is we only updated the query encoder.
00:37:56.880 | So updating the document encoder is very expensive.
00:38:01.800 | So one of the first papers actually kind of the OG
00:38:04.640 | of the non-frozen dense retrieval augmented methods
00:38:08.080 | is this paper called "Realm".
00:38:10.400 | This is really like visionary work.
00:38:12.480 | This was basically the first kind of version
00:38:16.520 | that did this properly where they updated it all the way
00:38:19.840 | including the document encoder.
00:38:21.640 | So can someone explain to me why it's expensive
00:38:25.320 | to update the document encoder?
00:38:27.120 | So let's say we have a trillion tokens in our corpus.
00:38:34.360 | So now we go all the way.
00:38:37.160 | So we basically do a forward pass.
00:38:39.600 | We get a gradient at the end.
00:38:41.080 | Now we back propagate the gradient through the retriever.
00:38:43.800 | We update the query encoder.
00:38:45.240 | Now we have to update the document encoder.
00:38:48.040 | So what do we then need to do
00:38:49.200 | after we've updated the document encoder?
00:38:52.000 | We need to re-encode the entire internet, right?
00:38:54.920 | So basically every single gradient update
00:38:57.200 | we have to re-encode whatever our index is.
00:38:59.720 | Which, and so if this is like trillions of tokens
00:39:02.320 | it's like re-encoding the internet
00:39:04.200 | after every batch update.
00:39:06.240 | So that's not very efficient.
00:39:09.400 | (indistinct)
00:39:11.800 | - Yeah.
00:39:24.800 | Yeah, that's one way to do it.
00:39:28.840 | So there are a bunch of different ways
00:39:31.000 | to update the document encoder.
00:39:33.160 | So what they do in Realm
00:39:35.040 | is they basically do it for T batches.
00:39:38.680 | Then they stop, they re-encode the entire internet
00:39:41.760 | and then they train again.
00:39:43.520 | So it's sort of asynchronous updates.
00:39:45.560 | They have this very fancy sort of sharding mechanisms
00:39:48.600 | where they take down certain parts of their entire index
00:39:52.880 | and then update them kind of on the fly.
00:39:54.880 | So you can do it, it's just very expensive.
00:39:57.760 | So one of the things that a lot of people
00:39:59.720 | have been thinking about, not exactly the Delora idea
00:40:02.000 | but similar versions of that are around like,
00:40:06.640 | can you make it more efficient
00:40:07.720 | so that you don't have to do this asynchronously?
00:40:10.960 | So one of the downsides of this Realm architecture
00:40:16.320 | is that it's really just a BERT model
00:40:18.480 | but then you do this retrieval augmentation
00:40:20.200 | on a BERT model with other BERT models.
00:40:21.840 | So it's not pretty generative.
00:40:23.040 | It's not really gen AI in the modern paradigm.
00:40:26.280 | But if you wanna read like one paper on this topic
00:40:29.960 | like this is a very good one to read.
00:40:31.800 | The other one that is really, really good to read
00:40:36.480 | is this paper called Atlas.
00:40:38.680 | So Atlas is, so this is out of there
00:40:43.680 | with a bunch of folks, the folks who did like RAG
00:40:46.040 | and the folks who did FID
00:40:47.840 | and really a brilliant set of people.
00:40:51.200 | And this is really a comprehensive analysis
00:40:54.560 | of everything that's happening in this architecture.
00:40:57.360 | So the first question they really look at
00:40:58.920 | is how do we train this retriever?
00:41:00.640 | So we've seen a couple of versions of this
00:41:04.120 | but which one actually works better?
00:41:06.560 | They haven't really been compared in a head to head setting.
00:41:09.680 | So one thing is we have this FID style
00:41:12.040 | sort of attention distillation.
00:41:14.120 | So that's really too complicated to go into detail here
00:41:17.720 | but the others are actually very simple.
00:41:20.440 | So one is this loss we've basically seen before, right?
00:41:24.800 | So we've seen this, I think with the in-context RAG one,
00:41:28.000 | right, so we have a stop gradient on the language model
00:41:30.360 | and then we update the retriever.
00:41:32.680 | The other one is what we've seen with Replug.
00:41:35.440 | So this is basically exactly the Replug loss, right?
00:41:37.680 | So we have the KL divergence of the documents
00:41:42.120 | and sort of the improvement that you see
00:41:44.680 | when you give it that document.
00:41:46.320 | The other thing they have
00:41:48.480 | is basically the inverse of that one.
00:41:50.600 | So if I take this one document out,
00:41:53.360 | how does that affect my perplexity of the language model?
00:41:57.520 | And so this one I think is actually quite elegant
00:42:02.280 | because that really gets to like how valuable
00:42:04.720 | is this one single document for me
00:42:07.320 | answering this question correctly.
00:42:09.080 | So they compare all of these different versions
00:42:13.520 | and what you can see is that the kind of Replug style loss
00:42:18.400 | and this leave one out loss,
00:42:20.080 | they perform a lot better than all of these others.
00:42:22.320 | So this fixed retriever or no joint pre-training,
00:42:25.200 | these are really kind of the baseline
00:42:26.800 | sort of frozen RAG models or closed book.
00:42:29.960 | And as you can see, you can do really a lot better
00:42:33.240 | if you optimize things.
00:42:35.160 | And so this leave one out thing
00:42:36.920 | is probably the best I would say.
00:42:39.240 | So then the other question is,
00:42:41.360 | how do you actually like train that entire system?
00:42:44.200 | Like what data or what tasks do you train this on?
00:42:46.800 | So they also experiment with a bunch of different versions.
00:42:50.680 | So one is doing prefix LM, if you're familiar with that.
00:42:55.160 | So they basically take a chunk
00:42:57.920 | that occurs somewhere on the internet
00:42:59.840 | and then they predict the next chunk from that chunk.
00:43:03.000 | So it's really like sentence to sentence.
00:43:05.880 | So maybe like skip thought back in the day,
00:43:07.680 | but now you have this retrieval step
00:43:09.320 | where you predict the next sentence.
00:43:11.120 | Then they just do T5 style sort of denoising.
00:43:14.720 | So that's mass language modeling,
00:43:16.000 | if you're familiar with T5.
00:43:17.800 | And then they have this title for section generation piece.
00:43:21.440 | So I think the takeaway from this table
00:43:23.960 | is basically that whatever you do here,
00:43:26.240 | so they're using T5 model.
00:43:28.280 | So whatever you do here needs to be the same
00:43:30.000 | that your language model expects.
00:43:32.240 | So for T5, that's T5 style loss.
00:43:35.600 | And then the next sort of final question
00:43:40.440 | that they look into going back to what we talked about,
00:43:43.520 | how exactly do we update this retriever?
00:43:46.240 | So do we have to update the document encoder
00:43:49.200 | or do we maybe have to do some sort of re-ranking
00:43:52.360 | or do we maybe just update the query?
00:43:54.680 | And quite surprisingly, I think they find that
00:43:57.760 | just updating the query,
00:43:59.000 | so like in the original RAD paper,
00:44:01.400 | is actually already basically good enough in many cases.
00:44:05.120 | So that's nice because it's much more efficient
00:44:08.160 | if you don't have to update your documents all the time.
00:44:11.680 | I think the real question here though is like,
00:44:14.560 | how good is your document representation to begin with?
00:44:17.400 | So you need to have a very, very high quality
00:44:20.360 | embedding model for this to work.
00:44:21.800 | If you don't have that, then this will not work.
00:44:23.880 | But if you do have that,
00:44:24.840 | then you get a very nice kind of query side fine-tuning thing.
00:44:28.280 | So the Atlas paper is about trying to do few-shot
00:44:36.520 | sort of language modeling tasks.
00:44:39.280 | So it's how many examples are given in the context.
00:44:42.160 | Yeah, so the main takeaway here is that
00:44:50.080 | if you compare like the closed-book equivalent model
00:44:52.760 | to the retrieval augmented model,
00:44:55.200 | you see very big improvements.
00:44:58.120 | That's really the only takeaway of this entire section.
00:45:02.680 | But I think that that's really saying something
00:45:08.520 | in terms of what we should be thinking about.
00:45:11.640 | How much time do I have until?
00:45:13.160 | - There's still time.
00:45:16.800 | - Okay, okay.
00:45:18.560 | All right, other questions?
00:45:20.720 | (indistinct)
00:45:23.120 | - Yeah, so they can be different.
00:45:31.040 | So in Atlas, Atlas basically tries everything.
00:45:36.480 | So they also tried to see what happens
00:45:38.320 | if I train this on Wikipedia,
00:45:40.080 | but I swap in like a sort of common crawl index.
00:45:43.760 | So in Atlas, but also in Retro,
00:45:47.360 | the main finding is just the more, the better.
00:45:50.840 | So it's really just like the bigger your index,
00:45:53.160 | the more likely you are to find the exact right thing
00:45:56.760 | and then make the right prediction.
00:45:59.680 | Any other questions on this?
00:46:07.000 | - Oh yeah, sorry.
00:46:08.400 | This is a question about the generator
00:46:10.640 | in the, I guess, the rack system.
00:46:14.120 | So recently I saw a paper on Mistral 7B.
00:46:19.040 | So it introduces a lot of these new architectural changes
00:46:22.760 | like the sliding window attention
00:46:24.680 | to handle longer sequences at a smaller cost
00:46:27.040 | and the group query attention for faster inference.
00:46:30.040 | I'd like to like know your thoughts
00:46:32.640 | on designing a generator specifically for RAG,
00:46:36.840 | leveraging, for example, where Mistral 7B currently is.
00:46:40.880 | Because for example, like the sliding window attention,
00:46:43.600 | I could see how that could be adapted to the RAG case.
00:46:47.960 | - Yeah, so maybe your read on sort of
00:46:49.880 | what makes Mistral's special is a bit different from mine.
00:46:52.840 | So I don't think that the sliding attention window thing
00:46:55.480 | is actually that interesting.
00:46:56.760 | The reason Mistral works so well
00:46:58.200 | is because it's trained on a lot of data
00:47:00.440 | and you can do that more efficiently
00:47:02.040 | because you have sliding window attention
00:47:03.640 | so you don't need to attend to everything.
00:47:05.760 | But so to answer your question,
00:47:10.040 | I guess you're asking sort of about the architecture
00:47:12.760 | of the generator if you know
00:47:14.920 | that there's gonna be a retriever.
00:47:16.400 | So I think that's basically what Retro tried to do.
00:47:20.760 | So Retro actually, some of the people on the Retro paper
00:47:25.600 | are at Mistral now.
00:47:27.600 | So they have this chunk cross-attention idea here.
00:47:32.080 | So you basically have the language model,
00:47:34.080 | but the way it does attention over the things you retrieve
00:47:37.480 | in your Retro architecture,
00:47:40.680 | they kind of get integrated into a model
00:47:44.440 | not using the standard attention mechanism,
00:47:46.840 | but using this slightly different chunk cross-attention.
00:47:50.160 | - Oh, okay.
00:47:51.960 | So I think the sliding window attention point
00:47:54.760 | I was trying to get at was that it uses a fixed window
00:47:59.040 | so that whenever you're doing the query key computation
00:48:02.600 | with the query vectors and the key vectors,
00:48:05.840 | you're using a fixed window attention.
00:48:07.800 | So I think my idea was to actually,
00:48:11.600 | one, use a dynamic window
00:48:13.520 | because for example, the rag case,
00:48:15.480 | if you use a fixed window when you're doing attention,
00:48:19.520 | it is possible that you actually are leaving,
00:48:23.200 | you're only looking at a fixed span of information.
00:48:26.880 | So if you could maybe adapt Mistral
00:48:29.760 | so that you could make it better for the rag case
00:48:32.240 | in, for example, making the fixed window
00:48:35.120 | size the dynamic window, yeah.
00:48:38.440 | - Yeah, I think it's an interesting idea.
00:48:39.800 | So for me, what Mistral is doing with the sliding window,
00:48:44.800 | that's basically like a conf net, right?
00:48:47.520 | So we had all these convolutional light conf nets
00:48:50.800 | where we would have word embeddings
00:48:52.440 | and you would do convolutions over it and then pool,
00:48:55.160 | and then you would still get the information out.
00:48:57.320 | So it's not that the sliding window
00:48:59.360 | prohibits you from looking earlier,
00:49:01.280 | it's just that that happens higher up
00:49:03.160 | in your transformer sort of.
00:49:05.400 | - Yeah, yeah.
00:49:06.480 | Okay.
00:49:08.600 | So I think that definitely is an interesting direction
00:49:11.680 | to think in, yeah.
00:49:13.160 | - Yeah, so I think it's like not too crazy to say,
00:49:17.080 | are there any architectural changes
00:49:19.000 | that we can introduce into these
00:49:20.720 | 7 billion parameter models
00:49:22.560 | so that they could be better adapted to the rag case?
00:49:25.840 | - Yeah, so there might be, yeah.
00:49:30.280 | I think one question is just how do you do the attention
00:49:33.760 | over things you've retrieved,
00:49:35.320 | which I think is what you're doing.
00:49:38.000 | Yeah, thanks.
00:49:39.080 | - So just to make sure I understand,
00:49:42.920 | so yes, I mean, in this virtual model,
00:49:45.600 | you are retrieving each block,
00:49:47.800 | and when you talk about putting the retrieval in the context,
00:49:53.120 | are you saying that you only do it at the beginning
00:49:55.000 | and you don't do it at each block?
00:49:57.200 | - Yeah, so in context,
00:49:59.160 | so this is, it's not exactly every layer sort of,
00:50:01.800 | so it's every token, right?
00:50:02.960 | So every step basically, not every block.
00:50:07.600 | So it doesn't make sense.
00:50:10.600 | So it's not every layer that you do the retrieval, right?
00:50:14.440 | Yeah, so every step, right?
00:50:16.360 | So this is kind of like what rag token is.
00:50:20.320 | So you retrieve every token,
00:50:22.880 | so you generate and then you can retrieve again.
00:50:25.800 | Or in the case of retro,
00:50:26.840 | you can generate like a chunk
00:50:28.120 | and then you retrieve chunks again.
00:50:29.920 | If you look at the in-context case,
00:50:32.840 | you retrieve once at the beginning and then you give it.
00:50:36.800 | - So that's what you're saying.
00:50:37.640 | You're saying that during the retrieval,
00:50:40.320 | nobody has had enough to say.
00:50:42.560 | - Yeah, so the in-context thing,
00:50:45.160 | so here you don't actually give it as context at all,
00:50:50.440 | like directly to the model, right?
00:50:51.920 | So here you let the decoder kind of attend over it.
00:50:56.440 | - Like cross-attention.
00:50:57.800 | - Yeah.
00:50:58.640 | - And that nobody has to do.
00:51:01.280 | - So I don't think cross-attention really works, yeah.
00:51:06.480 | - Yeah.
00:51:07.320 | - Other questions?
00:51:12.880 | - Yeah, we did inside that in the case
00:51:16.600 | which retrieving on the retriever is not so necessary
00:51:19.720 | because of the large loss.
00:51:22.760 | So I'm wondering what inside of the cases,
00:51:25.600 | like what cases are really necessarily need
00:51:29.440 | to do A, B, and X update
00:51:31.040 | or any way to update those document or, yeah.
00:51:35.320 | - Yeah, so you do want to update the retriever, right?
00:51:37.640 | But only part of the retriever is necessary
00:51:40.440 | to be updated for a lot of these cases.
00:51:43.600 | But so I think it,
00:51:46.840 | so these are very specific data sets, right?
00:51:49.720 | Natural questions, Wizard of Wikipedia, and Fever.
00:51:52.280 | So they're really very kind of knowledge-intensive tasks.
00:51:56.920 | So in that case, if you already have a very good system
00:52:00.080 | like DPR that is specifically pre-trained for those tasks,
00:52:03.920 | then you only need to update the query encoder.
00:52:06.840 | So I would expect that if you move beyond this
00:52:09.400 | to kind of general language modeling things like Retro,
00:52:13.080 | then you probably do want to update the document encoder
00:52:16.160 | at least in a way where you can scale it.
00:52:18.480 | - So I think that's in the,
00:52:22.480 | these tasks are very knowledge-intensive.
00:52:26.280 | And actually, we covered for (indistinct)
00:52:33.800 | as long as we have a good (indistinct) knowledge
00:52:36.320 | of the documents by those good models.
00:52:41.320 | - Yeah, but so you need to learn
00:52:45.560 | how to kind of query into that index, right?
00:52:48.400 | So if you don't do that,
00:52:50.880 | then yeah, you don't get really good performance.
00:52:54.040 | So that's sort of like your closed book performance, right?
00:52:57.120 | If you just have the language model
00:52:58.680 | and you're just like,
00:52:59.920 | what does the parametric model on its own
00:53:02.320 | without the retriever?
00:53:03.240 | What does it actually know?
00:53:04.840 | As you can see, there are pretty big gaps there.
00:53:07.240 | Other questions?
00:53:13.840 | Otherwise, I will cover other questions.
00:53:19.800 | - Hello?
00:53:20.640 | - Yeah, go for it.
00:53:21.880 | - A quick question.
00:53:22.720 | Like, so what about like more hierarchical retrieval?
00:53:26.000 | Like I suppose there'll be methods trying to
00:53:28.440 | not just retrieve a single chunk,
00:53:29.800 | but there's some kind of like groups of chunks
00:53:31.600 | or something or some right expressions.
00:53:34.680 | - There's been some interesting work on doing that
00:53:37.240 | where you first try to find,
00:53:38.680 | so you can have multiple indices
00:53:40.120 | and they can kind of cascade, right?
00:53:41.480 | So first you want to find the relevant document.
00:53:43.840 | So you have some document representation
00:53:45.640 | and then within that document,
00:53:46.800 | you want to find the relevant chunk.
00:53:49.880 | So you can do it sort of that direction.
00:53:51.360 | You can also do it in reverse.
00:53:52.920 | I think I have something on a slide there
00:53:54.600 | where you can find the chunk
00:53:56.200 | and then sort of expand the context around it
00:53:59.640 | and then give that to the language model.
00:54:01.760 | And so I think, yeah,
00:54:04.160 | there are all kinds of interesting things
00:54:05.560 | you can do there.
00:54:06.440 | - Cool.
00:54:09.320 | Thanks.
00:54:10.160 | I guess another thing, just like,
00:54:11.800 | can you compare RAD versus like long context efforts?
00:54:15.680 | So there are lots of things like around
00:54:18.120 | just having a really long context
00:54:19.920 | and the extreme, it could replace RAD,
00:54:21.920 | but I don't know, like if it takes.
00:54:24.560 | - Yeah, so everybody understands this question, right?
00:54:28.760 | So there's a trend where we want to have
00:54:31.360 | very long context language models
00:54:33.240 | so that basically you can like take Harry Potter
00:54:35.880 | or something, just put it in the context
00:54:37.640 | and then ask a question,
00:54:38.760 | like what is the name of like Harry Potter's owl
00:54:41.040 | or something, right?
00:54:42.440 | And then it can just attend over the entire thing.
00:54:45.520 | So attending over all of Harry Potter
00:54:48.120 | to answer that one question is super inefficient, right?
00:54:51.840 | So most of Harry Potter has nothing to do with the owl.
00:54:55.040 | So, but you are still kind of reading it
00:54:57.200 | if you do it with the long context window.
00:54:59.840 | So that's why I think doing it the RAG way
00:55:02.560 | where you have this non-parametric component
00:55:05.000 | is a much more efficient way to solve this problem.
00:55:07.920 | And if you actually look at the literature
00:55:09.640 | on long context windows,
00:55:11.920 | the way they solve the problem
00:55:14.080 | of scaling the attention mechanism
00:55:16.360 | is by making it very sparse.
00:55:18.600 | So they're basically turning it,
00:55:20.400 | so that's a different kind of sparse,
00:55:21.840 | but they're turning it into a non-parametric
00:55:24.040 | retrieval problem kind of behind the scenes.
00:55:26.960 | So they're not actually all that different.
00:55:29.000 | If you want to scale long context,
00:55:30.360 | then you're going to move towards a RAG style architecture.
00:55:33.360 | - Cool, thanks.
00:55:36.440 | - All right.
00:55:40.600 | So let's talk about some other interesting questions.
00:55:43.320 | So one thing, and I already alluded to this,
00:55:46.440 | is when do we actually retrieve?
00:55:48.760 | So if we're doing like,
00:55:50.160 | if we want to like retrieve every token,
00:55:53.880 | that's also very inefficient
00:55:55.080 | because I probably don't have to retrieve
00:55:56.960 | to generate the, right?
00:55:59.600 | I can probably do that on my own with the language model
00:56:01.880 | as sort of a way to go and retrieve stuff.
00:56:04.680 | But if I only retrieve once
00:56:06.760 | at the beginning of the sequence,
00:56:08.000 | that's probably also not great, right?
00:56:09.800 | So what we ideally want to be able to do is to say,
00:56:13.080 | okay, sometimes I want to retrieve,
00:56:14.520 | sometimes I don't want to retrieve,
00:56:15.880 | and I'm going to learn when I want to kind of expend
00:56:18.880 | the compute budget on doing the retrieval.
00:56:22.760 | So a nice paper where they have a stab at,
00:56:25.400 | this is called Flare for Active Retrieval Augmentation,
00:56:28.480 | where they basically have the language model decide
00:56:31.760 | when it should do a search
00:56:33.320 | and what it should do the search for.
00:56:35.200 | So I think this fits in a general trend
00:56:39.720 | that you can see in the field around kind of agents, right?
00:56:42.560 | So we can talk a little bit more about that too.
00:56:44.960 | So this other question
00:56:47.760 | that I think we've also kind of covered already here
00:56:50.480 | is how do we train this at scale, right?
00:56:52.280 | So we can do these asynchronous updates,
00:56:54.240 | we can do re-rankers, we can do query-side only.
00:56:57.120 | There's this really nice paper,
00:56:59.360 | which is quite close, I think, to the idea you proposed,
00:57:02.920 | where you first use VM25 to create a batch, basically,
00:57:07.480 | where everything is very similar
00:57:10.000 | in terms of what you've retrieved.
00:57:11.960 | And now you have this kind of in-batch update.
00:57:16.000 | So it's sort of like a re-ranker
00:57:17.600 | where you encode the information
00:57:18.840 | that is just in your batch using this other model.
00:57:22.000 | And now you can update this model on the fly.
00:57:24.360 | So you don't have to worry too much
00:57:25.720 | about doing the full kind of document-side update.
00:57:28.400 | And again, here, what really matters
00:57:31.120 | is how big is your index?
00:57:32.560 | If you have an amazing index,
00:57:33.920 | you can basically solve any problem just by looking it up.
00:57:37.600 | So rather than cramming it into your parameters,
00:57:39.880 | you can just find it.
00:57:40.960 | This is a really nice paper called "Silo."
00:57:47.000 | So one of the interesting things,
00:57:49.520 | I think that's going to happen in the next year or two,
00:57:52.720 | around language models is there,
00:57:54.040 | and you've seen this already,
00:57:55.120 | there's a bunch of lawsuits against OpenAI and other places
00:57:58.240 | around where does the data exactly come from.
00:58:00.680 | So one very elegant solution, I think,
00:58:05.000 | is to have a RAG system that you train on data
00:58:07.520 | that you know is safe.
00:58:09.480 | So you can train that thing on Wikipedia,
00:58:12.080 | but now during test time, you can give it a data store
00:58:14.920 | that has maybe slightly riskier information in it.
00:58:18.560 | So this massive index of all the stuff on the internet,
00:58:21.600 | including some things that are maybe higher risk,
00:58:26.080 | you can still have them in your index,
00:58:28.080 | but your language model,
00:58:29.760 | your retrieval augmented language model, I should say,
00:58:32.040 | you know that that thing is safe
00:58:33.440 | because it was trained on data that is public domain.
00:58:36.400 | So that's what they do in Silo,
00:58:37.720 | and they show that that works really well.
00:58:39.760 | So that's one possible solution
00:58:42.640 | to a lot of the kind of compliance and legal risk
00:58:45.760 | around language model deployments.
00:58:48.480 | - There's a great paper also from one of your colleagues
00:58:53.480 | around context getting lost in the middle.
00:58:57.360 | I think this is also kind of a fascinating phenomenon.
00:58:59.680 | This is on a frozen RAG system,
00:59:01.400 | but language models are very similar to humans
00:59:06.800 | in what things they pay attention to.
00:59:09.320 | So if you give them a bunch of things that you've retrieved,
00:59:12.400 | what they will look at are the first things you list
00:59:15.080 | and the last things you list,
00:59:16.440 | and they will sort of ignore the middle.
00:59:18.440 | So if it actually respected the rank function,
00:59:21.400 | then this curve would go down all the way, right?
00:59:23.960 | But it sort of goes up.
00:59:25.520 | So I think that's a very interesting observation,
00:59:30.040 | which kind of shows how brittle these systems can be, right?
00:59:34.560 | So if you have a frozen RAG system,
00:59:36.360 | it can be very, very brittle
00:59:37.640 | where like the order of the retrieved context
00:59:40.240 | matters a lot in whether you get the right answer or not.
00:59:44.760 | (indistinct)
00:59:47.160 | - Yeah, so what I just described,
01:00:06.760 | someone asked like, how do you actually,
01:00:09.520 | so I said there are other ways to do this,
01:00:11.040 | and then the question was, how do you do that?
01:00:12.960 | So the way that you do that is using reinforce.
01:00:15.320 | So yeah, there has been work on doing that.
01:00:18.720 | So some of the older papers were playing with this,
01:00:22.040 | but one of the big problems with,
01:00:23.760 | so I think the replug solution is sort of more elegant
01:00:28.280 | for solving that problem,
01:00:31.160 | because you actually sort of use signal
01:00:33.040 | from the language model.
01:00:34.080 | And if you just do reinforce, it's very high variance.
01:00:36.520 | So it's gonna be super finicky
01:00:39.840 | if you don't want to destroy your index.
01:00:42.920 | But people have tried it, yeah.
01:00:44.480 | So there's some really nice work from OpenAI
01:00:53.440 | where they basically show,
01:00:55.480 | and again, we're sort of like thinking more and more
01:00:57.800 | about agents here, right?
01:01:00.280 | Where they show something very similar
01:01:01.960 | to the FLARE results from earlier with active retrieval
01:01:04.400 | that doesn't necessarily have to be some index
01:01:06.760 | that you only can be just some web search, right?
01:01:10.800 | And obviously in this case,
01:01:11.840 | you don't really have access to the web search necessarily.
01:01:14.240 | So Bing or whatever they use here
01:01:16.080 | is not gonna update its parameters.
01:01:18.640 | But I just wanted to kind of put this in your mind,
01:01:21.200 | like this is another thing you can do, right?
01:01:23.960 | And if we take this really to the general form,
01:01:26.360 | then you can think of language models as just tool users.
01:01:31.080 | So rather than just retrieval augmenting language models,
01:01:34.240 | we can tool augment language models
01:01:36.240 | and retrieval is just one of the many tools
01:01:38.320 | that language models have access to.
01:01:40.400 | We can have re-rankers and things
01:01:42.640 | on top of the outputs of these tools.
01:01:45.360 | And so one of the big questions I think
01:01:48.000 | is how do you actually get the system to learn stuff, right?
01:01:51.680 | So we're gonna need RL if we want this system
01:01:53.960 | to really learn how to take these actions properly.
01:01:56.920 | And so, yeah, this has been taken to the extreme
01:02:03.640 | in this sort of self-drag architecture
01:02:05.960 | where they have this sort of retrieval step
01:02:07.760 | and it's active and then you criticize it
01:02:09.640 | and then you basically do some natural language inference
01:02:13.160 | and all of that just with one language model
01:02:15.280 | to answer the questions.
01:02:17.200 | So the other missing piece,
01:02:20.840 | so I'm just kind of going through a bunch
01:02:22.440 | of open questions that people have looked at,
01:02:25.880 | but feel free to interrupt me
01:02:27.080 | if there's anything you wanna know.
01:02:29.520 | But so instruction tuning,
01:02:31.280 | we established at the beginning of the lecture
01:02:33.080 | that this is pretty important for getting things to work,
01:02:35.520 | so fixing the user interface.
01:02:38.520 | But the instruction tuning has almost always
01:02:41.760 | only happened on the language model
01:02:43.280 | and not on the entire system.
01:02:45.080 | So I think one of the interesting things
01:02:47.200 | that people are looking at now
01:02:48.600 | with things like RADiT and InstructRetro
01:02:50.920 | is how can we instruction fine-tune
01:02:52.360 | an entire retrieval augmented system?
01:02:54.440 | So all the way into the retrieval step,
01:02:57.160 | can we generate data
01:02:58.160 | so that that also follows the instructions properly,
01:03:01.000 | which currently doesn't happen
01:03:02.200 | in any of these model architectures.
01:03:06.080 | And then finally, I think I would be remiss
01:03:08.520 | if I didn't really talk
01:03:09.880 | about what people call advanced RAG.
01:03:12.240 | So like the developer community
01:03:14.120 | has been really doing some awesome stuff.
01:03:16.880 | So like frameworks like Lama Index and LangChain,
01:03:19.560 | and there's all these open source vector databases
01:03:21.880 | like Chroma and Weave8,
01:03:23.200 | and they're all sort of about making RAG really easy,
01:03:26.240 | but this is all frozen RAG, right?
01:03:28.720 | But even with frozen RAG,
01:03:30.080 | you can really do incredible things.
01:03:32.240 | So we mentioned some of these already,
01:03:34.800 | so Child-Parent Recursive Retriever.
01:03:36.680 | So you find small parts
01:03:38.560 | and then you give the big parts around it
01:03:40.280 | to the language model.
01:03:41.600 | You can do hybrid search
01:03:42.880 | where we use reciprocal rank fusion.
01:03:44.880 | So we have like different search results
01:03:46.720 | that we then combine
01:03:48.080 | before we give the final thing to the language model.
01:03:51.160 | There's ZeroShot, like a large language model re-ranker.
01:03:53.880 | So basically the score function is not,
01:03:55.920 | it doesn't come from your retrieval.
01:03:57.320 | It comes directly from the language model.
01:04:00.080 | And then Hypothetical Document Embeddings,
01:04:02.640 | which I think is a really cool idea.
01:04:04.040 | So you just, basically you fix hallucination
01:04:07.920 | through hallucination.
01:04:09.800 | So you get a question,
01:04:11.120 | then you let the language model
01:04:12.240 | hallucinate a bunch of possible answers.
01:04:14.480 | Then you go and search for nearest neighbors
01:04:16.640 | to the possible answers,
01:04:17.880 | and you give those as context,
01:04:19.360 | and then it gives the right answer based on that.
01:04:22.120 | So it was really like hallucinating answers.
01:04:24.760 | I think it's a brilliant solution.
01:04:26.840 | So there's a lot of stuff happening
01:04:29.400 | in the kind of frozen RAG community too,
01:04:32.280 | that I think is very interesting to look at.
01:04:34.520 | So just to wrap up,
01:04:38.280 | kind of looking at the future of this stuff,
01:04:40.760 | there are still lots of very interesting open questions.
01:04:44.480 | So if you're a student thinking about
01:04:46.080 | how to solve any of these,
01:04:48.120 | I think you can have quite a lot of impact.
01:04:50.480 | So how exactly do we do like pre-training
01:04:55.080 | of this architecture?
01:04:56.000 | And do we even need to pre-train?
01:04:57.640 | I think even retro kind of shows
01:04:59.600 | that you don't necessarily have to pre-train.
01:05:01.440 | So maybe there's something wrong with how we do that.
01:05:05.640 | What do scaling laws look like?
01:05:07.240 | So I think there's a really interesting question here
01:05:09.160 | around if I have a huge index
01:05:11.320 | and a very rich encoder
01:05:12.680 | of all the information in that index,
01:05:14.880 | maybe I can move,
01:05:16.040 | so basically decouple all the memorization to this index.
01:05:19.680 | So I have a language model that doesn't know anything.
01:05:21.920 | It just speaks English.
01:05:23.400 | It just sort of reasons on top,
01:05:24.760 | but it has no knowledge
01:05:25.720 | because that always comes from this retriever.
01:05:28.160 | If you can do something like that,
01:05:29.480 | then you get very interesting scaling trade-offs, right?
01:05:31.720 | So you can have a tiny language model
01:05:33.880 | and do your retrieval to do a lot of the heavy lifting
01:05:37.560 | with your retrieval,
01:05:38.720 | which is nice because that's a cached computation, right?
01:05:41.240 | So you can just,
01:05:42.200 | you already have the embeddings.
01:05:43.880 | You just need to do the dot product.
01:05:46.120 | So it's much more efficient
01:05:47.200 | than kind of self-attention in the language model.
01:05:49.720 | Can we move beyond by encoder?
01:05:52.560 | So vector databases,
01:05:54.120 | I like people who build vector databases,
01:05:57.720 | but I'm not sure how long we're gonna keep vector databases
01:06:00.640 | because I think re-rankers probably work just as well,
01:06:06.360 | and VM25 is much more efficient than a vector database.
01:06:09.680 | So I don't really see why we need dedicated vector databases.
01:06:15.960 | And so what we're seeing,
01:06:17.280 | but maybe this is a bit of a critique
01:06:19.040 | of maybe Silicon Valley investment strategies
01:06:22.600 | and things like that,
01:06:23.440 | but a lot of these vector database companies
01:06:27.200 | are basically becoming database companies now.
01:06:29.080 | So they are adding all this sparse stuff
01:06:30.920 | because the density is not enough.
01:06:33.800 | And as it turns out,
01:06:34.800 | there are a lot of pretty good sparse databases
01:06:38.040 | out there already like Postgres and things like that.
01:06:40.560 | And they're also all adding vectors to their databases.
01:06:44.040 | So I think that's all gonna kind of coalesce into databases.
01:06:48.880 | So I think there are some interesting things to look at
01:06:55.880 | for kind of the data.
01:06:56.960 | So to this instruction problem,
01:06:59.080 | can we generate much better data
01:07:01.880 | for training rack systems synthetically?
01:07:04.720 | And then I think there's this massive open question
01:07:06.800 | around how we actually measure
01:07:08.200 | whether the rack system is any good.
01:07:10.000 | So right now we just look at downstream performance,
01:07:12.600 | which is sort of okay,
01:07:15.360 | but if you mess up the retrieval, it's very hard to measure.
01:07:19.000 | But how to measure whether your retrieval is right
01:07:22.120 | is also very difficult.
01:07:23.240 | So there are some frameworks
01:07:24.560 | where they try to take like the harmonic mean
01:07:26.640 | of your retrieval accuracy
01:07:28.000 | and your language model accuracy.
01:07:30.280 | But I think those are also very shoddy
01:07:32.280 | because we don't really have very good data sets
01:07:34.680 | to measure that on.
01:07:35.600 | So I think that's a very cool problem to work on as well.
01:07:39.160 | So the other problem that I personally
01:07:43.120 | am always very excited about is multimodality.
01:07:46.040 | And so why would we stop with rack systems with just text?
01:07:51.400 | So you can do the same thing with images.
01:07:54.400 | You can augment language models with vision.
01:07:56.880 | So we did this work on lens
01:07:58.360 | where we have a language model enhanced to see
01:08:01.920 | where you can just give a kind of a computer vision pipeline
01:08:05.840 | just like a retrieval pipeline
01:08:07.320 | and give that to a frozen language model
01:08:09.200 | and pass it to the context.
01:08:10.600 | And that system actually is an amazing
01:08:12.720 | visual question answering system.
01:08:14.920 | It's close to state of the art
01:08:16.920 | sort of Flamingo from DeepMind,
01:08:19.040 | which is also very hard to reproduce
01:08:20.600 | because there's no open source version of that.
01:08:23.960 | So we've done some early work on this in 2021
01:08:28.160 | where we have this cross-modal retrieval
01:08:30.160 | and there's some more recent work out of FAIR
01:08:32.960 | where they also look at this.
01:08:34.800 | So I think that's really like,
01:08:35.920 | if you look at the trend in the field,
01:08:37.720 | multimodality with GPD 4 or V and things like that
01:08:40.520 | is really a hot topic.
01:08:41.480 | So everything is kind of going in that direction.
01:08:44.440 | So it's an interesting thing to think about.
01:08:46.640 | So overall, I think it would be nice
01:08:51.560 | if everybody sort of moves away from RAG 1.0
01:08:54.520 | to frozen Frankenstein RAG
01:08:56.640 | and moves towards this much more
01:08:58.760 | optimized version RAG 2.0.
01:09:00.680 | So it's really about systems over models, right?
01:09:03.400 | It's not just your language model
01:09:05.000 | and your retriever and they're kind of separate.
01:09:06.640 | It's about thinking from a systems perspective
01:09:09.520 | about the entire thing
01:09:10.560 | and the problem you're trying to solve.
01:09:12.200 | And so I think that really is the way
01:09:14.720 | that in deep learning things have always progressed
01:09:17.240 | where if you optimize the system end-to-end,
01:09:19.840 | that's always going to win out.
01:09:21.520 | Like back in the day in computer vision or NLP,
01:09:23.560 | we had like parsers and scene parsers
01:09:25.720 | and all this kind of stuff.
01:09:26.640 | And all of that just doesn't exist anymore now
01:09:29.120 | because we optimize the system end-to-end.
01:09:32.000 | And so that's what's going to happen here too.
01:09:35.200 | So if we take that to the extreme,
01:09:36.520 | like there's this chunker thing in your documents, right?
01:09:38.640 | Like cutting it up into pieces,
01:09:40.280 | like you could back drop into debt.
01:09:42.160 | Like, why not?
01:09:43.720 | Somebody should really do that.
01:09:46.640 | And so, yeah, I think like trading off costs and quality
01:09:50.080 | and zero-shot domain generalization,
01:09:52.000 | that's really like where this stuff is going to come in.
01:09:54.240 | So language models right now, they're amazing,
01:09:56.320 | but very often they're way too expensive
01:09:58.640 | for being deployed somewhere
01:09:59.920 | where you can actually make money from them
01:10:01.680 | if you're in a company.
01:10:03.280 | So what you want to do is make it much more efficient
01:10:06.360 | and have the right cost quality trade-off.
01:10:08.160 | And the easiest way I can think of
01:10:10.080 | is to do it through retrieval augmentation.
01:10:12.080 | But obviously I'm very biased.
01:10:15.720 | So yeah, that was all I had actually.
01:10:18.480 | So if you're interested in this, I'm at Stanford.
01:10:20.720 | So I can work with you on research projects on these topics,
01:10:24.680 | or if you want, you can also join Contextual
01:10:26.760 | because we work on this stuff every day.
01:10:29.480 | Thank you.
01:10:30.320 | - Well, sorry, I had a question from earlier.
01:10:34.280 | Yeah, I think you said something really,
01:10:39.120 | I think really super helpful earlier about Mistral 7B.
01:10:42.400 | You talked about,
01:10:43.240 | you compared the sliding window attention
01:10:45.400 | to convolutional neural networks.
01:10:46.880 | And I do see the parallel
01:10:48.120 | because with convolutional neural networks,
01:10:49.680 | you have several layers of,
01:10:51.600 | several different layers of convolutional layers.
01:10:53.680 | And the top convolutional layers are able to see
01:10:56.960 | a larger receptive field
01:10:58.280 | in the bottom convolutional layers.
01:11:00.120 | And with convolutional layers,
01:11:02.400 | you're able to tune the filter sizes and the strides.
01:11:07.000 | So you're able to see a different receptive field.
01:11:10.120 | And I was wondering if you could see that same innovation
01:11:12.440 | in Mistral 7B by tuning,
01:11:15.360 | because you have different transformer layers
01:11:17.000 | and each transformer layer will have a span
01:11:19.160 | over a different set of tokens.
01:11:20.560 | And if you can tune, I guess, the transformer architecture,
01:11:23.520 | the way you tune those convolution layers,
01:11:25.560 | the filter sizes, the receptive field,
01:11:27.720 | perhaps we can do some optimization in the transformer realm
01:11:30.800 | that we have already done in convolution layers.
01:11:33.560 | - Yeah, I think that, so that's a good idea.
01:11:36.240 | There's a great paper on light convolutions,
01:11:38.920 | I think from Michael Owley and David Gange
01:11:42.520 | and a bunch of people,
01:11:43.360 | where it's basically, this came out
01:11:45.720 | at exactly the same time as the transformer.
01:11:48.320 | And the transformer is slightly more optimized
01:11:50.320 | for GPU computation,
01:11:52.200 | but the convolutional model was actually slightly better
01:11:55.120 | than the transformer.
01:11:56.120 | So it's definitely worth exploring.
01:12:00.560 | - Okay, cool, thanks.
01:12:01.680 | - It's probably the advantage of the re-branch
01:12:06.760 | over VM25, but does that give up a lot of the advantages
01:12:10.640 | of this massive search or is it a trade-off of upgrades?
01:12:14.880 | - Yeah, so it depends on the problem.
01:12:16.360 | I think what you probably want to do
01:12:18.280 | is sort of cast a white net with VM25
01:12:22.240 | and then just narrow it down with dense search.
01:12:25.200 | So you often see that kind of as a two-stage process
01:12:27.840 | where the first one is kind of noisy,
01:12:29.720 | you can add noise actually to your retrieval
01:12:31.760 | and then you use the dense one to filter it down.
01:12:35.400 | - Yeah, everyone's trying to maybe adapt their
01:12:38.960 | large-scale model to almost domain-specific areas.
01:12:43.360 | Like I think there are mainly two ways to approach it.
01:12:46.640 | One way is to use the instruction-tuning
01:12:49.080 | in a small-type learning way
01:12:50.520 | or fine-tuning like QE method.
01:12:52.800 | And another way is just, the main topic of this lecture
01:12:56.200 | is using virtual-augmented work.
01:12:59.200 | So I wonder if you guys know a method
01:13:03.000 | of low-cost advantage of virtual-augmented way?
01:13:06.880 | Do you think the capacity or the quality
01:13:09.520 | of virtual-augmented way can be matched
01:13:12.160 | with those tuning methods, fine-tuning type learning?
01:13:15.640 | - Yeah, so I think actually what's gonna happen
01:13:19.000 | is that all of this will come together, right?
01:13:21.840 | So if you actually train things
01:13:24.320 | like end-to-end, React 2.0 style,
01:13:26.760 | then you can also fine-tune that system
01:13:29.040 | on some use case end-to-end, right?
01:13:31.960 | So why would you just take the retrieval-augmented system
01:13:35.440 | if you can also fine-tune it on the thing you care about?
01:13:38.040 | So I think in the end,
01:13:38.880 | everybody's gonna do all of those things.
01:13:41.240 | And then there's questions like,
01:13:42.280 | how do you do that efficiently?
01:13:43.360 | So that's why you would use it after,
01:13:44.840 | sort of things like that.
01:13:46.080 | I think there was another question.
01:13:51.360 | - I'm curious about hardware.
01:13:54.720 | You said it's gonna become a database kind of thing,
01:13:57.680 | like a smart database,
01:13:58.520 | but what about retrieval of hardware?
01:14:03.080 | And because you've got so much of the learning part,
01:14:08.080 | but what about, because it's huge.
01:14:11.560 | - Yeah, yeah.
01:14:12.400 | - There's trillions of sets.
01:14:13.560 | So do you have any idea it's just a database problem?
01:14:17.080 | - So I don't know if I'm allowed
01:14:18.280 | to say this exactly, actually.
01:14:19.720 | But so one of the biggest chip manufacturers
01:14:24.720 | that recently, their stock has done really well,
01:14:27.560 | they have some dedicated retrieval hardware coming out.
01:14:30.920 | I think sooner it might already be out.
01:14:32.920 | So yeah, very efficient dense retrieval
01:14:39.600 | is a very big business.
01:14:42.280 | Other questions?
01:14:48.560 | - That's the thing.
01:14:51.920 | Like, have you been solving the flag
01:14:54.280 | that will solve this issue with that?
01:14:56.880 | Similar thing in the business?
01:14:59.200 | - Yes, I think so, if you take it to the extreme.
01:15:02.480 | So one of the big problems right now
01:15:04.200 | is that if you contextualize an existing language model
01:15:07.200 | that already hallucinates,
01:15:09.240 | then it's gonna be kind of hard
01:15:11.000 | to get rid of the hallucination, right?
01:15:12.480 | So if you do replug on GPT-4, GPT-4 might still hallucinate.
01:15:17.480 | So it could basically just ignore all the stuff
01:15:20.000 | you retrieved and just do whatever it wants anyway.
01:15:22.880 | So that's one of the reasons why you want
01:15:24.480 | to train the system end-to-end.
01:15:25.800 | And if you take that to the extreme where,
01:15:28.000 | like I said, right, if you can just have
01:15:29.560 | the language model only reason and speak,
01:15:32.840 | so it knows English and reasoning,
01:15:34.280 | but it has no knowledge,
01:15:35.520 | which all comes from somewhere else,
01:15:37.800 | then you can't hallucinate.
01:15:39.560 | So it's really all grounded in whatever is in your index.
01:15:43.040 | But so they're about hallucination.
01:15:51.200 | I'm sort of frustrated that a lot of people in the field
01:15:53.640 | misunderstand what hallucination even means, right?
01:15:56.320 | So a lot of people are conflating hallucination
01:15:58.440 | with correctness or incorrectness.
01:16:00.720 | So they're like, oh, the model made a mistake.
01:16:02.440 | It hallucinated.
01:16:03.280 | It's like, no, we've made a mistake.
01:16:05.480 | That's different from hallucination.
01:16:06.920 | Hallucination, I think is very specific kind of,
01:16:09.800 | I've retrieved something.
01:16:10.960 | So I have some sort of counterfactual ground truth.
01:16:13.720 | And what I'm saying does not correspond
01:16:16.760 | to that ground truth.
01:16:19.320 | And so, yeah, I think there's a bunch of folks
01:16:22.800 | at Stanford also working on better measurements
01:16:25.200 | of hallucination and definitions and things like that.
01:16:27.920 | - If I'm understanding correctly,
01:16:32.560 | then your definition of hallucination
01:16:34.160 | only makes sense in a context with people.
01:16:37.720 | - Yeah, of some ground truth, right?
01:16:39.920 | So hallucination is really like,
01:16:42.800 | there is something that is true, right?
01:16:44.800 | So if we're talking about like hallucination and,
01:16:48.160 | yeah, so if we're talking about
01:16:49.160 | just general parametric language models,
01:16:51.160 | then sort of the ground truth
01:16:52.320 | is whatever we consider to be true, right?
01:16:55.320 | But we had to work for like language models
01:16:59.800 | making mistakes before it was called making mistakes.
01:17:02.720 | - Yeah, on the ground truth,
01:17:09.080 | I guess this is solving the hallucination question
01:17:12.360 | I was looking for that path.
01:17:14.960 | Are you working on ground truth per se?
01:17:17.600 | And so, you know, if I generate the building documents
01:17:20.160 | saying, "Oh, well, I've never been a president,"
01:17:21.960 | then everything falls apart.
01:17:24.520 | Are you considering work on that, on this ground truth?
01:17:27.800 | - Yeah, so I like the sort of silo mentioned there as well.
01:17:31.920 | So I think the whole point is that you can have
01:17:35.320 | different indices and different definitions
01:17:37.240 | of ground truth, right?
01:17:38.320 | So I think you could say, "I only trust the archive,"
01:17:42.760 | or, "I only trust like peer-reviewed papers
01:17:45.240 | "and not just archive."
01:17:47.520 | And so you can make decisions in your architecture
01:17:49.680 | during test time about what you define as ground truth.
01:17:52.720 | And I also think actually that,
01:17:55.520 | and there's a bunch of work, I think,
01:17:58.200 | happening on this right now.
01:17:59.160 | You can control for how grounded you want it to be
01:18:02.280 | in your ground truth.
01:18:04.040 | So that's another kind of misconception about hallucinations.
01:18:07.840 | Like sometimes hallucinations are actually good, right?
01:18:10.080 | If you have a creative writing assistant
01:18:12.040 | and you wanted to come up with some cool new ideas,
01:18:14.080 | you want the language model to hallucinate.
01:18:16.800 | So I think what you want to have is kind of a tunable knob
01:18:19.760 | where you say like, "Oh, now you can hallucinate,
01:18:21.720 | "and now maybe you should really tell me the truth only."
01:18:24.720 | Anything else?
01:18:33.280 | - It has, I think, a lot of gravity
01:18:34.840 | that's already on to it, how much you're saying.
01:18:37.960 | (indistinct)
01:18:40.720 | - Yeah.
01:18:41.560 | So, but the temperature,
01:18:44.280 | that's just about how you sample, right?
01:18:46.320 | So how flat your distribution is that you sample from.
01:18:49.440 | (indistinct)
01:18:51.000 | Yeah.
01:18:54.480 | So even if you have a low temperature,
01:18:56.560 | it can still come up with random stuff, right?
01:18:59.000 | So it just says that then you're very likely
01:19:01.120 | to do greedy sampling.
01:19:02.600 | So I think what you want to get at
01:19:06.160 | is something more sophisticated than that.
01:19:09.120 | - Okay, lots of interesting questions.
01:19:17.000 | - Yeah, I like the question.
01:19:18.120 | - Thanks, Don, again, for the great talk.
01:19:19.920 | - Thank you.
01:19:20.920 | (upbeat music)
01:19:23.520 | [BLANK_AUDIO]