back to indexStanford CS25: V3 I Retrieval Augmented Language Models
00:00:00.000 |
- Hey guys, welcome to our last lecture of this quarter. 00:00:15.360 |
He's the CEO of Contextual BI, the enterprise LLM company, 00:00:24.960 |
And previously he was the head of research at ClickBase. 00:00:39.120 |
and studied philosophy and cognitive AI in undergrad. 00:00:42.560 |
And his work focuses on machine learning as well as NLP, 00:00:51.200 |
and better tools for evaluation and many more. 00:01:00.120 |
So I guess I have to sort of stand here in the corner 00:01:12.280 |
There were a couple of things I could talk about, 00:01:27.720 |
I think this is really one of the coolest topics 00:01:32.960 |
So I'll just give you an overview of what's been happening 00:01:35.960 |
and what I think are the interesting questions 00:01:39.200 |
So first of all, obviously, in case you've missed it, 00:01:54.860 |
If you thought OpenAI, then I'm angry with you, right? 00:02:06.360 |
and you factorize out the token probabilities, right? 00:02:16.640 |
So I'm bringing this up because I was talking to someone 00:02:19.920 |
and they were like, "OpenAI invented language models." 00:02:28.320 |
and this is the oldest one I could find, actually. 00:02:32.680 |
There's a very nice paper from 2003 from Bengio 00:02:45.360 |
And as it turns out, if you make them really big 00:02:48.200 |
and you parameterize them with these massive neural nets, 00:02:55.680 |
And that's why we're all so excited in this stuff. 00:02:58.180 |
So if we think about this from a classic CS perspective, 00:03:10.240 |
and then the task of the model is to predict the next token. 00:03:16.000 |
And so that's why it was so easy to come up with this 00:03:19.880 |
in 1991 already, because the idea is very intuitive. 00:03:23.860 |
But for a long time, what was really broken with this 00:03:28.460 |
And this, I think a lot of people kind of misunderstand 00:03:46.320 |
So we're much better at sort of telling people 00:03:52.800 |
We don't prompt it in a very weird way so that it sits, 00:04:01.240 |
in the style of a pirate or Shakespeare or something, 00:04:10.160 |
actually turns out to be super, super rare in just web data. 00:04:14.160 |
So what you need to do is you need to fix the user interface 00:04:45.040 |
But if you talk to anyone, especially in an enterprise, 00:04:56.400 |
because there are all these familiar problems, 00:04:57.920 |
probably a bunch of you are working on these problems 00:05:12.120 |
why these models are saying what they're saying. 00:05:16.400 |
And so this was a big problem with sort of chat GPT, 00:05:22.000 |
and they keep updating it every once in a while. 00:05:24.960 |
that's always completely up to date, that never goes stale. 00:05:27.960 |
You want to be able to revise the information in the system. 00:05:36.760 |
which means that you need to be able to remove information 00:05:38.900 |
from the language model or maybe revise facts, 00:05:43.840 |
So again, this is a very interesting area of study 00:05:57.720 |
So different people have different use cases, 00:05:59.760 |
you have different data, if you're a company, 00:06:01.640 |
or if you want to have a language model on your own data, 00:06:19.720 |
but the way to understand what is going on here 00:06:25.760 |
we have the input and the prompt just like before, 00:06:27.680 |
but now instead of just giving those two things, 00:06:37.040 |
And the retriever is very often pretty simple, 00:06:53.960 |
from the perspective of these two separate paradigms. 00:06:57.680 |
So if you've ever taken an exam, I'm sure you have, right? 00:07:09.720 |
where you have all of this information in the book 00:07:14.800 |
So it's a very similar thing with rank, right? 00:07:18.480 |
where you can give it access to this external information, 00:07:27.440 |
without having to memorize all of it in its parameters. 00:07:30.360 |
So the other, I think, useful distinction here 00:07:33.720 |
is that cramming everything into your parameters, 00:07:40.680 |
is we're adding this non-parametric retrieval component. 00:07:49.240 |
All right, so why does that actually solve these issues? 00:08:06.240 |
And so you can customize your language model system 00:08:16.720 |
and you can revise it if everything goes wrong, 00:08:30.320 |
And actually one really nice way to ground things 00:08:43.640 |
or even multimodal data that it retrieves separately. 00:08:46.600 |
So if you do that, then you get less hallucination, 00:08:48.760 |
because you can always point back to your source, 00:08:52.600 |
And you get attribution because you don't know 00:09:05.440 |
we're gonna talk about this basic architecture. 00:09:09.000 |
And so it kind of looks like a pretty simple thing, right? 00:09:12.800 |
But there are actually lots and lots of questions 00:09:14.800 |
you can ask about what this system should really look like. 00:09:31.280 |
and then there are things like this query encoder, 00:09:43.440 |
Is it like a full document, or is it a paragraph, 00:09:45.560 |
or a chunk, or a sentence, or a couple of words? 00:09:50.720 |
And as you'll see, there are lots of possible answers 00:10:17.960 |
we have this retriever, which one do we update? 00:10:33.000 |
These are the kinds of questions that you have to answer 00:10:37.840 |
And then during test time, you have this entire system, 00:10:46.000 |
So there's also different things you can do there, right? 00:10:49.720 |
So give it different indices during test time 00:10:59.680 |
I think if you ask someone now, like, what is RAG, 00:11:09.680 |
So going back to this question of train time, test time, 00:11:16.840 |
that we don't necessarily have control over, right? 00:11:34.120 |
And then the vector database just does search 00:11:43.720 |
So this only works because of in-context learning. 00:12:06.560 |
this frozen thing itself with just the vector database, 00:12:12.720 |
And the starting point for everything retrieval 00:12:22.160 |
So TF-IDF is basically a sparse retrieval method 00:12:28.840 |
that looks at documents and queries, so D and Q. 00:12:33.280 |
And then there are basically two terms that matter. 00:12:37.280 |
and the other is the IDF, the inverse document frequency. 00:12:42.120 |
is actually a really nice idea from Karen Spark-Jones, 00:12:48.040 |
But the basic idea is that you want to look at the words 00:12:53.560 |
so that don't occur in lots of different documents. 00:13:01.480 |
So you want to have sort of the special words. 00:13:06.440 |
It gives you a score for document query overlap. 00:13:13.720 |
So there's all these weird, different parameters, 00:13:22.320 |
So there's a couple of tweaks you can do there. 00:13:35.800 |
sort of the preceding 24 experiments failed, right? 00:13:39.120 |
So it's literally the 25th one that seemed to work, 00:13:53.840 |
It's sparse because most words never occur, right? 00:14:04.840 |
But so that's actually kind of a nice property 00:14:07.080 |
if you want to do fast search on a CPU, right? 00:14:19.360 |
which is really one of the first neural instances 00:14:23.840 |
sort of open book question answering paradigm. 00:14:28.800 |
like how many of Warsaw's inhabitants, blah, blah. 00:14:37.040 |
based on the sparse, so BM25, I think, in this case. 00:14:53.080 |
So this, I think, is really the first instance 00:15:01.280 |
that you use for answering complicated questions 00:15:09.000 |
there was a bunch of work on dense retrieval. 00:15:15.320 |
so this is just like word embeddings, basically vectors, 00:15:20.240 |
so they're much smaller in terms of dimensionality. 00:15:26.760 |
is that it's not really about specific words, right? 00:15:35.080 |
which you couldn't really do with a sparse representation. 00:15:47.440 |
but at the time that people started thinking about this, 00:15:53.000 |
a vector representation for an entire sequence of words. 00:15:56.320 |
So a sentence representation or a passage representation. 00:15:59.560 |
So there are all these cool systems like ORCA 00:16:11.400 |
And the way to get the latent variable to work, 00:16:14.720 |
to be good enough essentially to train the entire system 00:16:18.200 |
is to pre-train the retriever on relevant information. 00:16:21.600 |
So for ORCA, they do something called inverse close. 00:16:30.280 |
that are sort of relevant to the preceding passage. 00:16:33.520 |
And in DPR, they just train it on a supervised thing. 00:16:40.840 |
you can do better than VM25 if you add lots of documents 00:16:53.960 |
is that you can do them very, very efficiently 00:16:56.440 |
on the GPU as well if you know what you're doing. 00:17:03.080 |
is maximum inner product search, MIPS, right? 00:17:14.040 |
And so there's this really brilliant piece of work 00:17:28.160 |
they're sort of re-implementations of this FACE idea. 00:17:32.080 |
but it's all basically the same idea, it's just FACE. 00:17:35.000 |
And so FACE really powers a lot of this stuff. 00:17:41.640 |
about a vector database, just think about FACE, 00:17:45.720 |
So obviously, you can go beyond dot product, yes? 00:17:59.200 |
No, so it's just basic off-the-shelf ANN algorithms. 00:18:14.600 |
product quantization is and things like that? 00:18:17.000 |
So there are basically, so you have a bunch of vectors 00:18:20.440 |
and you can just compute the full dot product, 00:18:24.880 |
So what you can do is try to compress subspaces 00:18:28.480 |
of the vector, and then just look at the kind of centroids. 00:18:31.880 |
So you can quantize sub-vectors of the full vector 00:18:36.520 |
and then do much faster search over just the centroids. 00:19:00.680 |
And then at the end, you get these two vectors 00:19:05.760 |
But you can do all kinds of much fancier things 00:19:12.440 |
So a really nice example from one of your colleagues 00:19:22.920 |
So instead of just having this dot product here, 00:19:40.560 |
So it's sort of Omar's joke, actually, this name, 00:19:55.320 |
one of the nice things about these vector databases 00:20:09.440 |
they basically have sparse meet dense in a way. 00:20:16.080 |
with sparse is that you can't really handle synonyms 00:20:22.120 |
like a bird model, look at kind of this one word 00:20:25.520 |
in your sequence, try to see which other words 00:20:31.720 |
So now you can give all these synonyms to a sparse vector, 00:20:38.120 |
And so I have a much more efficient way to do search 00:20:41.000 |
without sort of giving up on all the cool stuff 00:20:49.280 |
And this other idea I really like is called DRAGON. 00:20:57.840 |
So if you want to take something off the shelf right now 00:21:01.760 |
then this DRAGON or DRAGON+ is probably the thing 00:21:11.480 |
to make the model better and better over time 00:21:16.040 |
And that gives you very good representations. 00:21:29.480 |
if you look at sort of the developer community around DRAGON 00:21:32.120 |
is that they're all doing hybrid search right now. 00:21:34.840 |
So you can actually just combine the search results 00:21:37.200 |
from your sparse BN25 or whatever thing, or SPLADE, 00:21:44.040 |
and then you'll get this ranking that works even better. 00:22:01.000 |
On the earlier slide, has there been any work on benchmark 00:22:14.440 |
has there been any benchmarking studies in this? 00:22:28.680 |
if you literally look for retrieval augmentation 00:22:31.080 |
reduces hallucination, then you'll find the paper. 00:22:55.320 |
So if there's like a brand name or something like that, 00:23:00.120 |
then like, let's say the brand is Apple, right? 00:23:02.560 |
You don't want to find stuff about the pairs, right? 00:23:05.160 |
So that's what you would do with a dense retriever. 00:23:08.480 |
So it really kind of depends on what you want to use it for. 00:23:24.440 |
it realized Apple, the company would be different. 00:23:28.120 |
- No, so if they were actually contextualized, then yes, 00:23:31.520 |
but very often it's a frozen retrieval system, right? 00:24:04.920 |
the components of the vector are literally the other words. 00:24:22.040 |
So basically it's a one big matrix of documents as rows 00:24:26.720 |
and the columns are the words in the documents. 00:24:29.320 |
And then you just count how often a word occurs 00:24:42.520 |
we call them sparse embeddings or sparse retrieval 00:24:49.640 |
Because most words don't occur in that document. 00:25:08.920 |
but how do we actually make this retriever good 00:25:11.320 |
for the context that is going to be used in, right? 00:25:14.520 |
So can we contextualize the retriever for the generator, 00:25:20.040 |
where we might not have access to the weights? 00:25:24.160 |
we just send it to some API, we get some stuff back. 00:25:28.200 |
And so one paper I really like is called Replug. 00:25:31.560 |
So just to kind of explain what this looks like, 00:25:42.040 |
And now, sorry, and now you compute the likelihood. 00:25:53.040 |
And then you'll give each one of the retreat documents 00:25:57.000 |
separately to this generator, to your language model. 00:26:00.680 |
So you can look at the perplexity of the correct answer 00:26:06.200 |
So now we have these two probability distributions 00:26:12.640 |
to make sure that we can actually retrieve the documents 00:26:20.440 |
So super simple idea, works really, really well. 00:26:26.520 |
And the nice thing about this is completely agnostic 00:26:31.080 |
So this will work for any sort of encoder, decoder, 00:26:39.000 |
but for most language models, you can get that, 00:26:44.040 |
And then there's this other really nice approach. 00:26:53.440 |
you're literally updating the dense representations, right? 00:26:58.360 |
So you're encoder basically for your dense representation. 00:27:06.640 |
on in-context retrieval augmented language models, 00:27:09.840 |
where the whole paper is basically about just doing BM25 00:27:14.080 |
and just giving stuff directly to the context 00:27:16.160 |
of the language model and things kind of work. 00:27:22.640 |
where the retriever is this very old sparse algorithm, 00:27:29.040 |
But then they have this really awesome section 00:27:31.320 |
where they show that you can just have this re-ranker 00:27:40.240 |
So now you still keep the language model completely fixed. 00:27:43.280 |
So that's sort of this part of the loss here. 00:27:46.800 |
So you have kind of a stop gradient on the parameters data. 00:27:51.360 |
But now you have this kind of rank function here 00:27:59.880 |
or anything like that that works on top of the things 00:28:11.400 |
So we're slowly progressing towards having a system 00:28:14.520 |
that is much more optimized for being properly 00:28:18.240 |
retrieval augmented in a way where it's useful 00:28:20.440 |
and contextualized for what you want to use it for. 00:28:23.240 |
So yeah, just to point out kind of what that looks like 00:28:28.400 |
So you just have this extra step essentially, right? 00:28:30.960 |
So we have our retriever, then we have a re-ranker, 00:28:57.240 |
Some of them do, but yeah, there are all kinds of tricks 00:29:13.120 |
so that you can backdrop all the way through it, 00:29:14.960 |
then you can do a reinforce style loss on the retrieval. 00:29:19.960 |
And then you just pass the kind of log-like view 00:29:36.120 |
is to optimize both the retriever and the generator. 00:29:46.680 |
where you want everything to work together, right? 00:29:56.880 |
One is your retriever, the other is your language model. 00:30:01.440 |
is thrown over the fence and then you hope for the best. 00:30:04.080 |
So instead of that, we have everything much closer 00:30:13.560 |
with a generator was RAG, retrieval augmented generation 00:30:19.440 |
And it's very similar to what we've already seen. 00:30:29.360 |
that gets given to this generator that generates the answer. 00:30:48.600 |
but we'll also update the part of the retriever 00:30:56.000 |
we actually have two different ways of doing this. 00:30:59.160 |
And this is probably something that when we talk about this 00:31:02.640 |
if you think about this long enough, then you'll think like 00:31:05.800 |
okay, but when actually do I need to retrieve? 00:31:08.720 |
Like do I retrieve every time I generate a new token 00:31:16.800 |
Or maybe I want to retrieve every N tokens, right? 00:31:23.600 |
As we'll see that's also something people have done. 00:31:33.840 |
is that this frozen thing doesn't really work all that well. 00:31:48.240 |
The whole point of the paper is that you want to optimize it. 00:31:54.640 |
we call this frozen thing Frankenstein's monster 00:32:00.440 |
You sort of, yeah, it's really like Frankenstein 00:32:02.720 |
and just put it together and then it sort of walks, you know 00:32:13.760 |
because there are so many opportunities to do better 00:32:18.240 |
So one of the limitations of the original RAG architecture 00:32:24.640 |
is that it only supports a very small cave, right? 00:32:34.160 |
but how do you really get that to fit, right? 00:32:38.120 |
So one thing you can do is you first encode things 00:32:45.680 |
or only the few sort of top level representations 00:33:10.600 |
towards more decoder only architectures, right? 00:33:23.440 |
And so another like pure decoder language model 00:33:33.400 |
which I think is very elegant in its simplicity. 00:33:36.680 |
So it's basically you just have a normal language model 00:33:39.880 |
but you interpolate the normal language model weights 00:33:46.960 |
So basically you have some sort of prompts, right? 00:33:49.320 |
So like Obama's birthplace is, you go to your big corpus 00:33:54.840 |
You look at the words that come next to the similar things. 00:34:07.280 |
between your retrieved kind of non-parametric memory scores 00:34:13.440 |
So this is very late fusion in a sense, right? 00:34:20.320 |
the pure language model probabilities or likelihoods. 00:34:23.000 |
So this works really well and it scales especially well 00:34:30.680 |
So if you have trillions and trillions of tokens in there 00:34:37.440 |
because you can really rely on this big source corpus 00:34:48.000 |
where they showed that you can have a 25 times smaller 00:34:51.840 |
retrieval augmented language model trained from scratch. 00:34:57.680 |
that outperforms this 25 times bigger language model 00:35:09.560 |
because you can rely on this external memory. 00:35:17.520 |
So there was a lot of excitement about "Retro" 00:35:19.720 |
when it was announced, but it's a "Deep Mind" paper. 00:35:24.640 |
nothing really to validate that this actually works. 00:35:27.680 |
And so very recently there has been a bit of work 00:35:34.080 |
where they have this hybrid between the "Retro" architecture 00:35:40.840 |
sort of they put the top one or the top K results 00:35:44.040 |
in the context of the language model after all. 00:35:46.480 |
So it's sort of a crossover between "Rag" and "Retro" 00:35:50.000 |
and they showed some really nice results here 00:35:52.000 |
but I think it's sort of pointing to this big flaw 00:36:15.600 |
and that's why we need to do this in context "Rag" 00:36:18.560 |
on top of "Retro" to actually get it to work. 00:36:38.200 |
Yeah, so there are even like distributed face packages 00:36:46.320 |
So in terms of compute it's actually not that hard anymore 00:37:03.520 |
then it actually gives you a gain over the pure GPT model. 00:37:06.600 |
So it starts from a GPT and then they kind of retrofit 00:37:11.080 |
So in short, I think there's still a lot of work 00:37:18.760 |
And "Retro" kind of showed that it might be possible 00:37:24.400 |
And this is really one of the interesting open questions. 00:37:43.520 |
So let's go all the way with the contextualization now. 00:37:52.560 |
what we actually did is we only updated the query encoder. 00:37:56.880 |
So updating the document encoder is very expensive. 00:38:01.800 |
So one of the first papers actually kind of the OG 00:38:04.640 |
of the non-frozen dense retrieval augmented methods 00:38:16.520 |
that did this properly where they updated it all the way 00:38:21.640 |
So can someone explain to me why it's expensive 00:38:27.120 |
So let's say we have a trillion tokens in our corpus. 00:38:41.080 |
Now we back propagate the gradient through the retriever. 00:38:52.000 |
We need to re-encode the entire internet, right? 00:38:59.720 |
Which, and so if this is like trillions of tokens 00:39:38.680 |
Then they stop, they re-encode the entire internet 00:39:45.560 |
They have this very fancy sort of sharding mechanisms 00:39:48.600 |
where they take down certain parts of their entire index 00:39:59.720 |
have been thinking about, not exactly the Delora idea 00:40:02.000 |
but similar versions of that are around like, 00:40:07.720 |
so that you don't have to do this asynchronously? 00:40:10.960 |
So one of the downsides of this Realm architecture 00:40:23.040 |
It's not really gen AI in the modern paradigm. 00:40:26.280 |
But if you wanna read like one paper on this topic 00:40:31.800 |
The other one that is really, really good to read 00:40:43.680 |
with a bunch of folks, the folks who did like RAG 00:40:54.560 |
of everything that's happening in this architecture. 00:41:06.560 |
They haven't really been compared in a head to head setting. 00:41:14.120 |
So that's really too complicated to go into detail here 00:41:20.440 |
So one is this loss we've basically seen before, right? 00:41:24.800 |
So we've seen this, I think with the in-context RAG one, 00:41:28.000 |
right, so we have a stop gradient on the language model 00:41:32.680 |
The other one is what we've seen with Replug. 00:41:35.440 |
So this is basically exactly the Replug loss, right? 00:41:37.680 |
So we have the KL divergence of the documents 00:41:53.360 |
how does that affect my perplexity of the language model? 00:41:57.520 |
And so this one I think is actually quite elegant 00:42:02.280 |
because that really gets to like how valuable 00:42:09.080 |
So they compare all of these different versions 00:42:13.520 |
and what you can see is that the kind of Replug style loss 00:42:20.080 |
they perform a lot better than all of these others. 00:42:22.320 |
So this fixed retriever or no joint pre-training, 00:42:29.960 |
And as you can see, you can do really a lot better 00:42:41.360 |
how do you actually like train that entire system? 00:42:44.200 |
Like what data or what tasks do you train this on? 00:42:46.800 |
So they also experiment with a bunch of different versions. 00:42:50.680 |
So one is doing prefix LM, if you're familiar with that. 00:42:59.840 |
and then they predict the next chunk from that chunk. 00:43:11.120 |
Then they just do T5 style sort of denoising. 00:43:17.800 |
And then they have this title for section generation piece. 00:43:40.440 |
that they look into going back to what we talked about, 00:43:49.200 |
or do we maybe have to do some sort of re-ranking 00:43:54.680 |
And quite surprisingly, I think they find that 00:44:01.400 |
is actually already basically good enough in many cases. 00:44:05.120 |
So that's nice because it's much more efficient 00:44:08.160 |
if you don't have to update your documents all the time. 00:44:11.680 |
I think the real question here though is like, 00:44:14.560 |
how good is your document representation to begin with? 00:44:17.400 |
So you need to have a very, very high quality 00:44:21.800 |
If you don't have that, then this will not work. 00:44:24.840 |
then you get a very nice kind of query side fine-tuning thing. 00:44:28.280 |
So the Atlas paper is about trying to do few-shot 00:44:39.280 |
So it's how many examples are given in the context. 00:44:50.080 |
if you compare like the closed-book equivalent model 00:44:58.120 |
That's really the only takeaway of this entire section. 00:45:02.680 |
But I think that that's really saying something 00:45:08.520 |
in terms of what we should be thinking about. 00:45:31.040 |
So in Atlas, Atlas basically tries everything. 00:45:40.080 |
but I swap in like a sort of common crawl index. 00:45:47.360 |
the main finding is just the more, the better. 00:45:50.840 |
So it's really just like the bigger your index, 00:45:53.160 |
the more likely you are to find the exact right thing 00:46:19.040 |
So it introduces a lot of these new architectural changes 00:46:27.040 |
and the group query attention for faster inference. 00:46:32.640 |
on designing a generator specifically for RAG, 00:46:36.840 |
leveraging, for example, where Mistral 7B currently is. 00:46:40.880 |
Because for example, like the sliding window attention, 00:46:43.600 |
I could see how that could be adapted to the RAG case. 00:46:49.880 |
what makes Mistral's special is a bit different from mine. 00:46:52.840 |
So I don't think that the sliding attention window thing 00:47:10.040 |
I guess you're asking sort of about the architecture 00:47:16.400 |
So I think that's basically what Retro tried to do. 00:47:20.760 |
So Retro actually, some of the people on the Retro paper 00:47:27.600 |
So they have this chunk cross-attention idea here. 00:47:34.080 |
but the way it does attention over the things you retrieve 00:47:46.840 |
but using this slightly different chunk cross-attention. 00:47:51.960 |
So I think the sliding window attention point 00:47:54.760 |
I was trying to get at was that it uses a fixed window 00:47:59.040 |
so that whenever you're doing the query key computation 00:48:15.480 |
if you use a fixed window when you're doing attention, 00:48:19.520 |
it is possible that you actually are leaving, 00:48:23.200 |
you're only looking at a fixed span of information. 00:48:29.760 |
so that you could make it better for the rag case 00:48:39.800 |
So for me, what Mistral is doing with the sliding window, 00:48:47.520 |
So we had all these convolutional light conf nets 00:48:52.440 |
and you would do convolutions over it and then pool, 00:48:55.160 |
and then you would still get the information out. 00:49:08.600 |
So I think that definitely is an interesting direction 00:49:13.160 |
- Yeah, so I think it's like not too crazy to say, 00:49:22.560 |
so that they could be better adapted to the rag case? 00:49:30.280 |
I think one question is just how do you do the attention 00:49:47.800 |
and when you talk about putting the retrieval in the context, 00:49:53.120 |
are you saying that you only do it at the beginning 00:49:59.160 |
so this is, it's not exactly every layer sort of, 00:50:10.600 |
So it's not every layer that you do the retrieval, right? 00:50:22.880 |
so you generate and then you can retrieve again. 00:50:32.840 |
you retrieve once at the beginning and then you give it. 00:50:45.160 |
so here you don't actually give it as context at all, 00:50:51.920 |
So here you let the decoder kind of attend over it. 00:51:01.280 |
- So I don't think cross-attention really works, yeah. 00:51:16.600 |
which retrieving on the retriever is not so necessary 00:51:31.040 |
or any way to update those document or, yeah. 00:51:35.320 |
- Yeah, so you do want to update the retriever, right? 00:51:49.720 |
Natural questions, Wizard of Wikipedia, and Fever. 00:51:52.280 |
So they're really very kind of knowledge-intensive tasks. 00:51:56.920 |
So in that case, if you already have a very good system 00:52:00.080 |
like DPR that is specifically pre-trained for those tasks, 00:52:03.920 |
then you only need to update the query encoder. 00:52:06.840 |
So I would expect that if you move beyond this 00:52:09.400 |
to kind of general language modeling things like Retro, 00:52:13.080 |
then you probably do want to update the document encoder 00:52:33.800 |
as long as we have a good (indistinct) knowledge 00:52:50.880 |
then yeah, you don't get really good performance. 00:52:54.040 |
So that's sort of like your closed book performance, right? 00:53:04.840 |
As you can see, there are pretty big gaps there. 00:53:22.720 |
Like, so what about like more hierarchical retrieval? 00:53:29.800 |
but there's some kind of like groups of chunks 00:53:34.680 |
- There's been some interesting work on doing that 00:53:41.480 |
So first you want to find the relevant document. 00:53:56.200 |
and then sort of expand the context around it 00:54:11.800 |
can you compare RAD versus like long context efforts? 00:54:24.560 |
- Yeah, so everybody understands this question, right? 00:54:33.240 |
so that basically you can like take Harry Potter 00:54:38.760 |
like what is the name of like Harry Potter's owl 00:54:42.440 |
And then it can just attend over the entire thing. 00:54:48.120 |
to answer that one question is super inefficient, right? 00:54:51.840 |
So most of Harry Potter has nothing to do with the owl. 00:55:05.000 |
is a much more efficient way to solve this problem. 00:55:30.360 |
then you're going to move towards a RAG style architecture. 00:55:40.600 |
So let's talk about some other interesting questions. 00:55:59.600 |
I can probably do that on my own with the language model 00:56:09.800 |
So what we ideally want to be able to do is to say, 00:56:15.880 |
and I'm going to learn when I want to kind of expend 00:56:25.400 |
this is called Flare for Active Retrieval Augmentation, 00:56:28.480 |
where they basically have the language model decide 00:56:39.720 |
that you can see in the field around kind of agents, right? 00:56:42.560 |
So we can talk a little bit more about that too. 00:56:47.760 |
that I think we've also kind of covered already here 00:56:54.240 |
we can do re-rankers, we can do query-side only. 00:56:59.360 |
which is quite close, I think, to the idea you proposed, 00:57:02.920 |
where you first use VM25 to create a batch, basically, 00:57:11.960 |
And now you have this kind of in-batch update. 00:57:18.840 |
that is just in your batch using this other model. 00:57:22.000 |
And now you can update this model on the fly. 00:57:25.720 |
about doing the full kind of document-side update. 00:57:33.920 |
you can basically solve any problem just by looking it up. 00:57:37.600 |
So rather than cramming it into your parameters, 00:57:49.520 |
I think that's going to happen in the next year or two, 00:57:55.120 |
there's a bunch of lawsuits against OpenAI and other places 00:57:58.240 |
around where does the data exactly come from. 00:58:05.000 |
is to have a RAG system that you train on data 00:58:12.080 |
but now during test time, you can give it a data store 00:58:14.920 |
that has maybe slightly riskier information in it. 00:58:18.560 |
So this massive index of all the stuff on the internet, 00:58:21.600 |
including some things that are maybe higher risk, 00:58:29.760 |
your retrieval augmented language model, I should say, 00:58:33.440 |
because it was trained on data that is public domain. 00:58:42.640 |
to a lot of the kind of compliance and legal risk 00:58:48.480 |
- There's a great paper also from one of your colleagues 00:58:57.360 |
I think this is also kind of a fascinating phenomenon. 00:59:01.400 |
but language models are very similar to humans 00:59:09.320 |
So if you give them a bunch of things that you've retrieved, 00:59:12.400 |
what they will look at are the first things you list 00:59:18.440 |
So if it actually respected the rank function, 00:59:21.400 |
then this curve would go down all the way, right? 00:59:25.520 |
So I think that's a very interesting observation, 00:59:30.040 |
which kind of shows how brittle these systems can be, right? 00:59:37.640 |
where like the order of the retrieved context 00:59:40.240 |
matters a lot in whether you get the right answer or not. 01:00:11.040 |
and then the question was, how do you do that? 01:00:12.960 |
So the way that you do that is using reinforce. 01:00:18.720 |
So some of the older papers were playing with this, 01:00:23.760 |
so I think the replug solution is sort of more elegant 01:00:34.080 |
And if you just do reinforce, it's very high variance. 01:00:55.480 |
and again, we're sort of like thinking more and more 01:01:01.960 |
to the FLARE results from earlier with active retrieval 01:01:04.400 |
that doesn't necessarily have to be some index 01:01:06.760 |
that you only can be just some web search, right? 01:01:11.840 |
you don't really have access to the web search necessarily. 01:01:18.640 |
But I just wanted to kind of put this in your mind, 01:01:21.200 |
like this is another thing you can do, right? 01:01:23.960 |
And if we take this really to the general form, 01:01:26.360 |
then you can think of language models as just tool users. 01:01:31.080 |
So rather than just retrieval augmenting language models, 01:01:48.000 |
is how do you actually get the system to learn stuff, right? 01:01:51.680 |
So we're gonna need RL if we want this system 01:01:53.960 |
to really learn how to take these actions properly. 01:01:56.920 |
And so, yeah, this has been taken to the extreme 01:02:09.640 |
and then you basically do some natural language inference 01:02:22.440 |
of open questions that people have looked at, 01:02:31.280 |
we established at the beginning of the lecture 01:02:33.080 |
that this is pretty important for getting things to work, 01:02:58.160 |
so that that also follows the instructions properly, 01:03:16.880 |
So like frameworks like Lama Index and LangChain, 01:03:19.560 |
and there's all these open source vector databases 01:03:23.200 |
and they're all sort of about making RAG really easy, 01:03:48.080 |
before we give the final thing to the language model. 01:03:51.160 |
There's ZeroShot, like a large language model re-ranker. 01:04:19.360 |
and then it gives the right answer based on that. 01:04:40.760 |
there are still lots of very interesting open questions. 01:04:59.600 |
that you don't necessarily have to pre-train. 01:05:01.440 |
So maybe there's something wrong with how we do that. 01:05:07.240 |
So I think there's a really interesting question here 01:05:16.040 |
so basically decouple all the memorization to this index. 01:05:19.680 |
So I have a language model that doesn't know anything. 01:05:25.720 |
because that always comes from this retriever. 01:05:29.480 |
then you get very interesting scaling trade-offs, right? 01:05:33.880 |
and do your retrieval to do a lot of the heavy lifting 01:05:38.720 |
which is nice because that's a cached computation, right? 01:05:47.200 |
than kind of self-attention in the language model. 01:05:57.720 |
but I'm not sure how long we're gonna keep vector databases 01:06:00.640 |
because I think re-rankers probably work just as well, 01:06:06.360 |
and VM25 is much more efficient than a vector database. 01:06:09.680 |
So I don't really see why we need dedicated vector databases. 01:06:19.040 |
of maybe Silicon Valley investment strategies 01:06:27.200 |
are basically becoming database companies now. 01:06:34.800 |
there are a lot of pretty good sparse databases 01:06:38.040 |
out there already like Postgres and things like that. 01:06:40.560 |
And they're also all adding vectors to their databases. 01:06:44.040 |
So I think that's all gonna kind of coalesce into databases. 01:06:48.880 |
So I think there are some interesting things to look at 01:07:04.720 |
And then I think there's this massive open question 01:07:10.000 |
So right now we just look at downstream performance, 01:07:15.360 |
but if you mess up the retrieval, it's very hard to measure. 01:07:19.000 |
But how to measure whether your retrieval is right 01:07:24.560 |
where they try to take like the harmonic mean 01:07:32.280 |
because we don't really have very good data sets 01:07:35.600 |
So I think that's a very cool problem to work on as well. 01:07:43.120 |
am always very excited about is multimodality. 01:07:46.040 |
And so why would we stop with rack systems with just text? 01:07:58.360 |
where we have a language model enhanced to see 01:08:01.920 |
where you can just give a kind of a computer vision pipeline 01:08:20.600 |
because there's no open source version of that. 01:08:23.960 |
So we've done some early work on this in 2021 01:08:30.160 |
and there's some more recent work out of FAIR 01:08:37.720 |
multimodality with GPD 4 or V and things like that 01:08:41.480 |
So everything is kind of going in that direction. 01:09:00.680 |
So it's really about systems over models, right? 01:09:05.000 |
and your retriever and they're kind of separate. 01:09:06.640 |
It's about thinking from a systems perspective 01:09:14.720 |
that in deep learning things have always progressed 01:09:21.520 |
Like back in the day in computer vision or NLP, 01:09:26.640 |
And all of that just doesn't exist anymore now 01:09:32.000 |
And so that's what's going to happen here too. 01:09:36.520 |
like there's this chunker thing in your documents, right? 01:09:46.640 |
And so, yeah, I think like trading off costs and quality 01:09:52.000 |
that's really like where this stuff is going to come in. 01:09:54.240 |
So language models right now, they're amazing, 01:10:03.280 |
So what you want to do is make it much more efficient 01:10:18.480 |
So if you're interested in this, I'm at Stanford. 01:10:20.720 |
So I can work with you on research projects on these topics, 01:10:30.320 |
- Well, sorry, I had a question from earlier. 01:10:39.120 |
I think really super helpful earlier about Mistral 7B. 01:10:51.600 |
several different layers of convolutional layers. 01:10:53.680 |
And the top convolutional layers are able to see 01:11:02.400 |
you're able to tune the filter sizes and the strides. 01:11:07.000 |
So you're able to see a different receptive field. 01:11:10.120 |
And I was wondering if you could see that same innovation 01:11:15.360 |
because you have different transformer layers 01:11:20.560 |
And if you can tune, I guess, the transformer architecture, 01:11:27.720 |
perhaps we can do some optimization in the transformer realm 01:11:30.800 |
that we have already done in convolution layers. 01:11:48.320 |
And the transformer is slightly more optimized 01:11:52.200 |
but the convolutional model was actually slightly better 01:12:01.680 |
- It's probably the advantage of the re-branch 01:12:06.760 |
over VM25, but does that give up a lot of the advantages 01:12:10.640 |
of this massive search or is it a trade-off of upgrades? 01:12:22.240 |
and then just narrow it down with dense search. 01:12:25.200 |
So you often see that kind of as a two-stage process 01:12:31.760 |
and then you use the dense one to filter it down. 01:12:35.400 |
- Yeah, everyone's trying to maybe adapt their 01:12:38.960 |
large-scale model to almost domain-specific areas. 01:12:43.360 |
Like I think there are mainly two ways to approach it. 01:12:52.800 |
And another way is just, the main topic of this lecture 01:13:03.000 |
of low-cost advantage of virtual-augmented way? 01:13:12.160 |
with those tuning methods, fine-tuning type learning? 01:13:15.640 |
- Yeah, so I think actually what's gonna happen 01:13:19.000 |
is that all of this will come together, right? 01:13:31.960 |
So why would you just take the retrieval-augmented system 01:13:35.440 |
if you can also fine-tune it on the thing you care about? 01:13:54.720 |
You said it's gonna become a database kind of thing, 01:14:03.080 |
And because you've got so much of the learning part, 01:14:13.560 |
So do you have any idea it's just a database problem? 01:14:24.720 |
that recently, their stock has done really well, 01:14:27.560 |
they have some dedicated retrieval hardware coming out. 01:14:59.200 |
- Yes, I think so, if you take it to the extreme. 01:15:04.200 |
is that if you contextualize an existing language model 01:15:12.480 |
So if you do replug on GPT-4, GPT-4 might still hallucinate. 01:15:17.480 |
So it could basically just ignore all the stuff 01:15:20.000 |
you retrieved and just do whatever it wants anyway. 01:15:39.560 |
So it's really all grounded in whatever is in your index. 01:15:51.200 |
I'm sort of frustrated that a lot of people in the field 01:15:53.640 |
misunderstand what hallucination even means, right? 01:15:56.320 |
So a lot of people are conflating hallucination 01:16:00.720 |
So they're like, oh, the model made a mistake. 01:16:06.920 |
Hallucination, I think is very specific kind of, 01:16:10.960 |
So I have some sort of counterfactual ground truth. 01:16:19.320 |
And so, yeah, I think there's a bunch of folks 01:16:22.800 |
at Stanford also working on better measurements 01:16:25.200 |
of hallucination and definitions and things like that. 01:16:44.800 |
So if we're talking about like hallucination and, 01:16:59.800 |
making mistakes before it was called making mistakes. 01:17:09.080 |
I guess this is solving the hallucination question 01:17:17.600 |
And so, you know, if I generate the building documents 01:17:20.160 |
saying, "Oh, well, I've never been a president," 01:17:24.520 |
Are you considering work on that, on this ground truth? 01:17:27.800 |
- Yeah, so I like the sort of silo mentioned there as well. 01:17:31.920 |
So I think the whole point is that you can have 01:17:38.320 |
So I think you could say, "I only trust the archive," 01:17:47.520 |
And so you can make decisions in your architecture 01:17:49.680 |
during test time about what you define as ground truth. 01:17:59.160 |
You can control for how grounded you want it to be 01:18:04.040 |
So that's another kind of misconception about hallucinations. 01:18:07.840 |
Like sometimes hallucinations are actually good, right? 01:18:12.040 |
and you wanted to come up with some cool new ideas, 01:18:16.800 |
So I think what you want to have is kind of a tunable knob 01:18:19.760 |
where you say like, "Oh, now you can hallucinate, 01:18:21.720 |
"and now maybe you should really tell me the truth only." 01:18:34.840 |
that's already on to it, how much you're saying. 01:18:46.320 |
So how flat your distribution is that you sample from. 01:18:56.560 |
it can still come up with random stuff, right?