back to indexStanford XCS224U: NLU I Information Retrieval, Part 1: Guiding Ideas I Spring 2023
Chapters
0:0 Intro
0:21 NLP is revolutionizing Information Retrieval I
1:59 IR is a hard NLU problem
2:53 IR is revolutionizing NLP
5:32 Knowledge-intensive tasks
7:2 Classical IR
8:0 LLMS for everything
8:58 Neural IR
10:52 Retrieval-augmented in-context leaming
13:11 IR is more important than ever!
16:55 Blog posts
00:00:05.840 |
This is our first screencast on information retrieval. 00:00:10.200 |
These will serve both to give you a sense for 00:00:12.400 |
the current moment in science and technology, 00:00:14.840 |
and also help you build a bridge into the homework, 00:00:17.580 |
which is on retrieval augmented in context learning. 00:00:35.120 |
Google announced that it was incorporating aspects of 00:00:45.160 |
I have a feeling that these two very public announcements were 00:00:48.360 |
just a glimpse of the changes that were starting 00:00:55.840 |
large language models would play a direct role in search. 00:00:59.000 |
I think the startup u.com was really visionary in this sense. 00:01:05.340 |
Richard Socher, is a distinguished alum of this course. 00:01:08.900 |
u.com was way ahead of the curve in seeing that 00:01:13.800 |
really interesting and powerful aspects of web search. 00:01:17.740 |
We've seen lots of activity in that space since then. 00:01:20.420 |
For example, Microsoft has partnered with OpenAI, 00:01:33.500 |
part of the announcement was a partnership with 00:01:35.780 |
Morgan Stanley to help Morgan Stanley employees 00:01:39.380 |
use GPT-4 to find things in their own internal documents. 00:01:43.540 |
That just shows you that we hear a lot about public web search, 00:01:47.160 |
but there are also powerful search applications that 00:01:59.060 |
You might ask yourself why this is happening, 00:02:06.260 |
a hard natural language understanding problem. 00:02:13.580 |
Here's an example that brings that point home. 00:02:26.500 |
powerful chemical defenses against ingested pathogens. 00:02:30.200 |
The coloring indicates relevance connections. 00:02:32.820 |
You'll notice that for the keywords in the query in 00:02:38.200 |
The connections that we need to make here are 00:02:40.460 |
entirely semantic and that shows you that the more 00:02:43.460 |
deeply we understand the language of the query in the document, 00:02:48.860 |
finding these relevant passages given queries like this. 00:02:53.060 |
That's all mainly about information retrieval, 00:02:57.280 |
and for me, the more exciting direction of this is that 00:03:00.560 |
information retrieval is now revolutionizing NLP, 00:03:11.880 |
Let me use question answering to highlight that. 00:03:20.080 |
the system is given a title and a context passage and a question, 00:03:28.260 |
and there is a guarantee that the answer will be 00:03:33.860 |
That is standard QA as formulated in tasks like SQUAD, 00:03:42.260 |
you're given a title, context, question, and answer. 00:03:48.660 |
and question, and you have a guarantee that the answer 00:03:51.960 |
is a literal substring of that context passage. 00:03:55.300 |
That used to be a hard problem for our best models, 00:03:59.820 |
and I think you can see that it's also pretty 00:04:01.700 |
disconnected from actual things that we want to 00:04:04.020 |
do with question answering in the world where we very 00:04:06.820 |
rarely get this very rich context or that substring guarantee. 00:04:25.140 |
But now, only the question and answer are given at train time, 00:04:30.180 |
and the title and the context passage will need to be 00:04:33.180 |
retrieved from somewhere, from a large document corpus, 00:04:43.220 |
a literal substring of anything in the context or the title. 00:04:49.140 |
but it's also much more relevant because this is simulating, 00:04:52.780 |
actually searching on the web where you pose a question and you need to 00:04:56.180 |
retrieve relevant information in order to answer the question. 00:05:00.140 |
But substantially harder, only the question and answer at train time, 00:05:05.580 |
the question and all the relevant information needs to be retrieved. 00:05:09.300 |
What you see there is to the extent that we can have 00:05:14.780 |
this really good evidence for answering these questions, 00:05:19.580 |
and that is the crucial role for retrieval in this OpenQA pipeline that you all will 00:05:24.580 |
be exploring as part of the associated homework and bake-off for this unit. 00:05:30.020 |
Question answering is really just one example of 00:05:35.560 |
a family of what you might call knowledge intensive tasks. 00:05:39.980 |
but we also have things like claim verification, 00:05:42.660 |
common sense reasoning, long-form reading comprehension, 00:05:48.200 |
These are all transparently tasks that depend very heavily on having 00:05:52.580 |
rich information about the world informing whatever prediction the system makes. 00:05:57.020 |
That's pretty clear, but I'm also interested in taking 00:06:00.300 |
standard typically closed NLP tasks and expanding them into more open variants. 00:06:05.860 |
For example, summarization is standardly just posed as 00:06:09.380 |
a task where you take in a long passage and try to 00:06:12.020 |
produce a shorter one but couldn't we make that 00:06:17.180 |
the input with lots of information that we've retrieved? 00:06:20.000 |
I think it's a reasonable hypothesis that could improve summarization systems. 00:06:24.960 |
Similarly, natural language inference typically just posed as 00:06:31.780 |
hypothesis and you give them one of three labels. 00:06:34.540 |
But wouldn't it be interesting to augment the premise with information about 00:06:38.340 |
the world that might help the system make better predictions as a classifier? 00:06:43.900 |
I think that's just two examples of how we could take classical problems, 00:06:49.600 |
and reformulate them into knowledge intensive tasks that would benefit from 00:06:53.660 |
retrieval with the result that they could be made more effective 00:06:57.380 |
and also more scalable to real-world problems. 00:07:01.820 |
Let's talk a little bit about information retrieval approaches, 00:07:09.140 |
In this case, we have a user query coming in, 00:07:16.980 |
What we've done offline presumably is create a large index 00:07:20.940 |
that maps terms to associated relevant documents. 00:07:24.780 |
It could be a list of documents that contain the term, 00:07:27.420 |
but we would probably also do some scoring of those documents with 00:07:30.980 |
respect to these query terms to organize them by relevance. 00:07:36.900 |
we can do document scoring and give back to the user 00:07:40.340 |
a ranked list of documents ordered by relevance. 00:07:44.540 |
Then it's up to the user to figure out which of those documents to 00:07:47.740 |
check out in looking for an answer to the question. 00:07:51.740 |
That is the classical search experience as we all know it. 00:07:57.500 |
There is now a movement afoot to replace a lot of that with pure language models. 00:08:02.740 |
I've called this the LLMs for everything approach. 00:08:12.500 |
does some mysterious work and spits out the answer, 00:08:21.780 |
whereas before we had to look through a ranked list of web pages to find our answer, 00:08:37.340 |
and so we should be skeptical consumers of their outputs. 00:08:40.620 |
Since we don't know where this answer came from, 00:08:43.220 |
we have no information about how it was produced. 00:08:46.020 |
We might start to wonder about whether our information need was actually met, 00:08:55.260 |
enough that I think we should be pushing in a different direction. 00:08:58.220 |
That's where you would get neural information retrieval modules 00:09:02.060 |
continuing to be important players in open knowledge intensive tasks for NLP. 00:09:08.820 |
Neural IR models are going to function a lot like 00:09:11.620 |
those classical models except in a much richer semantic space. 00:09:15.820 |
We're going to start with a big language model, 00:09:17.980 |
just as we did in the LLMs for everything approach, 00:09:20.420 |
but we're going to use it somewhat differently. 00:09:22.700 |
The first thing we'll do with that language model is take 00:09:25.380 |
all the documents that we have in our collection of 00:09:27.700 |
documents and represent them with the language model. 00:09:31.340 |
The result of that will be some dense numerical representations that we 00:09:35.900 |
expect to capture important aspects of their structure and their meaning. 00:09:41.300 |
That is essentially the document index in the classical IR mode, 00:09:46.060 |
but now it's a bunch of deep learning representations. 00:09:52.420 |
and the first thing we do with that query is process it, 00:09:54.820 |
probably using the same large language model, 00:09:57.140 |
and get back a dense numerical representation of that query. 00:10:01.300 |
Then on the basis of all these representations, 00:10:09.980 |
everything about the classical search experience. 00:10:12.660 |
The only twist is that scoring will happen in a different way because we're now 00:10:16.980 |
dealing not with terms and scores but rather with 00:10:20.460 |
these dense numerical representations that we're accustomed to throughout deep learning. 00:10:25.620 |
But the result of all that scoring is that we give the user back a ranked list of pages. 00:10:31.380 |
We've reproduced the classical experience for the user in the sense that they now need to 00:10:35.660 |
search through those pages and find the answer to their question. 00:10:39.940 |
We just hope that we're doing a much better job of offering 00:10:44.420 |
relevant pages in virtue of the fact that we're 00:10:50.980 |
This is a good moment to bridge into in-context learning, 00:10:58.060 |
and they'll all come together for you in the homework. 00:11:00.260 |
Let's think about how that bridge is going to happen. 00:11:07.180 |
In this case, we've simply prompted it with a question, 00:11:09.620 |
who is Bert, and the task is to come up with an answer. 00:11:14.980 |
that is the only thing that we're given by the system. 00:11:20.340 |
The question is, how can we effectively answer this question using retrieval? 00:11:26.340 |
Well, one thing we could do is retrieve from a document store, 00:11:34.060 |
hope will be relevant evidence for answering that question. 00:11:38.620 |
But there's more that we could do with retrieval. 00:11:40.900 |
For example, we know that large language models, 00:11:47.700 |
Maybe we have a train set of questions and we could 00:11:50.860 |
just retrieve from that set a question to use. 00:11:54.500 |
Then at that point, we could either use the training answer to 00:11:58.740 |
that question or maybe retrieve an answer in the hope that 00:12:02.020 |
that will more closely simulate what the system actually has to do. 00:12:05.580 |
But in any case, we now have this demonstration and we could go on further. 00:12:10.380 |
we could either use training evidence like a passage from our QA dataset, 00:12:18.340 |
using a retriever to function as evidence for this little demonstration here. 00:12:23.260 |
The guiding hypothesis is that having woven together 00:12:30.020 |
to produce evidence for answering this question, 00:12:32.860 |
we're going to do a better job at coming up with predicted answers. 00:12:40.980 |
where we're using our retriever to find evidence. 00:12:43.220 |
What you'll see in the in-context learning unit, 00:12:48.020 |
is that this is just the start of a very rich set of options that we can 00:12:52.020 |
employ for effectively develop in-context learning systems 00:12:56.020 |
that use retrieval to find relevant evidence. 00:12:59.340 |
That's how these two themes really come together. 00:13:02.460 |
I think that these two themes coming together is one of 00:13:05.580 |
the central questions for the field of NLP and IR in the current moment. 00:13:10.860 |
Because really what we're seeing is a lot of worrisome behavior 00:13:19.740 |
For example, we all saw that Google took a real hit in terms of its stock price for 00:13:24.700 |
making a minor factual error in one of its demo videos. 00:13:28.980 |
Maybe that was appropriate given how high stakes all of this is, 00:13:32.700 |
but it's funny to think about because at the same time, 00:13:36.140 |
open AI models were fabricating evidence all over the place. 00:13:40.260 |
This is a amusing example where I have asked the system, 00:13:44.060 |
are professional baseball players allowed to glue small wings to their caps? 00:13:48.180 |
I asked the model to offer me some evidence for the answer that it gave. 00:13:52.540 |
It did indeed dutifully say no and then offer some evidence, 00:13:56.460 |
but the evidence links that it offered are entirely fabricated. 00:14:08.660 |
easily worse than offering no evidence at all because we have all become accustomed 00:14:16.780 |
function as ground truth evidence for the answer given. 00:14:20.220 |
The fact that that ground truth is being completely fabricated 00:14:23.520 |
is absolutely worse than offering no evidence at all. 00:14:29.380 |
We're going to talk about our demonstrate search predict paper. 00:14:33.220 |
Figure 1 of that paper includes an example with the question, 00:14:37.220 |
how many stories are in the castle David Gregory inherited? 00:14:41.260 |
On Twitter, a user read our paper and then tried the example with Bing search engine. 00:14:47.620 |
They said, "Aha, Bing can answer your very difficult seeming question, no problem." 00:14:52.620 |
But then that user immediately followed up by noticing that Bing was actually 00:14:56.440 |
citing our own paper as evidence for the answer to this question. 00:15:01.020 |
I will say that this is deeply worrisome to me. 00:15:07.440 |
good ground truth evidence about the castle David Gregory inherited. 00:15:11.660 |
We used it purely as an illustrative example. 00:15:17.120 |
we might have actually been talking about giving the wrong answer to this question. 00:15:20.840 |
In fact, our figure does embed some wrong answers. 00:15:23.800 |
The idea that a scientific research paper that's about in context learning with 00:15:28.440 |
retrieval would be used as evidence for the castle David Gregory inherited, 00:15:33.400 |
anything about that is completely baffling to me. 00:15:36.360 |
That just shows you that simply because you have 00:15:38.920 |
some search mechanisms doesn't mean that you're doing good search. 00:15:42.320 |
But what we really need in this context is high-quality search. 00:15:48.920 |
I found this amusing but maybe a little bit worrisome. 00:15:54.480 |
I prompted chat GPT with write a biography of Christopher Potts from Stanford University. 00:16:03.640 |
It's very flattering to me and we can go ahead and say that it is truthful. 00:16:07.960 |
But everything in the box in red is completely false. 00:16:12.120 |
All of the factual information expressed there is false. 00:16:19.100 |
I have no complaints about any of these facts except that they are false. 00:16:22.720 |
The reason though that I'm worried is that I think not 00:16:25.920 |
everyone is going to get such flattering information when they ask for their biography. 00:16:30.120 |
We are just on the precipice of seeing really worrisome behavior with 00:16:34.280 |
really meaningful downstream effects for people in society, 00:16:37.720 |
if these language models continue to fabricate evidence in this way. 00:16:41.760 |
That's why I feel like the current unit and the work that you do for it is 00:16:46.360 |
absolutely extremely important and relevant to addressing 00:16:49.720 |
this growing societal and technological problem. 00:16:55.420 |
If you would like a little bit more on this, Omar, 00:16:58.420 |
Matej Zaharia and I did two blog posts on this a few years ago 00:17:02.340 |
that I think remain still extremely relevant. 00:17:07.120 |
explainable, and adaptive NLP models with retrieval. 00:17:16.640 |
this modest proposal for radically better AI-powered web search, 00:17:22.680 |
we were emphasizing the importance of provenance for information and 00:17:27.440 |
ground truth in documents as an important aspect of doing web search, 00:17:32.120 |
even with big, powerful, fancy, large language models. 00:17:35.640 |
That is the vision that we're going to try to project for 00:17:38.360 |
you throughout this unit and with our homework.