back to index

Stanford XCS224U: NLU I Information Retrieval, Part 1: Guiding Ideas I Spring 2023


Chapters

0:0 Intro
0:21 NLP is revolutionizing Information Retrieval I
1:59 IR is a hard NLU problem
2:53 IR is revolutionizing NLP
5:32 Knowledge-intensive tasks
7:2 Classical IR
8:0 LLMS for everything
8:58 Neural IR
10:52 Retrieval-augmented in-context leaming
13:11 IR is more important than ever!
16:55 Blog posts

Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome everyone.
00:00:05.840 | This is our first screencast on information retrieval.
00:00:08.600 | Let's start with some guiding ideas.
00:00:10.200 | These will serve both to give you a sense for
00:00:12.400 | the current moment in science and technology,
00:00:14.840 | and also help you build a bridge into the homework,
00:00:17.580 | which is on retrieval augmented in context learning.
00:00:21.400 | You might have noticed by now that NLP
00:00:24.440 | is revolutionizing information retrieval.
00:00:26.960 | This is a story that really begins with
00:00:28.700 | the transformer or maybe more properly,
00:00:30.800 | one of its most famous spokes models, BERT.
00:00:33.160 | Soon after BERT was launched,
00:00:35.120 | Google announced that it was incorporating aspects of
00:00:37.600 | BERT into its core search technologies.
00:00:40.520 | Microsoft made a similar announcement
00:00:42.920 | with Bing at about the same time.
00:00:45.160 | I have a feeling that these two very public announcements were
00:00:48.360 | just a glimpse of the changes that were starting
00:00:50.560 | to happen with major search engines.
00:00:53.600 | A little bit later, we started to see that
00:00:55.840 | large language models would play a direct role in search.
00:00:59.000 | I think the startup u.com was really visionary in this sense.
00:01:02.560 | I like to highlight u.com because its CEO,
00:01:05.340 | Richard Socher, is a distinguished alum of this course.
00:01:08.900 | u.com was way ahead of the curve in seeing that
00:01:11.620 | large language models could be
00:01:13.800 | really interesting and powerful aspects of web search.
00:01:17.740 | We've seen lots of activity in that space since then.
00:01:20.420 | For example, Microsoft has partnered with OpenAI,
00:01:23.620 | and it's now using OpenAI models,
00:01:25.640 | as part of the Bing search experience.
00:01:28.500 | You might have noticed also in
00:01:29.940 | a different perspective on this,
00:01:31.700 | that when GPT-4 was announced,
00:01:33.500 | part of the announcement was a partnership with
00:01:35.780 | Morgan Stanley to help Morgan Stanley employees
00:01:39.380 | use GPT-4 to find things in their own internal documents.
00:01:43.540 | That just shows you that we hear a lot about public web search,
00:01:47.160 | but there are also powerful search applications that
00:01:49.680 | could happen internal to organizations.
00:01:51.900 | Again, powered by, in this case,
00:01:54.160 | throughout this entire story as usual,
00:01:56.820 | powered by the transformer.
00:01:59.060 | You might ask yourself why this is happening,
00:02:01.740 | and I think the fundamental reason is that
00:02:04.140 | information retrieval is simply
00:02:06.260 | a hard natural language understanding problem.
00:02:08.740 | The more powerful our NLU technologies,
00:02:11.520 | the better we can do with retrieval.
00:02:13.580 | Here's an example that brings that point home.
00:02:16.220 | Our query is what compounds protect
00:02:18.540 | the digestive system against viruses?
00:02:21.100 | A relevant document is in the stomach,
00:02:23.800 | gastric acid, and proteases serve as
00:02:26.500 | powerful chemical defenses against ingested pathogens.
00:02:30.200 | The coloring indicates relevance connections.
00:02:32.820 | You'll notice that for the keywords in the query in
00:02:35.100 | the document, there is no string overlap.
00:02:38.200 | The connections that we need to make here are
00:02:40.460 | entirely semantic and that shows you that the more
00:02:43.460 | deeply we understand the language of the query in the document,
00:02:46.800 | the better we're going to be able to do at
00:02:48.860 | finding these relevant passages given queries like this.
00:02:53.060 | That's all mainly about information retrieval,
00:02:55.960 | but I'm an NLP-er,
00:02:57.280 | and for me, the more exciting direction of this is that
00:03:00.560 | information retrieval is now revolutionizing NLP,
00:03:03.640 | and the way it's doing that is by making
00:03:06.140 | our NLP problems more open and
00:03:08.720 | more relevant to actual daily tasks.
00:03:11.880 | Let me use question answering to highlight that.
00:03:14.800 | In the by now standard formulation
00:03:17.680 | of question answering within NLP,
00:03:20.080 | the system is given a title and a context passage and a question,
00:03:25.980 | and the task is to answer that question,
00:03:28.260 | and there is a guarantee that the answer will be
00:03:30.660 | a literal substring of that context passage.
00:03:33.860 | That is standard QA as formulated in tasks like SQUAD,
00:03:37.940 | the Stanford Question Answering Dataset.
00:03:40.100 | Just to repeat, at train time,
00:03:42.260 | you're given a title, context, question, and answer.
00:03:45.740 | At test time, you're given a title, context,
00:03:48.660 | and question, and you have a guarantee that the answer
00:03:51.960 | is a literal substring of that context passage.
00:03:55.300 | That used to be a hard problem for our best models,
00:03:57.780 | but it has grown quite easy,
00:03:59.820 | and I think you can see that it's also pretty
00:04:01.700 | disconnected from actual things that we want to
00:04:04.020 | do with question answering in the world where we very
00:04:06.820 | rarely get this very rich context or that substring guarantee.
00:04:12.340 | We are moving now as a field into
00:04:15.320 | a formulation of QA that I've called OpenQA,
00:04:17.820 | and this will be substantially harder.
00:04:19.700 | In this mode, maybe there's a title,
00:04:21.920 | and a context, and a question,
00:04:23.620 | and the task is to answer.
00:04:25.140 | But now, only the question and answer are given at train time,
00:04:30.180 | and the title and the context passage will need to be
00:04:33.180 | retrieved from somewhere, from a large document corpus,
00:04:37.260 | it could be the web.
00:04:38.780 | Having retrieved it, of course,
00:04:40.980 | we have no guarantee that the answer will be
00:04:43.220 | a literal substring of anything in the context or the title.
00:04:47.180 | This is a substantially harder problem,
00:04:49.140 | but it's also much more relevant because this is simulating,
00:04:52.780 | actually searching on the web where you pose a question and you need to
00:04:56.180 | retrieve relevant information in order to answer the question.
00:05:00.140 | But substantially harder, only the question and answer at train time,
00:05:03.440 | and at test time, all you're given is
00:05:05.580 | the question and all the relevant information needs to be retrieved.
00:05:09.300 | What you see there is to the extent that we can have
00:05:12.180 | really good retrieval technologies find
00:05:14.780 | this really good evidence for answering these questions,
00:05:17.420 | we can develop effective systems,
00:05:19.580 | and that is the crucial role for retrieval in this OpenQA pipeline that you all will
00:05:24.580 | be exploring as part of the associated homework and bake-off for this unit.
00:05:30.020 | Question answering is really just one example of
00:05:35.560 | a family of what you might call knowledge intensive tasks.
00:05:38.740 | I mentioned question answering,
00:05:39.980 | but we also have things like claim verification,
00:05:42.660 | common sense reasoning, long-form reading comprehension,
00:05:46.460 | and information seeking dialogue.
00:05:48.200 | These are all transparently tasks that depend very heavily on having
00:05:52.580 | rich information about the world informing whatever prediction the system makes.
00:05:57.020 | That's pretty clear, but I'm also interested in taking
00:06:00.300 | standard typically closed NLP tasks and expanding them into more open variants.
00:06:05.860 | For example, summarization is standardly just posed as
00:06:09.380 | a task where you take in a long passage and try to
00:06:12.020 | produce a shorter one but couldn't we make that
00:06:14.940 | a knowledge intensive task where we augment
00:06:17.180 | the input with lots of information that we've retrieved?
00:06:20.000 | I think it's a reasonable hypothesis that could improve summarization systems.
00:06:24.960 | Similarly, natural language inference typically just posed as
00:06:28.860 | a closed classification problem premise
00:06:31.780 | hypothesis and you give them one of three labels.
00:06:34.540 | But wouldn't it be interesting to augment the premise with information about
00:06:38.340 | the world that might help the system make better predictions as a classifier?
00:06:43.900 | I think that's just two examples of how we could take classical problems,
00:06:47.780 | even classification problems,
00:06:49.600 | and reformulate them into knowledge intensive tasks that would benefit from
00:06:53.660 | retrieval with the result that they could be made more effective
00:06:57.380 | and also more scalable to real-world problems.
00:07:01.820 | Let's talk a little bit about information retrieval approaches,
00:07:06.500 | and I'll start with classical IR.
00:07:09.140 | In this case, we have a user query coming in,
00:07:11.900 | when was Stanford University founded?
00:07:14.420 | The first thing that we do is term lookup.
00:07:16.980 | What we've done offline presumably is create a large index
00:07:20.940 | that maps terms to associated relevant documents.
00:07:24.780 | It could be a list of documents that contain the term,
00:07:27.420 | but we would probably also do some scoring of those documents with
00:07:30.980 | respect to these query terms to organize them by relevance.
00:07:35.020 | On the basis of that index,
00:07:36.900 | we can do document scoring and give back to the user
00:07:40.340 | a ranked list of documents ordered by relevance.
00:07:44.540 | Then it's up to the user to figure out which of those documents to
00:07:47.740 | check out in looking for an answer to the question.
00:07:51.740 | That is the classical search experience as we all know it.
00:07:57.500 | There is now a movement afoot to replace a lot of that with pure language models.
00:08:02.740 | I've called this the LLMs for everything approach.
00:08:05.540 | In this mode, the user's query comes in,
00:08:07.780 | when was Stanford University founded?
00:08:09.820 | A big language model, totally opaque to us,
00:08:12.500 | does some mysterious work and spits out the answer,
00:08:15.660 | Stanford University was founded in 1891.
00:08:19.180 | A real change to the search experience,
00:08:21.780 | whereas before we had to look through a ranked list of web pages to find our answer,
00:08:26.820 | now the answer is given to us directly.
00:08:30.020 | That could be very exciting.
00:08:32.020 | However, we might start to worry.
00:08:34.940 | We know these models can fabricate evidence,
00:08:37.340 | and so we should be skeptical consumers of their outputs.
00:08:40.620 | Since we don't know where this answer came from,
00:08:43.220 | we have no information about how it was produced.
00:08:46.020 | We might start to wonder about whether our information need was actually met,
00:08:50.220 | whether we should trust this string.
00:08:52.420 | I'm deeply concerned about this model,
00:08:55.260 | enough that I think we should be pushing in a different direction.
00:08:58.220 | That's where you would get neural information retrieval modules
00:09:02.060 | continuing to be important players in open knowledge intensive tasks for NLP.
00:09:08.820 | Neural IR models are going to function a lot like
00:09:11.620 | those classical models except in a much richer semantic space.
00:09:15.820 | We're going to start with a big language model,
00:09:17.980 | just as we did in the LLMs for everything approach,
00:09:20.420 | but we're going to use it somewhat differently.
00:09:22.700 | The first thing we'll do with that language model is take
00:09:25.380 | all the documents that we have in our collection of
00:09:27.700 | documents and represent them with the language model.
00:09:31.340 | The result of that will be some dense numerical representations that we
00:09:35.900 | expect to capture important aspects of their structure and their meaning.
00:09:41.300 | That is essentially the document index in the classical IR mode,
00:09:46.060 | but now it's a bunch of deep learning representations.
00:09:49.900 | Then the user's query comes in,
00:09:52.420 | and the first thing we do with that query is process it,
00:09:54.820 | probably using the same large language model,
00:09:57.140 | and get back a dense numerical representation of that query.
00:10:01.300 | Then on the basis of all these representations,
00:10:04.380 | we can do scoring and extraction as usual.
00:10:07.900 | At this point, we can reproduce
00:10:09.980 | everything about the classical search experience.
00:10:12.660 | The only twist is that scoring will happen in a different way because we're now
00:10:16.980 | dealing not with terms and scores but rather with
00:10:20.460 | these dense numerical representations that we're accustomed to throughout deep learning.
00:10:25.620 | But the result of all that scoring is that we give the user back a ranked list of pages.
00:10:31.380 | We've reproduced the classical experience for the user in the sense that they now need to
00:10:35.660 | search through those pages and find the answer to their question.
00:10:39.940 | We just hope that we're doing a much better job of offering
00:10:44.420 | relevant pages in virtue of the fact that we're
00:10:47.020 | operating in a much richer semantic space.
00:10:50.980 | This is a good moment to bridge into in-context learning,
00:10:56.220 | which is the other part of this unit,
00:10:58.060 | and they'll all come together for you in the homework.
00:11:00.260 | Let's think about how that bridge is going to happen.
00:11:02.860 | Now we're going to be in the mode of having
00:11:04.740 | a large language model and prompting it.
00:11:07.180 | In this case, we've simply prompted it with a question,
00:11:09.620 | who is Bert, and the task is to come up with an answer.
00:11:13.060 | In the mode that we're operating in,
00:11:14.980 | that is the only thing that we're given by the system.
00:11:17.460 | This is a truly an open QA formulation.
00:11:20.340 | The question is, how can we effectively answer this question using retrieval?
00:11:26.340 | Well, one thing we could do is retrieve from a document store,
00:11:31.100 | context passage for that question that we
00:11:34.060 | hope will be relevant evidence for answering that question.
00:11:36.900 | That's given in green here.
00:11:38.620 | But there's more that we could do with retrieval.
00:11:40.900 | For example, we know that large language models,
00:11:43.300 | when they're doing in-context learning,
00:11:45.140 | benefit from having demonstrations.
00:11:47.700 | Maybe we have a train set of questions and we could
00:11:50.860 | just retrieve from that set a question to use.
00:11:54.500 | Then at that point, we could either use the training answer to
00:11:58.740 | that question or maybe retrieve an answer in the hope that
00:12:02.020 | that will more closely simulate what the system actually has to do.
00:12:05.580 | But in any case, we now have this demonstration and we could go on further.
00:12:09.140 | Depending on the train set,
00:12:10.380 | we could either use training evidence like a passage from our QA dataset,
00:12:15.140 | or retrieve a context passage, again,
00:12:18.340 | using a retriever to function as evidence for this little demonstration here.
00:12:23.260 | The guiding hypothesis is that having woven together
00:12:27.340 | training instances with some retrieval steps
00:12:30.020 | to produce evidence for answering this question,
00:12:32.860 | we're going to do a better job at coming up with predicted answers.
00:12:37.300 | That's a simple retrieve then read pipeline
00:12:40.980 | where we're using our retriever to find evidence.
00:12:43.220 | What you'll see in the in-context learning unit,
00:12:46.100 | and as you work on the homework,
00:12:48.020 | is that this is just the start of a very rich set of options that we can
00:12:52.020 | employ for effectively develop in-context learning systems
00:12:56.020 | that use retrieval to find relevant evidence.
00:12:59.340 | That's how these two themes really come together.
00:13:02.460 | I think that these two themes coming together is one of
00:13:05.580 | the central questions for the field of NLP and IR in the current moment.
00:13:10.860 | Because really what we're seeing is a lot of worrisome behavior
00:13:15.100 | from large language models that are being
00:13:17.140 | deployed as part of search technologies.
00:13:19.740 | For example, we all saw that Google took a real hit in terms of its stock price for
00:13:24.700 | making a minor factual error in one of its demo videos.
00:13:28.980 | Maybe that was appropriate given how high stakes all of this is,
00:13:32.700 | but it's funny to think about because at the same time,
00:13:36.140 | open AI models were fabricating evidence all over the place.
00:13:40.260 | This is a amusing example where I have asked the system,
00:13:44.060 | are professional baseball players allowed to glue small wings to their caps?
00:13:48.180 | I asked the model to offer me some evidence for the answer that it gave.
00:13:52.540 | It did indeed dutifully say no and then offer some evidence,
00:13:56.460 | but the evidence links that it offered are entirely fabricated.
00:14:00.260 | They are not real links to web pages.
00:14:02.940 | If you follow them, you get a 404 page.
00:14:05.620 | I find this tremendously frustrating and
00:14:08.660 | easily worse than offering no evidence at all because we have all become accustomed
00:14:14.380 | to seeing URLs and assuming that they do
00:14:16.780 | function as ground truth evidence for the answer given.
00:14:20.220 | The fact that that ground truth is being completely fabricated
00:14:23.520 | is absolutely worse than offering no evidence at all.
00:14:27.380 | Here's another funny case for this.
00:14:29.380 | We're going to talk about our demonstrate search predict paper.
00:14:33.220 | Figure 1 of that paper includes an example with the question,
00:14:37.220 | how many stories are in the castle David Gregory inherited?
00:14:41.260 | On Twitter, a user read our paper and then tried the example with Bing search engine.
00:14:47.620 | They said, "Aha, Bing can answer your very difficult seeming question, no problem."
00:14:52.620 | But then that user immediately followed up by noticing that Bing was actually
00:14:56.440 | citing our own paper as evidence for the answer to this question.
00:15:01.020 | I will say that this is deeply worrisome to me.
00:15:04.080 | Our paper should not be regarded as
00:15:07.440 | good ground truth evidence about the castle David Gregory inherited.
00:15:11.660 | We used it purely as an illustrative example.
00:15:14.880 | If we had had slightly different intentions,
00:15:17.120 | we might have actually been talking about giving the wrong answer to this question.
00:15:20.840 | In fact, our figure does embed some wrong answers.
00:15:23.800 | The idea that a scientific research paper that's about in context learning with
00:15:28.440 | retrieval would be used as evidence for the castle David Gregory inherited,
00:15:33.400 | anything about that is completely baffling to me.
00:15:36.360 | That just shows you that simply because you have
00:15:38.920 | some search mechanisms doesn't mean that you're doing good search.
00:15:42.320 | But what we really need in this context is high-quality search.
00:15:47.000 | Just to round this out,
00:15:48.920 | I found this amusing but maybe a little bit worrisome.
00:15:52.040 | You should all try this with your own names.
00:15:54.480 | I prompted chat GPT with write a biography of Christopher Potts from Stanford University.
00:16:00.880 | I'm very happy with the first paragraph.
00:16:03.640 | It's very flattering to me and we can go ahead and say that it is truthful.
00:16:07.960 | But everything in the box in red is completely false.
00:16:12.120 | All of the factual information expressed there is false.
00:16:16.120 | It's a nice biography of me.
00:16:19.100 | I have no complaints about any of these facts except that they are false.
00:16:22.720 | The reason though that I'm worried is that I think not
00:16:25.920 | everyone is going to get such flattering information when they ask for their biography.
00:16:30.120 | We are just on the precipice of seeing really worrisome behavior with
00:16:34.280 | really meaningful downstream effects for people in society,
00:16:37.720 | if these language models continue to fabricate evidence in this way.
00:16:41.760 | That's why I feel like the current unit and the work that you do for it is
00:16:46.360 | absolutely extremely important and relevant to addressing
00:16:49.720 | this growing societal and technological problem.
00:16:53.720 | That sets the stage.
00:16:55.420 | If you would like a little bit more on this, Omar,
00:16:58.420 | Matej Zaharia and I did two blog posts on this a few years ago
00:17:02.340 | that I think remain still extremely relevant.
00:17:05.500 | The first is building scalable,
00:17:07.120 | explainable, and adaptive NLP models with retrieval.
00:17:10.440 | That's a technical blog post.
00:17:13.440 | A more high-level outward looking one is
00:17:16.640 | this modest proposal for radically better AI-powered web search,
00:17:20.180 | where all the way back in 2021,
00:17:22.680 | we were emphasizing the importance of provenance for information and
00:17:27.440 | ground truth in documents as an important aspect of doing web search,
00:17:32.120 | even with big, powerful, fancy, large language models.
00:17:35.640 | That is the vision that we're going to try to project for
00:17:38.360 | you throughout this unit and with our homework.
00:17:42.400 | [BLANK_AUDIO]