Stanford XCS224U: NLU I Information Retrieval, Part 1: Guiding Ideas I Spring 2023

00:00:00.000 | Welcome everyone.

00:00:05.840 | This is our first screencast on information retrieval.

00:00:08.600 | Let's start with some guiding ideas.

00:00:10.200 | These will serve both to give you a sense for

00:00:12.400 | the current moment in science and technology,

00:00:14.840 | and also help you build a bridge into the homework,

00:00:17.580 | which is on retrieval augmented in context learning.

00:00:21.400 | You might have noticed by now that NLP

00:00:24.440 | is revolutionizing information retrieval.

00:00:26.960 | This is a story that really begins with

00:00:28.700 | the transformer or maybe more properly,

00:00:30.800 | one of its most famous spokes models, BERT.

00:00:33.160 | Soon after BERT was launched,

00:00:35.120 | Google announced that it was incorporating aspects of

00:00:37.600 | BERT into its core search technologies.

00:00:40.520 | Microsoft made a similar announcement

00:00:42.920 | with Bing at about the same time.

00:00:45.160 | I have a feeling that these two very public announcements were

00:00:48.360 | just a glimpse of the changes that were starting

00:00:50.560 | to happen with major search engines.

00:00:53.600 | A little bit later, we started to see that

00:00:55.840 | large language models would play a direct role in search.

00:00:59.000 | I think the startup u.com was really visionary in this sense.

00:01:02.560 | I like to highlight u.com because its CEO,

00:01:05.340 | Richard Socher, is a distinguished alum of this course.

00:01:08.900 | u.com was way ahead of the curve in seeing that

00:01:11.620 | large language models could be

00:01:13.800 | really interesting and powerful aspects of web search.

00:01:17.740 | We've seen lots of activity in that space since then.

00:01:20.420 | For example, Microsoft has partnered with OpenAI,

00:01:23.620 | and it's now using OpenAI models,

00:01:25.640 | as part of the Bing search experience.

00:01:28.500 | You might have noticed also in

00:01:29.940 | a different perspective on this,

00:01:31.700 | that when GPT-4 was announced,

00:01:33.500 | part of the announcement was a partnership with

00:01:35.780 | Morgan Stanley to help Morgan Stanley employees

00:01:39.380 | use GPT-4 to find things in their own internal documents.

00:01:43.540 | That just shows you that we hear a lot about public web search,

00:01:47.160 | but there are also powerful search applications that

00:01:49.680 | could happen internal to organizations.

00:01:51.900 | Again, powered by, in this case,

00:01:54.160 | throughout this entire story as usual,

00:01:56.820 | powered by the transformer.

00:01:59.060 | You might ask yourself why this is happening,

00:02:01.740 | and I think the fundamental reason is that

00:02:04.140 | information retrieval is simply

00:02:06.260 | a hard natural language understanding problem.

00:02:08.740 | The more powerful our NLU technologies,

00:02:11.520 | the better we can do with retrieval.

00:02:13.580 | Here's an example that brings that point home.

00:02:16.220 | Our query is what compounds protect

00:02:18.540 | the digestive system against viruses?

00:02:21.100 | A relevant document is in the stomach,

00:02:23.800 | gastric acid, and proteases serve as

00:02:26.500 | powerful chemical defenses against ingested pathogens.

00:02:30.200 | The coloring indicates relevance connections.

00:02:32.820 | You'll notice that for the keywords in the query in

00:02:35.100 | the document, there is no string overlap.

00:02:38.200 | The connections that we need to make here are

00:02:40.460 | entirely semantic and that shows you that the more

00:02:43.460 | deeply we understand the language of the query in the document,

00:02:46.800 | the better we're going to be able to do at

00:02:48.860 | finding these relevant passages given queries like this.

00:02:53.060 | That's all mainly about information retrieval,

00:02:55.960 | but I'm an NLP-er,

00:02:57.280 | and for me, the more exciting direction of this is that

00:03:00.560 | information retrieval is now revolutionizing NLP,

00:03:03.640 | and the way it's doing that is by making

00:03:06.140 | our NLP problems more open and

00:03:08.720 | more relevant to actual daily tasks.

00:03:11.880 | Let me use question answering to highlight that.

00:03:14.800 | In the by now standard formulation

00:03:17.680 | of question answering within NLP,

00:03:20.080 | the system is given a title and a context passage and a question,

00:03:25.980 | and the task is to answer that question,

00:03:28.260 | and there is a guarantee that the answer will be

00:03:30.660 | a literal substring of that context passage.

00:03:33.860 | That is standard QA as formulated in tasks like SQUAD,

00:03:37.940 | the Stanford Question Answering Dataset.

00:03:40.100 | Just to repeat, at train time,

00:03:42.260 | you're given a title, context, question, and answer.

00:03:45.740 | At test time, you're given a title, context,

00:03:48.660 | and question, and you have a guarantee that the answer

00:03:51.960 | is a literal substring of that context passage.

00:03:55.300 | That used to be a hard problem for our best models,

00:03:57.780 | but it has grown quite easy,

00:03:59.820 | and I think you can see that it's also pretty

00:04:01.700 | disconnected from actual things that we want to

00:04:04.020 | do with question answering in the world where we very

00:04:06.820 | rarely get this very rich context or that substring guarantee.

00:04:12.340 | We are moving now as a field into

00:04:15.320 | a formulation of QA that I've called OpenQA,

00:04:17.820 | and this will be substantially harder.

00:04:19.700 | In this mode, maybe there's a title,

00:04:21.920 | and a context, and a question,

00:04:23.620 | and the task is to answer.

00:04:25.140 | But now, only the question and answer are given at train time,

00:04:30.180 | and the title and the context passage will need to be

00:04:33.180 | retrieved from somewhere, from a large document corpus,

00:04:37.260 | it could be the web.

00:04:38.780 | Having retrieved it, of course,

00:04:40.980 | we have no guarantee that the answer will be

00:04:43.220 | a literal substring of anything in the context or the title.

00:04:47.180 | This is a substantially harder problem,

00:04:49.140 | but it's also much more relevant because this is simulating,

00:04:52.780 | actually searching on the web where you pose a question and you need to

00:04:56.180 | retrieve relevant information in order to answer the question.

00:05:00.140 | But substantially harder, only the question and answer at train time,

00:05:03.440 | and at test time, all you're given is

00:05:05.580 | the question and all the relevant information needs to be retrieved.

00:05:09.300 | What you see there is to the extent that we can have

00:05:12.180 | really good retrieval technologies find

00:05:14.780 | this really good evidence for answering these questions,

00:05:17.420 | we can develop effective systems,

00:05:19.580 | and that is the crucial role for retrieval in this OpenQA pipeline that you all will

00:05:24.580 | be exploring as part of the associated homework and bake-off for this unit.

00:05:30.020 | Question answering is really just one example of

00:05:35.560 | a family of what you might call knowledge intensive tasks.

00:05:38.740 | I mentioned question answering,

00:05:39.980 | but we also have things like claim verification,

00:05:42.660 | common sense reasoning, long-form reading comprehension,

00:05:46.460 | and information seeking dialogue.

00:05:48.200 | These are all transparently tasks that depend very heavily on having

00:05:52.580 | rich information about the world informing whatever prediction the system makes.

00:05:57.020 | That's pretty clear, but I'm also interested in taking

00:06:00.300 | standard typically closed NLP tasks and expanding them into more open variants.

00:06:05.860 | For example, summarization is standardly just posed as

00:06:09.380 | a task where you take in a long passage and try to

00:06:12.020 | produce a shorter one but couldn't we make that

00:06:14.940 | a knowledge intensive task where we augment

00:06:17.180 | the input with lots of information that we've retrieved?

00:06:20.000 | I think it's a reasonable hypothesis that could improve summarization systems.

00:06:24.960 | Similarly, natural language inference typically just posed as

00:06:28.860 | a closed classification problem premise

00:06:31.780 | hypothesis and you give them one of three labels.

00:06:34.540 | But wouldn't it be interesting to augment the premise with information about

00:06:38.340 | the world that might help the system make better predictions as a classifier?

00:06:43.900 | I think that's just two examples of how we could take classical problems,

00:06:47.780 | even classification problems,

00:06:49.600 | and reformulate them into knowledge intensive tasks that would benefit from

00:06:53.660 | retrieval with the result that they could be made more effective

00:06:57.380 | and also more scalable to real-world problems.

00:07:01.820 | Let's talk a little bit about information retrieval approaches,

00:07:06.500 | and I'll start with classical IR.

00:07:09.140 | In this case, we have a user query coming in,

00:07:11.900 | when was Stanford University founded?

00:07:14.420 | The first thing that we do is term lookup.

00:07:16.980 | What we've done offline presumably is create a large index

00:07:20.940 | that maps terms to associated relevant documents.

00:07:24.780 | It could be a list of documents that contain the term,

00:07:27.420 | but we would probably also do some scoring of those documents with

00:07:30.980 | respect to these query terms to organize them by relevance.

00:07:35.020 | On the basis of that index,

00:07:36.900 | we can do document scoring and give back to the user

00:07:40.340 | a ranked list of documents ordered by relevance.

00:07:44.540 | Then it's up to the user to figure out which of those documents to

00:07:47.740 | check out in looking for an answer to the question.

00:07:51.740 | That is the classical search experience as we all know it.

00:07:57.500 | There is now a movement afoot to replace a lot of that with pure language models.

00:08:02.740 | I've called this the LLMs for everything approach.

00:08:05.540 | In this mode, the user's query comes in,

00:08:07.780 | when was Stanford University founded?

00:08:09.820 | A big language model, totally opaque to us,

00:08:12.500 | does some mysterious work and spits out the answer,

00:08:15.660 | Stanford University was founded in 1891.

00:08:19.180 | A real change to the search experience,

00:08:21.780 | whereas before we had to look through a ranked list of web pages to find our answer,

00:08:26.820 | now the answer is given to us directly.

00:08:30.020 | That could be very exciting.

00:08:32.020 | However, we might start to worry.

00:08:34.940 | We know these models can fabricate evidence,

00:08:37.340 | and so we should be skeptical consumers of their outputs.

00:08:40.620 | Since we don't know where this answer came from,

00:08:43.220 | we have no information about how it was produced.

00:08:46.020 | We might start to wonder about whether our information need was actually met,

00:08:50.220 | whether we should trust this string.

00:08:52.420 | I'm deeply concerned about this model,

00:08:55.260 | enough that I think we should be pushing in a different direction.

00:08:58.220 | That's where you would get neural information retrieval modules

00:09:02.060 | continuing to be important players in open knowledge intensive tasks for NLP.

00:09:08.820 | Neural IR models are going to function a lot like

00:09:11.620 | those classical models except in a much richer semantic space.

00:09:15.820 | We're going to start with a big language model,

00:09:17.980 | just as we did in the LLMs for everything approach,

00:09:20.420 | but we're going to use it somewhat differently.

00:09:22.700 | The first thing we'll do with that language model is take

00:09:25.380 | all the documents that we have in our collection of

00:09:27.700 | documents and represent them with the language model.

00:09:31.340 | The result of that will be some dense numerical representations that we

00:09:35.900 | expect to capture important aspects of their structure and their meaning.

00:09:41.300 | That is essentially the document index in the classical IR mode,

00:09:46.060 | but now it's a bunch of deep learning representations.

00:09:49.900 | Then the user's query comes in,

00:09:52.420 | and the first thing we do with that query is process it,

00:09:54.820 | probably using the same large language model,

00:09:57.140 | and get back a dense numerical representation of that query.

00:10:01.300 | Then on the basis of all these representations,

00:10:04.380 | we can do scoring and extraction as usual.

00:10:07.900 | At this point, we can reproduce

00:10:09.980 | everything about the classical search experience.

00:10:12.660 | The only twist is that scoring will happen in a different way because we're now

00:10:16.980 | dealing not with terms and scores but rather with

00:10:20.460 | these dense numerical representations that we're accustomed to throughout deep learning.

00:10:25.620 | But the result of all that scoring is that we give the user back a ranked list of pages.

00:10:31.380 | We've reproduced the classical experience for the user in the sense that they now need to

00:10:35.660 | search through those pages and find the answer to their question.

00:10:39.940 | We just hope that we're doing a much better job of offering

00:10:44.420 | relevant pages in virtue of the fact that we're

00:10:47.020 | operating in a much richer semantic space.

00:10:50.980 | This is a good moment to bridge into in-context learning,

00:10:56.220 | which is the other part of this unit,

00:10:58.060 | and they'll all come together for you in the homework.

00:11:00.260 | Let's think about how that bridge is going to happen.

00:11:02.860 | Now we're going to be in the mode of having

00:11:04.740 | a large language model and prompting it.

00:11:07.180 | In this case, we've simply prompted it with a question,

00:11:09.620 | who is Bert, and the task is to come up with an answer.

00:11:13.060 | In the mode that we're operating in,

00:11:14.980 | that is the only thing that we're given by the system.

00:11:17.460 | This is a truly an open QA formulation.

00:11:20.340 | The question is, how can we effectively answer this question using retrieval?

00:11:26.340 | Well, one thing we could do is retrieve from a document store,

00:11:31.100 | context passage for that question that we

00:11:34.060 | hope will be relevant evidence for answering that question.

00:11:36.900 | That's given in green here.

00:11:38.620 | But there's more that we could do with retrieval.

00:11:40.900 | For example, we know that large language models,

00:11:43.300 | when they're doing in-context learning,

00:11:45.140 | benefit from having demonstrations.

00:11:47.700 | Maybe we have a train set of questions and we could

00:11:50.860 | just retrieve from that set a question to use.

00:11:54.500 | Then at that point, we could either use the training answer to

00:11:58.740 | that question or maybe retrieve an answer in the hope that

00:12:02.020 | that will more closely simulate what the system actually has to do.

00:12:05.580 | But in any case, we now have this demonstration and we could go on further.

00:12:09.140 | Depending on the train set,

00:12:10.380 | we could either use training evidence like a passage from our QA dataset,

00:12:15.140 | or retrieve a context passage, again,

00:12:18.340 | using a retriever to function as evidence for this little demonstration here.

00:12:23.260 | The guiding hypothesis is that having woven together

00:12:27.340 | training instances with some retrieval steps

00:12:30.020 | to produce evidence for answering this question,

00:12:32.860 | we're going to do a better job at coming up with predicted answers.

00:12:37.300 | That's a simple retrieve then read pipeline

00:12:40.980 | where we're using our retriever to find evidence.

00:12:43.220 | What you'll see in the in-context learning unit,

00:12:46.100 | and as you work on the homework,

00:12:48.020 | is that this is just the start of a very rich set of options that we can

00:12:52.020 | employ for effectively develop in-context learning systems

00:12:56.020 | that use retrieval to find relevant evidence.

00:12:59.340 | That's how these two themes really come together.

00:13:02.460 | I think that these two themes coming together is one of

00:13:05.580 | the central questions for the field of NLP and IR in the current moment.

00:13:10.860 | Because really what we're seeing is a lot of worrisome behavior

00:13:15.100 | from large language models that are being

00:13:17.140 | deployed as part of search technologies.

00:13:19.740 | For example, we all saw that Google took a real hit in terms of its stock price for

00:13:24.700 | making a minor factual error in one of its demo videos.

00:13:28.980 | Maybe that was appropriate given how high stakes all of this is,

00:13:32.700 | but it's funny to think about because at the same time,

00:13:36.140 | open AI models were fabricating evidence all over the place.

00:13:40.260 | This is a amusing example where I have asked the system,

00:13:44.060 | are professional baseball players allowed to glue small wings to their caps?

00:13:48.180 | I asked the model to offer me some evidence for the answer that it gave.

00:13:52.540 | It did indeed dutifully say no and then offer some evidence,

00:13:56.460 | but the evidence links that it offered are entirely fabricated.

00:14:00.260 | They are not real links to web pages.

00:14:02.940 | If you follow them, you get a 404 page.

00:14:05.620 | I find this tremendously frustrating and

00:14:08.660 | easily worse than offering no evidence at all because we have all become accustomed

00:14:14.380 | to seeing URLs and assuming that they do

00:14:16.780 | function as ground truth evidence for the answer given.

00:14:20.220 | The fact that that ground truth is being completely fabricated

00:14:23.520 | is absolutely worse than offering no evidence at all.

00:14:27.380 | Here's another funny case for this.

00:14:29.380 | We're going to talk about our demonstrate search predict paper.

00:14:33.220 | Figure 1 of that paper includes an example with the question,

00:14:37.220 | how many stories are in the castle David Gregory inherited?

00:14:41.260 | On Twitter, a user read our paper and then tried the example with Bing search engine.

00:14:47.620 | They said, "Aha, Bing can answer your very difficult seeming question, no problem."

00:14:52.620 | But then that user immediately followed up by noticing that Bing was actually

00:14:56.440 | citing our own paper as evidence for the answer to this question.

00:15:01.020 | I will say that this is deeply worrisome to me.

00:15:04.080 | Our paper should not be regarded as

00:15:07.440 | good ground truth evidence about the castle David Gregory inherited.

00:15:11.660 | We used it purely as an illustrative example.

00:15:14.880 | If we had had slightly different intentions,

00:15:17.120 | we might have actually been talking about giving the wrong answer to this question.

00:15:20.840 | In fact, our figure does embed some wrong answers.

00:15:23.800 | The idea that a scientific research paper that's about in context learning with

00:15:28.440 | retrieval would be used as evidence for the castle David Gregory inherited,

00:15:33.400 | anything about that is completely baffling to me.

00:15:36.360 | That just shows you that simply because you have

00:15:38.920 | some search mechanisms doesn't mean that you're doing good search.

00:15:42.320 | But what we really need in this context is high-quality search.

00:15:47.000 | Just to round this out,

00:15:48.920 | I found this amusing but maybe a little bit worrisome.

00:15:52.040 | You should all try this with your own names.

00:15:54.480 | I prompted chat GPT with write a biography of Christopher Potts from Stanford University.

00:16:00.880 | I'm very happy with the first paragraph.

00:16:03.640 | It's very flattering to me and we can go ahead and say that it is truthful.

00:16:07.960 | But everything in the box in red is completely false.

00:16:12.120 | All of the factual information expressed there is false.

00:16:16.120 | It's a nice biography of me.

00:16:19.100 | I have no complaints about any of these facts except that they are false.

00:16:22.720 | The reason though that I'm worried is that I think not

00:16:25.920 | everyone is going to get such flattering information when they ask for their biography.

00:16:30.120 | We are just on the precipice of seeing really worrisome behavior with

00:16:34.280 | really meaningful downstream effects for people in society,

00:16:37.720 | if these language models continue to fabricate evidence in this way.

00:16:41.760 | That's why I feel like the current unit and the work that you do for it is

00:16:46.360 | absolutely extremely important and relevant to addressing

00:16:49.720 | this growing societal and technological problem.

00:16:53.720 | That sets the stage.

00:16:55.420 | If you would like a little bit more on this, Omar,

00:16:58.420 | Matej Zaharia and I did two blog posts on this a few years ago

00:17:02.340 | that I think remain still extremely relevant.

00:17:05.500 | The first is building scalable,

00:17:07.120 | explainable, and adaptive NLP models with retrieval.

00:17:10.440 | That's a technical blog post.

00:17:13.440 | A more high-level outward looking one is

00:17:16.640 | this modest proposal for radically better AI-powered web search,

00:17:20.180 | where all the way back in 2021,

00:17:22.680 | we were emphasizing the importance of provenance for information and

00:17:27.440 | ground truth in documents as an important aspect of doing web search,

00:17:32.120 | even with big, powerful, fancy, large language models.

00:17:35.640 | That is the vision that we're going to try to project for

00:17:38.360 | you throughout this unit and with our homework.

00:17:42.400 | [BLANK_AUDIO]

Stanford XCS224U: NLU I Information Retrieval, Part 1: Guiding Ideas I Spring 2023

Chapters