Stanford XCS224U: Natural Language Understanding I Homework 2 I Spring 2023

00:00:00.000 | Welcome, everyone.

00:00:06.280 | This screencast is an overview of assignment two and its associated bake-off.

00:00:11.340 | The name of this combination is "Few Shot Open QA with DSP," and part of the function

00:00:16.260 | of this screencast is to unpack that complicated-sounding title.

00:00:20.880 | Let's begin with a review of different question-answering tasks, and keep in mind that the task you're

00:00:25.980 | confronted with for this assignment and bake-off is the one in the final row, which is very

00:00:30.660 | difficult indeed.

00:00:32.180 | Let's begin at the top.

00:00:33.580 | QA, standard QA, the way this is formulated in the modern phase, as in datasets like SQUAD,

00:00:39.780 | is that you're given a gold evidence passage, and the name of the game is to train a QA

00:00:44.820 | reader that will learn to find answers to questions in those evidence passages.

00:00:49.820 | And in this mode, we don't have a retriever at all, so I put "NA" here.

00:00:54.300 | We've talked about Open QA.

00:00:55.820 | This is the variant where we're not given a passage, but rather we need to learn to

00:01:00.220 | retrieve relevant passages, and then in the standard mode, we train a QA module to learn

00:01:06.500 | how to find answers to questions in those retrieved passages.

00:01:10.260 | And that is already substantially harder because now we have to retrieve good evidence, and

00:01:14.940 | we don't have a guarantee that the answer will even be findable in the evidence that

00:01:18.980 | we've retrieved.

00:01:21.500 | Few Shot QA is something we haven't discussed yet.

00:01:24.100 | This is the task that was really introduced in the GPT-3 paper, and it's hard along a

00:01:29.100 | different dimension.

00:01:30.800 | In this mode, we are given a gold passage.

00:01:33.100 | We could use SQUAD, for example, as the basis for the task, but we're not allowed to do

00:01:38.240 | any task-specific reader training.

00:01:41.020 | We have to rely on a frozen large language model to learn in context how to do what we

00:01:47.020 | want it to do.

00:01:48.780 | And in this mode, there's no retrieval because we just rely on the closed nature of a task

00:01:53.620 | like SQUAD.

00:01:55.940 | That's already hard enough because you don't get to do any task-specific training.

00:02:00.940 | We are going to move you into a mode that combines the hard aspects of Open QA and Few

00:02:05.180 | Shot QA, and that is Few Shot Open QA.

00:02:07.940 | In this mode, you do not have a gold evidence passage, and you are compelled to use only

00:02:14.820 | frozen language models to do the QA part.

00:02:18.980 | We're going to have a retrieval mechanism for you.

00:02:21.100 | You could do some fine-tuning of it, but we're not going to explore that in this homework.

00:02:25.060 | That could be left for projects.

00:02:27.100 | So really, in the end, what you're left with is a frozen retrieval model, a frozen language

00:02:31.380 | model, and on that basis, you need to figure out how to answer questions effectively.

00:02:36.740 | Just to repeat, your situation is a difficult one.

00:02:40.820 | During development, you will have gold QA pairs, but at test time, all you're going

00:02:46.620 | to have is questions, no gold passages or any other associated data.

00:02:51.500 | You will see this in the bake-off file.

00:02:53.920 | It is simply a list of questions that you need to figure out how to answer.

00:02:58.520 | Very difficult indeed.

00:02:59.520 | I feel like this task would not even have been posable in 2018, and when we first did

00:03:05.040 | it last year, I worried that it might be too difficult, but people did incredible things,

00:03:10.580 | and I think you're going to do incredible things with this seemingly almost impossible

00:03:15.220 | task.

00:03:16.220 | But just to emphasize here, you have to operate throughout this with frozen components.

00:03:21.540 | You cannot train any LLMs.

00:03:24.020 | All you can do is in-context learning with frozen models, but I assure you, you'll get

00:03:29.580 | traction on this problem.

00:03:32.740 | Just as a reminder, for that task that I mentioned, few-shot QA, that is the one that was posed

00:03:37.900 | in the GPT-3 paper.

00:03:39.540 | Here's an example from their appendix.

00:03:41.300 | It's a squad example.

00:03:42.980 | You can see you have this gold passage that you prompt your language model with.

00:03:47.600 | You give it a demonstration QA pair, and then you have your final target question.

00:03:53.500 | The demonstration follows the substring guarantee into the gold evidence passage, and so does

00:03:58.820 | the answer.

00:03:59.820 | And that's how they posed this, and they did pretty well at it.

00:04:02.980 | And just as a check, I tried TextDaVinci 2 with exactly this example and got the right

00:04:08.300 | answer, so no regression there.

00:04:10.060 | They can still do few-shot QA with squad with these deployed models.

00:04:16.100 | But as I said, that's not your task.

00:04:17.900 | Your task is harder.

00:04:19.300 | In your setting, you're just given a question, and the task is to answer it.

00:04:24.860 | And a standard baseline in this mode is what I've called retrieve-then-read.

00:04:29.500 | So the way this would work is that you'll rely on a retrieval mechanism to find a context

00:04:34.780 | passage that's relevant for this question.

00:04:38.860 | And then you might add in for few-shot retrieve-then-read some demonstrations, and you could get those

00:04:43.780 | from the squad data set that we provide to you, or you could try to get it from somewhere

00:04:48.140 | else.

00:04:49.180 | You could also get from your train set or retrieve an answer to that demonstration question,

00:04:55.380 | and the same thing for the context.

00:04:57.180 | Squad provides all of these as gold evidence, but it's conceivable that you would want to

00:05:02.380 | retrieve answers or predict answers and retrieve passages so that your system learns from demonstrations

00:05:10.180 | that are kind of like the actual situation that you have down here where there's no gold

00:05:14.580 | passages and no gold answers, just questions, and everything else has to be found somewhere.

00:05:22.340 | For the assignment itself, we're pushing to use the demonstrate-search-predict programming

00:05:27.820 | library, and the vision behind this library is that we're going to make prompt engineering

00:05:32.340 | proper software engineering, where you write a little program as opposed to typing out

00:05:37.340 | a prompt from scratch.

00:05:39.140 | And the idea here is that that opens up a whole new design space and really gets us

00:05:43.300 | thinking in new ways about how to design AI systems in this modern mode that are essentially

00:05:49.380 | prompting frozen pre-trained components and getting them to work in concert to do complicated

00:05:54.900 | things that we want done.

00:05:56.860 | So this is a diagram from the DSP paper, and you're going to be writing little programs

00:06:01.060 | that look like this.

00:06:04.300 | For the notebook itself, we begin with some setup, and what you see happening here is

00:06:08.260 | that we're kind of connecting with some large language model vendors who provide powerful

00:06:13.340 | frozen language models that you can use.

00:06:16.020 | Here's the key for OpenAI, and here's the one for Cohere.

00:06:19.660 | These are not supplied to you, so you need to get set up separately with your own API

00:06:24.140 | keys.

00:06:25.140 | Here, you can use the models for free, and for OpenAI, when you open an account, you

00:06:29.180 | get some small number of free credits to use for their models.

00:06:34.900 | And then finally, we set you up with a Colbert server.

00:06:37.200 | This is an index that we created that will provide you with a very rich retrieval mechanism.

00:06:43.200 | And then in the cell down here, we set up using the DSP library, the language model.

00:06:48.300 | Here I'm using text DaVinci One, and I've got my OpenAI key associated with it.

00:06:53.660 | And there's the commented out version for doing this with Cohere models.

00:06:57.780 | And here I set up the retrieval mechanism.

00:06:59.820 | And the final piece here is to just set DSP as a library so that you're using that LM

00:07:05.960 | and that retrieval mechanism.

00:07:08.180 | So that's my way of setup.

00:07:10.580 | One thing I wanted to pause on here is the appearance of SQuAD in the notebook.

00:07:16.300 | That might surprise you because SQuAD is a closed standard QA formulation of the task

00:07:22.220 | where you're given gold passages and so forth.

00:07:24.880 | So I want to emphasize that the role of SQuAD here is to provide you with some training

00:07:29.140 | and dev examples.

00:07:30.460 | And I put "train" in quotation marks there because, of course, you can't train any systems.

00:07:36.100 | But you can use the train portion of SQuAD to construct demonstrations for your passages

00:07:41.580 | and other things like that.

00:07:43.580 | So in essence, SQuAD is providing train data, gold QA pairs, maybe with gold passages that

00:07:49.580 | you'll make use of that you can use for demonstrations.

00:07:52.540 | And SQuAD also provides dev QA pairs that we can use to simulate your actual situation

00:07:58.580 | so that you can figure out how well your system is going to do at test time, that is, on the

00:08:04.140 | bake-off.

00:08:05.740 | So that's why you get this section, SQuAD train, SQuAD dev, SQuAD dev sample.

00:08:11.020 | That final thing there is just because you should keep in mind that in this mode, evaluations

00:08:16.620 | can be quite expensive, especially if you're paying OpenAI for each one of its API calls.

00:08:21.580 | And so you'll want to do evaluations on small data sets and do them only sparingly.

00:08:26.400 | So I've provided you with a tiny sample of 200 dev examples to kind of use in a very

00:08:31.900 | controlled way, although, honestly, even that can get quite expensive.

00:08:35.820 | And so I would do even those kind of quantitative evaluations only sparingly.

00:08:42.720 | With that background in place, we can begin thinking about using DSP itself.

00:08:47.300 | One nice thing about DSP is that it gives us very easy access to a language model.

00:08:51.900 | So what I'm showing in this cell 13 here is a direct call to the language model with the

00:08:56.900 | string which US states border no US states.

00:08:59.540 | And you can see it's given me a list of responses and kind of messiness with all their new lines

00:09:04.940 | here.

00:09:06.000 | You can add in keyword parameters to the underlying language model if that model honors them,

00:09:11.100 | and that will affect the behavior of this function call.

00:09:13.660 | So here I've called it with temperature 0.9, and I'm getting four responses back.

00:09:18.180 | And you can see it's listed them out in a list there.

00:09:22.260 | Another nice thing about DSP is that if you call lm.inspecthistory, it will show you the

00:09:27.420 | previous calls to the language model, and it's formatted those quite nicely.

00:09:31.180 | So if you're uncertain about what you're feeding into your model, and that can happen with

00:09:35.540 | DSP, you can call inspecthistory and get a look at what you actually did.

00:09:43.140 | Now mostly for DSP, you won't call the language model directly the way we just did.

00:09:48.460 | You will rely on DSP templates to kind of format prompts to the language model and also

00:09:54.980 | extract information from the generated answer to use as kind of the basis for your system.

00:10:01.660 | So here I've set up a very simple template.

00:10:03.700 | This happens in the notebook QA template.

00:10:05.740 | It's got question and answer components and includes some instructions.

00:10:10.380 | And then for example, if you create a DSP example from our running case, which US states

00:10:15.740 | border no US states, and you call it with a sample of two squad training instances to

00:10:21.180 | use as demonstrations, you can feed that through your template.

00:10:25.100 | And what you get is something that looks like this, where our target question is at the

00:10:29.820 | bottom waiting to be answered.

00:10:32.140 | And here are those two demonstrations that we sampled from the squad train set.

00:10:37.980 | Here are the instructions and here's some formatting stuff that comes from the template.

00:10:42.620 | And this is a pretty good standard mode for all these modern large language models to

00:10:47.420 | help them do in context learning and figure out what you want them to do based on information

00:10:52.700 | in the prompt and the demonstrations you've provided.

00:10:57.500 | And here to kind of put those pieces together, you have in DSP what I've called prompt based

00:11:01.620 | generation.

00:11:02.620 | So DSP.generate, you feed that through a template and that gives you a kind of generator function

00:11:08.060 | that when called on a DSP example, will give you back some responses.

00:11:13.300 | And here's the answer value from the completions.

00:11:16.820 | Alaska, Hawaii is how it has answered the question, which US states border no US states.

00:11:23.180 | And again, if you feel unsure, you can call inspect history and you'll see exactly what

00:11:26.700 | happened.

00:11:27.700 | It looks like this.

00:11:28.700 | And there's that prompt again with our two sample demonstrations.

00:11:32.680 | And there's the generated response in green.

00:11:37.820 | The other part of this assignment is thinking about retrieval.

00:11:41.880 | And as I said before, for that, we have given you a Colbert index and a Colbert retrieval

00:11:46.800 | mechanism that you can use.

00:11:48.800 | And you can mostly just treat that as a very effective retrieval mechanism.

00:11:53.340 | Here's a question that we can use it as an example.

00:11:56.380 | We've got retrieve when given a string and some number of passages we want in response

00:12:01.660 | will give you back a list of passages that you can use for constructing prompts and so

00:12:06.420 | forth.

00:12:07.980 | As with the language model, if you need deeper access to the retrieval mechanism, you can

00:12:12.660 | call it directly with RM called on a string.

00:12:16.260 | And that will allow you to have a bunch of other keyword parameters in here and give

00:12:20.140 | you more information back than just the list of passages, for example, scores and other

00:12:25.060 | things that go along with retrieval.

00:12:27.020 | So that's there if you want to design more advanced systems as part of your original

00:12:31.980 | system.

00:12:34.340 | All of these things come together in the first part of the notebook.

00:12:38.740 | This is a little DSP program that is a complete solution to few shot open QA.

00:12:45.400 | It's just this tiny program.

00:12:46.860 | Let's walk through it.

00:12:48.380 | First, keep this in mind, use this decorator, DSP.transformation on all of your programs

00:12:55.860 | so that your programs don't modify the DSP examples that come in.

00:13:00.260 | You'll be augmenting them with demonstrations and maybe changing the fields and you don't

00:13:04.740 | want that to have an in-place impact on, for example, the squad data set that you have

00:13:09.620 | loaded in.

00:13:10.620 | So as a precaution, always add this decorator to all your DSP functions and your life will

00:13:16.420 | be more sensible.

00:13:19.900 | Next, programs operate on DSP example instances, individual ones.

00:13:26.700 | That's what's coming in here.

00:13:27.980 | Keep that in mind.

00:13:30.180 | In the first line of this little program, we sample K random demonstrations from the

00:13:34.540 | squad train set by default.

00:13:37.100 | And K is the user supplied parameter there.

00:13:39.380 | That gives us a demonstrations attribute on the example that came in.

00:13:46.020 | Then we use our QA template as before.

00:13:48.300 | I define that on slide 12.

00:13:50.820 | That gives us this new modified generator function for the language model.

00:13:57.100 | That's the modified generator.

00:13:58.420 | We call that on an example and we get back completions as well as a copy of the example.

00:14:03.940 | DSP completions is the heart of this.

00:14:06.140 | It will have an answer attribute because of the QA template.

00:14:09.940 | And that's the field that we'll use as the answer that the system has responded to the

00:14:14.580 | question with.

00:14:16.660 | So that's the complete program for Fushot Open QA and the assignment questions essentially

00:14:22.620 | ask you to write very similar programs.

00:14:25.500 | Both of the questions are DSP programs just like that one.

00:14:29.480 | Question one is Fushot Open QA with context.

00:14:32.460 | So a small modification of the program I just showed you where you add in context passages

00:14:37.520 | that you retrieve using Colbert.

00:14:40.620 | And then question two asks you to use the annotate primitive from DSP which is very

00:14:45.180 | powerful as a mechanism for doing lots of things.

00:14:48.860 | And what we have you use it to do is construct demonstrations that will be especially effective

00:14:54.540 | for guiding the system toward the behaviors that you want it to have in context.

00:15:01.100 | And then having done those things, you design your original system.

00:15:04.520 | We expect your original system to be a DSP program because we think we've provided you

00:15:09.960 | lots of primitives for writing really powerful DSP programs that will be really interesting

00:15:14.900 | solutions to the problems we've posed.

00:15:17.700 | But it is not required.

00:15:18.860 | As before, original systems can take many forms and you should feel free to use whatever

00:15:23.500 | techniques you would like.

00:15:26.260 | If you would like to explore DSP even further, Omar has created an intro notebook that walks

00:15:32.960 | through additional advanced programs for really hard QA problems.

00:15:37.180 | We're going to talk about a bunch of those techniques when we talk about in-context learning.

00:15:41.340 | A lot of the most powerful ones are exemplified in that intro notebook where we show you how

00:15:47.020 | we have DSP primitives that make it very easy to make use of those concepts.

00:15:53.100 | So that's a powerful next step that you might check out as part of thinking about your original

00:15:57.620 | system.

00:15:58.140 | [BLANK_AUDIO]

Stanford XCS224U: Natural Language Understanding I Homework 2 I Spring 2023

Chapters