back to index

Stanford XCS224U: Natural Language Understanding I Homework 2 I Spring 2023


Chapters

0:0
3:32 GPT-3 paper: Few-shot QA
4:38 Few-shot retrieve-then-read
6:3 Set-up
7:9 SQUAD for "train" and dev
9:42 Templates
10:57 Prompt-based generation
12:33 Few-shot OpenQA
14:17 Assignment questions

Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome, everyone.
00:00:06.280 | This screencast is an overview of assignment two and its associated bake-off.
00:00:11.340 | The name of this combination is "Few Shot Open QA with DSP," and part of the function
00:00:16.260 | of this screencast is to unpack that complicated-sounding title.
00:00:20.880 | Let's begin with a review of different question-answering tasks, and keep in mind that the task you're
00:00:25.980 | confronted with for this assignment and bake-off is the one in the final row, which is very
00:00:30.660 | difficult indeed.
00:00:32.180 | Let's begin at the top.
00:00:33.580 | QA, standard QA, the way this is formulated in the modern phase, as in datasets like SQUAD,
00:00:39.780 | is that you're given a gold evidence passage, and the name of the game is to train a QA
00:00:44.820 | reader that will learn to find answers to questions in those evidence passages.
00:00:49.820 | And in this mode, we don't have a retriever at all, so I put "NA" here.
00:00:54.300 | We've talked about Open QA.
00:00:55.820 | This is the variant where we're not given a passage, but rather we need to learn to
00:01:00.220 | retrieve relevant passages, and then in the standard mode, we train a QA module to learn
00:01:06.500 | how to find answers to questions in those retrieved passages.
00:01:10.260 | And that is already substantially harder because now we have to retrieve good evidence, and
00:01:14.940 | we don't have a guarantee that the answer will even be findable in the evidence that
00:01:18.980 | we've retrieved.
00:01:21.500 | Few Shot QA is something we haven't discussed yet.
00:01:24.100 | This is the task that was really introduced in the GPT-3 paper, and it's hard along a
00:01:29.100 | different dimension.
00:01:30.800 | In this mode, we are given a gold passage.
00:01:33.100 | We could use SQUAD, for example, as the basis for the task, but we're not allowed to do
00:01:38.240 | any task-specific reader training.
00:01:41.020 | We have to rely on a frozen large language model to learn in context how to do what we
00:01:47.020 | want it to do.
00:01:48.780 | And in this mode, there's no retrieval because we just rely on the closed nature of a task
00:01:53.620 | like SQUAD.
00:01:55.940 | That's already hard enough because you don't get to do any task-specific training.
00:02:00.940 | We are going to move you into a mode that combines the hard aspects of Open QA and Few
00:02:05.180 | Shot QA, and that is Few Shot Open QA.
00:02:07.940 | In this mode, you do not have a gold evidence passage, and you are compelled to use only
00:02:14.820 | frozen language models to do the QA part.
00:02:18.980 | We're going to have a retrieval mechanism for you.
00:02:21.100 | You could do some fine-tuning of it, but we're not going to explore that in this homework.
00:02:25.060 | That could be left for projects.
00:02:27.100 | So really, in the end, what you're left with is a frozen retrieval model, a frozen language
00:02:31.380 | model, and on that basis, you need to figure out how to answer questions effectively.
00:02:36.740 | Just to repeat, your situation is a difficult one.
00:02:40.820 | During development, you will have gold QA pairs, but at test time, all you're going
00:02:46.620 | to have is questions, no gold passages or any other associated data.
00:02:51.500 | You will see this in the bake-off file.
00:02:53.920 | It is simply a list of questions that you need to figure out how to answer.
00:02:58.520 | Very difficult indeed.
00:02:59.520 | I feel like this task would not even have been posable in 2018, and when we first did
00:03:05.040 | it last year, I worried that it might be too difficult, but people did incredible things,
00:03:10.580 | and I think you're going to do incredible things with this seemingly almost impossible
00:03:15.220 | task.
00:03:16.220 | But just to emphasize here, you have to operate throughout this with frozen components.
00:03:21.540 | You cannot train any LLMs.
00:03:24.020 | All you can do is in-context learning with frozen models, but I assure you, you'll get
00:03:29.580 | traction on this problem.
00:03:32.740 | Just as a reminder, for that task that I mentioned, few-shot QA, that is the one that was posed
00:03:37.900 | in the GPT-3 paper.
00:03:39.540 | Here's an example from their appendix.
00:03:41.300 | It's a squad example.
00:03:42.980 | You can see you have this gold passage that you prompt your language model with.
00:03:47.600 | You give it a demonstration QA pair, and then you have your final target question.
00:03:53.500 | The demonstration follows the substring guarantee into the gold evidence passage, and so does
00:03:58.820 | the answer.
00:03:59.820 | And that's how they posed this, and they did pretty well at it.
00:04:02.980 | And just as a check, I tried TextDaVinci 2 with exactly this example and got the right
00:04:08.300 | answer, so no regression there.
00:04:10.060 | They can still do few-shot QA with squad with these deployed models.
00:04:16.100 | But as I said, that's not your task.
00:04:17.900 | Your task is harder.
00:04:19.300 | In your setting, you're just given a question, and the task is to answer it.
00:04:24.860 | And a standard baseline in this mode is what I've called retrieve-then-read.
00:04:29.500 | So the way this would work is that you'll rely on a retrieval mechanism to find a context
00:04:34.780 | passage that's relevant for this question.
00:04:38.860 | And then you might add in for few-shot retrieve-then-read some demonstrations, and you could get those
00:04:43.780 | from the squad data set that we provide to you, or you could try to get it from somewhere
00:04:48.140 | else.
00:04:49.180 | You could also get from your train set or retrieve an answer to that demonstration question,
00:04:55.380 | and the same thing for the context.
00:04:57.180 | Squad provides all of these as gold evidence, but it's conceivable that you would want to
00:05:02.380 | retrieve answers or predict answers and retrieve passages so that your system learns from demonstrations
00:05:10.180 | that are kind of like the actual situation that you have down here where there's no gold
00:05:14.580 | passages and no gold answers, just questions, and everything else has to be found somewhere.
00:05:22.340 | For the assignment itself, we're pushing to use the demonstrate-search-predict programming
00:05:27.820 | library, and the vision behind this library is that we're going to make prompt engineering
00:05:32.340 | proper software engineering, where you write a little program as opposed to typing out
00:05:37.340 | a prompt from scratch.
00:05:39.140 | And the idea here is that that opens up a whole new design space and really gets us
00:05:43.300 | thinking in new ways about how to design AI systems in this modern mode that are essentially
00:05:49.380 | prompting frozen pre-trained components and getting them to work in concert to do complicated
00:05:54.900 | things that we want done.
00:05:56.860 | So this is a diagram from the DSP paper, and you're going to be writing little programs
00:06:01.060 | that look like this.
00:06:04.300 | For the notebook itself, we begin with some setup, and what you see happening here is
00:06:08.260 | that we're kind of connecting with some large language model vendors who provide powerful
00:06:13.340 | frozen language models that you can use.
00:06:16.020 | Here's the key for OpenAI, and here's the one for Cohere.
00:06:19.660 | These are not supplied to you, so you need to get set up separately with your own API
00:06:24.140 | keys.
00:06:25.140 | Here, you can use the models for free, and for OpenAI, when you open an account, you
00:06:29.180 | get some small number of free credits to use for their models.
00:06:34.900 | And then finally, we set you up with a Colbert server.
00:06:37.200 | This is an index that we created that will provide you with a very rich retrieval mechanism.
00:06:43.200 | And then in the cell down here, we set up using the DSP library, the language model.
00:06:48.300 | Here I'm using text DaVinci One, and I've got my OpenAI key associated with it.
00:06:53.660 | And there's the commented out version for doing this with Cohere models.
00:06:57.780 | And here I set up the retrieval mechanism.
00:06:59.820 | And the final piece here is to just set DSP as a library so that you're using that LM
00:07:05.960 | and that retrieval mechanism.
00:07:08.180 | So that's my way of setup.
00:07:10.580 | One thing I wanted to pause on here is the appearance of SQuAD in the notebook.
00:07:16.300 | That might surprise you because SQuAD is a closed standard QA formulation of the task
00:07:22.220 | where you're given gold passages and so forth.
00:07:24.880 | So I want to emphasize that the role of SQuAD here is to provide you with some training
00:07:29.140 | and dev examples.
00:07:30.460 | And I put "train" in quotation marks there because, of course, you can't train any systems.
00:07:36.100 | But you can use the train portion of SQuAD to construct demonstrations for your passages
00:07:41.580 | and other things like that.
00:07:43.580 | So in essence, SQuAD is providing train data, gold QA pairs, maybe with gold passages that
00:07:49.580 | you'll make use of that you can use for demonstrations.
00:07:52.540 | And SQuAD also provides dev QA pairs that we can use to simulate your actual situation
00:07:58.580 | so that you can figure out how well your system is going to do at test time, that is, on the
00:08:04.140 | bake-off.
00:08:05.740 | So that's why you get this section, SQuAD train, SQuAD dev, SQuAD dev sample.
00:08:11.020 | That final thing there is just because you should keep in mind that in this mode, evaluations
00:08:16.620 | can be quite expensive, especially if you're paying OpenAI for each one of its API calls.
00:08:21.580 | And so you'll want to do evaluations on small data sets and do them only sparingly.
00:08:26.400 | So I've provided you with a tiny sample of 200 dev examples to kind of use in a very
00:08:31.900 | controlled way, although, honestly, even that can get quite expensive.
00:08:35.820 | And so I would do even those kind of quantitative evaluations only sparingly.
00:08:42.720 | With that background in place, we can begin thinking about using DSP itself.
00:08:47.300 | One nice thing about DSP is that it gives us very easy access to a language model.
00:08:51.900 | So what I'm showing in this cell 13 here is a direct call to the language model with the
00:08:56.900 | string which US states border no US states.
00:08:59.540 | And you can see it's given me a list of responses and kind of messiness with all their new lines
00:09:04.940 | here.
00:09:06.000 | You can add in keyword parameters to the underlying language model if that model honors them,
00:09:11.100 | and that will affect the behavior of this function call.
00:09:13.660 | So here I've called it with temperature 0.9, and I'm getting four responses back.
00:09:18.180 | And you can see it's listed them out in a list there.
00:09:22.260 | Another nice thing about DSP is that if you call lm.inspecthistory, it will show you the
00:09:27.420 | previous calls to the language model, and it's formatted those quite nicely.
00:09:31.180 | So if you're uncertain about what you're feeding into your model, and that can happen with
00:09:35.540 | DSP, you can call inspecthistory and get a look at what you actually did.
00:09:43.140 | Now mostly for DSP, you won't call the language model directly the way we just did.
00:09:48.460 | You will rely on DSP templates to kind of format prompts to the language model and also
00:09:54.980 | extract information from the generated answer to use as kind of the basis for your system.
00:10:01.660 | So here I've set up a very simple template.
00:10:03.700 | This happens in the notebook QA template.
00:10:05.740 | It's got question and answer components and includes some instructions.
00:10:10.380 | And then for example, if you create a DSP example from our running case, which US states
00:10:15.740 | border no US states, and you call it with a sample of two squad training instances to
00:10:21.180 | use as demonstrations, you can feed that through your template.
00:10:25.100 | And what you get is something that looks like this, where our target question is at the
00:10:29.820 | bottom waiting to be answered.
00:10:32.140 | And here are those two demonstrations that we sampled from the squad train set.
00:10:37.980 | Here are the instructions and here's some formatting stuff that comes from the template.
00:10:42.620 | And this is a pretty good standard mode for all these modern large language models to
00:10:47.420 | help them do in context learning and figure out what you want them to do based on information
00:10:52.700 | in the prompt and the demonstrations you've provided.
00:10:57.500 | And here to kind of put those pieces together, you have in DSP what I've called prompt based
00:11:01.620 | generation.
00:11:02.620 | So DSP.generate, you feed that through a template and that gives you a kind of generator function
00:11:08.060 | that when called on a DSP example, will give you back some responses.
00:11:13.300 | And here's the answer value from the completions.
00:11:16.820 | Alaska, Hawaii is how it has answered the question, which US states border no US states.
00:11:23.180 | And again, if you feel unsure, you can call inspect history and you'll see exactly what
00:11:26.700 | happened.
00:11:27.700 | It looks like this.
00:11:28.700 | And there's that prompt again with our two sample demonstrations.
00:11:32.680 | And there's the generated response in green.
00:11:37.820 | The other part of this assignment is thinking about retrieval.
00:11:41.880 | And as I said before, for that, we have given you a Colbert index and a Colbert retrieval
00:11:46.800 | mechanism that you can use.
00:11:48.800 | And you can mostly just treat that as a very effective retrieval mechanism.
00:11:53.340 | Here's a question that we can use it as an example.
00:11:56.380 | We've got retrieve when given a string and some number of passages we want in response
00:12:01.660 | will give you back a list of passages that you can use for constructing prompts and so
00:12:06.420 | forth.
00:12:07.980 | As with the language model, if you need deeper access to the retrieval mechanism, you can
00:12:12.660 | call it directly with RM called on a string.
00:12:16.260 | And that will allow you to have a bunch of other keyword parameters in here and give
00:12:20.140 | you more information back than just the list of passages, for example, scores and other
00:12:25.060 | things that go along with retrieval.
00:12:27.020 | So that's there if you want to design more advanced systems as part of your original
00:12:31.980 | system.
00:12:34.340 | All of these things come together in the first part of the notebook.
00:12:38.740 | This is a little DSP program that is a complete solution to few shot open QA.
00:12:45.400 | It's just this tiny program.
00:12:46.860 | Let's walk through it.
00:12:48.380 | First, keep this in mind, use this decorator, DSP.transformation on all of your programs
00:12:55.860 | so that your programs don't modify the DSP examples that come in.
00:13:00.260 | You'll be augmenting them with demonstrations and maybe changing the fields and you don't
00:13:04.740 | want that to have an in-place impact on, for example, the squad data set that you have
00:13:09.620 | loaded in.
00:13:10.620 | So as a precaution, always add this decorator to all your DSP functions and your life will
00:13:16.420 | be more sensible.
00:13:19.900 | Next, programs operate on DSP example instances, individual ones.
00:13:26.700 | That's what's coming in here.
00:13:27.980 | Keep that in mind.
00:13:30.180 | In the first line of this little program, we sample K random demonstrations from the
00:13:34.540 | squad train set by default.
00:13:37.100 | And K is the user supplied parameter there.
00:13:39.380 | That gives us a demonstrations attribute on the example that came in.
00:13:46.020 | Then we use our QA template as before.
00:13:48.300 | I define that on slide 12.
00:13:50.820 | That gives us this new modified generator function for the language model.
00:13:57.100 | That's the modified generator.
00:13:58.420 | We call that on an example and we get back completions as well as a copy of the example.
00:14:03.940 | DSP completions is the heart of this.
00:14:06.140 | It will have an answer attribute because of the QA template.
00:14:09.940 | And that's the field that we'll use as the answer that the system has responded to the
00:14:14.580 | question with.
00:14:16.660 | So that's the complete program for Fushot Open QA and the assignment questions essentially
00:14:22.620 | ask you to write very similar programs.
00:14:25.500 | Both of the questions are DSP programs just like that one.
00:14:29.480 | Question one is Fushot Open QA with context.
00:14:32.460 | So a small modification of the program I just showed you where you add in context passages
00:14:37.520 | that you retrieve using Colbert.
00:14:40.620 | And then question two asks you to use the annotate primitive from DSP which is very
00:14:45.180 | powerful as a mechanism for doing lots of things.
00:14:48.860 | And what we have you use it to do is construct demonstrations that will be especially effective
00:14:54.540 | for guiding the system toward the behaviors that you want it to have in context.
00:15:01.100 | And then having done those things, you design your original system.
00:15:04.520 | We expect your original system to be a DSP program because we think we've provided you
00:15:09.960 | lots of primitives for writing really powerful DSP programs that will be really interesting
00:15:14.900 | solutions to the problems we've posed.
00:15:17.700 | But it is not required.
00:15:18.860 | As before, original systems can take many forms and you should feel free to use whatever
00:15:23.500 | techniques you would like.
00:15:26.260 | If you would like to explore DSP even further, Omar has created an intro notebook that walks
00:15:32.960 | through additional advanced programs for really hard QA problems.
00:15:37.180 | We're going to talk about a bunch of those techniques when we talk about in-context learning.
00:15:41.340 | A lot of the most powerful ones are exemplified in that intro notebook where we show you how
00:15:47.020 | we have DSP primitives that make it very easy to make use of those concepts.
00:15:53.100 | So that's a powerful next step that you might check out as part of thinking about your original
00:15:57.620 | system.
00:15:58.140 | [BLANK_AUDIO]