back to indexStanford XCS224U: Natural Language Understanding I Homework 2 I Spring 2023
Chapters
0:0
3:32 GPT-3 paper: Few-shot QA
4:38 Few-shot retrieve-then-read
6:3 Set-up
7:9 SQUAD for "train" and dev
9:42 Templates
10:57 Prompt-based generation
12:33 Few-shot OpenQA
14:17 Assignment questions
00:00:06.280 |
This screencast is an overview of assignment two and its associated bake-off. 00:00:11.340 |
The name of this combination is "Few Shot Open QA with DSP," and part of the function 00:00:16.260 |
of this screencast is to unpack that complicated-sounding title. 00:00:20.880 |
Let's begin with a review of different question-answering tasks, and keep in mind that the task you're 00:00:25.980 |
confronted with for this assignment and bake-off is the one in the final row, which is very 00:00:33.580 |
QA, standard QA, the way this is formulated in the modern phase, as in datasets like SQUAD, 00:00:39.780 |
is that you're given a gold evidence passage, and the name of the game is to train a QA 00:00:44.820 |
reader that will learn to find answers to questions in those evidence passages. 00:00:49.820 |
And in this mode, we don't have a retriever at all, so I put "NA" here. 00:00:55.820 |
This is the variant where we're not given a passage, but rather we need to learn to 00:01:00.220 |
retrieve relevant passages, and then in the standard mode, we train a QA module to learn 00:01:06.500 |
how to find answers to questions in those retrieved passages. 00:01:10.260 |
And that is already substantially harder because now we have to retrieve good evidence, and 00:01:14.940 |
we don't have a guarantee that the answer will even be findable in the evidence that 00:01:21.500 |
Few Shot QA is something we haven't discussed yet. 00:01:24.100 |
This is the task that was really introduced in the GPT-3 paper, and it's hard along a 00:01:33.100 |
We could use SQUAD, for example, as the basis for the task, but we're not allowed to do 00:01:41.020 |
We have to rely on a frozen large language model to learn in context how to do what we 00:01:48.780 |
And in this mode, there's no retrieval because we just rely on the closed nature of a task 00:01:55.940 |
That's already hard enough because you don't get to do any task-specific training. 00:02:00.940 |
We are going to move you into a mode that combines the hard aspects of Open QA and Few 00:02:07.940 |
In this mode, you do not have a gold evidence passage, and you are compelled to use only 00:02:18.980 |
We're going to have a retrieval mechanism for you. 00:02:21.100 |
You could do some fine-tuning of it, but we're not going to explore that in this homework. 00:02:27.100 |
So really, in the end, what you're left with is a frozen retrieval model, a frozen language 00:02:31.380 |
model, and on that basis, you need to figure out how to answer questions effectively. 00:02:36.740 |
Just to repeat, your situation is a difficult one. 00:02:40.820 |
During development, you will have gold QA pairs, but at test time, all you're going 00:02:46.620 |
to have is questions, no gold passages or any other associated data. 00:02:53.920 |
It is simply a list of questions that you need to figure out how to answer. 00:02:59.520 |
I feel like this task would not even have been posable in 2018, and when we first did 00:03:05.040 |
it last year, I worried that it might be too difficult, but people did incredible things, 00:03:10.580 |
and I think you're going to do incredible things with this seemingly almost impossible 00:03:16.220 |
But just to emphasize here, you have to operate throughout this with frozen components. 00:03:24.020 |
All you can do is in-context learning with frozen models, but I assure you, you'll get 00:03:32.740 |
Just as a reminder, for that task that I mentioned, few-shot QA, that is the one that was posed 00:03:42.980 |
You can see you have this gold passage that you prompt your language model with. 00:03:47.600 |
You give it a demonstration QA pair, and then you have your final target question. 00:03:53.500 |
The demonstration follows the substring guarantee into the gold evidence passage, and so does 00:03:59.820 |
And that's how they posed this, and they did pretty well at it. 00:04:02.980 |
And just as a check, I tried TextDaVinci 2 with exactly this example and got the right 00:04:10.060 |
They can still do few-shot QA with squad with these deployed models. 00:04:19.300 |
In your setting, you're just given a question, and the task is to answer it. 00:04:24.860 |
And a standard baseline in this mode is what I've called retrieve-then-read. 00:04:29.500 |
So the way this would work is that you'll rely on a retrieval mechanism to find a context 00:04:38.860 |
And then you might add in for few-shot retrieve-then-read some demonstrations, and you could get those 00:04:43.780 |
from the squad data set that we provide to you, or you could try to get it from somewhere 00:04:49.180 |
You could also get from your train set or retrieve an answer to that demonstration question, 00:04:57.180 |
Squad provides all of these as gold evidence, but it's conceivable that you would want to 00:05:02.380 |
retrieve answers or predict answers and retrieve passages so that your system learns from demonstrations 00:05:10.180 |
that are kind of like the actual situation that you have down here where there's no gold 00:05:14.580 |
passages and no gold answers, just questions, and everything else has to be found somewhere. 00:05:22.340 |
For the assignment itself, we're pushing to use the demonstrate-search-predict programming 00:05:27.820 |
library, and the vision behind this library is that we're going to make prompt engineering 00:05:32.340 |
proper software engineering, where you write a little program as opposed to typing out 00:05:39.140 |
And the idea here is that that opens up a whole new design space and really gets us 00:05:43.300 |
thinking in new ways about how to design AI systems in this modern mode that are essentially 00:05:49.380 |
prompting frozen pre-trained components and getting them to work in concert to do complicated 00:05:56.860 |
So this is a diagram from the DSP paper, and you're going to be writing little programs 00:06:04.300 |
For the notebook itself, we begin with some setup, and what you see happening here is 00:06:08.260 |
that we're kind of connecting with some large language model vendors who provide powerful 00:06:16.020 |
Here's the key for OpenAI, and here's the one for Cohere. 00:06:19.660 |
These are not supplied to you, so you need to get set up separately with your own API 00:06:25.140 |
Here, you can use the models for free, and for OpenAI, when you open an account, you 00:06:29.180 |
get some small number of free credits to use for their models. 00:06:34.900 |
And then finally, we set you up with a Colbert server. 00:06:37.200 |
This is an index that we created that will provide you with a very rich retrieval mechanism. 00:06:43.200 |
And then in the cell down here, we set up using the DSP library, the language model. 00:06:48.300 |
Here I'm using text DaVinci One, and I've got my OpenAI key associated with it. 00:06:53.660 |
And there's the commented out version for doing this with Cohere models. 00:06:59.820 |
And the final piece here is to just set DSP as a library so that you're using that LM 00:07:10.580 |
One thing I wanted to pause on here is the appearance of SQuAD in the notebook. 00:07:16.300 |
That might surprise you because SQuAD is a closed standard QA formulation of the task 00:07:22.220 |
where you're given gold passages and so forth. 00:07:24.880 |
So I want to emphasize that the role of SQuAD here is to provide you with some training 00:07:30.460 |
And I put "train" in quotation marks there because, of course, you can't train any systems. 00:07:36.100 |
But you can use the train portion of SQuAD to construct demonstrations for your passages 00:07:43.580 |
So in essence, SQuAD is providing train data, gold QA pairs, maybe with gold passages that 00:07:49.580 |
you'll make use of that you can use for demonstrations. 00:07:52.540 |
And SQuAD also provides dev QA pairs that we can use to simulate your actual situation 00:07:58.580 |
so that you can figure out how well your system is going to do at test time, that is, on the 00:08:05.740 |
So that's why you get this section, SQuAD train, SQuAD dev, SQuAD dev sample. 00:08:11.020 |
That final thing there is just because you should keep in mind that in this mode, evaluations 00:08:16.620 |
can be quite expensive, especially if you're paying OpenAI for each one of its API calls. 00:08:21.580 |
And so you'll want to do evaluations on small data sets and do them only sparingly. 00:08:26.400 |
So I've provided you with a tiny sample of 200 dev examples to kind of use in a very 00:08:31.900 |
controlled way, although, honestly, even that can get quite expensive. 00:08:35.820 |
And so I would do even those kind of quantitative evaluations only sparingly. 00:08:42.720 |
With that background in place, we can begin thinking about using DSP itself. 00:08:47.300 |
One nice thing about DSP is that it gives us very easy access to a language model. 00:08:51.900 |
So what I'm showing in this cell 13 here is a direct call to the language model with the 00:08:59.540 |
And you can see it's given me a list of responses and kind of messiness with all their new lines 00:09:06.000 |
You can add in keyword parameters to the underlying language model if that model honors them, 00:09:11.100 |
and that will affect the behavior of this function call. 00:09:13.660 |
So here I've called it with temperature 0.9, and I'm getting four responses back. 00:09:18.180 |
And you can see it's listed them out in a list there. 00:09:22.260 |
Another nice thing about DSP is that if you call lm.inspecthistory, it will show you the 00:09:27.420 |
previous calls to the language model, and it's formatted those quite nicely. 00:09:31.180 |
So if you're uncertain about what you're feeding into your model, and that can happen with 00:09:35.540 |
DSP, you can call inspecthistory and get a look at what you actually did. 00:09:43.140 |
Now mostly for DSP, you won't call the language model directly the way we just did. 00:09:48.460 |
You will rely on DSP templates to kind of format prompts to the language model and also 00:09:54.980 |
extract information from the generated answer to use as kind of the basis for your system. 00:10:05.740 |
It's got question and answer components and includes some instructions. 00:10:10.380 |
And then for example, if you create a DSP example from our running case, which US states 00:10:15.740 |
border no US states, and you call it with a sample of two squad training instances to 00:10:21.180 |
use as demonstrations, you can feed that through your template. 00:10:25.100 |
And what you get is something that looks like this, where our target question is at the 00:10:32.140 |
And here are those two demonstrations that we sampled from the squad train set. 00:10:37.980 |
Here are the instructions and here's some formatting stuff that comes from the template. 00:10:42.620 |
And this is a pretty good standard mode for all these modern large language models to 00:10:47.420 |
help them do in context learning and figure out what you want them to do based on information 00:10:52.700 |
in the prompt and the demonstrations you've provided. 00:10:57.500 |
And here to kind of put those pieces together, you have in DSP what I've called prompt based 00:11:02.620 |
So DSP.generate, you feed that through a template and that gives you a kind of generator function 00:11:08.060 |
that when called on a DSP example, will give you back some responses. 00:11:13.300 |
And here's the answer value from the completions. 00:11:16.820 |
Alaska, Hawaii is how it has answered the question, which US states border no US states. 00:11:23.180 |
And again, if you feel unsure, you can call inspect history and you'll see exactly what 00:11:28.700 |
And there's that prompt again with our two sample demonstrations. 00:11:37.820 |
The other part of this assignment is thinking about retrieval. 00:11:41.880 |
And as I said before, for that, we have given you a Colbert index and a Colbert retrieval 00:11:48.800 |
And you can mostly just treat that as a very effective retrieval mechanism. 00:11:53.340 |
Here's a question that we can use it as an example. 00:11:56.380 |
We've got retrieve when given a string and some number of passages we want in response 00:12:01.660 |
will give you back a list of passages that you can use for constructing prompts and so 00:12:07.980 |
As with the language model, if you need deeper access to the retrieval mechanism, you can 00:12:16.260 |
And that will allow you to have a bunch of other keyword parameters in here and give 00:12:20.140 |
you more information back than just the list of passages, for example, scores and other 00:12:27.020 |
So that's there if you want to design more advanced systems as part of your original 00:12:34.340 |
All of these things come together in the first part of the notebook. 00:12:38.740 |
This is a little DSP program that is a complete solution to few shot open QA. 00:12:48.380 |
First, keep this in mind, use this decorator, DSP.transformation on all of your programs 00:12:55.860 |
so that your programs don't modify the DSP examples that come in. 00:13:00.260 |
You'll be augmenting them with demonstrations and maybe changing the fields and you don't 00:13:04.740 |
want that to have an in-place impact on, for example, the squad data set that you have 00:13:10.620 |
So as a precaution, always add this decorator to all your DSP functions and your life will 00:13:19.900 |
Next, programs operate on DSP example instances, individual ones. 00:13:30.180 |
In the first line of this little program, we sample K random demonstrations from the 00:13:39.380 |
That gives us a demonstrations attribute on the example that came in. 00:13:50.820 |
That gives us this new modified generator function for the language model. 00:13:58.420 |
We call that on an example and we get back completions as well as a copy of the example. 00:14:06.140 |
It will have an answer attribute because of the QA template. 00:14:09.940 |
And that's the field that we'll use as the answer that the system has responded to the 00:14:16.660 |
So that's the complete program for Fushot Open QA and the assignment questions essentially 00:14:25.500 |
Both of the questions are DSP programs just like that one. 00:14:32.460 |
So a small modification of the program I just showed you where you add in context passages 00:14:40.620 |
And then question two asks you to use the annotate primitive from DSP which is very 00:14:45.180 |
powerful as a mechanism for doing lots of things. 00:14:48.860 |
And what we have you use it to do is construct demonstrations that will be especially effective 00:14:54.540 |
for guiding the system toward the behaviors that you want it to have in context. 00:15:01.100 |
And then having done those things, you design your original system. 00:15:04.520 |
We expect your original system to be a DSP program because we think we've provided you 00:15:09.960 |
lots of primitives for writing really powerful DSP programs that will be really interesting 00:15:18.860 |
As before, original systems can take many forms and you should feel free to use whatever 00:15:26.260 |
If you would like to explore DSP even further, Omar has created an intro notebook that walks 00:15:32.960 |
through additional advanced programs for really hard QA problems. 00:15:37.180 |
We're going to talk about a bunch of those techniques when we talk about in-context learning. 00:15:41.340 |
A lot of the most powerful ones are exemplified in that intro notebook where we show you how 00:15:47.020 |
we have DSP primitives that make it very easy to make use of those concepts. 00:15:53.100 |
So that's a powerful next step that you might check out as part of thinking about your original