back to indexStanford XCS224U: NLU I In-context Learning, Part 4: Techniques and Suggested Methods I Spring 2023
Chapters
0:0
3:14 Choosing demonstrations
5:59 Example from Assignment 2
9:11 Chain of Thought
10:24 Generic step-by-step with instructions
13:13 Self-Consistency in DSP
13:52 Self-Ask
15:3 Iterative rewriting
16:5 Some DSP results
17:59 Suggested methods
00:00:09.940 |
I'm going to talk about techniques for in-context learning and then 00:00:12.960 |
suggest some future directions for you in your own research in this space. 00:00:17.620 |
As before, I'm nervous about giving this screencast because it's essentially 00:00:22.760 |
certain that what turns out to be a really powerful technique for 00:00:25.640 |
in-context learning will be discovered in a few months, 00:00:29.920 |
In fact, one of you might go off and discover precisely that technique, 00:00:33.760 |
which will make this screencast seem incomplete. 00:00:37.880 |
I feel confident that we've learned important lessons about what works and what 00:00:41.800 |
doesn't and those will carry forward no matter what happens in the field. 00:00:46.400 |
Let's dive in. Let's start with the core concept of a demonstration. 00:00:51.440 |
This is an idea that stretches back at least to the GPT-2 paper and is indeed 00:00:56.360 |
incredibly powerful in terms of designing effective in-context learning systems. 00:01:01.400 |
Let me illustrate in the context of few-shot open domain question answering, 00:01:05.740 |
which is our topic for the associated homework and bake-off. 00:01:09.560 |
Imagine that we're given the question, who is Bert? 00:01:12.300 |
We're going to prompt our language model with this question and 00:01:17.440 |
By now, you probably have an intuition that it would be 00:01:20.160 |
effective to retrieve a context passage to insert into the prompt, 00:01:26.920 |
some evidence that it can use for answering the question. 00:01:30.480 |
The idea behind demonstrations is that it might also help to show 00:01:34.960 |
the model examples of the kind of behaviors that we would like to elicit. 00:01:40.000 |
Maybe we have a train set of QA pairs and we fetch 00:01:43.760 |
one of those QA pairs and insert it into the prompt. 00:01:52.640 |
But this is your first real choice point here. 00:01:55.520 |
Of course, you could use the answer that comes 00:02:04.600 |
But counter-intuitively, it could be useful to instead use 00:02:09.400 |
an answer that you retrieve from some data store, 00:02:15.320 |
the very language model that you are prompting right now. 00:02:23.640 |
your language model is actually capable of doing, 00:02:26.280 |
versus just relying on the gold QA pairs which might be 00:02:29.240 |
disconnected from the behavior of your model. 00:02:40.840 |
just use the gold passage from the train data. 00:02:44.000 |
But since this is a retrieve passage down here, 00:02:52.240 |
retrieve a passage instead of using a gold one to 00:02:57.040 |
the model actually has for our target question. 00:03:07.200 |
the situation that your model is actually in. 00:03:17.440 |
Of course, you could just randomly choose them from 00:03:20.120 |
available data but perhaps you could do better. 00:03:23.120 |
Maybe you should choose your demonstrations based 00:03:25.640 |
on relationships that they have to your target example. 00:03:33.760 |
based on similarity in some sense to the target input. 00:03:37.280 |
Or for classification, you might select demonstrations to 00:03:41.040 |
help the model implicitly determine the target input type. 00:03:48.760 |
your demonstrations to satisfy specific criteria. 00:04:05.840 |
the language model to be able to predict the correct answer. 00:04:11.760 |
In classification, maybe a straightforward thing would be to 00:04:20.600 |
represented in your dataset so that your model has 00:04:23.000 |
an example of every possible behavior and isn't 00:04:33.760 |
these demonstrations that we have available to us. 00:05:04.680 |
you to generate answers in the style of a pirate, 00:05:07.720 |
it might be useful to have your language model actually 00:05:10.360 |
rewrite the demonstrations in the style of a pirate, 00:05:18.480 |
The fundamental thing that you have to get used to, 00:05:25.320 |
is that for powerful in-context learning systems, 00:05:32.800 |
a different prompt to that self-same language model. 00:05:41.080 |
the product of multiple calls to your language model. 00:05:49.760 |
but the end result can be something that is very powerful in terms of 00:05:54.640 |
your language model with the results that you want to see. 00:05:58.760 |
In that context, let me actually linger a little bit 00:06:05.280 |
because this is one that people often find hard to think about, 00:06:08.960 |
but fundamentally I think it's an intuitive and powerful idea. 00:06:15.000 |
Let's start with our usual question, who is Bert? 00:06:18.640 |
We're going to retrieve some context passage presumably, 00:06:22.240 |
and then the question is what to do for demonstrations. 00:06:33.160 |
But the train set doesn't have context passages, 00:06:35.880 |
so what I decide to do is retrieve a context passage. 00:06:42.720 |
Elmo is an LSTM for contextual representations. 00:06:49.040 |
That context passage is about a different Elmo than the one 00:06:52.560 |
represented in this question-answer pair serving as my demonstration. 00:07:03.240 |
The question is, could we detect that automatically? 00:07:20.720 |
and then yes, we'll probably insert a demonstration. 00:07:25.360 |
let's assume that that demonstration just comes from 00:07:30.400 |
the question, and the answer to keep things simple. 00:07:34.880 |
our language model and we see what comes out, 00:07:39.640 |
We can observe that that predicted response to 00:07:43.240 |
this demonstration does not match the gold answer in our dataset. 00:07:48.200 |
We could use that as a signal that something about 00:08:03.960 |
In this case, the context question and answer look harmonious. 00:08:09.120 |
Again, we could try to detect that automatically by firing off 00:08:14.640 |
with our demonstration question, who is Ernie? 00:08:24.400 |
In this case, the model's response matches our gold answer, 00:08:33.000 |
finally lead to good behavior from our model. 00:08:39.960 |
We're using the language model demonstrations and our gold data to 00:08:43.560 |
figure out which demonstrations are likely to be 00:08:46.000 |
effective and which aren't and we're trying to do that automatically. 00:08:50.360 |
Yes, for this sub-process with this language model where I 00:08:54.800 |
just inserted a gold context question-answer pair, 00:08:58.440 |
you can imagine recursively doing the same thing of trying to find 00:09:02.080 |
good demonstrations for the demonstration selection process. 00:09:06.720 |
At some point, that recursive process needs to end. 00:09:18.600 |
The intuition behind chain of thought is that for complicated things, 00:09:25.720 |
the prompt to ask the model to simply produce the answer in its initial tokens. 00:09:32.720 |
What we do with chain of thought is construct 00:09:36.200 |
demonstrations that encourage the model to generate in a step-by-step fashion, 00:09:46.960 |
This again shows the power of demonstrations. 00:09:49.280 |
We illustrate chain of thought with these extensive, 00:09:54.400 |
Then when the model goes to do our target behavior, 00:09:57.800 |
the demonstration has led it to walk through a similar chain of 00:10:01.560 |
thought and ultimately produce what we hope is the correct answer. 00:10:05.320 |
Assume we didn't lead it down the garden path or it didn't lead 00:10:08.480 |
itself down the garden path toward the wrong answer, 00:10:10.800 |
which absolutely can happen with chain of thought. 00:10:14.400 |
The original chain of thought is quite bespoke. 00:10:17.320 |
We need to carefully construct these chain of thought demonstration 00:10:20.720 |
prompts to encourage the model to do particular things. 00:10:23.800 |
I think there is a more generic version of this that can be quite powerful. 00:10:28.080 |
I've called this generic step-by-step with instructions. 00:10:31.360 |
Here we are definitely aligning with the instruct fine-tuning that 00:10:36.880 |
these models are probably undergoing and leveraging that in some indirect fashion. 00:10:46.280 |
is it true that if a customer doesn't have any loans, 00:10:49.040 |
then the customer doesn't have any auto loans? 00:10:51.680 |
It's a complicated conditional question involving negation, 00:10:55.600 |
and the model has unfortunately given the wrong answer. 00:11:00.880 |
The continuation is revealing a customer can have 00:11:05.740 |
which is the reverse of the conditional question that I posed. 00:11:16.360 |
tells it something high level about what we want to do. 00:11:19.020 |
It says logic and common sense reasoning exam. 00:11:27.340 |
the reasoning should look like and what the prompt will look like, 00:11:30.220 |
using an informal markup language that probably the model 00:11:33.620 |
acquired via some instruct fine-tuning phase. 00:11:40.100 |
What happens is the model walks through the logical reasoning, 00:11:44.860 |
and in this case, arrives at the correct answer, 00:11:47.740 |
and also does an excellent job of explaining its own reasoning. 00:11:52.540 |
but here it looks like this generic step-by-step instruction format, 00:12:10.060 |
an earlier model called retrieval augmented generation. 00:12:15.620 |
Let me zoom in on what the important piece is. 00:12:20.980 |
sample a bunch of different generated responses, 00:12:24.460 |
which might go through different reasoning paths 00:12:26.860 |
using something like chain of thought reasoning, 00:12:34.300 |
the different generated paths that the model has taken. 00:12:37.340 |
What we're going to do is select the answer that 00:12:41.120 |
was most often produced across all of these different reasoning paths. 00:12:47.040 |
marginalizing out the reasoning paths to arrive at an answer, 00:12:51.800 |
with the intuition being that the answer that was arrived at by 00:12:55.140 |
the most paths effectively or the most probable answer 00:12:58.220 |
given all these paths is likely to be a trustworthy one. 00:13:04.140 |
It can get expensive because you sample a lot of 00:13:08.560 |
but the result can make models more self-consistent. 00:13:19.380 |
that actually makes it very easy to do self-consistency. 00:13:22.580 |
You just set your model up to generate a lot of 00:13:24.820 |
different responses given your prompt template, 00:13:27.500 |
and then dsp.majority will figure out which answer was 00:13:30.780 |
produced most often given all of those reasoning paths. 00:13:37.100 |
self-consistency essentially a drop-in for any program that you write, 00:13:41.380 |
assuming you can afford to do all of the sampling. 00:13:47.820 |
Omar's intro notebook which walks through this in more detail. 00:13:55.700 |
Here, the idea behind self-ask is that we will, 00:13:59.900 |
via demonstrations, encourage the model to break 00:14:10.820 |
In that way, the idea is that it will iteratively get to 00:14:13.620 |
the point where it can find the answer to the overall question. 00:14:17.460 |
This is especially powerful for questions that might be multi-hop, 00:14:21.180 |
that is, might involve multiple different resources, 00:14:24.340 |
which you can essentially think of as being broken down into 00:14:29.260 |
resolved in order to get an answer to the final question. 00:14:33.020 |
That's self-ask, and it has an intriguing property for 00:14:40.320 |
retrieval for answering the intermediate questions. 00:14:44.760 |
generations for those intermediate questions, 00:14:46.960 |
you'd look to something like a search engine, 00:14:49.560 |
like in the paper they use Google to answer those questions, 00:14:52.640 |
the answers get inserted into the prompt and the model continues. 00:14:57.080 |
That's self-ask or maybe self-ask and Google answer. 00:15:02.320 |
Another very powerful general idea that I'm sure will 00:15:07.620 |
survive no matter what people discover about in-context learning, 00:15:11.300 |
is that it can be useful to iteratively rewrite parts of your prompt. 00:15:23.540 |
I've given some code here that shows how this 00:15:25.800 |
plays out in the context of multi-hop search, 00:15:30.480 |
evidence passages for a bunch of different sources, 00:15:35.880 |
and then using those as evidence for answering a complicated question. 00:15:46.080 |
a very complicated situation in terms of information, 00:15:49.900 |
it might be helpful to iteratively have your language model rewrite parts of 00:15:54.400 |
its own prompt and then prompt the model with 00:15:58.840 |
synthesizing information and getting better results. 00:16:06.960 |
I thought I would just call out some results from the DSP paper. 00:16:12.640 |
we evaluate across a bunch of different knowledge-intensive tasks, 00:16:18.000 |
question answering or information-seeking dialogue. 00:16:21.860 |
The high-level takeaway here is that we can write 00:16:25.360 |
DSP programs that are breakaway winners in these competitions. 00:16:38.600 |
the most underexplored datasets like MusicQ and PopQA. 00:16:46.360 |
that DSP is amazing and Omar and the team did an amazing job. 00:16:50.480 |
But I think the deeper lesson is that it is very early days for these techniques. 00:16:55.680 |
The only time you see these breakaway results for modeling is when 00:17:00.560 |
something new has happened and people are just figuring out what to do next, 00:17:09.480 |
more powerful in-context learning techniques, 00:17:11.760 |
and I would just encourage you to think about DSP as 00:17:15.080 |
a tool for creating prompts that are truly full-on AI systems. 00:17:20.920 |
We want to bring the software engineering to prompt engineering, 00:17:24.340 |
and really think of this as a first-class way of designing AI systems. 00:17:30.960 |
I think we're going to see more and more breakaway results, 00:17:34.320 |
and we will indeed realize the vision of having in-context learning systems 00:17:38.340 |
surpass the fine-tuned systems that were supervised in the classical mode. 00:17:47.700 |
and there's lots of space in which to be creative right now. 00:17:52.160 |
That's a good opportunity for me to queue up this final part of the screencast, 00:17:57.200 |
some suggested methods for you as you think about working in the space. 00:18:03.600 |
create Devon test sets for yourself based on the task you want to solve, 00:18:08.040 |
aiming for a format that can work with a lot of different prompts. 00:18:15.920 |
you have a fixed target that you're trying to achieve that will help you get 00:18:19.240 |
better results and be more scientifically rigorous. 00:18:27.840 |
Paying particular attention to whether it was 00:18:32.240 |
I think we have already seen that the extent to which you can 00:18:40.240 |
Often, we don't know what that data set was like, 00:18:43.240 |
but we can discover it in a heuristic fashion, at least in part. 00:18:52.960 |
Try to write systematic generalizable code for handling the entire workflow, 00:18:58.280 |
from reading data to extracting responses and analyzing the results. 00:19:02.560 |
That is a guiding philosophical idea behind DSP. 00:19:08.320 |
I think this is an important methodological note. 00:19:10.800 |
We shouldn't be pecking out prompts and designing systems in that very ad hoc way. 00:19:15.680 |
We should be thinking about this as the new mode in which we 00:19:19.520 |
program AI systems and take it as seriously as we can. 00:19:24.160 |
Finally, for the current and perhaps brief moment, 00:19:27.680 |
prompt designs involving multiple pre-trained components and tools seem to 00:19:31.600 |
be underexplored relative to their potential value. 00:19:34.840 |
For this unit, we are exploring how a retrieval model 00:19:38.400 |
and a language model can work in concert to do powerful things. 00:19:42.280 |
But we could obviously bring in other pre-trained components and maybe 00:19:46.360 |
even other just core computational capabilities like calculators, 00:19:53.000 |
Thinking about how to design prompts that take advantage of 00:19:55.920 |
all of those different tools is a wonderful new avenue. 00:19:59.160 |
We're starting to see exploration in this space and it is 00:20:05.160 |
Maybe go forth and see what value you can extract out of