back to indexStanford XCS224U: NLU I In-context Learning, Part 2: Core Concepts I Spring 2023
00:00:06.000 |
This is part two in our series on in-context learning. 00:00:11.760 |
I think these concepts are a review for you all, 00:00:14.560 |
but I thought it would be good to get them into 00:00:16.780 |
our common ground to help us think about them as 00:00:23.600 |
To start, let's just establish some terminology. 00:00:30.080 |
and I thought I would just try to be clear about what I 00:00:41.160 |
performs a task only by conditioning on the prompt text. 00:00:45.640 |
It's frozen, that is there are no gradient updates. 00:00:48.100 |
The only mechanism we have for learning is that we input 00:00:55.640 |
useful for having it generate things that we regard as useful for our task. 00:01:01.240 |
Few shot in-context learning is a special case of that. 00:01:05.000 |
This is where the prompt includes examples of 00:01:10.880 |
of the intended behavior were seen in training. 00:01:14.040 |
Of course, we are unlikely to be able to verify two. 00:01:18.400 |
In this modern era where models are trained on massive amounts of text, 00:01:24.420 |
those training datasets often we have no ability to audit them. 00:01:28.160 |
We might not be sure whether we're actually doing few shot in-context learning. 00:01:32.720 |
But this is the ideal and the spirit of this is 00:01:36.200 |
that if models have seen examples of this type in training, 00:01:41.000 |
The whole point is to see whether with just a few instances, 00:01:47.560 |
I'll also acknowledge that the term few shot is 00:01:50.940 |
used in more traditional supervised learning settings in 00:01:55.140 |
the sense of training on a few examples with gradient updates. 00:01:59.060 |
I'm just emphasizing that when I say few shot in this lecture series, 00:02:03.300 |
I'm always going to mean few shot in-context learning with no gradient updates. 00:02:08.860 |
Zero shot in-context learning is another special case. 00:02:12.500 |
This is where the prompt includes no examples of the intended behavior, 00:02:16.320 |
but I'll allow that it could contain some instructions. 00:02:22.300 |
no examples of the intended behavior were seen in training. 00:02:25.280 |
Again, we're unlikely to be able to verify two, 00:02:28.460 |
so we won't know whether this is truly zero shot, 00:02:38.220 |
other instructions that you include in the prompt are a gray area, 00:02:41.960 |
but let's allow them in the zero shot category. 00:02:44.440 |
What I mean by that is that as you give more elaborate instructions, 00:02:47.380 |
you might in effect be demonstrating the intended behavior. 00:02:51.120 |
But the other side of this is that instructions are conceptually very different kinds 00:02:55.860 |
of things for machine learning in general than actual demonstrations. 00:03:01.100 |
It's interesting to separate out the case where you demonstrate 00:03:04.460 |
directly from the case where you just describe the intended behavior. 00:03:08.780 |
We'll allow mere descriptions to still be zero shot. 00:03:13.620 |
Another reminder is just how GPT and other models work. 00:03:19.900 |
We covered this in the unit on contextual representation, 00:03:24.440 |
and I thought I'd just remind us so that this is front of 00:03:27.000 |
mind as we think about the in-context learning techniques. 00:03:32.060 |
the autoregressive loss function that these models use. 00:03:34.780 |
Again, the essence of this is that scoring happens on the basis 00:03:41.020 |
the token that we want to predict at time step t, 00:03:44.100 |
and the hidden state that the model has created up until the time step preceding t. 00:03:53.460 |
Here's how that plays out for GPT style models in the context of training. 00:03:59.900 |
Then I'll show you first training with teacher forcing. 00:04:02.780 |
This slide is a repeat from one we had in the contextual representations unit, 00:04:13.700 |
representing the sequence of tokens in the sequence that we are using for training. 00:04:18.640 |
Normally, we represent these as actual sequences of tokens, 00:04:23.620 |
a mechanical level of how these things actually operate. 00:04:26.580 |
These are one-hot vectors and those are used to look up vectors in 00:04:31.340 |
our embedding layer that's given in gray here and the result of that lookup is a vector. 00:04:41.400 |
But again, what we really have here is a sequence of vectors. 00:04:48.620 |
the big transformer model that we're using for language modeling. 00:04:52.260 |
I've shown a schematic of this and the one thing I've 00:04:55.300 |
highlighted is the pattern of attention mechanisms. 00:04:58.380 |
Recall that when we're doing autoregressive modeling, 00:05:03.340 |
those dot product attention mechanisms only into the past. 00:05:06.900 |
You see that characteristic pattern for the attention connections. 00:05:11.260 |
We do all our processing with all of these transformer blocks. 00:05:15.700 |
we're going to use our embedding layer again. 00:05:23.740 |
one from the sequence that we have at the bottom here. 00:05:32.780 |
the comes in down here and that is the basis for predicting rock. 00:05:40.740 |
and then we finally predict the end of token sequence. 00:05:43.220 |
We're offset by one using the previous context to predict the next token. 00:05:49.500 |
Again, I've given these as one-hot vectors because 00:05:52.500 |
those one-hot vectors are the actual learning signal. 00:05:56.340 |
Those are compared for learning with the vector of 00:05:59.540 |
scores that the model produces at each time step, 00:06:05.500 |
It's the difference between the one-hot vector and 00:06:08.620 |
the score vector that we use to get gradient updates to improve the model. 00:06:13.420 |
Again, I'm emphasizing this because we tend to 00:06:18.540 |
But in fact, predicting tokens is something that we make them do. 00:06:21.740 |
What they actually do is predict score vectors. 00:06:25.500 |
What's depicted on the slide here is teacher forcing. 00:06:29.300 |
There's an interesting thing that happened at this time step where 00:06:32.500 |
the score vector actually put the highest score on the final element here, 00:06:36.780 |
which is different from the one-hot vector that we wanted to predict. 00:06:41.500 |
In teacher forcing, I still use this one-hot vector 00:06:44.820 |
down at the next time step to continue my predictions. 00:06:48.300 |
There are versions of training where I would instead use 00:06:51.820 |
the one-hot vector that had a one here at the next time step, 00:06:55.380 |
and that can be useful for introducing some diversity into the mix. 00:06:59.440 |
That is also a reminder that these models don't predict tokens, 00:07:07.340 |
we could use their predicted score vectors in lots of different ways. 00:07:15.800 |
beam search to do training for future time steps. 00:07:18.740 |
We could pick the lowest scoring item if we wanted. 00:07:24.340 |
what these models do is predict score vectors. 00:07:30.120 |
Our actual focus is on frozen language models for this unit, 00:07:33.520 |
and so we're really going to be thinking about generation. 00:07:38.880 |
Let's imagine that the model has been prompted with 00:07:46.960 |
We use rock, the one-hot vector there as the input to the next time step. 00:07:55.140 |
In this case, we could think of the prediction as rolls. 00:08:00.760 |
the next time step and we continue our predictions. 00:08:06.280 |
Again, I want to emphasize that at each time step, 00:08:09.660 |
the model is predicting score vectors over the vocabulary. 00:08:16.880 |
decide what token that actually corresponds to. 00:08:19.900 |
What I've depicted here is something that you might call greedy decoding, 00:08:23.520 |
where the highest scoring token at each time step is used at the next time step. 00:08:29.600 |
But again, that just reveals that there are lots of 00:08:32.720 |
decision rules that I could use at this point to guide generation. 00:08:39.000 |
that would be where we do a rollout and look at 00:08:41.520 |
all the score distributions that we got for a few time steps, 00:08:44.640 |
and pick one that seems to be the best scoring across that whole sequence, 00:08:48.460 |
which could yield very different behaviors from 00:08:51.280 |
the behavior that we get from greedy decoding. 00:08:54.160 |
If you look at the APIs for our really large language models now, 00:08:58.480 |
you'll see that they have a lot of different parameters that are 00:09:01.760 |
essentially shaping how generation actually happens. 00:09:05.040 |
That is again a reminder that generation is not really intrinsic to these models. 00:09:10.480 |
What's intrinsic to them is predicting score vectors over the vocabulary, 00:09:14.800 |
and the generation part is something that we make them do via a rule that we 00:09:19.740 |
decide separately from their internal structure. 00:09:23.960 |
That queues up a nice question that you could debate with 00:09:27.600 |
your fellow researchers and friends and loved ones and people out in the world. 00:09:32.200 |
Do autoregressive language models simply predict the next token? 00:09:38.840 |
that's all they do, and that is a reasonable answer. 00:09:41.740 |
However, we just saw that it's more precise to say that they 00:09:46.600 |
predict scores over the entire vocabulary at each time step, 00:09:59.240 |
That feels more correct at a technical level. 00:10:02.360 |
You might reflect also that they actually represent 00:10:05.860 |
data in their internal and output representations, 00:10:10.680 |
it's those representations that we care about, 00:10:15.700 |
That just points to the fact that autoregressive LMs do 00:10:22.080 |
But on balance, I would say that it's saying they 00:10:26.580 |
simply predict the next token might be the best 00:10:29.000 |
in terms of science communication with the public. 00:10:33.400 |
your fellow researchers about what they're actually 00:10:38.700 |
it might give people the best mental model if you simply say that they predict 00:10:45.280 |
have already generated and the ones that you put in. 00:10:47.820 |
It's an appropriately mechanistic explanation that I think might 00:10:51.160 |
help people out in the world calibrate to what's actually happening. 00:10:55.280 |
Because we should even remind ourselves as we see 00:10:58.780 |
more impressive behaviors from these models that 00:11:04.580 |
If you prompt the model with better late than, and it says never, 00:11:08.680 |
transparently we can see that that's just a high probability 00:11:17.300 |
lunch, and it will probably say dinner and you might immediately think, 00:11:21.540 |
that reflects some world knowledge that the model has. 00:11:28.180 |
is a continuation of the sequence with a high probability token. 00:11:32.540 |
It's high probability because of regularities in the world. 00:11:37.140 |
this is simply a high probability continuation. 00:11:40.580 |
Again, when you prompt it with the president of the US is, 00:11:44.060 |
and it gives you the name of a person as an answer, 00:11:46.900 |
that might look like it has stored some knowledge about the world, 00:11:56.660 |
a high probability continuation of the sequence. 00:11:59.780 |
When you get to something like the key to happiness is, 00:12:03.220 |
and it offers you an answer that seems insightful, 00:12:06.260 |
you should again remind yourself that that is just 00:12:13.260 |
all the training experience that the model has had. 00:12:16.740 |
We really have no ability to audit what those training sequences were like. 00:12:22.420 |
There might be something interesting happening in terms of 00:12:27.660 |
But we should remind ourselves that really it's just 00:12:30.200 |
high probability continuations for all of these cases. 00:12:34.160 |
The final core concept that I want to mention here, 00:12:42.260 |
This is this notion of instruction fine-tuning. 00:12:45.820 |
This is from the blog post that announced chat GPT. 00:12:50.980 |
It's a description of how they do instruct fine-tuning for that model. 00:12:56.580 |
I think the thing to highlight is that in step 1, 00:12:59.700 |
we have what looks like fairly standard supervised learning, 00:13:05.980 |
human curated examples of prompts with good outputs, 00:13:13.180 |
Then at step 2, we again have humans coming in now to look at 00:13:18.440 |
model outputs that have been generated and rank 00:13:21.460 |
them according to quality conditional on the prompt input. 00:13:25.220 |
That's two stages at which people are playing a crucial role. 00:13:36.020 |
just doing language model training of the sort I described before, 00:13:45.280 |
We have now entered back into a mode where a lot of 00:13:48.520 |
the most interesting behaviors are certainly happening because 00:13:54.180 |
direct supervision about what's a good output given an input. 00:13:58.300 |
It's not magic when these models seem to do very sophisticated things. 00:14:02.120 |
It is largely because they have been instructed to do 00:14:05.220 |
very sophisticated things by very sophisticated humans. 00:14:08.960 |
That is important in terms of understanding why these models work, 00:14:12.420 |
and I think it's also important for understanding how 00:14:15.500 |
various in-context learning techniques behave because increasingly, 00:14:20.040 |
we're seeing a feedback loop where the kinds of things that we want to do with 00:14:24.200 |
our prompts are informing the kinds of things that happen in 00:14:27.420 |
the supervised learning phase making them more powerful. 00:14:31.000 |
Again, it's not a mysterious discovery about how large language models work, 00:14:38.460 |
instruct fine-tuning that are very commonly happening now.