back to indexStanford XCS224U: NLU I Contextual Word Representations, Part 4: GPT I Spring 2023
00:00:06.080 |
This is part four in our series on contextual representations. 00:00:11.200 |
the most famous transformer-based architecture, 00:00:19.720 |
that is the autoregressive loss function that's 00:00:26.180 |
that technical piece with a bunch of illustrations. 00:00:30.880 |
There are a lot of mathematical details here. 00:00:32.780 |
I think the smart thing to do is zoom in on the numerator. 00:00:36.520 |
What we're saying here is that at position T, 00:00:39.760 |
we're going to look up the token representation 00:00:44.360 |
and do a dot product of that vector representation with 00:00:48.120 |
the hidden representation that we have built up in 00:00:59.540 |
So we do that same calculation for every item in the vocabulary. 00:01:09.340 |
But again, the thing to keep an eye on is that the scoring is 00:01:14.560 |
the embedding representation for the token we want to 00:01:17.620 |
predict and the hidden representation that we have 00:01:20.420 |
built up until the time before that time-step. 00:01:24.540 |
Here's that same thing by way of an illustration. 00:01:31.900 |
I think it's good to keep track of start and end tokens here. 00:01:43.660 |
and then we form some hidden representation H1. 00:01:56.180 |
dot producted with the embedding representation for the. 00:01:59.820 |
Remember, that's the numerator in that scoring function. 00:02:03.100 |
At time-step 2, we now copy over the, and we continue, 00:02:09.060 |
Here I'm depicting like a recurrent neural network. 00:02:12.180 |
So we're traveling left to right just to keep things simple. 00:02:14.860 |
I'll talk about how GPT handles this in a bit later. 00:02:19.820 |
and we got a second hidden representation H2, 00:02:29.260 |
Then the sequence modeling continues in exactly that same way 00:02:37.980 |
remember what we're doing is getting a score for the rock that is 00:02:41.500 |
proportional to the dot product of the embedding for that token, 00:02:48.780 |
prior to that point that we're making the prediction at. 00:02:51.580 |
Then we exponentiate that for the sake of doing that softmax scoring. 00:02:58.180 |
we're essentially just doing this in the context of a transformer. 00:03:03.060 |
I've got at the bottom here a traditional absolute encoding scheme for positions. 00:03:07.900 |
We look up all those static vector representations, 00:03:11.860 |
and we get our first contextual representations in green as usual. 00:03:16.700 |
Then for GPT, we might stack lots and lots of 00:03:22.660 |
Eventually though, we will get some output representations. 00:03:26.180 |
Those are the ones that I've depicted in green here, 00:03:28.740 |
and those will be the basis for language modeling. 00:03:31.420 |
We will add on top of those some language modeling specific parameters, 00:03:40.980 |
That will be the basis for predicting the actual sequence. 00:03:45.300 |
We get an error signal to the extent that we're making 00:03:48.060 |
predictions into this space that don't correspond to 00:03:51.820 |
the actual one-hot encoding vectors that correspond to the sequence itself. 00:04:00.500 |
that conditional language modeling using the transformer architecture. 00:04:04.500 |
Maybe the one thing to keep in mind here is that 00:04:06.980 |
because of the nature of the attention mechanisms, 00:04:09.700 |
we need to do some masking to make sure that we don't in effect look into the future. 00:04:16.780 |
We start with position A. At that first position, 00:04:23.220 |
ourselves because we can't look into the future. 00:04:25.700 |
I haven't depicted that, but we could do self-attention. 00:04:30.700 |
we now have the opportunity to look back into position A and get that dot product. 00:04:41.740 |
we can look back to the previous two positions. 00:04:44.620 |
The attention mask is going to have this look where we go backwards, 00:04:48.700 |
but not forwards so that we don't end up looking into 00:04:51.820 |
the future into tokens that we have not yet generated. 00:05:00.220 |
there are a lot of technical details hiding here. 00:05:02.780 |
What I'm going to depict on this slide is very specifically 00:05:06.540 |
training of a GPT style model with what's called teacher forcing. 00:05:11.460 |
This is going to mean that no matter what token we predict at every time step, 00:05:16.340 |
we will use the actual token at the next time step. 00:05:23.860 |
I have our input sequence as represented as a series of one-hot vectors. 00:05:29.420 |
These are lookup devices that will give us back 00:05:32.660 |
the embedding representations for words from our embedding space. 00:05:39.460 |
We do those lookups and that gives us a bunch of vectors. 00:05:42.500 |
As a shorthand, I have depicted the names of those vectors, 00:05:52.300 |
Then we have a whole bunch of those transformer blocks. 00:05:55.900 |
What I've done is summarize them in green here. 00:05:58.620 |
Just a reminder, I've got all those arrows showing you 00:06:02.180 |
the attention pattern so that we always look into the past, 00:06:09.740 |
we're going to use our embedding parameters again. 00:06:12.900 |
These are the same parameters that I've got depicted down here. 00:06:17.140 |
Now we are going to compare essentially the scores that we 00:06:27.100 |
actually correspond to the sequence that we want to predict. 00:06:30.060 |
This was the start token and conditional on that, 00:06:34.860 |
This is the down here and conditional on that, 00:06:39.940 |
You do get this offset where the input sequence and 00:06:42.900 |
the output sequence are staggered by one so that we're always 00:06:46.340 |
conditional on what we've seen predicting the next token. 00:07:00.660 |
the comparison that we make to get our error signal. 00:07:12.460 |
That's how this actually happens in practice. 00:07:14.860 |
We always think of language modeling as predicting tokens, 00:07:19.260 |
but really and truly it predicts scores over the entire vocabulary, 00:07:24.100 |
and then we make a choice about which token was actually predicted by, 00:07:28.220 |
for example, picking the token that had the largest score. 00:07:33.780 |
I've mentioned teacher forcing in this context. 00:07:38.620 |
I think we should imagine that at this time step here, 00:07:44.100 |
the highest score to whatever word was in the final position here, 00:07:50.780 |
This will give us an error signal because we have in 00:07:53.180 |
effect made a mistake that we can learn from. 00:07:56.660 |
The teacher forcing aspect of this is that I do not use 00:08:00.220 |
the vector consisting of all zeros and a one down here at this time step, 00:08:10.260 |
we could go into a mode where at least some of the time, 00:08:13.260 |
we use the vector consisting of all zeros and 00:08:16.140 |
a one down here as the input at the next time step. 00:08:19.100 |
That would effectively be using the model's predictions rather 00:08:25.820 |
and that could introduce some useful diversity 00:08:30.860 |
But usually, we do something like teacher forcing 00:08:33.780 |
where even though we got an error signal here, 00:08:36.100 |
we use the actual thing that we wanted to have 00:08:49.100 |
What I've depicted on the slide here is something like imagining that 00:08:52.420 |
the user has prompted the model with the sequence start token and the. 00:08:58.300 |
The model has predicted rock as the next sequence. 00:09:11.940 |
Now we have generated the rock rolls as our sequence. 00:09:22.340 |
Notice that you might have expected that this would be the rock rules, 00:09:26.060 |
and the model ended up predicting something different. 00:09:36.740 |
teacher forcing because we are creating new tokens. 00:09:40.100 |
We have to use the scores that we got up here to infer a next token, 00:09:51.300 |
But throughout this entire process, again, a reminder, 00:09:57.700 |
The model predicts scores over the vocabulary as depicted here, 00:10:02.500 |
and we do some inferencing to figure out which token that ought to be. 00:10:08.420 |
there are lots of schemes for doing that sampling. 00:10:13.740 |
but you could also roll out over a whole bunch of time steps, 00:10:18.260 |
looking at all the different predictions and generate 00:10:22.420 |
the sequence that maximizes that overall probability. 00:10:28.060 |
and that's very different from the maximum probability step that I'm depicting here. 00:10:37.260 |
what we're doing is applying a decision rule on 00:10:39.860 |
top of the representations that these models have created. 00:10:43.420 |
It's not so intrinsic to the models themselves, 00:10:46.660 |
that they follow that particular decision rule. 00:10:53.100 |
Final step here, when we think about fine-tuning GPT, 00:10:57.620 |
the standard mode is to process a sequence and then use 00:11:01.700 |
the final output state as the basis for some task-specific parameters 00:11:06.340 |
that you use to fine-tune on whatever supervised learning task you're focused on. 00:11:13.180 |
We could also think about using all of the output states that the model has created, 00:11:17.740 |
maybe by doing some max or mean pooling over them to gather 00:11:21.940 |
more information from the sequence that is just contained in that final output state. 00:11:35.340 |
To round this out, I thought I'd just show you some GPTs that have been released, 00:11:45.020 |
along with some information about how they're 00:11:47.180 |
structured to the extent that we know how they're structured. 00:11:50.020 |
The first GPT had 12 layers and a model dimensionality of 768, 00:11:58.980 |
That's that one point in the model where you can expand out before 00:12:02.300 |
collapsing back to decay inside the feed-forward layers. 00:12:06.260 |
That gave rise to a model that had 117 million parameters. 00:12:11.220 |
GPT-2 scaled that up considerably to 48 layers, 00:12:25.860 |
Then GPT-3 had 96 layers and a massive model dimensionality, 00:12:35.580 |
As far as I know, we don't know the dimensionality of that inside feed-forward layer, 00:12:43.020 |
That gave rise to a model that had 175 billion parameters. 00:12:51.580 |
on models that are intermediate in those sizes. 00:12:57.260 |
If you want to think about truly open alternatives, 00:12:59.860 |
here is a fast summary of the models that I know about. 00:13:03.540 |
This table is probably already hopelessly out of date by the time you are viewing it, 00:13:08.540 |
but it does give you a sense for the kinds of things that have happened on the open-source side. 00:13:14.220 |
I would say that the hopeful aspect of this is that there are a lot of these models now, 00:13:18.460 |
and some of them are quite competitive in terms of their overall size. 00:13:22.220 |
For example, the Bloom model there has 176 billion parameters, 00:13:27.260 |
and it's truly gargantuan in terms of its dimensionalities and so forth. 00:13:31.540 |
There are some other smaller ones here that are obviously very performant, 00:13:36.060 |
but very powerful, very interesting artifacts in the GPT mode.