back to indexStanford XCS224U: NLU I Contextual Word Representations, Part 4: GPT I Spring 2023

00:00:06.080 | 
This is part four in our series on contextual representations. 00:00:11.200 | 
the most famous transformer-based architecture, 00:00:19.720 | 
that is the autoregressive loss function that's 00:00:26.180 | 
that technical piece with a bunch of illustrations. 00:00:30.880 | 
There are a lot of mathematical details here. 00:00:32.780 | 
I think the smart thing to do is zoom in on the numerator. 00:00:36.520 | 
What we're saying here is that at position T, 00:00:39.760 | 
we're going to look up the token representation 00:00:44.360 | 
and do a dot product of that vector representation with 00:00:48.120 | 
the hidden representation that we have built up in 00:00:59.540 | 
So we do that same calculation for every item in the vocabulary. 00:01:09.340 | 
But again, the thing to keep an eye on is that the scoring is 00:01:14.560 | 
the embedding representation for the token we want to 00:01:17.620 | 
predict and the hidden representation that we have 00:01:20.420 | 
built up until the time before that time-step. 00:01:24.540 | 
Here's that same thing by way of an illustration. 00:01:31.900 | 
I think it's good to keep track of start and end tokens here. 00:01:43.660 | 
and then we form some hidden representation H1. 00:01:56.180 | 
dot producted with the embedding representation for the. 00:01:59.820 | 
Remember, that's the numerator in that scoring function. 00:02:03.100 | 
At time-step 2, we now copy over the, and we continue, 00:02:09.060 | 
Here I'm depicting like a recurrent neural network. 00:02:12.180 | 
So we're traveling left to right just to keep things simple. 00:02:14.860 | 
I'll talk about how GPT handles this in a bit later. 00:02:19.820 | 
and we got a second hidden representation H2, 00:02:29.260 | 
Then the sequence modeling continues in exactly that same way 00:02:37.980 | 
remember what we're doing is getting a score for the rock that is 00:02:41.500 | 
proportional to the dot product of the embedding for that token, 00:02:48.780 | 
prior to that point that we're making the prediction at. 00:02:51.580 | 
Then we exponentiate that for the sake of doing that softmax scoring. 00:02:58.180 | 
we're essentially just doing this in the context of a transformer. 00:03:03.060 | 
I've got at the bottom here a traditional absolute encoding scheme for positions. 00:03:07.900 | 
We look up all those static vector representations, 00:03:11.860 | 
and we get our first contextual representations in green as usual. 00:03:16.700 | 
Then for GPT, we might stack lots and lots of 00:03:22.660 | 
Eventually though, we will get some output representations. 00:03:26.180 | 
Those are the ones that I've depicted in green here, 00:03:28.740 | 
and those will be the basis for language modeling. 00:03:31.420 | 
We will add on top of those some language modeling specific parameters, 00:03:40.980 | 
That will be the basis for predicting the actual sequence. 00:03:45.300 | 
We get an error signal to the extent that we're making 00:03:48.060 | 
predictions into this space that don't correspond to 00:03:51.820 | 
the actual one-hot encoding vectors that correspond to the sequence itself. 00:04:00.500 | 
that conditional language modeling using the transformer architecture. 00:04:04.500 | 
Maybe the one thing to keep in mind here is that 00:04:06.980 | 
because of the nature of the attention mechanisms, 00:04:09.700 | 
we need to do some masking to make sure that we don't in effect look into the future. 00:04:16.780 | 
We start with position A. At that first position, 00:04:23.220 | 
ourselves because we can't look into the future. 00:04:25.700 | 
I haven't depicted that, but we could do self-attention. 00:04:30.700 | 
we now have the opportunity to look back into position A and get that dot product. 00:04:41.740 | 
we can look back to the previous two positions. 00:04:44.620 | 
The attention mask is going to have this look where we go backwards, 00:04:48.700 | 
but not forwards so that we don't end up looking into 00:04:51.820 | 
the future into tokens that we have not yet generated. 00:05:00.220 | 
there are a lot of technical details hiding here. 00:05:02.780 | 
What I'm going to depict on this slide is very specifically 00:05:06.540 | 
training of a GPT style model with what's called teacher forcing. 00:05:11.460 | 
This is going to mean that no matter what token we predict at every time step, 00:05:16.340 | 
we will use the actual token at the next time step. 00:05:23.860 | 
I have our input sequence as represented as a series of one-hot vectors. 00:05:29.420 | 
These are lookup devices that will give us back 00:05:32.660 | 
the embedding representations for words from our embedding space. 00:05:39.460 | 
We do those lookups and that gives us a bunch of vectors. 00:05:42.500 | 
As a shorthand, I have depicted the names of those vectors, 00:05:52.300 | 
Then we have a whole bunch of those transformer blocks. 00:05:55.900 | 
What I've done is summarize them in green here. 00:05:58.620 | 
Just a reminder, I've got all those arrows showing you 00:06:02.180 | 
the attention pattern so that we always look into the past, 00:06:09.740 | 
we're going to use our embedding parameters again. 00:06:12.900 | 
These are the same parameters that I've got depicted down here. 00:06:17.140 | 
Now we are going to compare essentially the scores that we 00:06:27.100 | 
actually correspond to the sequence that we want to predict. 00:06:30.060 | 
This was the start token and conditional on that, 00:06:34.860 | 
This is the down here and conditional on that, 00:06:39.940 | 
You do get this offset where the input sequence and 00:06:42.900 | 
the output sequence are staggered by one so that we're always 00:06:46.340 | 
conditional on what we've seen predicting the next token. 00:07:00.660 | 
the comparison that we make to get our error signal. 00:07:12.460 | 
That's how this actually happens in practice. 00:07:14.860 | 
We always think of language modeling as predicting tokens, 00:07:19.260 | 
but really and truly it predicts scores over the entire vocabulary, 00:07:24.100 | 
and then we make a choice about which token was actually predicted by, 00:07:28.220 | 
for example, picking the token that had the largest score. 00:07:33.780 | 
I've mentioned teacher forcing in this context. 00:07:38.620 | 
I think we should imagine that at this time step here, 00:07:44.100 | 
the highest score to whatever word was in the final position here, 00:07:50.780 | 
This will give us an error signal because we have in 00:07:53.180 | 
effect made a mistake that we can learn from. 00:07:56.660 | 
The teacher forcing aspect of this is that I do not use 00:08:00.220 | 
the vector consisting of all zeros and a one down here at this time step, 00:08:10.260 | 
we could go into a mode where at least some of the time, 00:08:13.260 | 
we use the vector consisting of all zeros and 00:08:16.140 | 
a one down here as the input at the next time step. 00:08:19.100 | 
That would effectively be using the model's predictions rather 00:08:25.820 | 
and that could introduce some useful diversity 00:08:30.860 | 
But usually, we do something like teacher forcing 00:08:33.780 | 
where even though we got an error signal here, 00:08:36.100 | 
we use the actual thing that we wanted to have 00:08:49.100 | 
What I've depicted on the slide here is something like imagining that 00:08:52.420 | 
the user has prompted the model with the sequence start token and the. 00:08:58.300 | 
The model has predicted rock as the next sequence. 00:09:11.940 | 
Now we have generated the rock rolls as our sequence. 00:09:22.340 | 
Notice that you might have expected that this would be the rock rules, 00:09:26.060 | 
and the model ended up predicting something different. 00:09:36.740 | 
teacher forcing because we are creating new tokens. 00:09:40.100 | 
We have to use the scores that we got up here to infer a next token, 00:09:51.300 | 
But throughout this entire process, again, a reminder, 00:09:57.700 | 
The model predicts scores over the vocabulary as depicted here, 00:10:02.500 | 
and we do some inferencing to figure out which token that ought to be. 00:10:08.420 | 
there are lots of schemes for doing that sampling. 00:10:13.740 | 
but you could also roll out over a whole bunch of time steps, 00:10:18.260 | 
looking at all the different predictions and generate 00:10:22.420 | 
the sequence that maximizes that overall probability. 00:10:28.060 | 
and that's very different from the maximum probability step that I'm depicting here. 00:10:37.260 | 
what we're doing is applying a decision rule on 00:10:39.860 | 
top of the representations that these models have created. 00:10:43.420 | 
It's not so intrinsic to the models themselves, 00:10:46.660 | 
that they follow that particular decision rule. 00:10:53.100 | 
Final step here, when we think about fine-tuning GPT, 00:10:57.620 | 
the standard mode is to process a sequence and then use 00:11:01.700 | 
the final output state as the basis for some task-specific parameters 00:11:06.340 | 
that you use to fine-tune on whatever supervised learning task you're focused on. 00:11:13.180 | 
We could also think about using all of the output states that the model has created, 00:11:17.740 | 
maybe by doing some max or mean pooling over them to gather 00:11:21.940 | 
more information from the sequence that is just contained in that final output state. 00:11:35.340 | 
To round this out, I thought I'd just show you some GPTs that have been released, 00:11:45.020 | 
along with some information about how they're 00:11:47.180 | 
structured to the extent that we know how they're structured. 00:11:50.020 | 
The first GPT had 12 layers and a model dimensionality of 768, 00:11:58.980 | 
That's that one point in the model where you can expand out before 00:12:02.300 | 
collapsing back to decay inside the feed-forward layers. 00:12:06.260 | 
That gave rise to a model that had 117 million parameters. 00:12:11.220 | 
GPT-2 scaled that up considerably to 48 layers, 00:12:25.860 | 
Then GPT-3 had 96 layers and a massive model dimensionality, 00:12:35.580 | 
As far as I know, we don't know the dimensionality of that inside feed-forward layer, 00:12:43.020 | 
That gave rise to a model that had 175 billion parameters. 00:12:51.580 | 
on models that are intermediate in those sizes. 00:12:57.260 | 
If you want to think about truly open alternatives, 00:12:59.860 | 
here is a fast summary of the models that I know about. 00:13:03.540 | 
This table is probably already hopelessly out of date by the time you are viewing it, 00:13:08.540 | 
but it does give you a sense for the kinds of things that have happened on the open-source side. 00:13:14.220 | 
I would say that the hopeful aspect of this is that there are a lot of these models now, 00:13:18.460 | 
and some of them are quite competitive in terms of their overall size. 00:13:22.220 | 
For example, the Bloom model there has 176 billion parameters, 00:13:27.260 | 
and it's truly gargantuan in terms of its dimensionalities and so forth. 00:13:31.540 | 
There are some other smaller ones here that are obviously very performant, 00:13:36.060 | 
but very powerful, very interesting artifacts in the GPT mode.