Stanford XCS224U: NLU I Contextual Word Representations, Part 4: GPT I Spring 2023

00:00:00.000 | Welcome back everyone.

00:00:06.080 | This is part four in our series on contextual representations.

00:00:09.440 | We have come to what might be

00:00:11.200 | the most famous transformer-based architecture,

00:00:13.880 | and that is GPT.

00:00:15.960 | I thought I would start this discussion

00:00:18.280 | in a technical place,

00:00:19.720 | that is the autoregressive loss function that's

00:00:21.960 | usually used for neural language modeling.

00:00:24.680 | Then I'm going to try to support

00:00:26.180 | that technical piece with a bunch of illustrations.

00:00:28.960 | Here is the full loss function.

00:00:30.880 | There are a lot of mathematical details here.

00:00:32.780 | I think the smart thing to do is zoom in on the numerator.

00:00:36.520 | What we're saying here is that at position T,

00:00:39.760 | we're going to look up the token representation

00:00:42.400 | for T in our embedding layer,

00:00:44.360 | and do a dot product of that vector representation with

00:00:48.120 | the hidden representation that we have built up in

00:00:51.100 | our model to the time-step preceding

00:00:53.560 | the one that is in focus here, time-step T.

00:00:56.420 | The rest of this is softmax normalization.

00:00:59.540 | So we do that same calculation for every item in the vocabulary.

00:01:03.620 | Then we take the log of that,

00:01:05.140 | and we are looking for parameters that will

00:01:06.980 | maximize this log probability.

00:01:09.340 | But again, the thing to keep an eye on is that the scoring is

00:01:12.420 | based in the dot product of

00:01:14.560 | the embedding representation for the token we want to

00:01:17.620 | predict and the hidden representation that we have

00:01:20.420 | built up until the time before that time-step.

00:01:24.540 | Here's that same thing by way of an illustration.

00:01:28.460 | Our sequence is the rock rules,

00:01:30.380 | and for this language modeling,

00:01:31.900 | I think it's good to keep track of start and end tokens here.

00:01:35.780 | We begin modeling with that start token,

00:01:38.260 | which is given, that is at position T1,

00:01:41.100 | and we look up its embedding representation,

00:01:43.660 | and then we form some hidden representation H1.

00:01:47.660 | Then to predict the,

00:01:49.780 | which is now at time-step 2,

00:01:52.380 | we are going to use H1 here,

00:01:56.180 | dot producted with the embedding representation for the.

00:01:59.820 | Remember, that's the numerator in that scoring function.

00:02:03.100 | At time-step 2, we now copy over the, and we continue,

00:02:06.860 | we get its embedding representation.

00:02:09.060 | Here I'm depicting like a recurrent neural network.

00:02:12.180 | So we're traveling left to right just to keep things simple.

00:02:14.860 | I'll talk about how GPT handles this in a bit later.

00:02:18.460 | So we traveled left to right,

00:02:19.820 | and we got a second hidden representation H2,

00:02:22.260 | and again, the scoring is the same.

00:02:23.980 | To predict the rock at position 3,

00:02:26.140 | we use H2 and the embedding for rock.

00:02:29.260 | Then the sequence modeling continues in exactly that same way

00:02:32.940 | until we predict our end token.

00:02:35.820 | But for each one of these time-steps,

00:02:37.980 | remember what we're doing is getting a score for the rock that is

00:02:41.500 | proportional to the dot product of the embedding for that token,

00:02:46.420 | and the hidden representation just

00:02:48.780 | prior to that point that we're making the prediction at.

00:02:51.580 | Then we exponentiate that for the sake of doing that softmax scoring.

00:02:56.020 | When we move to GPT,

00:02:58.180 | we're essentially just doing this in the context of a transformer.

00:03:01.340 | So to depict that,

00:03:03.060 | I've got at the bottom here a traditional absolute encoding scheme for positions.

00:03:07.900 | We look up all those static vector representations,

00:03:11.860 | and we get our first contextual representations in green as usual.

00:03:16.700 | Then for GPT, we might stack lots and lots of

00:03:19.700 | transformer blocks on top of each other.

00:03:22.660 | Eventually though, we will get some output representations.

00:03:26.180 | Those are the ones that I've depicted in green here,

00:03:28.740 | and those will be the basis for language modeling.

00:03:31.420 | We will add on top of those some language modeling specific parameters,

00:03:36.260 | which could just be the embedding layer that

00:03:38.220 | comes from the word embeddings down here.

00:03:40.980 | That will be the basis for predicting the actual sequence.

00:03:45.300 | We get an error signal to the extent that we're making

00:03:48.060 | predictions into this space that don't correspond to

00:03:51.820 | the actual one-hot encoding vectors that correspond to the sequence itself.

00:03:57.140 | In essence though, this is just more of

00:04:00.500 | that conditional language modeling using the transformer architecture.

00:04:04.500 | Maybe the one thing to keep in mind here is that

00:04:06.980 | because of the nature of the attention mechanisms,

00:04:09.700 | we need to do some masking to make sure that we don't in effect look into the future.

00:04:14.620 | So let's build that up a little bit.

00:04:16.780 | We start with position A. At that first position,

00:04:21.300 | the only attending we can do is to

00:04:23.220 | ourselves because we can't look into the future.

00:04:25.700 | I haven't depicted that, but we could do self-attention.

00:04:28.900 | When we move to position B,

00:04:30.700 | we now have the opportunity to look back into position A and get that dot product.

00:04:35.900 | We could self-attend, as I said before,

00:04:38.020 | although I didn't depict that.

00:04:39.700 | Then finally, when we get to position C,

00:04:41.740 | we can look back to the previous two positions.

00:04:44.620 | The attention mask is going to have this look where we go backwards,

00:04:48.700 | but not forwards so that we don't end up looking into

00:04:51.820 | the future into tokens that we have not yet generated.

00:04:55.420 | In a little more detail, again,

00:04:58.140 | I would like to belabor these points because

00:05:00.220 | there are a lot of technical details hiding here.

00:05:02.780 | What I'm going to depict on this slide is very specifically

00:05:06.540 | training of a GPT style model with what's called teacher forcing.

00:05:11.460 | This is going to mean that no matter what token we predict at every time step,

00:05:16.340 | we will use the actual token at the next time step.

00:05:20.180 | Let's be really pedantic about this.

00:05:22.500 | At the bottom here,

00:05:23.860 | I have our input sequence as represented as a series of one-hot vectors.

00:05:29.420 | These are lookup devices that will give us back

00:05:32.660 | the embedding representations for words from our embedding space.

00:05:37.260 | Here's our embedding space in gray.

00:05:39.460 | We do those lookups and that gives us a bunch of vectors.

00:05:42.500 | As a shorthand, I have depicted the names of those vectors,

00:05:46.780 | B for beginning of sequence,

00:05:48.860 | the rock rules.

00:05:50.500 | That's the sequence there.

00:05:52.300 | Then we have a whole bunch of those transformer blocks.

00:05:55.900 | What I've done is summarize them in green here.

00:05:58.620 | Just a reminder, I've got all those arrows showing you

00:06:02.180 | the attention pattern so that we always look into the past,

00:06:05.340 | but never into the future.

00:06:07.580 | Then on top of that,

00:06:09.740 | we're going to use our embedding parameters again.

00:06:12.900 | These are the same parameters that I've got depicted down here.

00:06:17.140 | Now we are going to compare essentially the scores that we

00:06:21.580 | predict in each one of those spaces with

00:06:24.500 | the one-hot vectors that

00:06:27.100 | actually correspond to the sequence that we want to predict.

00:06:30.060 | This was the start token and conditional on that,

00:06:33.220 | we want to predict the.

00:06:34.860 | This is the down here and conditional on that,

00:06:37.460 | we want to predict rock and so forth.

00:06:39.940 | You do get this offset where the input sequence and

00:06:42.900 | the output sequence are staggered by one so that we're always

00:06:46.340 | conditional on what we've seen predicting the next token.

00:06:51.020 | Imagine that these are the scores that we

00:06:53.860 | get out of this final layer here.

00:06:55.580 | I've depicted them as integers,

00:06:57.020 | but they could be floats.

00:06:58.700 | The idea is that that is

00:07:00.660 | the comparison that we make to get our error signal.

00:07:03.460 | We can look at the difference between

00:07:05.140 | this vector here and this vector here,

00:07:08.220 | and use that as a gradient signal

00:07:10.260 | to update the parameters of the model.

00:07:12.460 | That's how this actually happens in practice.

00:07:14.860 | We always think of language modeling as predicting tokens,

00:07:19.260 | but really and truly it predicts scores over the entire vocabulary,

00:07:24.100 | and then we make a choice about which token was actually predicted by,

00:07:28.220 | for example, picking the token that had the largest score.

00:07:33.780 | I've mentioned teacher forcing in this context.

00:07:37.020 | This is really important here.

00:07:38.620 | I think we should imagine that at this time step here,

00:07:42.140 | what the model did is give

00:07:44.100 | the highest score to whatever word was in the final position here,

00:07:48.180 | but the actual token was rules.

00:07:50.780 | This will give us an error signal because we have in

00:07:53.180 | effect made a mistake that we can learn from.

00:07:56.660 | The teacher forcing aspect of this is that I do not use

00:08:00.220 | the vector consisting of all zeros and a one down here at this time step,

00:08:04.420 | but rather I use the actual token.

00:08:07.500 | If we back off from teacher forcing,

00:08:10.260 | we could go into a mode where at least some of the time,

00:08:13.260 | we use the vector consisting of all zeros and

00:08:16.140 | a one down here as the input at the next time step.

00:08:19.100 | That would effectively be using the model's predictions rather

00:08:22.860 | than the gold sequence as part of training,

00:08:25.820 | and that could introduce some useful diversity

00:08:28.780 | into the learned representations.

00:08:30.860 | But usually, we do something like teacher forcing

00:08:33.780 | where even though we got an error signal here,

00:08:36.100 | we use the actual thing that we wanted to have

00:08:38.380 | predicted down at the next time step.

00:08:41.540 | That is part of training the model,

00:08:43.980 | and then when we move to generation,

00:08:45.660 | we do something that's very similar,

00:08:47.620 | although with some twists.

00:08:49.100 | What I've depicted on the slide here is something like imagining that

00:08:52.420 | the user has prompted the model with the sequence start token and the.

00:08:58.300 | The model has predicted rock as the next sequence.

00:09:01.860 | We copy that over that representation,

00:09:05.140 | we process with the transformer as usual,

00:09:08.260 | and we get another prediction.

00:09:09.700 | In this case, it was the sequence rolls.

00:09:11.940 | Now we have generated the rock rolls as our sequence.

00:09:15.940 | We copy that over into the next time step,

00:09:19.460 | and then we get along as the next token.

00:09:22.340 | Notice that you might have expected that this would be the rock rules,

00:09:26.060 | and the model ended up predicting something different.

00:09:28.540 | That might be in its nature.

00:09:30.100 | Maybe that was a mistake, maybe it wasn't.

00:09:32.340 | But the point is that in generation,

00:09:34.500 | we no longer have the possibility of doing

00:09:36.740 | teacher forcing because we are creating new tokens.

00:09:40.100 | We have to use the scores that we got up here to infer a next token,

00:09:45.580 | copy that over, condition the model on that,

00:09:48.500 | and have the generation process repeat.

00:09:51.300 | But throughout this entire process, again, a reminder,

00:09:55.380 | the model does not predict tokens.

00:09:57.700 | The model predicts scores over the vocabulary as depicted here,

00:10:02.500 | and we do some inferencing to figure out which token that ought to be.

00:10:06.780 | As we'll discuss later in the course,

00:10:08.420 | there are lots of schemes for doing that sampling.

00:10:11.900 | You could fix the max scoring one,

00:10:13.740 | but you could also roll out over a whole bunch of time steps,

00:10:18.260 | looking at all the different predictions and generate

00:10:22.420 | the sequence that maximizes that overall probability.

00:10:25.900 | That would be more like beam search,

00:10:28.060 | and that's very different from the maximum probability step that I'm depicting here.

00:10:33.900 | That's a nice reminder that in generation,

00:10:37.260 | what we're doing is applying a decision rule on

00:10:39.860 | top of the representations that these models have created.

00:10:43.420 | It's not so intrinsic to the models themselves,

00:10:46.660 | that they follow that particular decision rule.

00:10:49.660 | That's a complexity of generation.

00:10:53.100 | Final step here, when we think about fine-tuning GPT,

00:10:57.620 | the standard mode is to process a sequence and then use

00:11:01.700 | the final output state as the basis for some task-specific parameters

00:11:06.340 | that you use to fine-tune on whatever supervised learning task you're focused on.

00:11:11.220 | But of course, we're not limited to do that.

00:11:13.180 | We could also think about using all of the output states that the model has created,

00:11:17.740 | maybe by doing some max or mean pooling over them to gather

00:11:21.940 | more information from the sequence that is just contained in that final output state.

00:11:26.900 | But for example, in the first GPT paper,

00:11:29.820 | their fine-tuning is based entirely,

00:11:31.780 | I believe, on that final output state.

00:11:35.340 | To round this out, I thought I'd just show you some GPTs that have been released,

00:11:45.020 | along with some information about how they're

00:11:47.180 | structured to the extent that we know how they're structured.

00:11:50.020 | The first GPT had 12 layers and a model dimensionality of 768,

00:11:55.980 | and a feed-forward dimensionality of 3072.

00:11:58.980 | That's that one point in the model where you can expand out before

00:12:02.300 | collapsing back to decay inside the feed-forward layers.

00:12:06.260 | That gave rise to a model that had 117 million parameters.

00:12:11.220 | GPT-2 scaled that up considerably to 48 layers,

00:12:16.100 | 1600 dimensionality for the model,

00:12:19.100 | and 1600 for the feed-forward layer,

00:12:21.780 | for a total of about 1.5 billion parameters.

00:12:25.860 | Then GPT-3 had 96 layers and a massive model dimensionality,

00:12:32.220 | over 12,000 for its model dimensionality.

00:12:35.580 | As far as I know, we don't know the dimensionality of that inside feed-forward layer,

00:12:40.300 | but it might also be 12,000.

00:12:43.020 | That gave rise to a model that had 175 billion parameters.

00:12:48.700 | By the way, the GPT-3 paper reports

00:12:51.580 | on models that are intermediate in those sizes.

00:12:54.860 | All of those models are from OpenAI.

00:12:57.260 | If you want to think about truly open alternatives,

00:12:59.860 | here is a fast summary of the models that I know about.

00:13:03.540 | This table is probably already hopelessly out of date by the time you are viewing it,

00:13:08.540 | but it does give you a sense for the kinds of things that have happened on the open-source side.

00:13:14.220 | I would say that the hopeful aspect of this is that there are a lot of these models now,

00:13:18.460 | and some of them are quite competitive in terms of their overall size.

00:13:22.220 | For example, the Bloom model there has 176 billion parameters,

00:13:27.260 | and it's truly gargantuan in terms of its dimensionalities and so forth.

00:13:31.540 | There are some other smaller ones here that are obviously very performant,

00:13:36.060 | but very powerful, very interesting artifacts in the GPT mode.

00:13:40.300 | [BLANK_AUDIO]