back to indexStanford CS224N NLP with Deep Learning | 2023 | Lecture 9 - Pretraining
00:00:12.440 |
which is another exciting topic on the road to modern natural language 00:00:42.880 |
so this lecture, the Transformers lecture, and then to a lesser extent, 00:00:47.800 |
Thursday's lecture on natural language generation 00:00:51.680 |
will be sort of the sum of lectures for the assignments you have to do. 00:00:56.280 |
So assignment five is coming out on Thursday. 00:01:01.480 |
And the topics covered in this lecture, and self-attention transformers, 00:01:06.640 |
and again, a little bit of natural language generation 00:01:10.440 |
And then the rest of the course will go through some really fascinating topics 00:01:14.600 |
in sort of modern natural language processing 00:01:17.640 |
that should be useful for your final projects, and future jobs, 00:01:25.240 |
But I think that today's lecture is significantly less technical in detail 00:01:31.800 |
than last Thursday's on self-attention and transformers, 00:01:35.600 |
but should give you an idea of the sort of world of pre-training 00:01:41.160 |
and sort of how it helps define natural language processing today. 00:01:46.600 |
So a reminder about assignment five, your project proposals 00:01:55.160 |
Try to get them in on time so that we can give you prompt feedback 00:02:09.240 |
is a bit of a technical detail on word structure 00:02:16.160 |
and sort of how we model the input sequence of words that we get. 00:02:19.800 |
So when we were teaching Word2Vec and sort of all the methods 00:02:26.360 |
that we've talked about so far, we assumed a finite vocabulary. 00:02:29.840 |
So you had a vocabulary v that you define via whatever. 00:02:33.680 |
You've decided what the words are in that data. 00:02:36.640 |
And so you have some words like hat and learn. 00:02:44.680 |
It's in red because you've learned it properly. 00:02:46.920 |
Actually, let's replace hat and learn with pizza and tasty. 00:02:56.440 |
And you have an embedding that's been learned on your data 00:03:00.800 |
to sort of know what to do when you see those words. 00:03:06.020 |
maybe you see like tasty and maybe a typo like learn, 00:03:11.640 |
or maybe novel items where it's like a word that you as a human 00:03:18.160 |
This is called derivational morphology of this word 00:03:22.240 |
transformer that you know and if I, which means take this noun 00:03:31.160 |
To transformerify NLP might mean to make NLP more 00:03:39.000 |
And for each of these, this maybe didn't show up 00:03:48.960 |
And young people are always making new words. 00:03:54.640 |
though, because you've defined this finite vocabulary. 00:03:57.440 |
And there's sort of no mapping in that vocabulary 00:04:02.400 |
Even though their meanings should be relatively well 00:04:05.280 |
defined based on the data you've seen so far, 00:04:08.120 |
it's just that the sort of string of characters that define them 00:04:14.640 |
Well, maybe you map them to this sort of universal unknown token. 00:04:23.240 |
I'm going to say it's always represented by the same token UNK. 00:04:38.640 |
And so this is like a clear problem, especially-- 00:04:44.120 |
In many of the roles languages, it's a substantially larger problem. 00:04:49.000 |
So English has relatively simple word structure. 00:04:53.360 |
There's a couple of conjugations for each verb, like eat, eats, eaten, ate. 00:05:00.360 |
But in a language with much more complex morphology or word structure, 00:05:06.960 |
you'll have a considerably more complex sort of set of things 00:05:12.360 |
So here is a conjugation table for a Swahili verb. 00:05:20.840 |
And if I define the vocabulary to be every unique string of characters 00:05:24.800 |
maps to its own word, then every one of the 300 conjugations 00:05:28.400 |
would get an independent vector under my model, which makes no sense, 00:05:33.280 |
because the 300 conjugations obviously have a lot in common 00:05:41.240 |
You'd have to have a huge vocabulary if I wanted all conjugations to show up. 00:05:46.400 |
And that's a mistake for efficiency reasons and for learning reasons. 00:05:57.160 |
And so what we end up doing is we'll look at subword structure, 00:06:06.640 |
So what we're going to do is we're going to say, 00:06:08.680 |
if I can try to define what the set of all words is, 00:06:12.640 |
I'm going to define my vocabulary to include parts of words. 00:06:17.640 |
So I'm going to split words into sequences of known subwords. 00:06:30.280 |
And so there's a simple sort of algorithm for this, 00:06:35.480 |
So if I only had a vocabulary of all characters, 00:06:38.240 |
and maybe like an end of word symbol for a finite data set, 00:06:44.320 |
then no matter what word I saw in the future, 00:06:46.480 |
as long as I had seen all possible characters, 00:06:48.560 |
I could take the word and say, I don't know what this word is. 00:06:51.100 |
I'm going to split it into all of its individual characters. 00:06:57.360 |
And then you're going to find common adjacent characters and say, OK, 00:07:01.240 |
A and B co-occur next to each other quite a bit. 00:07:03.920 |
So I'm going to add a new word to my vocabulary. 00:07:07.120 |
Now it's all characters plus this new word A, B, which is a subword. 00:07:13.440 |
And likewise, so now I'm going to replace the character pair 00:07:16.040 |
with the new subword and repeat until you add a lot, a lot, a lot of vocabulary 00:07:20.720 |
items through this process of what things tend to co-occur next to each other. 00:07:24.520 |
And so what you'll end up with is a vocabulary 00:07:28.480 |
of very commonly co-occurring sort of substrings 00:07:33.540 |
And this was originally developed for machine translation, 00:07:36.000 |
but then it's been used considerably in pretty much all modern language models. 00:07:41.200 |
So now we have a hat and learn, hat and learn. 00:07:46.840 |
showed up enough that they're their own individual words. 00:07:51.800 |
So simple common words show up as a word in your vocabulary 00:08:07.160 |
So T-A-A and then A-A-A and then S-T-Y, right? 00:08:12.160 |
So I've actually taken one sort of thing that seems like a word, 00:08:15.200 |
and in my vocabulary, it's now split into three subword tokens. 00:08:20.120 |
So when I pass this to my transformer or to my recurrent neural network, 00:08:24.960 |
the recurrent neural network would take T-A-A as just a single element, 00:08:29.960 |
do the RNN update, and then take A-A-A, do the RNN update, and then S-T-Y. 00:08:35.200 |
So it could learn to process constructions like this. 00:08:39.720 |
And maybe I can even add more A-A-As in the middle, 00:08:44.080 |
Instead of just seeing the entire word tasty and not knowing what it means. 00:09:11.240 |
and so you can see that you have sort of three learned embeddings instead 00:09:17.760 |
This is just wildly useful and is used pretty much everywhere. 00:09:21.280 |
Variants of this algorithm are used pretty much everywhere in modern NLP. 00:09:28.640 |
If we have three embeddings for tasty, do we just add them together? 00:09:32.840 |
So the question is, if we have three embeddings for tasty, 00:09:39.920 |
so when we're actually processing the sequence, 00:09:42.520 |
I'd see something like I learned about the T-A-A, A-A-A, S-T-Y. 00:09:52.480 |
But if I wanted to then say, what's my representation of this thing? 00:09:58.800 |
Sometimes you average the contextual representations of the three 00:10:15.720 |
So you know where to split based on the algorithm 00:10:18.720 |
that I specified earlier for learning the vocabulary. 00:10:23.280 |
So you learn this vocabulary by just combining 00:10:25.800 |
commonly co-occurring adjacent strings of letters. 00:10:34.000 |
And then when I'm actually walking through and tokenizing, 00:10:38.560 |
So I split words into the maximal sort of subword 00:10:45.120 |
Yeah, so I'm like, OK, if I want to split this up, 00:10:50.580 |
And you try to find some approximate what the best way to split it 00:10:56.120 |
Does it seem to make sense to use punctuation in the character set? 00:11:00.520 |
So the question is, do people use punctuation in the character set? 00:11:12.760 |
that what text is given to these models is as unprocessed as possible. 00:11:17.680 |
You try to make it sort of clean looking text, where you've removed HTML tags, 00:11:26.240 |
But then beyond that, you process it as little as possible 00:11:29.120 |
so that it reflects as well as possible what people might actually 00:11:35.080 |
So maybe earlier in the course, when we were looking at Word2Vec, 00:11:40.280 |
oh, we don't want Word2Vec vectors of punctuation or something like that. 00:11:48.240 |
to what the text you'd get with people trying to use your system would be. 00:11:52.120 |
So yes, in practice, punctuation and dot, dot, dot 00:11:55.600 |
might be its own word, and maybe a sequence of hyphens, 00:12:11.800 |
Could be multiple embeddings versus a single embedding. 00:12:21.760 |
The question is, does the system treat any differently words 00:12:24.280 |
that are really themselves a whole word versus words that are pieces? 00:12:29.680 |
They're all just indices into your embedding vocabulary matrix. 00:12:37.960 |
What about really long words that are relatively common? 00:12:44.640 |
Because if you're building up from single character all the way up, 00:12:49.440 |
The question is, what happens to very long words 00:12:51.920 |
if you're building up from character pairs and portions of characters? 00:12:57.400 |
In practice, the statistics speak really well for themselves. 00:13:01.080 |
So if a long word is very common, it will end up in the vocabulary. 00:13:07.920 |
There are algorithms that aren't this that do slightly better in various ways. 00:13:13.000 |
But the intuition that you figure out what the common co-occurring 00:13:17.520 |
substrings are, independent of length almost, 00:13:22.040 |
And so you can actually just look at the learned vocabularies 00:13:26.600 |
And you see some long words just because they showed up a lot. 00:13:32.240 |
I'm curious, how does it weigh the frequency? 00:13:43.680 |
In your next slide, it was like if-i at the very last one. 00:13:50.120 |
So how does it weigh the frequency of a subword versus the length of it? 00:13:54.320 |
It tries to spread it up into the smallest number. 00:13:56.920 |
But what if it split it up into three, but one of them was super common? 00:14:00.960 |
Yeah, so the question is, if transformer is a subword in my vocabulary, 00:14:05.920 |
and if is a subword, and y is a subword, and if-i as a three-letter tuple 00:14:12.840 |
is also a subword, how does it choose to take the-- 00:14:23.840 |
We choose to try to take the smallest number of subwords, 00:14:26.480 |
because that tends to be more of the bottleneck, as opposed 00:14:29.720 |
to having a bunch of very common, very short subwords. 00:14:34.800 |
Sequence length is a big problem in transformers. 00:14:39.360 |
Although trying to split things into multiple options of a sequence 00:14:44.600 |
is the thing that people have done to see which one will work better. 00:14:47.760 |
But yeah, having fewer bigger subwords tends to be the best sort of idea. 00:14:53.320 |
Feel free to ask me more questions about this afterward. 00:14:56.720 |
OK, so let's talk about pre-training from the context of the course so far. 00:15:03.120 |
So at the very beginning of the course, we gave you this quote, which was, 00:15:07.480 |
"You shall know a word by the company it keeps." 00:15:09.640 |
This was the sort of thesis of the distributional hypothesis, 00:15:13.640 |
that the meaning of the word is defined by, or at least reflected by, 00:15:23.960 |
The same person who made that quote had a separate quote, actually earlier, 00:15:29.720 |
that continues this notion of meaning as defined by context, which 00:15:34.800 |
has something along the lines of, well, since the word shows up 00:15:38.920 |
in context when we actually use it, when we speak to each other, 00:15:42.560 |
the meaning of the word should be defined in the context 00:15:47.480 |
And so the complete meaning of a word is always contextual, 00:15:51.360 |
and no study of meaning apart from a complete context 00:15:55.920 |
So the big difference here is, at Word2Vec training time, 00:16:01.240 |
if I have the word record, R-E-C-O-R-D, when I'm training Word2Vec, 00:16:07.920 |
I get one vector or two, but one vector meaning record, the string. 00:16:16.160 |
And it has to learn by what context it shows up in, 00:16:19.960 |
that sometimes it can mean I record, i.e. the verb, or record, i.e. 00:16:30.480 |
And so when I use the Word2Vec embedding of record, 00:16:33.320 |
it sort of has this mixture meaning of both of its sort of senses, right? 00:16:38.960 |
It doesn't get to specialize and say, oh, this part means record, 00:16:45.040 |
And so Word2Vec is going to just sort of fail. 00:16:48.320 |
And so I can build better representations of language 00:16:51.360 |
through these contextual representations that 00:16:53.640 |
are going to take things like recurrent neural networks or transformers 00:16:56.640 |
that we used before to build up sort of contextual meaning. 00:17:03.320 |
So what we had before were pre-trained word embeddings. 00:17:07.600 |
And then we had sort of a big box on top of it, 00:17:10.960 |
like a transformer or an LSTM, that was not pre-trained, right? 00:17:15.160 |
So you learn via context your word embeddings here. 00:17:19.320 |
And then you have a task, like sentiment analysis or machine translation 00:17:25.760 |
And you initialize all the parameters of this randomly. 00:17:37.040 |
that we're going to try to pre-train all the parameters. 00:17:41.180 |
And instead of just pre-training my word embeddings with Word2Vec, 00:17:45.500 |
I'm going to train all of the parameters of the network, 00:17:57.600 |
So now the labeled data that I have for, say, machine translation 00:18:05.720 |
I might not need as much of it, because I've already 00:18:08.520 |
trained much more of the network than I otherwise 00:18:10.760 |
would have if I had just gotten Word2Vec embeddings. 00:18:13.480 |
So here, I've pre-trained this entire structure-- 00:18:23.640 |
Everything's been trained via methods that we'll talk about today. 00:18:28.680 |
I mean, it gives you very strong representations of language. 00:18:31.520 |
So the meaning of record and record will be different 00:18:36.120 |
in the sort of contextual representations that 00:18:38.920 |
know where in the sequence it is and what words are co-occurring with it 00:18:42.920 |
in this specific input than Word2Vec, which only has one representation 00:18:50.080 |
It'll also be used as strong parameter initializations for NLP models. 00:18:56.920 |
worked with building out a natural language processing 00:19:03.680 |
And we always say, oh, small, normally distributed noise, 00:19:12.280 |
And here, we're going to say, well, just like we 00:19:14.800 |
were going to use the Word2Vec embeddings and those sort of encoded 00:19:18.440 |
structure, I'm going to start maybe my machine translation 00:19:21.400 |
system from a parameter initialization that's 00:19:27.380 |
And then also, it's going to give us probability distributions 00:19:29.880 |
over language that we can use to generate and otherwise. 00:19:42.020 |
to be centered around this idea of reconstructing the input. 00:19:47.040 |
It's a sequence of text that some human has generated. 00:19:49.840 |
And the sort of hypothesis is that by masking out part of it 00:19:55.960 |
and tasking a neural network with reconstructing the original input, 00:20:00.720 |
that neural network has to learn a lot about language, about the world, 00:20:05.320 |
in order to do a good job of reconstructing the input. 00:20:07.960 |
So this is now a supervised learning problem, 00:20:13.520 |
Taking this sentence that just existed, Stanford University 00:20:16.120 |
is located in, say, Palo Alto, California, or Stanford, California, 00:20:23.240 |
And I have, by removing this part of the sentence, made a label for myself. 00:20:29.560 |
The input is this sort of broken masked sentence. 00:20:36.420 |
So if I give this example to a network and ask 00:20:41.940 |
it to predict the center thing, as it's doing its gradient step 00:20:45.360 |
on this input, it's going to encode information 00:20:47.760 |
about the co-occurrence between this context, Stanford University is located 00:20:53.680 |
So by tasking it with this, it might learn, say, where Stanford is. 00:20:59.320 |
Well, it can learn things about maybe syntax. 00:21:05.480 |
Here, there's only a certain set of words that could go here. 00:21:14.240 |
So the context shows me what kinds of words can appear 00:21:30.080 |
So this sort of co-reference between this entity 00:21:35.320 |
who is being discussed in the world, this woman, and her shoulder. 00:21:44.840 |
It's referring to the same entity in the discourse. 00:21:47.240 |
And so the network might be able to learn things about what 00:21:58.480 |
So if I went to the ocean to see the fish, turtles, seals, and blank, 00:22:02.800 |
then the word that's in the blank should be a member of the class 00:22:06.520 |
that I'm thinking of as a person writing this sentence of stuff 00:22:09.840 |
that I see when I go to the ocean and see these other things as well. 00:22:13.860 |
So in order to do this prediction task, maybe 00:22:15.840 |
I learn about the semantics of aquatic creatures. 00:22:24.580 |
I've got overall, the value I got from the two hours watching it 00:22:31.920 |
What kind of task could I be learning from doing 00:22:38.820 |
So this is just a naturalistic sort of text that I naturally wrote myself. 00:22:48.920 |
learning about sort of the latent sentiment of the person who 00:22:53.200 |
wrote this, what they were feeling about the movie at the time. 00:22:57.080 |
So maybe if I see a new review later on, I can just paste in the review, 00:23:07.200 |
could be implicitly solving the task of sentiment analysis. 00:23:16.760 |
Standing next to Iroh, Zuko pondered his destiny. 00:23:23.160 |
OK, so in this scenario, we've got a world implicitly 00:23:27.120 |
that's been designed by the person who is creating this text. 00:23:31.160 |
I've got physical locations in the discourse, like the kitchen. 00:23:47.120 |
And so in terms of latent notions of embodiment and physical location, 00:23:51.640 |
the way that people talk about people being next to something 00:23:57.800 |
stuff about sort of, yeah, a little bit about how the world works even. 00:24:06.360 |
I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, blank. 00:24:19.640 |
If you had to model by looking at a bunch of numbers from the Fibonacci 00:24:23.120 |
sequence, learn to, in general, predict the next one, 00:24:27.160 |
it's a question you should be thinking about throughout the lecture. 00:24:34.240 |
of what you might learn from predicting the context? 00:24:44.240 |
So a very simple way to think about pre-training 00:24:49.340 |
So we saw language modeling earlier in the course. 00:24:51.640 |
And now we're just going to say, instead of using my language model just 00:24:59.600 |
I'm going to actually model the distribution p theta of the word 00:25:12.240 |
There's just an amazing amount of data for this in a lot of languages, 00:25:16.800 |
There's very little data for this in actually 00:25:18.680 |
most of the world's languages, which is a separate problem. 00:25:21.920 |
But you can pre-train just through language modeling, right? 00:25:24.340 |
So I'm going to sort of do the teacher forcing thing. 00:25:30.600 |
And I'm going to train my sort of LSTM or my transformer to do this task. 00:25:35.760 |
And then I'm just going to keep all the weights. 00:25:38.400 |
OK, I'm going to save all the network parameters. 00:25:41.000 |
And then once I have these parameters, instead 00:25:48.040 |
just going to use them as an initialization for my parameters. 00:25:52.280 |
So I have this pre-training fine-tuning paradigm. 00:25:59.680 |
Let's say a large portion of you this year in your final projects 00:26:02.440 |
will be doing the pre-training fine-tuning sort of paradigm, 00:26:05.160 |
where someone has done the pre-training for you, right? 00:26:09.100 |
You learn very general things about the distribution of words 00:26:13.080 |
and sort of the latent things that that tells you about the world 00:26:17.400 |
And then in step two, you've got some task, maybe sentiment analysis. 00:26:26.720 |
And you adapt the pre-trained model to the task 00:26:29.840 |
that you care about by further doing gradient steps on this task. 00:26:37.960 |
And then you sort of continue to update the parameters 00:26:42.080 |
based on the initialization from the pre-training. 00:26:48.440 |
I mean, unbelievably well-- compared to training from scratch. 00:26:51.800 |
Intuitively, because you've taken a lot of the burden of learning 00:26:54.760 |
about language, learning about the world, off of the data 00:27:00.400 |
And you're sort of giving that task of learning 00:27:02.560 |
all this sort of very general stuff to the much more general task of language 00:27:07.460 |
You said we didn't have much data in other languages. 00:27:16.600 |
The question is, you said we have a lot of data in English, 00:27:22.320 |
What do you mean by data that we don't have a lot of in other languages? 00:27:29.960 |
Because you don't need annotations to do language model pre-training, right? 00:27:33.240 |
The existence of that sequence of words that someone has written 00:27:37.280 |
provides you with all these pairs of input and output. 00:27:44.800 |
Those are all labels sort of that you've constructed from the input just 00:27:49.520 |
But in most languages, even on the entire internet, 00:27:52.840 |
I mean, there's about 7,000-ish languages on Earth. 00:27:55.960 |
And most of them don't have the sort of billions of words 00:28:08.320 |
are you still learning one vector representation per word? 00:28:11.480 |
The question is, if you're pre-training the entire thing, 00:28:13.800 |
do you still learn one vector representation per word? 00:28:23.000 |
You've got your embedding matrix that is vocabulary size 00:28:32.680 |
But then the transformer that you're learning on top of it 00:28:35.520 |
takes in the sequence so far and sort of gives a vector to each of them 00:28:39.440 |
that's dependent on the context in that case. 00:28:41.760 |
But still, at the input, you only have one embedding per word. 00:28:46.500 |
So what sort of metrics would you use to evaluate a pre-trained model? 00:28:58.700 |
use to evaluate pre-trained models since it's 00:29:02.740 |
But there are lots of very specific evaluations you could use. 00:29:07.300 |
We'll get into a lot of that in the rest of the lecture. 00:29:09.940 |
While you're training it, you can use simple metrics 00:29:13.900 |
but aren't actually what you want, just like the probability quality. 00:29:18.340 |
So you can evaluate the perplexity of your language model 00:29:21.180 |
just like you would have when you cared about language modeling. 00:29:23.760 |
And it turns out to be the case that better perplexity correlates 00:29:27.460 |
with all the stuff that's much harder to evaluate, 00:29:32.420 |
But also, the natural language processing community 00:29:34.460 |
has built very large sort of benchmark suites of varying tasks 00:29:39.520 |
to try to get at sort of a notion of generality, 00:29:45.540 |
And so when you develop new pre-training methods, what you often do 00:29:48.820 |
is you try to pick a whole bunch of evaluations 00:29:55.660 |
So why should this sort of pre-training, fine-tuning, two-part paradigm help? 00:30:06.740 |
This is still an open area of research, but the intuitions 00:30:10.380 |
are all you're going to take from this course. 00:30:12.500 |
So pre-training provides some sort of starting parameters, L theta. 00:30:17.500 |
So this is like all the parameters in your network, 00:30:20.140 |
from trying to do this minimum over all possible settings of your parameters 00:30:26.900 |
And then the fine-tuning process takes your data for fine-tuning. 00:30:32.580 |
And it tries to approximate the minimum through gradient descent 00:30:36.380 |
of the loss of the fine-tuning task of theta. 00:30:46.420 |
And then if you could actually solve this min and wanted to, 00:30:51.900 |
it sort of feels like the starting point shouldn't matter. 00:31:03.900 |
But the process of gradient descent, maybe it 00:31:07.660 |
sticks relatively close to the theta hat during fine-tuning. 00:31:14.620 |
And then you sort of walk downhill with gradient descent 00:31:21.620 |
because it's close to the pre-training parameters, which were 00:31:26.060 |
This is a cool place where sort of practice and theory 00:31:29.300 |
are sort of like meeting, where optimization people want 00:31:34.700 |
NLP people sort of just want to build better systems. 00:31:56.180 |
The question is, if stochastic gradient descent 00:31:59.180 |
sticks relatively close, what if we use a different optimizer? 00:32:01.780 |
I mean, if we use sort of any common variant of gradient 00:32:07.100 |
like Adam, which we use in this course, or AdaGrad, 00:32:10.420 |
or they all have this very, very similar properties. 00:32:14.860 |
Other types of optimization we just tend to not use. 00:32:21.700 |
the pre-training plus fine tuning works better than just 00:32:25.060 |
fine tuning, but making the model more powerful, 00:32:27.180 |
like adding more layers, more data, et cetera. 00:32:30.300 |
The question is, why does the pre-trained fine tune paradigm 00:32:33.540 |
work better than just making the model more powerful, 00:32:36.580 |
adding more layers, adding more data to just the fine tuning? 00:32:39.180 |
The simple answer is that you have orders of magnitude 00:32:51.860 |
Then you do carefully labeled data and the tasks 00:32:59.660 |
or whatever that you've had someone label carefully. 00:33:03.220 |
So you have something like on the internet at least 5 00:33:13.020 |
and you have maybe a million words of your labeled data 00:33:24.260 |
to do a very, very simple thing like sentiment analysis 00:33:28.180 |
is not going to get you a very generally able agent 00:33:34.940 |
in a wide range of settings compared to language modeling. 00:33:42.180 |
Even if you have a lot of labeled data of movie reviews 00:33:45.020 |
of the kind that people are writing today, maybe tomorrow 00:33:49.260 |
they start writing slightly different kinds of movie 00:33:51.380 |
reviews, and your system doesn't perform as well. 00:33:53.660 |
Whereas if you pre-trained on a really diverse set of text 00:33:58.900 |
it might be more adaptable to seeing stuff that doesn't quite 00:34:05.060 |
even if you showed it a ton of training data. 00:34:10.420 |
is that you get this huge amount of variety of text 00:34:17.100 |
I mean, yeah, you should be very careful about what kind of text 00:34:20.220 |
you're showing it and what kind of text you're not, 00:34:22.460 |
because the internet is full of awful text as well. 00:34:29.660 |
from how hard this problem is and how much data 00:34:34.420 |
--pre-trained model was trained on so much data. 00:34:37.780 |
How do you then train it so that it considers the stuff 00:34:42.140 |
that you're fine-tuning it with as more important, more 00:34:46.660 |
rather than just one in a billion articles of data? 00:34:51.900 |
So the question is, given that the amount of data 00:34:54.380 |
on the pre-training side is orders of magnitude 00:34:56.340 |
more than the amount of data on the fine-tuning side, 00:34:58.540 |
how do you get across to the model that, OK, actually, 00:35:11.900 |
So I've gotten my parameter initialization from this. 00:35:17.620 |
I move to where the parameters are doing well 00:35:25.060 |
about how to do this, because now I'm just asking 00:35:32.940 |
But we're going to keep talking about this in much more detail 00:35:38.180 |
So OK, so let's talk about model pre-training. 00:35:55.140 |
Let's talk about model pre-training three ways. 00:36:01.660 |
we talked about encoders, encoder decoders, and decoders. 00:36:04.980 |
And we'll do decoders last, because actually, 00:36:08.580 |
many of the largest models that are being used today 00:36:14.180 |
And so we'll have a bit more to say about them. 00:36:23.700 |
able to see the whole thing, kind of like an encoder 00:36:28.100 |
Encoder decoders have one portion of the network 00:36:34.140 |
So that's like the source sentence of my machine 00:36:37.900 |
And then they're sort of paired with a decoder that 00:36:42.420 |
have this sort of informational masking where 00:36:45.420 |
I can't see the future, so that I can do things 00:36:48.500 |
I can generate the next token of my translation, whatever. 00:36:51.260 |
So you could think of it as I've got my source sentence here, 00:36:54.820 |
and my partial translation here, and I'm sort of decoding 00:36:59.180 |
And then decoders only are things like language models. 00:37:09.100 |
And how you pre-train them and then how you use them 00:37:11.380 |
depends on the properties and the proactivities 00:37:18.740 |
So we've looked at language modeling quite a bit. 00:37:21.460 |
But we can't do language modeling with an encoder, 00:37:26.620 |
So if I'm down here at i, and I want to present-- 00:37:33.460 |
a trivial task at this level here to predict the next word. 00:37:38.020 |
Because in the middle, I was able to look at the next word. 00:37:43.060 |
There's nothing hard about learning to predict the next word here, 00:37:45.560 |
because I could just look at it, see what it is, and then copy it over. 00:37:49.380 |
So when I'm training an encoder in something for pre-training, 00:37:57.380 |
In practice, what I do is something like this. 00:38:02.100 |
I mask out words, sort of like I did in the examples 00:38:09.260 |
And then I have the network predict with its whole-- 00:38:15.340 |
So now this vector representation of the blank 00:38:22.340 |
And then I predict the word "went," and then here, the word "store." 00:38:34.460 |
And you can see how this is doing something quite a bit like language 00:38:41.180 |
I've removed the network's information about the words that go in the blanks, 00:38:49.620 |
I only ask it to actually do the prediction, compute the loss, 00:38:52.780 |
backpropagate the gradients for the words that I've masked out. 00:38:56.580 |
And you can think of this as instead of learning probability of x, 00:39:03.140 |
this is learning the probability of x, the real document, 00:39:06.300 |
given x tilde, which is this sort of corrupted document, 00:39:17.780 |
one per word, which is the output of my encoder in blue. 00:39:21.380 |
And then I'd say that for the words that I want to predict, yi, I draw them. 00:39:25.700 |
This is the sim means the probability is proportional to my embedding matrix 00:39:36.500 |
So it's just a linear transformation of that last thing here. 00:39:41.860 |
And I do the prediction, and I train the entire network to do this. 00:39:47.020 |
So the words that we mask out, do we just select them randomly, 00:39:54.260 |
The question is, do we just choose words randomly to mask out, 00:39:59.380 |
We'll talk about a slightly smarter scheme in a couple of slides, 00:40:07.020 |
What was that last part on the bottom, x, the masked version of-- 00:40:11.460 |
like, if it's the first or the very last sentence? 00:40:16.580 |
Yeah, so I'm saying that I'm defining x tilde to be this input part, where 00:40:23.100 |
I've got the masked version of the sentence with these words missing. 00:40:26.820 |
And then I'm defining a probability distribution 00:40:29.060 |
that's the probability of a sequence conditioned 00:40:32.340 |
on the input being the corrupted sequence, the masked sequence. 00:40:35.940 |
So this brings us to a very, very popular NLP model 00:40:49.940 |
And it was the first one to popularize this masked language modeling 00:40:55.300 |
And they released the weights of this pre-trained transformer 00:40:58.420 |
that they pre-trained via something that looks a lot like masked language 00:41:03.780 |
You can use them via code that's released by the company HuggingFace 00:41:10.300 |
Many of you will use a model like BERT in your final project 00:41:13.700 |
because it's such a useful builder of representations 00:41:27.020 |
So remember, all of our inputs now are subword tokens. 00:41:32.500 |
But just like we saw at the very beginning of class, 00:41:34.620 |
each of these tokens could just be some portion, some subword. 00:41:38.940 |
And I'm going to do a couple of things with it. 00:41:40.900 |
Sometimes I am going to just mask out the word 00:41:48.220 |
Sometimes I'm going to replace the word with some random sample 00:41:53.260 |
of another word from my vocabulary and predict 00:41:58.780 |
And sometimes I'm going to not change the word at all 00:42:11.820 |
in the middle of this network for words that are masked out, 00:42:15.940 |
then when I actually use the model at test time 00:42:19.220 |
on some real review to do sentiment analysis on, 00:42:22.820 |
well, there are never going to be any tokens like this. 00:42:27.300 |
because it's like, oh, I have no job to do here 00:42:29.780 |
because I only need to deal with the mask tokens. 00:42:33.540 |
By giving it sequences of words where sometimes it's 00:42:38.420 |
sometimes you have to detect if the word is wrong. 00:42:48.660 |
all the words in context because it has this chance 00:42:51.660 |
that it could be asked to predict anything at any time. 00:42:54.120 |
OK, so the folks at Google who were defining this 00:43:03.980 |
had a separate additional task that is sort of interesting 00:43:10.780 |
So this was their BERT model from their paper. 00:43:14.760 |
just like we saw from our transformers lecture, 00:43:18.180 |
token embeddings just like we saw from the transformers 00:43:21.620 |
But then also they had this thing called a segment embedding 00:43:23.980 |
where they had two possible segments, segment A 00:43:26.380 |
and segment B. And they had this additional task 00:43:31.820 |
where they would get a big chunk of text for segment A 00:43:38.780 |
is segment B a real continuation of segment A? 00:43:49.660 |
And the idea was that this should teach the network 00:43:55.460 |
about the connection between a bunch of text over here 00:44:12.060 |
that we're trying to come up with hard problems 00:44:14.100 |
for the network to solve such that by solving them, 00:44:21.580 |
by making simple transformations or removing information 00:44:33.080 |
The plus signs, do we concatenate the vectors, 00:44:40.020 |
do we concatenate the vectors or do element-wise addition? 00:44:58.420 |
So just saying everything's the same dimension 00:45:00.300 |
and then doing addition just ends up being simpler. 00:45:03.980 |
So why was the next sentence prediction not necessary? 00:45:11.060 |
Yeah, why was the next sentence prediction not necessary? 00:45:16.460 |
is that now the effective context length for a lot 00:45:26.580 |
So one of the things that's useful about pre-training 00:45:28.820 |
seemingly is that you get to build representations 00:45:35.460 |
segment A was going to be something like 250 words, 00:45:42.060 |
And in the paper that let us know that this wasn't necessary, 00:45:50.740 |
this very long context because longer contexts help give you 00:45:55.380 |
more information about the role that each word is playing 00:46:06.900 |
it's much clearer what its role is in that context is. 00:46:09.540 |
So yeah, it cuts the effective context size is one answer. 00:46:13.600 |
Another thing is that this is actually much more difficult. 00:46:23.260 |
But it's been shown since then that these models are really, 00:46:25.760 |
really bad at the next sentence prediction task. 00:46:34.860 |
And so it just wasn't useful because the model 00:46:43.100 |
Can you explain again why we need to do a next sentence 00:46:46.500 |
What about just masking and predicting the next? 00:46:57.380 |
You seem to not need to do next sentence prediction. 00:47:07.420 |
you to develop this pairwise, do these two segments of text 00:47:16.300 |
And many NLP tasks are defined on pairs of things. 00:47:33.060 |
There are intuitions as to why it could work. 00:47:42.100 |
so BERT was doing both this next sentence prediction training 00:47:46.540 |
as well as this masking training all at the same time. 00:47:52.220 |
And so you had to have a separate predictor head 00:48:12.420 |
was going to say, is the next sentence real or fake or not? 00:48:22.500 |
that we had earlier about how do you evaluate these things. 00:48:25.540 |
There's a lot of different NLP tasks out there. 00:48:32.140 |
they would look at a ton of different evaluations 00:48:34.580 |
that had been sort of compiled as a set of things that 00:48:38.860 |
So are you detecting paraphrases between questions? 00:48:41.900 |
Are two Quora questions actually the same question? 00:48:47.500 |
Can you do sentiment analysis on this hard data set? 00:48:51.540 |
Can you tell if sentences are linguistically acceptable? 00:48:59.020 |
Do they mean sort of vaguely the similar thing? 00:49:01.900 |
And we'll talk a bit about natural language inference 00:49:04.100 |
later, but that's the task of defining sort of if I say, 00:49:08.240 |
you know, I saw the dog, that does not necessarily 00:49:14.440 |
But saying I saw the little dog does mean I saw the dog. 00:49:18.000 |
So that's sort of this natural language inference task. 00:49:20.560 |
And the difference between sort of pre-pre-training days, 00:49:29.320 |
before you had substantial amounts of pre-training 00:49:33.600 |
and BERT was just like the field was taken aback in a way that's 00:49:39.420 |
You know, very carefully crafted architectures 00:49:45.660 |
and doing things that they thought were sort of clever as 00:49:48.000 |
to how to define all the connections and the weights 00:49:50.300 |
and whatever to do their tasks independently. 00:49:52.400 |
So everyone was doing a different thing for each one 00:49:59.360 |
by just build a big transformer and just teach it 00:50:04.160 |
and then fine tune it on each of these tasks. 00:50:11.920 |
It's a little bit less flashy than chat GPT, I'll admit. 00:50:14.760 |
But it's really part of the story that gets us to it, 00:50:31.680 |
encoder usually outputs some sort of hidden values. 00:50:48.320 |
How do we actually correlate those values to stuff 00:50:52.640 |
I'm going to go on to the next slide here to bring up 00:50:56.120 |
So the encoder gives us, for each input word token, 00:51:04.360 |
And the question is, how do we get these representations 00:51:07.520 |
and turn them into sort of answers for the tasks 00:51:13.080 |
And the answer comes back to something like this. 00:51:41.040 |
we had the transformer that was giving us our representations. 00:51:49.840 |
that moved us from the encoder's hidden state size 00:51:55.000 |
And we just removed this last prediction layer here. 00:51:58.280 |
And let's say we want to do something that is classifying 00:52:04.600 |
We just pick arbitrarily maybe the last word in the sentence. 00:52:17.160 |
So yeah, the BERT model had two different models. 00:52:24.840 |
Keep that sort of in the back of your head sort of percolating 00:52:27.480 |
as we talk about models with many, many more parameters 00:52:40.000 |
maybe 25 million words, but on the order of less than a billion 00:52:48.220 |
And it was trained on what was considered at the time 00:52:55.720 |
And we were like, oh, who has that kind of compute? 00:53:01.600 |
But fine tuning is practical and common on a single GPU. 00:53:04.720 |
So you could take the BERT model that they've spent a lot of time 00:53:07.560 |
training and fine tune it yourself on your task 00:53:17.820 |
So one question is like, well, this seems really great. 00:53:35.040 |
What's the structure of the pre-trained model good for? 00:53:38.520 |
BERT is really good for sort of filling in the blanks. 00:53:48.960 |
a summary of something because it's not really built for it. 00:53:53.080 |
It doesn't have a natural notion of predicting the next word given 00:53:58.280 |
So maybe I want to use BERT if I want a good representation of, say, 00:54:01.840 |
a document to classify it, give it a set of topic labels, 00:54:07.960 |
But I wouldn't want to use it to generate a whole sequence. 00:54:15.040 |
So we had a question earlier of whether you just 00:54:18.920 |
One thing that seems to work better is you mask out 00:54:23.480 |
sort of whole contiguous spans because sort of the difficulty 00:54:30.480 |
of this problem is much easier than it would otherwise be because sort of this 00:54:37.120 |
And you can tell very easily based on the sort of subwords that came before it. 00:54:41.160 |
Whereas if I have a much longer sequence, it's a trade-off. 00:54:47.840 |
And it ends up being better to do this sort of span-based masking 00:54:52.600 |
And that might be because subwords make very simple prediction problems when 00:54:56.680 |
you mask out just one subword of a word versus all the subwords of a word. 00:55:05.360 |
There's also a paper called the Roberta paper, 00:55:07.360 |
which showed that the next sentence prediction wasn't necessary. 00:55:16.840 |
So Roberta is a drop-in replacement for BERT. 00:55:19.560 |
So if you're thinking of using BERT, just use Roberta. 00:55:24.280 |
don't know a whole lot about the best practices for training these things. 00:55:27.400 |
You sort of train it for as long as you're willing to. 00:55:33.600 |
So this is very-- but it's very difficult to do sort of iteration 00:55:42.520 |
Another thing that you should know for your final projects in the world ahead 00:55:45.960 |
is this notion of fine-tuning all parameters of the network 00:55:51.200 |
So what we've talked about so far is you pre-train all the parameters 00:55:59.480 |
Alternative, which you call parameter efficient or lightweight fine-tuning, 00:56:06.520 |
or you choose the very smart way of keeping most of the parameters fixed 00:56:11.480 |
And the intuition is that these pre-trained parameters were really good. 00:56:16.600 |
And you want to make the minimal change from the pre-trained model 00:56:20.080 |
to the model that does what you want so that you 00:56:22.120 |
keep some of the generality, some of the goodness of the pre-training. 00:56:26.280 |
So one way that this is done is called prefix tuning. 00:56:29.280 |
Prompt tuning is very similar, where you actually 00:56:36.920 |
And I never change any of the parameter values. 00:56:39.720 |
Instead, I make a bunch of fake sort of pseudo word vectors 00:56:44.360 |
that I prepend to the very beginning of the sequence. 00:56:50.800 |
It's like these would have been like inputs to the network, 00:56:55.340 |
And I'm training everything to do my sentiment analysis task 00:56:58.640 |
just by changing the values of these sort of fake words. 00:57:03.120 |
And this is nice because I get to keep all the good pre-trained parameters 00:57:08.960 |
and then just specify the sort of diff that ends up generalizing better. 00:57:17.520 |
But this is also cheaper because I don't have to compute the gradients, 00:57:21.480 |
or I don't have to store the gradients and all the optimizer state. 00:57:25.240 |
With respect to all these parameters, I'm only training 00:57:33.340 |
It's like fake parameters at the end, as if like here. 00:57:38.180 |
It doesn't make any difference if you put these 00:57:41.380 |
In a decoder, you have to put them at the beginning 00:57:43.540 |
because otherwise you don't see them before you process the whole sequence. 00:57:49.280 |
Can we just attach a few layers and only train the new layers? 00:57:53.500 |
The question is, can we just attach a new layers at the top of this 00:58:00.540 |
Another thing that works well-- sorry, we're running out of time-- 00:58:06.780 |
So I have a bunch of weight matrices in my transformer. 00:58:09.700 |
And I freeze the weight matrix and learn a very low rank little diff. 00:58:15.420 |
And I set the weight matrix's value to be sort of the original value 00:58:19.660 |
plus my sort of very low rank diff from the original one. 00:58:24.900 |
And this ends up being a very similarly useful technique. 00:58:29.620 |
And the overall idea here is that, again, I'm 00:58:31.700 |
learning way fewer parameters than I did via pre-training and freezing 00:58:41.140 |
So for encoder-decoders, we could do something like language modeling. 00:58:45.300 |
I've got my input sequence here, encoder, output sequence here. 00:58:49.700 |
And I could say this part is my prefix for sort 00:58:58.220 |
are sort of in the latter half of the sequence, 00:59:07.140 |
You sort of take a long text, split it into two, 00:59:09.700 |
give half of it to the encoder, and then generate 00:59:13.420 |
But in practice, what works much better is this notion of span corruption. 00:59:20.300 |
Span corruption is going to show up in your assignment 5. 00:59:23.260 |
And the idea here is a lot like BERT, but in a sort of generative sense, 00:59:30.580 |
where I'm going to mask out a bunch of words in the input. 00:59:33.500 |
Thank you, mask token 1, me to your party, mask token 2, week. 00:59:40.860 |
And then at the output, I generate the mask token 00:59:44.660 |
and then what was supposed to be there, but the mask token replaced it. 00:59:48.580 |
So thank you, then predict for inviting at the output, 00:59:54.860 |
And what this does is that it allows you to have bidirectional context. 01:00:00.900 |
I get to see the whole sequence, except I can generate 01:00:07.100 |
So this feels a little bit like you mask out parts of the input, 01:00:10.020 |
but you actually generate the output as a sequence 01:00:14.860 |
So this might be good for something like machine translation, 01:00:17.420 |
where I have an input that I want bidirectional context in, 01:00:24.380 |
So this was shown to work better than language modeling at the scales 01:00:27.780 |
that these folks at Google were able to test back in 2018. 01:00:43.780 |
There's a fascinating property of these models also. 01:00:46.540 |
So T5 was the model that was originally introduced 01:00:55.820 |
you saw a bunch of things like Franklin D. Roosevelt was born in blank, 01:01:02.580 |
And there's this task called open domain question 01:01:07.620 |
answering, which has a bunch of trivia questions, 01:01:12.820 |
And then you're supposed to generate out the answer as a string, just 01:01:20.580 |
And then you're supposed to generate these answers. 01:01:22.900 |
And what's fascinating is that this salient span masking method 01:01:36.820 |
And then when you tested on new trivia questions, 01:01:40.260 |
the model would implicitly extract from its pre-training data 01:01:44.540 |
somehow the answer to that new question that it never 01:01:49.700 |
So it learned this sort of implicit retrieval-- sometimes, 01:01:53.060 |
sometimes, less than 50% of the time or whatever, 01:02:01.580 |
So you've learned to access this latent knowledge 01:02:07.380 |
And so you just pass it the text, when was Roosevelt born, 01:02:13.020 |
And one thing to know is that the answers always look very fluent. 01:02:19.980 |
And that's still true of things like ChatsGPT. 01:02:35.980 |
So I get a sequence of hidden states from my decoder. 01:02:38.940 |
The model-- the words can only look at themselves, not the future. 01:02:43.220 |
And then I predict the next word in the sentence. 01:02:48.700 |
to do sentiment analysis, maybe take the last state 01:02:50.900 |
for the last word, and then predict happy or sad 01:02:56.340 |
Back-propagate the gradients of the whole network, 01:02:58.420 |
train the whole thing, or do some kind of lightweight 01:03:07.940 |
And I can just pre-train it on language modeling. 01:03:13.460 |
So again, you might want to do this if you are wanting to generate texts, 01:03:22.220 |
You sort of can use this like you use an encoder-decoder. 01:03:25.700 |
But in practice, as we'll see, a lot of the sort of biggest, 01:03:29.580 |
most powerful pre-trained models tend to be decoder-only. 01:03:33.740 |
It's not really clear exactly why, except they 01:03:36.780 |
seem a little bit simpler than encoder-decoders. 01:03:41.140 |
And you get to share all the parameters in one big network for the decoder, 01:03:45.060 |
whereas in an encoder-decoder, you have to split them, 01:03:47.820 |
sort of some into the encoder, some into the decoder. 01:03:50.620 |
So for the rest of this lecture, we'll talk only about decoders. 01:03:55.500 |
In modern things, the biggest networks do tend to be decoders. 01:04:00.780 |
So we're coming all the way back again to 2018. 01:04:03.740 |
And the GPT model from OpenAI was a big success. 01:04:16.660 |
And it had this vocabulary that was 40,000-ish words that 01:04:23.180 |
was defined via a method like what we showed at the beginning of class, 01:04:28.620 |
And actually, GPT never actually showed up in the original paper. 01:04:32.860 |
It's unclear what exactly it's supposed to refer to. 01:04:39.180 |
But this model was a precursor to all the things 01:04:55.820 |
So if we wanted to do something like natural language inference, which 01:04:59.900 |
says, take these pairs of sentences-- the man is in the doorway, 01:05:05.460 |
and say that these mean that one entails the other, 01:05:09.100 |
the premise entails the hypothesis, that I can believe the hypothesis 01:05:12.900 |
if I believe the premise, I'd just concatenate them together. 01:05:16.780 |
So give it maybe a start token, pass in one sentence, 01:05:21.180 |
pass in some delimiter token, pass in the other, 01:05:23.920 |
and then predict yes, no, entailment, not entailment, fine tuning. 01:05:57.840 |
And given some sort of prompt, it could generate, at the time, 01:06:01.800 |
a quite surprisingly coherent continuation to the prompt. 01:06:04.680 |
So it's telling this sort of story about scientists and unicorns here. 01:06:11.480 |
And this size of model is still sort of small enough 01:06:15.280 |
that you can use on a small GPU and fine tune and whatever. 01:06:19.620 |
And its capabilities of generating long, coherent text 01:06:28.140 |
It was also trained on more data, although I don't-- 01:06:42.280 |
And then we come with a different way of interacting with the models. 01:06:45.620 |
So we've interacted with pre-trained models in two ways so far. 01:06:49.060 |
We've sort of sampled from the distribution that they define. 01:06:53.180 |
We generated text via a machine translation system or whatever. 01:06:57.380 |
Or we fine-tuned them on a task that we care about. 01:07:03.620 |
But GPT-3 seems to have an interesting new ability. 01:07:11.580 |
And it can do some tasks without any sort of fine-tuning whatsoever. 01:07:20.060 |
So we went from GPT, 100-ish million parameters, 01:07:23.500 |
GPT-2, 1.5 billion, GPT-3, 175 billion, much larger, 01:07:32.100 |
And this sort of notion of in-context learning, 01:07:34.500 |
that it could define or figure out patterns in the training 01:07:40.140 |
and continue the pattern, is called in-context learning. 01:07:46.440 |
And I pass in this little arrow and say, OK, thanks goes to merci. 01:07:57.300 |
And it's learned to sort of continue the pattern 01:08:01.020 |
and say that this is the translation of otter. 01:08:04.660 |
So now, remember, this is a single sort of input that I've given to my model. 01:08:09.860 |
And I haven't said, oh, do translation or fine-tune it on translation 01:08:14.380 |
I've just passed in the input, given it some examples. 01:08:16.980 |
And then it is able to, to some extent, do this seemingly complex task. 01:08:29.820 |
And then it can do some simple addition afterward. 01:08:33.900 |
You give it-- in this case, this is sort of rewriting typos. 01:08:36.900 |
It can figure out how to rewrite typos in context 01:08:41.820 |
And this was the start of this idea that there 01:08:43.860 |
were these emergent properties that showed up in much larger models. 01:08:47.940 |
And it wasn't clear, when looking at the smaller models, 01:08:51.020 |
that you'd get this sort of new, this qualitatively new behavior out of them. 01:08:57.780 |
Like, it's not obvious from just the language modeling signal, right? 01:09:01.140 |
GPT-3 is just trained on that decoder only, just predict the next word, 01:09:09.740 |
learn to perform seemingly quite complex things 01:09:19.540 |
This should be quite surprising, I think, right? 01:09:29.060 |
So far, we've talked about good representations, 01:09:31.900 |
contextual representations, meanings of words in context. 01:09:35.060 |
This is some very, very high-level pattern matching, right? 01:09:37.500 |
It's coming up with patterns in just the input data. 01:09:40.660 |
And that one sequence of text that you've passed it so far, 01:09:43.660 |
and it's able to sort of identify how to complete the pattern. 01:09:48.180 |
And you think, what kinds of things can this solve? 01:09:56.100 |
Sort of, what are the kinds of problems that you maybe 01:10:00.020 |
Maybe GPT-3 saw a ton of pairs of words, right? 01:10:03.780 |
It saw a bunch of dictionaries, bilingual dictionaries 01:10:11.420 |
where it's really learning the task in context? 01:10:18.740 |
It seems like it has to be tied to your training data in ways 01:10:26.180 |
to learn new sort of, at least, types of patterns 01:10:31.580 |
So this is a very interesting thing to work on. 01:10:34.740 |
Now, we've talked a lot about the size of these models so far. 01:10:43.220 |
So GPT-3 was trained on 300 billion words of text. 01:10:56.140 |
And it's very unclear whether you're getting the best 01:11:00.640 |
have been doing in terms of the number of parameters. 01:11:06.180 |
is roughly you take the number of parameters, 01:11:09.740 |
that you're going to train it on, the number of words. 01:11:16.240 |
Some folks at DeepMind realized through some experimentation 01:11:20.980 |
that actually GPT-3 was just comically oversized. 01:11:34.640 |
And this is an interesting trade-off about how do you 01:11:39.120 |
I mean, you can't do this more than a handful of times, 01:11:48.280 |
Another way of interacting with these networks 01:11:51.320 |
that has come out recently is called chain of thought. 01:11:56.120 |
So the prefix, we saw in the in-context learning slide 01:12:00.200 |
that the prefix can help specify what task you're 01:12:07.680 |
We have a prefix of examples of questions and answers. 01:12:11.440 |
So you have a question and then an example answer. 01:12:14.800 |
So that's your prompt that's specifying the task. 01:12:18.800 |
And you're having the model generate an answer. 01:12:26.560 |
how about in the example, in the demonstration we give, 01:12:30.600 |
And then we give this sort of decomposition of steps 01:12:36.180 |
So I'm actually writing this out as part of the input. 01:12:49.480 |
And the model says, oh, I know what I'm supposed to do. 01:12:51.760 |
I'm supposed to first generate a sequence of steps, 01:13:06.440 |
that the model can tend to generate plausible sequences 01:13:14.880 |
relative to trying to generate the answer by itself. 01:13:28.920 |
As I generate each word of this continuation here, 01:13:33.040 |
I'm able to condition on all the past words so far. 01:13:40.280 |
to sort of decompose the problem into smaller, simpler 01:13:43.160 |
problems, which it's more able to solve each. 01:13:47.800 |
No one's really sure why this works exactly either. 01:13:51.240 |
At this point, with networks that are this large, 01:13:54.120 |
their emergent properties are both very powerful 01:14:03.440 |
Because it's unclear what its capabilities are 01:14:05.560 |
and what its limitations are, where it will fail. 01:14:09.200 |
So what do we think pre-training is teaching? 01:14:14.520 |
beyond what I've written in this slide, which 01:14:19.600 |
So it can teach you trivia, and syntax, and coreference, 01:14:22.480 |
and maybe some lexical semantics, and sentiment, 01:14:27.360 |
than we would have thought even three years ago. 01:14:47.440 |
are not just studied for the sake of using them, 01:14:51.040 |
but studied for the sake of understanding anything 01:14:59.440 |
Has anyone tried benchmarking GPT for programming tasks, 01:15:13.920 |
The question is, has anyone tried benchmarking 01:15:23.120 |
of people using GPT-3 for simple programming things. 01:15:30.920 |
competitive programming bots are all based on ideas 01:15:36.600 |
And I think they're all also based on pre-trained language 01:15:50.920 |
And so yeah, I think all of the best systems use this, 01:15:58.840 |
Is that the basis for what GitHub Copilot's trying to do? 01:16:04.120 |
Is what we just mentioned the basis for the GitHub Copilot 01:16:10.320 |
We don't know exactly what it is in terms of details, 01:16:16.080 |
What if you have a situation where you have still 01:16:21.000 |
and then you have also a large amount of data 01:16:24.880 |
At what point is it better to train a new model 01:16:27.760 |
for that fine-tuning versus get data from both? 01:16:35.560 |
When is it better to do a separate training on just 01:16:43.240 |
If you have a bunch of data for the task that you care about, 01:16:48.400 |
what's frequently done instead is three-part training, 01:16:54.720 |
Then you continue to pre-train using something 01:16:57.320 |
like language modeling on an unlabeled version of the label 01:17:03.280 |
You just strip the labels off and just treat it all as text 01:17:09.320 |
and then do the final stage of fine-tuning with the labels 01:17:14.240 |
There's an interesting paper called Don't Stop Pre-Training. 01:17:32.800 |
if there's a lot of instances where a pre-trained model can 01:17:36.840 |
do some task that's not seen before even without fine-tuning? 01:17:41.160 |
Yeah, so are there any instances of where a pre-trained model 01:17:43.700 |
can do a task that it hasn't seen before without fine-tuning? 01:17:47.240 |
The question is, what does hasn't seen before mean? 01:17:50.960 |
These models, especially GPT-3 and similar very large models, 01:17:55.280 |
during pre-training, did it ever see something exactly 01:18:05.080 |
It's clearly able to recombine bits and pieces of tasks 01:18:13.040 |
Language modeling looks a lot like trivia sometimes, 01:18:15.520 |
where you just read the first paragraph of a Wikipedia page, 01:18:19.080 |
and it's kind of like answering a bunch of little trivia 01:18:21.360 |
questions about where someone was born and when. 01:18:24.400 |
But it's never seen something quite like this. 01:18:28.280 |
how much it's able to do things that don't seem like they 01:18:30.640 |
should have shown up all that directly in the pre-training 01:18:33.920 |
Quantifying that extent is an open research problem.