back to index

Stanford CS224N NLP with Deep Learning | 2023 | Lecture 9 - Pretraining


Whisper Transcript | Transcript Only Page

00:00:00.000 | [AUDIO OUT]
00:00:05.480 | Hello.
00:00:06.240 | Welcome to CS224N.
00:00:09.520 | Today we'll be talking about pre-training,
00:00:12.440 | which is another exciting topic on the road to modern natural language
00:00:18.120 | processing.
00:00:21.760 | How is everyone doing?
00:00:23.520 | Thumbs up, some side, thumbs down.
00:00:28.240 | No response bias there.
00:00:29.880 | All thumbs up.
00:00:31.200 | Oh, side.
00:00:31.760 | Nice.
00:00:32.000 | I like that honesty.
00:00:32.840 | That's good.
00:00:33.360 | Well, OK.
00:00:35.800 | So we're now-- what is this, week five?
00:00:39.360 | Yes, it's week five.
00:00:40.640 | And we have a couple--
00:00:42.880 | so this lecture, the Transformers lecture, and then to a lesser extent,
00:00:47.800 | Thursday's lecture on natural language generation
00:00:51.680 | will be sort of the sum of lectures for the assignments you have to do.
00:00:56.280 | So assignment five is coming out on Thursday.
00:01:01.480 | And the topics covered in this lecture, and self-attention transformers,
00:01:06.640 | and again, a little bit of natural language generation
00:01:09.000 | will be tested in assignment five.
00:01:10.440 | And then the rest of the course will go through some really fascinating topics
00:01:14.600 | in sort of modern natural language processing
00:01:17.640 | that should be useful for your final projects, and future jobs,
00:01:20.920 | and interviews, and intellectual curiosity.
00:01:25.240 | But I think that today's lecture is significantly less technical in detail
00:01:31.800 | than last Thursday's on self-attention and transformers,
00:01:35.600 | but should give you an idea of the sort of world of pre-training
00:01:41.160 | and sort of how it helps define natural language processing today.
00:01:46.600 | So a reminder about assignment five, your project proposals
00:01:49.240 | also are due on Tuesday, next Tuesday.
00:01:53.760 | Please do get those in.
00:01:55.160 | Try to get them in on time so that we can give you prompt feedback
00:01:58.860 | about your project proposals.
00:02:01.360 | And yeah, so let's jump into it.
00:02:03.440 | OK, so what we're going to start with today
00:02:09.240 | is a bit of a technical detail on word structure
00:02:16.160 | and sort of how we model the input sequence of words that we get.
00:02:19.800 | So when we were teaching Word2Vec and sort of all the methods
00:02:26.360 | that we've talked about so far, we assumed a finite vocabulary.
00:02:29.840 | So you had a vocabulary v that you define via whatever.
00:02:32.560 | You've looked at some data.
00:02:33.680 | You've decided what the words are in that data.
00:02:36.640 | And so you have some words like hat and learn.
00:02:42.720 | And you have this embedding.
00:02:44.680 | It's in red because you've learned it properly.
00:02:46.920 | Actually, let's replace hat and learn with pizza and tasty.
00:02:49.360 | Those are better.
00:02:51.760 | And so that's all well and good.
00:02:53.760 | You see these words in your model.
00:02:56.440 | And you have an embedding that's been learned on your data
00:03:00.800 | to sort of know what to do when you see those words.
00:03:04.120 | But when you see some sort of variations,
00:03:06.020 | maybe you see like tasty and maybe a typo like learn,
00:03:11.640 | or maybe novel items where it's like a word that you as a human
00:03:15.760 | can understand as sort of this combination.
00:03:18.160 | This is called derivational morphology of this word
00:03:22.240 | transformer that you know and if I, which means take this noun
00:03:26.800 | and give me back a verb.
00:03:29.240 | That means to make more like that noun.
00:03:31.160 | To transformerify NLP might mean to make NLP more
00:03:36.200 | like using transformers and such.
00:03:39.000 | And for each of these, this maybe didn't show up
00:03:41.200 | in your training corpus.
00:03:42.400 | And language is always doing this.
00:03:45.760 | People are always coming up with new words.
00:03:47.640 | And there's new domains.
00:03:48.960 | And young people are always making new words.
00:03:52.360 | It's great.
00:03:52.840 | And so it's a problem for your model,
00:03:54.640 | though, because you've defined this finite vocabulary.
00:03:57.440 | And there's sort of no mapping in that vocabulary
00:04:00.880 | for each of these things.
00:04:02.400 | Even though their meanings should be relatively well
00:04:05.280 | defined based on the data you've seen so far,
00:04:08.120 | it's just that the sort of string of characters that define them
00:04:11.560 | aren't quite what you've seen.
00:04:13.760 | And so what do you do?
00:04:14.640 | Well, maybe you map them to this sort of universal unknown token.
00:04:18.440 | This is UNK.
00:04:20.000 | So it's like, oh, I see something.
00:04:21.000 | I don't know what.
00:04:21.960 | I've never seen it before.
00:04:23.240 | I'm going to say it's always represented by the same token UNK.
00:04:26.840 | And so that's been done in the past.
00:04:29.120 | And that's sort of bad, right, because it's
00:04:30.960 | totally losing tons of information.
00:04:34.760 | But you need to map it to something.
00:04:38.640 | And so this is like a clear problem, especially--
00:04:42.480 | I mean, in English, it's a problem.
00:04:44.120 | In many of the roles languages, it's a substantially larger problem.
00:04:49.000 | So English has relatively simple word structure.
00:04:53.360 | There's a couple of conjugations for each verb, like eat, eats, eaten, ate.
00:05:00.360 | But in a language with much more complex morphology or word structure,
00:05:06.960 | you'll have a considerably more complex sort of set of things
00:05:11.040 | that you could see in the world.
00:05:12.360 | So here is a conjugation table for a Swahili verb.
00:05:17.560 | And it has over 300 conjugations.
00:05:20.840 | And if I define the vocabulary to be every unique string of characters
00:05:24.800 | maps to its own word, then every one of the 300 conjugations
00:05:28.400 | would get an independent vector under my model, which makes no sense,
00:05:33.280 | because the 300 conjugations obviously have a lot in common
00:05:37.200 | and differ by sort of meaningful extent.
00:05:39.680 | So you don't want to do this.
00:05:41.240 | You'd have to have a huge vocabulary if I wanted all conjugations to show up.
00:05:46.400 | And that's a mistake for efficiency reasons and for learning reasons.
00:05:51.200 | Any questions so far?
00:05:52.080 | Cool.
00:05:57.160 | And so what we end up doing is we'll look at subword structure,
00:06:05.440 | subword modeling.
00:06:06.640 | So what we're going to do is we're going to say,
00:06:08.680 | if I can try to define what the set of all words is,
00:06:12.640 | I'm going to define my vocabulary to include parts of words.
00:06:17.640 | So I'm going to split words into sequences of known subwords.
00:06:30.280 | And so there's a simple sort of algorithm for this,
00:06:33.200 | where you start with all characters.
00:06:35.480 | So if I only had a vocabulary of all characters,
00:06:38.240 | and maybe like an end of word symbol for a finite data set,
00:06:44.320 | then no matter what word I saw in the future,
00:06:46.480 | as long as I had seen all possible characters,
00:06:48.560 | I could take the word and say, I don't know what this word is.
00:06:51.100 | I'm going to split it into all of its individual characters.
00:06:53.960 | So you won't have this unk problem.
00:06:55.440 | You can sort of represent any word.
00:06:57.360 | And then you're going to find common adjacent characters and say, OK,
00:07:01.240 | A and B co-occur next to each other quite a bit.
00:07:03.920 | So I'm going to add a new word to my vocabulary.
00:07:07.120 | Now it's all characters plus this new word A, B, which is a subword.
00:07:13.440 | And likewise, so now I'm going to replace the character pair
00:07:16.040 | with the new subword and repeat until you add a lot, a lot, a lot of vocabulary
00:07:20.720 | items through this process of what things tend to co-occur next to each other.
00:07:24.520 | And so what you'll end up with is a vocabulary
00:07:28.480 | of very commonly co-occurring sort of substrings
00:07:31.600 | by which you can build up words.
00:07:33.540 | And this was originally developed for machine translation,
00:07:36.000 | but then it's been used considerably in pretty much all modern language models.
00:07:41.200 | So now we have a hat and learn, hat and learn.
00:07:44.120 | So in our subword vocabulary, hat and learn
00:07:46.840 | showed up enough that they're their own individual words.
00:07:49.920 | So that's sort of good, right?
00:07:51.800 | So simple common words show up as a word in your vocabulary
00:07:56.560 | just like you'd like them to.
00:07:57.760 | But now tasty maybe gets split into T-A-A.
00:08:01.200 | And then maybe in some cases, this hash hash
00:08:04.040 | means like don't add a space next, right?
00:08:07.160 | So T-A-A and then A-A-A and then S-T-Y, right?
00:08:12.160 | So I've actually taken one sort of thing that seems like a word,
00:08:15.200 | and in my vocabulary, it's now split into three subword tokens.
00:08:20.120 | So when I pass this to my transformer or to my recurrent neural network,
00:08:24.960 | the recurrent neural network would take T-A-A as just a single element,
00:08:29.960 | do the RNN update, and then take A-A-A, do the RNN update, and then S-T-Y.
00:08:35.200 | So it could learn to process constructions like this.
00:08:39.720 | And maybe I can even add more A-A-As in the middle,
00:08:41.920 | and have it do something similar.
00:08:44.080 | Instead of just seeing the entire word tasty and not knowing what it means.
00:08:51.960 | Is that?
00:08:53.240 | That's feedback, yeah.
00:08:58.920 | How loud is that feedback?
00:09:01.320 | We good?
00:09:02.920 | OK, I think we're fixed.
00:09:04.200 | Great.
00:09:06.480 | And so same with transformerify.
00:09:08.040 | Maybe transformer is its own word.
00:09:10.080 | And then if I--
00:09:11.240 | and so you can see that you have sort of three learned embeddings instead
00:09:14.760 | of one sort of useless unkembedding.
00:09:17.760 | This is just wildly useful and is used pretty much everywhere.
00:09:21.280 | Variants of this algorithm are used pretty much everywhere in modern NLP.
00:09:26.480 | Questions?
00:09:28.640 | If we have three embeddings for tasty, do we just add them together?
00:09:32.840 | So the question is, if we have three embeddings for tasty,
00:09:35.220 | do we just add them together?
00:09:38.080 | If we want to represent--
00:09:39.920 | so when we're actually processing the sequence,
00:09:42.520 | I'd see something like I learned about the T-A-A, A-A-A, S-T-Y.
00:09:50.160 | So it'd actually be totally separate tokens.
00:09:52.480 | But if I wanted to then say, what's my representation of this thing?
00:09:57.520 | Depends on what you want to do.
00:09:58.800 | Sometimes you average the contextual representations of the three
00:10:02.960 | or look at the last one maybe.
00:10:06.400 | At that point, it's unclear what to do.
00:10:08.000 | But everything sort of works OK.
00:10:10.920 | How do you know where to split?
00:10:12.800 | How do you what?
00:10:13.520 | How do you know where to split?
00:10:15.200 | Yeah.
00:10:15.720 | So you know where to split based on the algorithm
00:10:18.720 | that I specified earlier for learning the vocabulary.
00:10:23.280 | So you learn this vocabulary by just combining
00:10:25.800 | commonly co-occurring adjacent strings of letters.
00:10:29.080 | So like A, B co-occurred a lot.
00:10:30.920 | So now I've got a new word that's A, B.
00:10:34.000 | And then when I'm actually walking through and tokenizing,
00:10:36.520 | I try to split as little as possible.
00:10:38.560 | So I split words into the maximal sort of subword
00:10:41.600 | that takes up the most characters.
00:10:43.060 | There are algorithms for this.
00:10:45.120 | Yeah, so I'm like, OK, if I want to split this up,
00:10:49.040 | there's many ways I could split it up.
00:10:50.580 | And you try to find some approximate what the best way to split it
00:10:54.080 | into the fewest words is.
00:10:55.120 | Yeah.
00:10:56.120 | Does it seem to make sense to use punctuation in the character set?
00:11:00.520 | So the question is, do people use punctuation in the character set?
00:11:04.600 | Do people do it?
00:11:05.240 | Yes, absolutely.
00:11:06.360 | So sort of from this point on, just assume
00:11:12.760 | that what text is given to these models is as unprocessed as possible.
00:11:17.680 | You try to make it sort of clean looking text, where you've removed HTML tags,
00:11:22.680 | maybe if it's from the internet or whatever.
00:11:26.240 | But then beyond that, you process it as little as possible
00:11:29.120 | so that it reflects as well as possible what people might actually
00:11:32.600 | be using this for.
00:11:35.080 | So maybe earlier in the course, when we were looking at Word2Vec,
00:11:38.320 | maybe we had what might have thought about,
00:11:40.280 | oh, we don't want Word2Vec vectors of punctuation or something like that.
00:11:45.520 | Now everything is just as close as possible
00:11:48.240 | to what the text you'd get with people trying to use your system would be.
00:11:52.120 | So yes, in practice, punctuation and dot, dot, dot
00:11:55.600 | might be its own word, and maybe a sequence of hyphens,
00:12:00.320 | because people make big bars across tables.
00:12:03.200 | Yeah.
00:12:03.700 | How does it impact one wordage now?
00:12:11.800 | Could be multiple embeddings versus a single embedding.
00:12:16.680 | Does the system treat those any differently?
00:12:21.760 | The question is, does the system treat any differently words
00:12:24.280 | that are really themselves a whole word versus words that are pieces?
00:12:28.440 | No, the system has no idea.
00:12:29.680 | They're all just indices into your embedding vocabulary matrix.
00:12:36.320 | So they're all treated equally.
00:12:37.960 | What about really long words that are relatively common?
00:12:44.640 | Because if you're building up from single character all the way up,
00:12:47.880 | what happens then?
00:12:49.440 | The question is, what happens to very long words
00:12:51.920 | if you're building up from character pairs and portions of characters?
00:12:57.400 | In practice, the statistics speak really well for themselves.
00:13:01.080 | So if a long word is very common, it will end up in the vocabulary.
00:13:04.720 | And if it's not very common, it won't.
00:13:07.920 | There are algorithms that aren't this that do slightly better in various ways.
00:13:13.000 | But the intuition that you figure out what the common co-occurring
00:13:17.520 | substrings are, independent of length almost,
00:13:20.480 | is the right intuition to have.
00:13:22.040 | And so you can actually just look at the learned vocabularies
00:13:25.080 | of a lot of these models.
00:13:26.600 | And you see some long words just because they showed up a lot.
00:13:32.240 | I'm curious, how does it weigh the frequency?
00:13:41.280 | So let's say there's if-y.
00:13:43.680 | In your next slide, it was like if-i at the very last one.
00:13:48.080 | So if could be really common.
00:13:50.120 | So how does it weigh the frequency of a subword versus the length of it?
00:13:54.320 | It tries to spread it up into the smallest number.
00:13:56.920 | But what if it split it up into three, but one of them was super common?
00:14:00.960 | Yeah, so the question is, if transformer is a subword in my vocabulary,
00:14:05.920 | and if is a subword, and y is a subword, and if-i as a three-letter tuple
00:14:12.840 | is also a subword, how does it choose to take the--
00:14:15.800 | if-i, maybe it's not very common, as opposed
00:14:19.920 | to splitting it into more subwords.
00:14:23.000 | It's just a choice.
00:14:23.840 | We choose to try to take the smallest number of subwords,
00:14:26.480 | because that tends to be more of the bottleneck, as opposed
00:14:29.720 | to having a bunch of very common, very short subwords.
00:14:34.800 | Sequence length is a big problem in transformers.
00:14:36.960 | And this seems to be what works.
00:14:39.360 | Although trying to split things into multiple options of a sequence
00:14:42.560 | and running the transformer on all of them
00:14:44.600 | is the thing that people have done to see which one will work better.
00:14:47.760 | But yeah, having fewer bigger subwords tends to be the best sort of idea.
00:14:51.640 | I'm going to start moving on, though.
00:14:53.320 | Feel free to ask me more questions about this afterward.
00:14:56.720 | OK, so let's talk about pre-training from the context of the course so far.
00:15:03.120 | So at the very beginning of the course, we gave you this quote, which was,
00:15:07.480 | "You shall know a word by the company it keeps."
00:15:09.640 | This was the sort of thesis of the distributional hypothesis,
00:15:13.640 | that the meaning of the word is defined by, or at least reflected by,
00:15:17.960 | what words it tends to co-occur around.
00:15:19.800 | And we implemented this via Word2Vec.
00:15:23.960 | The same person who made that quote had a separate quote, actually earlier,
00:15:29.720 | that continues this notion of meaning as defined by context, which
00:15:34.800 | has something along the lines of, well, since the word shows up
00:15:38.920 | in context when we actually use it, when we speak to each other,
00:15:42.560 | the meaning of the word should be defined in the context
00:15:45.760 | that it actually shows up in.
00:15:47.480 | And so the complete meaning of a word is always contextual,
00:15:51.360 | and no study of meaning apart from a complete context
00:15:54.280 | can be taken seriously.
00:15:55.920 | So the big difference here is, at Word2Vec training time,
00:16:01.240 | if I have the word record, R-E-C-O-R-D, when I'm training Word2Vec,
00:16:07.920 | I get one vector or two, but one vector meaning record, the string.
00:16:16.160 | And it has to learn by what context it shows up in,
00:16:19.960 | that sometimes it can mean I record, i.e. the verb, or record, i.e.
00:16:26.720 | the noun.
00:16:28.040 | But I only have one vector to represent it.
00:16:30.480 | And so when I use the Word2Vec embedding of record,
00:16:33.320 | it sort of has this mixture meaning of both of its sort of senses, right?
00:16:38.960 | It doesn't get to specialize and say, oh, this part means record,
00:16:43.040 | and this part means record.
00:16:45.040 | And so Word2Vec is going to just sort of fail.
00:16:48.320 | And so I can build better representations of language
00:16:51.360 | through these contextual representations that
00:16:53.640 | are going to take things like recurrent neural networks or transformers
00:16:56.640 | that we used before to build up sort of contextual meaning.
00:16:59.640 | [AUDIO OUT]
00:17:03.320 | So what we had before were pre-trained word embeddings.
00:17:07.600 | And then we had sort of a big box on top of it,
00:17:10.960 | like a transformer or an LSTM, that was not pre-trained, right?
00:17:15.160 | So you learn via context your word embeddings here.
00:17:19.320 | And then you have a task, like sentiment analysis or machine translation
00:17:23.400 | or parsing or whatever.
00:17:25.760 | And you initialize all the parameters of this randomly.
00:17:29.180 | And then you train to predict your label.
00:17:33.120 | And the big difference in today's work is
00:17:37.040 | that we're going to try to pre-train all the parameters.
00:17:39.600 | So I have my big transformer.
00:17:41.180 | And instead of just pre-training my word embeddings with Word2Vec,
00:17:45.500 | I'm going to train all of the parameters of the network,
00:17:50.800 | trying to teach it much more about language
00:17:54.560 | that I could use in my downstream tasks.
00:17:57.600 | So now the labeled data that I have for, say, machine translation
00:18:03.600 | might need to be smaller.
00:18:05.720 | I might not need as much of it, because I've already
00:18:08.520 | trained much more of the network than I otherwise
00:18:10.760 | would have if I had just gotten Word2Vec embeddings.
00:18:13.480 | So here, I've pre-trained this entire structure--
00:18:20.360 | the word embeddings, the transformer on top.
00:18:23.640 | Everything's been trained via methods that we'll talk about today.
00:18:27.040 | And so what does this give you?
00:18:28.680 | I mean, it gives you very strong representations of language.
00:18:31.520 | So the meaning of record and record will be different
00:18:36.120 | in the sort of contextual representations that
00:18:38.920 | know where in the sequence it is and what words are co-occurring with it
00:18:42.920 | in this specific input than Word2Vec, which only has one representation
00:18:46.800 | for record independent of where it shows up.
00:18:50.080 | It'll also be used as strong parameter initializations for NLP models.
00:18:55.040 | So in all of your homework so far, you've
00:18:56.920 | worked with building out a natural language processing
00:19:00.440 | system sort of from scratch.
00:19:02.040 | How do I initialize this weight matrix?
00:19:03.680 | And we always say, oh, small, normally distributed noise,
00:19:08.080 | like little values close to 0.
00:19:12.280 | And here, we're going to say, well, just like we
00:19:14.800 | were going to use the Word2Vec embeddings and those sort of encoded
00:19:18.440 | structure, I'm going to start maybe my machine translation
00:19:21.400 | system from a parameter initialization that's
00:19:23.760 | given to me via pre-training.
00:19:27.380 | And then also, it's going to give us probability distributions
00:19:29.880 | over language that we can use to generate and otherwise.
00:19:33.440 | And we'll talk about this.
00:19:35.800 | So whole models are going to be pre-trained.
00:19:38.240 | So all of pre-training is effectively going
00:19:42.020 | to be centered around this idea of reconstructing the input.
00:19:45.600 | So you have an input.
00:19:47.040 | It's a sequence of text that some human has generated.
00:19:49.840 | And the sort of hypothesis is that by masking out part of it
00:19:55.960 | and tasking a neural network with reconstructing the original input,
00:20:00.720 | that neural network has to learn a lot about language, about the world,
00:20:05.320 | in order to do a good job of reconstructing the input.
00:20:07.960 | So this is now a supervised learning problem,
00:20:10.880 | just like machine translation.
00:20:13.520 | Taking this sentence that just existed, Stanford University
00:20:16.120 | is located in, say, Palo Alto, California, or Stanford, California,
00:20:20.560 | I guess.
00:20:23.240 | And I have, by removing this part of the sentence, made a label for myself.
00:20:29.560 | The input is this sort of broken masked sentence.
00:20:33.680 | And the label is Stanford or Palo Alto.
00:20:36.420 | So if I give this example to a network and ask
00:20:41.940 | it to predict the center thing, as it's doing its gradient step
00:20:45.360 | on this input, it's going to encode information
00:20:47.760 | about the co-occurrence between this context, Stanford University is located
00:20:51.600 | in, and Palo Alto.
00:20:53.680 | So by tasking it with this, it might learn, say, where Stanford is.
00:20:58.320 | What else might it learn?
00:20:59.320 | Well, it can learn things about maybe syntax.
00:21:01.560 | So I put blank fork down on the table.
00:21:05.480 | Here, there's only a certain set of words that could go here.
00:21:08.200 | I put the fork down on the table.
00:21:09.960 | I put a fork down on the table.
00:21:11.960 | These are syntactic constraints.
00:21:14.240 | So the context shows me what kinds of words can appear
00:21:18.520 | in what kinds of contexts.
00:21:19.720 | The woman walked across the street checking
00:21:24.320 | for traffic over blank shoulder.
00:21:27.000 | Any ideas on what could go here?
00:21:29.520 | Her, right?
00:21:30.080 | So this sort of co-reference between this entity
00:21:35.320 | who is being discussed in the world, this woman, and her shoulder.
00:21:39.040 | Now, when I discuss--
00:21:40.840 | this is sort of a linguistic concept.
00:21:42.380 | Her here is a co-referent to woman.
00:21:44.840 | It's referring to the same entity in the discourse.
00:21:47.240 | And so the network might be able to learn things about what
00:21:51.400 | entities are doing what where.
00:21:52.800 | It can learn things about semantics.
00:21:58.480 | So if I went to the ocean to see the fish, turtles, seals, and blank,
00:22:02.800 | then the word that's in the blank should be a member of the class
00:22:06.520 | that I'm thinking of as a person writing this sentence of stuff
00:22:09.840 | that I see when I go to the ocean and see these other things as well.
00:22:13.860 | So in order to do this prediction task, maybe
00:22:15.840 | I learn about the semantics of aquatic creatures.
00:22:22.840 | OK, so what else could I learn?
00:22:24.580 | I've got overall, the value I got from the two hours watching it
00:22:27.460 | was the sum total of the popcorn and drink.
00:22:29.760 | The movie was blank.
00:22:31.920 | What kind of task could I be learning from doing
00:22:33.980 | this sort of prediction problem?
00:22:37.680 | Sentiment, exactly.
00:22:38.820 | So this is just a naturalistic sort of text that I naturally wrote myself.
00:22:45.800 | But by saying, oh, the movie was bad, I'm
00:22:48.920 | learning about sort of the latent sentiment of the person who
00:22:53.200 | wrote this, what they were feeling about the movie at the time.
00:22:57.080 | So maybe if I see a new review later on, I can just paste in the review,
00:23:01.240 | say the movie was blank.
00:23:04.400 | And if the model generates bad or good, that
00:23:07.200 | could be implicitly solving the task of sentiment analysis.
00:23:10.640 | So here's another one.
00:23:14.720 | Iroh went to the kitchen to make some tea.
00:23:16.760 | Standing next to Iroh, Zuko pondered his destiny.
00:23:19.760 | Zuko left the blank.
00:23:23.160 | OK, so in this scenario, we've got a world implicitly
00:23:27.120 | that's been designed by the person who is creating this text.
00:23:31.160 | I've got physical locations in the discourse, like the kitchen.
00:23:35.280 | And I've got Zuko.
00:23:37.160 | Iroh's in the kitchen.
00:23:38.480 | Zuko's next to Iroh.
00:23:40.680 | So Zuko must be in the kitchen.
00:23:44.080 | So what could Zuko leave but the kitchen?
00:23:47.120 | And so in terms of latent notions of embodiment and physical location,
00:23:51.640 | the way that people talk about people being next to something
00:23:54.760 | and then leaving something could tell you
00:23:57.800 | stuff about sort of, yeah, a little bit about how the world works even.
00:24:04.920 | So here's a sequence.
00:24:06.360 | I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, blank.
00:24:12.640 | And this is a pretty tough one, right?
00:24:18.000 | This is the Fibonacci sequence, right?
00:24:19.640 | If you had to model by looking at a bunch of numbers from the Fibonacci
00:24:23.120 | sequence, learn to, in general, predict the next one,
00:24:27.160 | it's a question you should be thinking about throughout the lecture.
00:24:31.920 | OK, any questions on these sort of examples
00:24:34.240 | of what you might learn from predicting the context?
00:24:36.360 | OK, OK, cool.
00:24:44.240 | So a very simple way to think about pre-training
00:24:47.800 | is pre-training is language modeling.
00:24:49.340 | So we saw language modeling earlier in the course.
00:24:51.640 | And now we're just going to say, instead of using my language model just
00:24:55.140 | to provide probabilities over the next word,
00:24:57.560 | I am going to train it on that task.
00:24:59.600 | I'm going to actually model the distribution p theta of the word
00:25:06.440 | t given all the words previous.
00:25:10.080 | And there's a ton of data for this, right?
00:25:12.240 | There's just an amazing amount of data for this in a lot of languages,
00:25:15.720 | especially English.
00:25:16.800 | There's very little data for this in actually
00:25:18.680 | most of the world's languages, which is a separate problem.
00:25:21.920 | But you can pre-train just through language modeling, right?
00:25:24.340 | So I'm going to sort of do the teacher forcing thing.
00:25:27.160 | So I have IRO.
00:25:28.080 | I predict goes.
00:25:28.880 | I have goes.
00:25:29.400 | I predict to.
00:25:30.600 | And I'm going to train my sort of LSTM or my transformer to do this task.
00:25:35.760 | And then I'm just going to keep all the weights.
00:25:38.400 | OK, I'm going to save all the network parameters.
00:25:41.000 | And then once I have these parameters, instead
00:25:46.340 | of generating from my language model, I'm
00:25:48.040 | just going to use them as an initialization for my parameters.
00:25:52.280 | So I have this pre-training fine-tuning paradigm.
00:25:55.280 | Two steps.
00:25:56.640 | Most of you, I think, in your--
00:25:58.320 | well, maybe not this year.
00:25:59.680 | Let's say a large portion of you this year in your final projects
00:26:02.440 | will be doing the pre-training fine-tuning sort of paradigm,
00:26:05.160 | where someone has done the pre-training for you, right?
00:26:07.460 | So you have a ton of text.
00:26:09.100 | You learn very general things about the distribution of words
00:26:13.080 | and sort of the latent things that that tells you about the world
00:26:15.940 | and about language.
00:26:17.400 | And then in step two, you've got some task, maybe sentiment analysis.
00:26:22.040 | And you have maybe not very many labels.
00:26:24.560 | You have a little bit of labeled data.
00:26:26.720 | And you adapt the pre-trained model to the task
00:26:29.840 | that you care about by further doing gradient steps on this task.
00:26:34.040 | So you give it the movie was.
00:26:35.680 | You predict happy or sad.
00:26:37.960 | And then you sort of continue to update the parameters
00:26:42.080 | based on the initialization from the pre-training.
00:26:46.240 | And this just works exceptionally well--
00:26:48.440 | I mean, unbelievably well-- compared to training from scratch.
00:26:51.800 | Intuitively, because you've taken a lot of the burden of learning
00:26:54.760 | about language, learning about the world, off of the data
00:26:58.280 | that you've labeled for sentiment analysis.
00:27:00.400 | And you're sort of giving that task of learning
00:27:02.560 | all this sort of very general stuff to the much more general task of language
00:27:06.400 | modeling.
00:27:07.460 | You said we didn't have much data in other languages.
00:27:10.880 | What do you mean by data?
00:27:11.920 | Is it just text in that language?
00:27:13.960 | Yeah.
00:27:14.460 | Or is it labeled in some way?
00:27:16.600 | The question is, you said we have a lot of data in English,
00:27:19.720 | but not in other languages.
00:27:22.320 | What do you mean by data that we don't have a lot of in other languages?
00:27:25.280 | Is it just text?
00:27:25.980 | It's literally just text.
00:27:28.320 | No annotations.
00:27:29.960 | Because you don't need annotations to do language model pre-training, right?
00:27:33.240 | The existence of that sequence of words that someone has written
00:27:37.280 | provides you with all these pairs of input and output.
00:27:41.040 | Input IRO, output goes.
00:27:42.680 | Input IRO goes, output too.
00:27:44.800 | Those are all labels sort of that you've constructed from the input just
00:27:48.840 | existing.
00:27:49.520 | But in most languages, even on the entire internet,
00:27:52.840 | I mean, there's about 7,000-ish languages on Earth.
00:27:55.960 | And most of them don't have the sort of billions of words
00:28:01.200 | you might want to train these systems on.
00:28:04.760 | Yeah?
00:28:06.680 | If you're pre-training the entire thing,
00:28:08.320 | are you still learning one vector representation per word?
00:28:11.480 | The question is, if you're pre-training the entire thing,
00:28:13.800 | do you still learn one vector representation per word?
00:28:16.120 | You learn one vector representation that
00:28:17.940 | is the non-contextual input vector.
00:28:21.280 | So you have your vocabulary matrix.
00:28:23.000 | You've got your embedding matrix that is vocabulary size
00:28:26.240 | by model dimensionality.
00:28:28.920 | And so yeah, IRO has one vector.
00:28:30.720 | GOES has one vector.
00:28:32.680 | But then the transformer that you're learning on top of it
00:28:35.520 | takes in the sequence so far and sort of gives a vector to each of them
00:28:39.440 | that's dependent on the context in that case.
00:28:41.760 | But still, at the input, you only have one embedding per word.
00:28:46.000 | Yeah?
00:28:46.500 | So what sort of metrics would you use to evaluate a pre-trained model?
00:28:51.740 | It's supposed to be general.
00:28:53.900 | But there's application-specific metrics.
00:28:55.660 | So which one do you use?
00:28:56.860 | Yeah.
00:28:57.340 | So the question is, what metric do you
00:28:58.700 | use to evaluate pre-trained models since it's
00:29:00.620 | supposed to be so general?
00:29:02.740 | But there are lots of very specific evaluations you could use.
00:29:07.300 | We'll get into a lot of that in the rest of the lecture.
00:29:09.940 | While you're training it, you can use simple metrics
00:29:12.220 | that sort of correlate with what you want
00:29:13.900 | but aren't actually what you want, just like the probability quality.
00:29:18.340 | So you can evaluate the perplexity of your language model
00:29:21.180 | just like you would have when you cared about language modeling.
00:29:23.760 | And it turns out to be the case that better perplexity correlates
00:29:27.460 | with all the stuff that's much harder to evaluate,
00:29:30.080 | like lots and lots of different tasks.
00:29:32.420 | But also, the natural language processing community
00:29:34.460 | has built very large sort of benchmark suites of varying tasks
00:29:39.520 | to try to get at sort of a notion of generality,
00:29:41.780 | although that's very, very difficult.
00:29:43.460 | It's sort of ill-defined, even.
00:29:45.540 | And so when you develop new pre-training methods, what you often do
00:29:48.820 | is you try to pick a whole bunch of evaluations
00:29:51.260 | and show that you do better on all of them.
00:29:53.700 | And that's your argument for generality.
00:29:55.660 | So why should this sort of pre-training, fine-tuning, two-part paradigm help?
00:30:06.740 | This is still an open area of research, but the intuitions
00:30:10.380 | are all you're going to take from this course.
00:30:12.500 | So pre-training provides some sort of starting parameters, L theta.
00:30:17.500 | So this is like all the parameters in your network,
00:30:20.140 | from trying to do this minimum over all possible settings of your parameters
00:30:24.300 | of the pre-training loss.
00:30:26.900 | And then the fine-tuning process takes your data for fine-tuning.
00:30:31.220 | You've got some labels.
00:30:32.580 | And it tries to approximate the minimum through gradient descent
00:30:36.380 | of the loss of the fine-tuning task of theta.
00:30:39.140 | But you start at theta hat.
00:30:41.340 | So you start gradient descent at theta hat,
00:30:43.820 | which your pre-training process gave you.
00:30:46.420 | And then if you could actually solve this min and wanted to,
00:30:51.900 | it sort of feels like the starting point shouldn't matter.
00:30:55.700 | But it really, really, really does.
00:30:58.140 | It really does.
00:31:00.940 | So we'll talk a bit more about this later.
00:31:03.900 | But the process of gradient descent, maybe it
00:31:07.660 | sticks relatively close to the theta hat during fine-tuning.
00:31:11.700 | So you start at theta hat.
00:31:14.620 | And then you sort of walk downhill with gradient descent
00:31:17.500 | until you hit sort of a valley.
00:31:19.380 | And that valley ends up being really good
00:31:21.620 | because it's close to the pre-training parameters, which were
00:31:24.260 | really good for a lot of things.
00:31:26.060 | This is a cool place where sort of practice and theory
00:31:29.300 | are sort of like meeting, where optimization people want
00:31:31.940 | to understand why this is so useful.
00:31:34.700 | NLP people sort of just want to build better systems.
00:31:39.060 | So yeah, maybe the stuff around theta hat
00:31:43.140 | tends to generalize well.
00:31:44.460 | If you want to work on this kind of thing,
00:31:46.220 | you should talk about it.
00:31:47.220 | Yeah?
00:31:48.220 | So if stochastic gradient descent
00:31:50.220 | sticks relatively close, but what
00:31:51.980 | if we were to use a different optimizer?
00:31:53.740 | How would that change our results?
00:31:56.180 | The question is, if stochastic gradient descent
00:31:59.180 | sticks relatively close, what if we use a different optimizer?
00:32:01.780 | I mean, if we use sort of any common variant of gradient
00:32:05.020 | descent, like any first order method,
00:32:07.100 | like Adam, which we use in this course, or AdaGrad,
00:32:10.420 | or they all have this very, very similar properties.
00:32:14.860 | Other types of optimization we just tend to not use.
00:32:17.660 | So who knows?
00:32:19.460 | Yeah?
00:32:19.960 | Yeah, I'm still a little unclear on why
00:32:21.700 | the pre-training plus fine tuning works better than just
00:32:25.060 | fine tuning, but making the model more powerful,
00:32:27.180 | like adding more layers, more data, et cetera.
00:32:29.580 | Yeah.
00:32:30.300 | The question is, why does the pre-trained fine tune paradigm
00:32:33.540 | work better than just making the model more powerful,
00:32:36.580 | adding more layers, adding more data to just the fine tuning?
00:32:39.180 | The simple answer is that you have orders of magnitude
00:32:45.860 | more data that's unlabeled.
00:32:48.500 | That's just text that you found.
00:32:51.860 | Then you do carefully labeled data and the tasks
00:32:54.460 | that you care about, right?
00:32:55.540 | Because that's expensive to get.
00:32:57.140 | It has to be examples of your movie reviews
00:32:59.660 | or whatever that you've had someone label carefully.
00:33:03.220 | So you have something like on the internet at least 5
00:33:09.460 | trillion, maybe 10 trillion words of this,
00:33:13.020 | and you have maybe a million words of your labeled data
00:33:16.620 | or whatever over here.
00:33:17.860 | So it's just the scale is way off.
00:33:21.460 | But there's also an intuition that learning
00:33:24.260 | to do a very, very simple thing like sentiment analysis
00:33:28.180 | is not going to get you a very generally able agent
00:33:34.940 | in a wide range of settings compared to language modeling.
00:33:38.940 | So it's hard to get--
00:33:40.900 | how do I put it?
00:33:42.180 | Even if you have a lot of labeled data of movie reviews
00:33:45.020 | of the kind that people are writing today, maybe tomorrow
00:33:49.260 | they start writing slightly different kinds of movie
00:33:51.380 | reviews, and your system doesn't perform as well.
00:33:53.660 | Whereas if you pre-trained on a really diverse set of text
00:33:56.700 | from a wide range of sources and people,
00:33:58.900 | it might be more adaptable to seeing stuff that doesn't quite
00:34:03.260 | look like the training data you showed it,
00:34:05.060 | even if you showed it a ton of training data.
00:34:07.580 | So one of the big takeaways of pre-training
00:34:10.420 | is that you get this huge amount of variety of text
00:34:14.980 | on the internet.
00:34:15.660 | And you have to be very careful.
00:34:17.100 | I mean, yeah, you should be very careful about what kind of text
00:34:20.220 | you're showing it and what kind of text you're not,
00:34:22.460 | because the internet is full of awful text as well.
00:34:27.780 | But some of that generality just comes
00:34:29.660 | from how hard this problem is and how much data
00:34:31.940 | you can show it.
00:34:33.940 | [INAUDIBLE]
00:34:34.420 | --pre-trained model was trained on so much data.
00:34:37.780 | How do you then train it so that it considers the stuff
00:34:42.140 | that you're fine-tuning it with as more important, more
00:34:44.660 | salient to the task it's trying to do,
00:34:46.660 | rather than just one in a billion articles of data?
00:34:50.580 | Yeah, it's a good question.
00:34:51.900 | So the question is, given that the amount of data
00:34:54.380 | on the pre-training side is orders of magnitude
00:34:56.340 | more than the amount of data on the fine-tuning side,
00:34:58.540 | how do you get across to the model that, OK, actually,
00:35:01.220 | the fine-tuning task is what I care about.
00:35:03.140 | So focus on that.
00:35:04.940 | It's about the fact that I did this first,
00:35:07.220 | the pre-training first.
00:35:08.540 | And then I do the fine-tuning second.
00:35:11.900 | So I've gotten my parameter initialization from this.
00:35:14.780 | I've set it somewhere.
00:35:16.100 | And then I fine-tune.
00:35:17.620 | I move to where the parameters are doing well
00:35:20.100 | for this task afterward.
00:35:22.220 | And so, well, it might just forget a lot
00:35:25.060 | about how to do this, because now I'm just asking
00:35:27.540 | it to do this at this point.
00:35:30.820 | I should move on, I think.
00:35:32.940 | But we're going to keep talking about this in much more detail
00:35:36.060 | with more concrete elements.
00:35:38.180 | So OK, so let's talk about model pre-training.
00:35:44.980 | Oh, wait.
00:35:47.140 | That did not advance the slides.
00:35:49.100 | Nice, OK.
00:35:55.140 | Let's talk about model pre-training three ways.
00:35:58.020 | In our Transformers lecture Tuesday,
00:36:01.660 | we talked about encoders, encoder decoders, and decoders.
00:36:04.980 | And we'll do decoders last, because actually,
00:36:08.580 | many of the largest models that are being used today
00:36:12.140 | are all decoders.
00:36:14.180 | And so we'll have a bit more to say about them.
00:36:17.260 | So let's recall these three.
00:36:19.340 | So encoders get bidirectional context.
00:36:21.540 | You have a single sequence, and you're
00:36:23.700 | able to see the whole thing, kind of like an encoder
00:36:25.940 | in machine translation.
00:36:28.100 | Encoder decoders have one portion of the network
00:36:32.340 | that gets bidirectional context.
00:36:34.140 | So that's like the source sentence of my machine
00:36:36.620 | translation system.
00:36:37.900 | And then they're sort of paired with a decoder that
00:36:40.540 | gets unidirectional context, so that I
00:36:42.420 | have this sort of informational masking where
00:36:45.420 | I can't see the future, so that I can do things
00:36:47.460 | like language modeling.
00:36:48.500 | I can generate the next token of my translation, whatever.
00:36:51.260 | So you could think of it as I've got my source sentence here,
00:36:54.820 | and my partial translation here, and I'm sort of decoding
00:36:57.260 | out the translation.
00:36:59.180 | And then decoders only are things like language models.
00:37:02.060 | We've seen a lot of this so far.
00:37:03.540 | And there's pre-training for all three sort
00:37:05.580 | of large classes of models.
00:37:09.100 | And how you pre-train them and then how you use them
00:37:11.380 | depends on the properties and the proactivities
00:37:14.260 | of the specific architecture.
00:37:15.740 | So let's look at encoders first.
00:37:18.740 | So we've looked at language modeling quite a bit.
00:37:21.460 | But we can't do language modeling with an encoder,
00:37:24.100 | because they get bidirectional context.
00:37:26.620 | So if I'm down here at i, and I want to present--
00:37:31.100 | I want to predict the next word, it's
00:37:33.460 | a trivial task at this level here to predict the next word.
00:37:38.020 | Because in the middle, I was able to look at the next word.
00:37:41.900 | And so I should just know.
00:37:43.060 | There's nothing hard about learning to predict the next word here,
00:37:45.560 | because I could just look at it, see what it is, and then copy it over.
00:37:49.380 | So when I'm training an encoder in something for pre-training,
00:37:54.720 | I have to be a little bit more clever.
00:37:57.380 | In practice, what I do is something like this.
00:37:59.900 | I take the input, and I modify it somewhat.
00:38:02.100 | I mask out words, sort of like I did in the examples
00:38:04.620 | I gave at the beginning of class.
00:38:06.020 | So I blank to the blank.
00:38:09.260 | And then I have the network predict with its whole--
00:38:12.980 | I have it build contextual representations.
00:38:15.340 | So now this vector representation of the blank
00:38:18.060 | sees the entire context around it here.
00:38:22.340 | And then I predict the word "went," and then here, the word "store."
00:38:29.340 | Any questions?
00:38:34.460 | And you can see how this is doing something quite a bit like language
00:38:37.940 | modeling, but with bidirectional context.
00:38:41.180 | I've removed the network's information about the words that go in the blanks,
00:38:45.340 | and I'm training it to reconstruct that.
00:38:47.740 | So I only have loss terms, right?
00:38:49.620 | I only ask it to actually do the prediction, compute the loss,
00:38:52.780 | backpropagate the gradients for the words that I've masked out.
00:38:56.580 | And you can think of this as instead of learning probability of x,
00:39:00.580 | where x is like a sentence or a document,
00:39:03.140 | this is learning the probability of x, the real document,
00:39:06.300 | given x tilde, which is this sort of corrupted document,
00:39:11.420 | with some of the information missing.
00:39:14.940 | And so we get the sequence of vectors here,
00:39:17.780 | one per word, which is the output of my encoder in blue.
00:39:21.380 | And then I'd say that for the words that I want to predict, yi, I draw them.
00:39:25.700 | This is the sim means the probability is proportional to my embedding matrix
00:39:32.940 | times my representation of it.
00:39:36.500 | So it's just a linear transformation of that last thing here.
00:39:38.980 | So this a plus b is this red portion here.
00:39:41.860 | And I do the prediction, and I train the entire network to do this.
00:39:47.020 | So the words that we mask out, do we just select them randomly,
00:39:51.900 | or is there some scheme to it?
00:39:54.260 | The question is, do we just choose words randomly to mask out,
00:39:57.100 | or is there a scheme?
00:39:58.380 | Mostly randomly.
00:39:59.380 | We'll talk about a slightly smarter scheme in a couple of slides,
00:40:02.140 | but yeah, just mostly randomly.
00:40:05.500 | Yeah?
00:40:07.020 | What was that last part on the bottom, x, the masked version of--
00:40:11.460 | like, if it's the first or the very last sentence?
00:40:16.580 | Yeah, so I'm saying that I'm defining x tilde to be this input part, where
00:40:23.100 | I've got the masked version of the sentence with these words missing.
00:40:26.820 | And then I'm defining a probability distribution
00:40:29.060 | that's the probability of a sequence conditioned
00:40:32.340 | on the input being the corrupted sequence, the masked sequence.
00:40:35.940 | So this brings us to a very, very popular NLP model
00:40:47.300 | that you need to know about.
00:40:48.460 | It's called BERT.
00:40:49.940 | And it was the first one to popularize this masked language modeling
00:40:53.500 | objective.
00:40:55.300 | And they released the weights of this pre-trained transformer
00:40:58.420 | that they pre-trained via something that looks a lot like masked language
00:41:01.380 | modeling.
00:41:01.900 | And so these you can download.
00:41:03.780 | You can use them via code that's released by the company HuggingFace
00:41:07.660 | that we have continued to bring up.
00:41:10.300 | Many of you will use a model like BERT in your final project
00:41:13.700 | because it's such a useful builder of representations
00:41:16.500 | of language and context.
00:41:18.340 | So let's talk a little bit about the details
00:41:20.140 | of masked language modeling in BERT.
00:41:23.260 | First, we take 15% of the subword tokens.
00:41:27.020 | So remember, all of our inputs now are subword tokens.
00:41:30.460 | I've made them all look like words.
00:41:32.500 | But just like we saw at the very beginning of class,
00:41:34.620 | each of these tokens could just be some portion, some subword.
00:41:38.940 | And I'm going to do a couple of things with it.
00:41:40.900 | Sometimes I am going to just mask out the word
00:41:45.860 | and then predict the true word.
00:41:48.220 | Sometimes I'm going to replace the word with some random sample
00:41:53.260 | of another word from my vocabulary and predict
00:41:56.700 | the real word that was supposed to go there.
00:41:58.780 | And sometimes I'm going to not change the word at all
00:42:02.780 | and still predict it.
00:42:04.300 | The intuition of this is the following.
00:42:07.340 | If I just had to build good representations
00:42:11.820 | in the middle of this network for words that are masked out,
00:42:15.940 | then when I actually use the model at test time
00:42:19.220 | on some real review to do sentiment analysis on,
00:42:22.820 | well, there are never going to be any tokens like this.
00:42:25.340 | So maybe the model won't do a very good job
00:42:27.300 | because it's like, oh, I have no job to do here
00:42:29.780 | because I only need to deal with the mask tokens.
00:42:33.540 | By giving it sequences of words where sometimes it's
00:42:36.660 | the real word that needs to be predicted,
00:42:38.420 | sometimes you have to detect if the word is wrong.
00:42:41.300 | The idea is that now when I give it
00:42:43.100 | a sentence that doesn't have any masks,
00:42:46.660 | it actually does a good job of representing
00:42:48.660 | all the words in context because it has this chance
00:42:51.660 | that it could be asked to predict anything at any time.
00:42:54.120 | OK, so the folks at Google who were defining this
00:43:03.980 | had a separate additional task that is sort of interesting
00:43:09.100 | to think about.
00:43:10.780 | So this was their BERT model from their paper.
00:43:13.340 | They had their position embeddings
00:43:14.760 | just like we saw from our transformers lecture,
00:43:18.180 | token embeddings just like we saw from the transformers
00:43:20.500 | lecture.
00:43:21.620 | But then also they had this thing called a segment embedding
00:43:23.980 | where they had two possible segments, segment A
00:43:26.380 | and segment B. And they had this additional task
00:43:31.820 | where they would get a big chunk of text for segment A
00:43:34.780 | and a big chunk of text for segment B.
00:43:37.220 | And then they would ask the model,
00:43:38.780 | is segment B a real continuation of segment A?
00:43:43.140 | Was it the text that actually came next?
00:43:45.780 | Or did I just pick this big segment randomly
00:43:48.100 | from somewhere else?
00:43:49.660 | And the idea was that this should teach the network
00:43:52.180 | some notion of long distance coherence
00:43:55.460 | about the connection between a bunch of text over here
00:43:58.420 | and a bunch of text over there.
00:44:00.180 | Turns out it's not really necessary,
00:44:01.740 | but it's an interesting idea.
00:44:04.940 | And similar things have continued
00:44:06.880 | to have some sort of influence since then.
00:44:09.980 | But again, you should get this intuition
00:44:12.060 | that we're trying to come up with hard problems
00:44:14.100 | for the network to solve such that by solving them,
00:44:16.780 | it has to learn a lot about language.
00:44:19.460 | And we're defining those problems
00:44:21.580 | by making simple transformations or removing information
00:44:25.060 | from text that just happened to occur.
00:44:26.860 | Questions?
00:44:32.580 | Yeah.
00:44:33.080 | The plus signs, do we concatenate the vectors,
00:44:35.500 | or do we do an element-wise addition?
00:44:38.420 | The question is, for these plus signs,
00:44:40.020 | do we concatenate the vectors or do element-wise addition?
00:44:43.140 | We do element-wise addition.
00:44:45.940 | You could have concatenated them.
00:44:48.180 | However, one of the big conventions
00:44:50.660 | of all of these networks is that you always
00:44:52.420 | have exactly the same number of dimensions
00:44:54.980 | everywhere at every layer of the network.
00:44:56.660 | It just makes everything very simple.
00:44:58.420 | So just saying everything's the same dimension
00:45:00.300 | and then doing addition just ends up being simpler.
00:45:03.980 | So why was the next sentence prediction not necessary?
00:45:09.220 | What's the main question for that?
00:45:11.060 | Yeah, why was the next sentence prediction not necessary?
00:45:14.420 | One thing that it does that's a negative
00:45:16.460 | is that now the effective context length for a lot
00:45:24.300 | of your examples is halved.
00:45:26.580 | So one of the things that's useful about pre-training
00:45:28.820 | seemingly is that you get to build representations
00:45:30.980 | of very long sequences of text.
00:45:33.220 | This is very short, but in practice,
00:45:35.460 | segment A was going to be something like 250 words,
00:45:39.540 | and segment B was going to be 250 words.
00:45:42.060 | And in the paper that let us know that this wasn't necessary,
00:45:45.460 | they always had a long segment of 500 words.
00:45:48.940 | And it seemed to be useful to always have
00:45:50.740 | this very long context because longer contexts help give you
00:45:55.380 | more information about the role that each word is playing
00:45:58.060 | in that specific context.
00:45:59.820 | If I see one word, it's hard to know.
00:46:02.100 | If I just see record, it's hard to know
00:46:03.980 | what it's supposed to mean.
00:46:05.220 | But if I see 1,000 words around it,
00:46:06.900 | it's much clearer what its role is in that context is.
00:46:09.540 | So yeah, it cuts the effective context size is one answer.
00:46:13.600 | Another thing is that this is actually much more difficult.
00:46:19.860 | This is a much more recent paper that I
00:46:21.980 | don't have in the slides.
00:46:23.260 | But it's been shown since then that these models are really,
00:46:25.760 | really bad at the next sentence prediction task.
00:46:28.860 | So it could be that maybe it just
00:46:31.140 | was too hard at the time.
00:46:34.860 | And so it just wasn't useful because the model
00:46:37.060 | was failing to do it at all.
00:46:39.740 | So I can give the link for that paper later.
00:46:43.100 | Can you explain again why we need to do a next sentence
00:46:45.940 | prediction?
00:46:46.500 | What about just masking and predicting the next?
00:46:49.020 | I missed that jump.
00:46:50.140 | So it's the next sentence.
00:46:52.020 | Yeah.
00:46:52.540 | So the question is, why do we need
00:46:53.620 | to do next sentence prediction?
00:46:54.700 | Why not just do the masking we saw before?
00:46:57.020 | That's the thing.
00:46:57.380 | You seem to not need to do next sentence prediction.
00:46:59.660 | But as history of the research, it
00:47:03.020 | was thought that this was useful.
00:47:05.380 | And the idea was that it required
00:47:07.420 | you to develop this pairwise, do these two segments of text
00:47:12.060 | interact?
00:47:12.560 | How do they interact?
00:47:13.500 | Are they related?
00:47:14.260 | The sort of longer distance notion.
00:47:16.300 | And many NLP tasks are defined on pairs of things.
00:47:19.860 | And they thought that might be useful.
00:47:22.180 | And so they published it with this.
00:47:24.020 | And then someone else came through,
00:47:25.500 | published a new model that didn't do that.
00:47:27.260 | And it sort of did better.
00:47:29.500 | So this is just-- yeah.
00:47:31.820 | So yeah.
00:47:33.060 | There are intuitions as to why it could work.
00:47:34.860 | It just didn't.
00:47:36.260 | So BERT wasn't doing masking or was doing--
00:47:38.700 | It was doing both.
00:47:39.420 | It was doing both.
00:47:40.260 | It was doing both this next sentence--
00:47:42.100 | so BERT was doing both this next sentence prediction training
00:47:46.540 | as well as this masking training all at the same time.
00:47:52.220 | And so you had to have a separate predictor head
00:47:55.380 | on top of BERT, a separate predictor sort
00:47:57.340 | of classification thing.
00:47:59.580 | And so one detail there is that there's
00:48:02.300 | this special word at the beginning of BERT
00:48:04.460 | in every sequence that's CLS.
00:48:07.140 | And you can define a predictor on top
00:48:10.140 | of that sort of fake word embedding that
00:48:12.420 | was going to say, is the next sentence real or fake or not?
00:48:16.140 | Yeah.
00:48:17.740 | OK, I'm going to move on.
00:48:20.620 | And so this gets at sort of the question
00:48:22.500 | that we had earlier about how do you evaluate these things.
00:48:25.540 | There's a lot of different NLP tasks out there.
00:48:27.780 | Gosh.
00:48:28.380 | And when people were defining these papers,
00:48:32.140 | they would look at a ton of different evaluations
00:48:34.580 | that had been sort of compiled as a set of things that
00:48:36.880 | are a little hard for today's systems.
00:48:38.860 | So are you detecting paraphrases between questions?
00:48:41.900 | Are two Quora questions actually the same question?
00:48:44.260 | That turns out to be hard.
00:48:47.500 | Can you do sentiment analysis on this hard data set?
00:48:51.540 | Can you tell if sentences are linguistically acceptable?
00:48:54.460 | Are they grammatical or not?
00:48:56.620 | Are two sequences similar semantically?
00:48:59.020 | Do they mean sort of vaguely the similar thing?
00:49:01.900 | And we'll talk a bit about natural language inference
00:49:04.100 | later, but that's the task of defining sort of if I say,
00:49:08.240 | you know, I saw the dog, that does not necessarily
00:49:11.400 | mean I saw the little dog.
00:49:14.440 | But saying I saw the little dog does mean I saw the dog.
00:49:18.000 | So that's sort of this natural language inference task.
00:49:20.560 | And the difference between sort of pre-pre-training days,
00:49:26.920 | where you had this sort of this row here
00:49:29.320 | before you had substantial amounts of pre-training
00:49:33.600 | and BERT was just like the field was taken aback in a way that's
00:49:37.540 | hard to describe.
00:49:39.420 | You know, very carefully crafted architectures
00:49:41.960 | for each individual task, where everyone
00:49:44.040 | was designing their own neural network
00:49:45.660 | and doing things that they thought were sort of clever as
00:49:48.000 | to how to define all the connections and the weights
00:49:50.300 | and whatever to do their tasks independently.
00:49:52.400 | So everyone was doing a different thing for each one
00:49:54.600 | of these tasks, roughly.
00:49:57.360 | All of that was blown out of the water
00:49:59.360 | by just build a big transformer and just teach it
00:50:02.120 | to predict the missing words a whole bunch
00:50:04.160 | and then fine tune it on each of these tasks.
00:50:07.000 | So this was just a sea change in the field.
00:50:09.680 | People were, I mean, amazed.
00:50:11.920 | It's a little bit less flashy than chat GPT, I'll admit.
00:50:14.760 | But it's really part of the story that gets us to it,
00:50:17.160 | you know?
00:50:18.920 | OK, questions?
00:50:20.800 | So like to get stuff out of the--
00:50:28.200 | during the encoder pre-training stage,
00:50:31.680 | encoder usually outputs some sort of hidden values.
00:50:36.680 | How do we correlate those to words
00:50:39.200 | that we are trying to test against?
00:50:41.720 | So the question is, the encoder output
00:50:44.760 | is a bunch of hidden values.
00:50:48.320 | How do we actually correlate those values to stuff
00:50:51.320 | that we want to predict?
00:50:52.640 | I'm going to go on to the next slide here to bring up
00:50:54.980 | this example here, right?
00:50:56.120 | So the encoder gives us, for each input word token,
00:51:00.200 | a vector of that token that represents
00:51:02.640 | the token in context.
00:51:04.360 | And the question is, how do we get these representations
00:51:07.520 | and turn them into sort of answers for the tasks
00:51:11.560 | that we care about?
00:51:13.080 | And the answer comes back to something like this.
00:51:30.080 | Something like this, maybe?
00:51:32.480 | Sure.
00:51:39.360 | So when we were doing the pre-training,
00:51:41.040 | we had the transformer that was giving us our representations.
00:51:43.840 | And we had this little last layer here,
00:51:46.040 | this little sort of affine transformation
00:51:49.840 | that moved us from the encoder's hidden state size
00:51:52.480 | to the vocabulary to do our prediction.
00:51:55.000 | And we just removed this last prediction layer here.
00:51:58.280 | And let's say we want to do something that is classifying
00:52:03.320 | the sentiment of the sentence.
00:52:04.600 | We just pick arbitrarily maybe the last word in the sentence.
00:52:08.320 | And we stick a linear classifier on top
00:52:11.480 | and map it to positive or negative,
00:52:13.320 | and then fine tune the whole thing.
00:52:17.160 | So yeah, the BERT model had two different models.
00:52:21.160 | One was 110 million parameters.
00:52:22.920 | One was 340 million.
00:52:24.840 | Keep that sort of in the back of your head sort of percolating
00:52:27.480 | as we talk about models with many, many more parameters
00:52:30.160 | later on.
00:52:31.160 | It was trained on 800 million words plus--
00:52:38.320 | that is definitely wrong--
00:52:40.000 | maybe 25 million words, but on the order of less than a billion
00:52:44.520 | words of text, quite a bit still.
00:52:48.220 | And it was trained on what was considered at the time
00:52:50.640 | to be a whole lot of compute.
00:52:53.120 | Just it was Google doing this.
00:52:54.840 | And they released it.
00:52:55.720 | And we were like, oh, who has that kind of compute?
00:52:57.400 | But Google-- although nowadays, it's
00:52:59.280 | not considered to be very much.
00:53:01.600 | But fine tuning is practical and common on a single GPU.
00:53:04.720 | So you could take the BERT model that they've spent a lot of time
00:53:07.560 | training and fine tune it yourself on your task
00:53:10.640 | on even sort of a very small GPU.
00:53:17.820 | So one question is like, well, this seems really great.
00:53:24.520 | Why don't we just use this for everything?
00:53:27.080 | Yeah.
00:53:31.400 | And the answer is, well, what is the sort
00:53:33.960 | of pre-training objective?
00:53:35.040 | What's the structure of the pre-trained model good for?
00:53:38.520 | BERT is really good for sort of filling in the blanks.
00:53:41.920 | But it's much less naturally used
00:53:44.320 | for actually generating text.
00:53:46.800 | So I wouldn't want to use BERT to generate
00:53:48.960 | a summary of something because it's not really built for it.
00:53:53.080 | It doesn't have a natural notion of predicting the next word given
00:53:56.520 | all the words that came before it.
00:53:58.280 | So maybe I want to use BERT if I want a good representation of, say,
00:54:01.840 | a document to classify it, give it a set of topic labels,
00:54:05.440 | or say it's toxic or non-toxic or whatever.
00:54:07.960 | But I wouldn't want to use it to generate a whole sequence.
00:54:13.460 | Some extensions of BERT.
00:54:15.040 | So we had a question earlier of whether you just
00:54:17.400 | mask things out randomly.
00:54:18.920 | One thing that seems to work better is you mask out
00:54:23.480 | sort of whole contiguous spans because sort of the difficulty
00:54:30.480 | of this problem is much easier than it would otherwise be because sort of this
00:54:35.720 | is part of irresistibly.
00:54:37.120 | And you can tell very easily based on the sort of subwords that came before it.
00:54:41.160 | Whereas if I have a much longer sequence, it's a trade-off.
00:54:45.500 | But this might be a harder problem.
00:54:47.840 | And it ends up being better to do this sort of span-based masking
00:54:51.600 | than random masking.
00:54:52.600 | And that might be because subwords make very simple prediction problems when
00:54:56.680 | you mask out just one subword of a word versus all the subwords of a word.
00:55:02.660 | So this ends up doing much better.
00:55:05.360 | There's also a paper called the Roberta paper,
00:55:07.360 | which showed that the next sentence prediction wasn't necessary.
00:55:12.180 | They also showed that they really should
00:55:13.880 | have trained it on a lot more text.
00:55:16.840 | So Roberta is a drop-in replacement for BERT.
00:55:19.560 | So if you're thinking of using BERT, just use Roberta.
00:55:21.760 | It's better.
00:55:22.640 | And it gave us this intuition that we really
00:55:24.280 | don't know a whole lot about the best practices for training these things.
00:55:27.400 | You sort of train it for as long as you're willing to.
00:55:29.800 | And things do good stuff and whatever.
00:55:33.600 | So this is very-- but it's very difficult to do sort of iteration
00:55:37.920 | on these models because they're big.
00:55:39.420 | It's expensive to train them.
00:55:42.520 | Another thing that you should know for your final projects in the world ahead
00:55:45.960 | is this notion of fine-tuning all parameters of the network
00:55:49.200 | versus just a couple of them.
00:55:51.200 | So what we've talked about so far is you pre-train all the parameters
00:55:54.840 | and then you fine-tune all of them as well.
00:55:56.640 | So all the parameter values change.
00:55:59.480 | Alternative, which you call parameter efficient or lightweight fine-tuning,
00:56:04.000 | you sort of choose little bits of parameters
00:56:06.520 | or you choose the very smart way of keeping most of the parameters fixed
00:56:09.640 | and only fine-tuning others.
00:56:11.480 | And the intuition is that these pre-trained parameters were really good.
00:56:16.600 | And you want to make the minimal change from the pre-trained model
00:56:20.080 | to the model that does what you want so that you
00:56:22.120 | keep some of the generality, some of the goodness of the pre-training.
00:56:26.280 | So one way that this is done is called prefix tuning.
00:56:29.280 | Prompt tuning is very similar, where you actually
00:56:31.560 | freeze all the parameters of the network.
00:56:33.280 | So I've pre-trained my network here.
00:56:36.920 | And I never change any of the parameter values.
00:56:39.720 | Instead, I make a bunch of fake sort of pseudo word vectors
00:56:44.360 | that I prepend to the very beginning of the sequence.
00:56:47.360 | And I train just them.
00:56:49.280 | Sort of unintuitive.
00:56:50.800 | It's like these would have been like inputs to the network,
00:56:53.480 | but I'm specifying them as parameters.
00:56:55.340 | And I'm training everything to do my sentiment analysis task
00:56:58.640 | just by changing the values of these sort of fake words.
00:57:03.120 | And this is nice because I get to keep all the good pre-trained parameters
00:57:08.960 | and then just specify the sort of diff that ends up generalizing better.
00:57:15.000 | This is a very open field of research.
00:57:17.520 | But this is also cheaper because I don't have to compute the gradients,
00:57:21.480 | or I don't have to store the gradients and all the optimizer state.
00:57:25.240 | With respect to all these parameters, I'm only training
00:57:28.160 | a very small number of parameters.
00:57:30.840 | Yeah.
00:57:31.340 | [INAUDIBLE]
00:57:33.340 | It's like fake parameters at the end, as if like here.
00:57:38.180 | It doesn't make any difference if you put these
00:57:40.180 | at the end or the beginning.
00:57:41.380 | In a decoder, you have to put them at the beginning
00:57:43.540 | because otherwise you don't see them before you process the whole sequence.
00:57:49.280 | Can we just attach a few layers and only train the new layers?
00:57:53.500 | The question is, can we just attach a new layers at the top of this
00:57:56.660 | and only train those?
00:57:57.660 | Absolutely.
00:57:58.820 | This works a bit better.
00:58:00.540 | Another thing that works well-- sorry, we're running out of time--
00:58:04.340 | is taking each weight matrix.
00:58:06.780 | So I have a bunch of weight matrices in my transformer.
00:58:09.700 | And I freeze the weight matrix and learn a very low rank little diff.
00:58:15.420 | And I set the weight matrix's value to be sort of the original value
00:58:19.660 | plus my sort of very low rank diff from the original one.
00:58:24.900 | And this ends up being a very similarly useful technique.
00:58:29.620 | And the overall idea here is that, again, I'm
00:58:31.700 | learning way fewer parameters than I did via pre-training and freezing
00:58:36.180 | most of the pre-training parameters.
00:58:39.300 | OK, encoder-decoders.
00:58:41.140 | So for encoder-decoders, we could do something like language modeling.
00:58:45.300 | I've got my input sequence here, encoder, output sequence here.
00:58:49.700 | And I could say this part is my prefix for sort
00:58:52.980 | of having bidirectional context.
00:58:55.100 | And I could then predict all the words that
00:58:58.220 | are sort of in the latter half of the sequence,
00:59:00.780 | just like a language model.
00:59:01.900 | And that would work fine.
00:59:04.460 | And so this is something that you could do.
00:59:07.140 | You sort of take a long text, split it into two,
00:59:09.700 | give half of it to the encoder, and then generate
00:59:12.080 | the second half with the decoder.
00:59:13.420 | But in practice, what works much better is this notion of span corruption.
00:59:20.300 | Span corruption is going to show up in your assignment 5.
00:59:23.260 | And the idea here is a lot like BERT, but in a sort of generative sense,
00:59:30.580 | where I'm going to mask out a bunch of words in the input.
00:59:33.500 | Thank you, mask token 1, me to your party, mask token 2, week.
00:59:40.860 | And then at the output, I generate the mask token
00:59:44.660 | and then what was supposed to be there, but the mask token replaced it.
00:59:48.580 | So thank you, then predict for inviting at the output,
00:59:52.500 | me to your party, last week.
00:59:54.860 | And what this does is that it allows you to have bidirectional context.
01:00:00.900 | I get to see the whole sequence, except I can generate
01:00:05.220 | the parts that were missing.
01:00:07.100 | So this feels a little bit like you mask out parts of the input,
01:00:10.020 | but you actually generate the output as a sequence
01:00:12.960 | like you would in language modeling.
01:00:14.860 | So this might be good for something like machine translation,
01:00:17.420 | where I have an input that I want bidirectional context in,
01:00:20.340 | but then I want to generate an output.
01:00:22.300 | And I want to pre-train the whole thing.
01:00:24.380 | So this was shown to work better than language modeling at the scales
01:00:27.780 | that these folks at Google were able to test back in 2018.
01:00:31.580 | This is still quite popular.
01:00:32.860 | Yeah, there's a lot of numbers.
01:00:37.940 | It works better than the other stuff.
01:00:40.100 | I'm not going to worry about it.
01:00:43.780 | There's a fascinating property of these models also.
01:00:46.540 | So T5 was the model that was originally introduced
01:00:51.060 | with salient span masking.
01:00:52.820 | And you can think of at pre-training time,
01:00:55.820 | you saw a bunch of things like Franklin D. Roosevelt was born in blank,
01:01:00.540 | and you generated out the blank.
01:01:02.580 | And there's this task called open domain question
01:01:07.620 | answering, which has a bunch of trivia questions,
01:01:10.220 | like when was Franklin D. Roosevelt born?
01:01:12.820 | And then you're supposed to generate out the answer as a string, just
01:01:16.860 | from your parameters.
01:01:17.940 | So you did a bunch of pre-training.
01:01:19.400 | You saw a bunch of text.
01:01:20.580 | And then you're supposed to generate these answers.
01:01:22.900 | And what's fascinating is that this salient span masking method
01:01:29.900 | allowed you to pre-train and then fine tune
01:01:32.380 | on some examples of trivia questions.
01:01:36.820 | And then when you tested on new trivia questions,
01:01:40.260 | the model would implicitly extract from its pre-training data
01:01:44.540 | somehow the answer to that new question that it never
01:01:47.780 | saw explicitly at fine tuning time.
01:01:49.700 | So it learned this sort of implicit retrieval-- sometimes,
01:01:53.060 | sometimes, less than 50% of the time or whatever,
01:01:55.740 | but much more than random chance.
01:02:00.020 | And that's fascinating.
01:02:01.580 | So you've learned to access this latent knowledge
01:02:05.180 | that you stored up by pre-training.
01:02:07.380 | And so you just pass it the text, when was Roosevelt born,
01:02:10.820 | and it would pass out an answer.
01:02:13.020 | And one thing to know is that the answers always look very fluent.
01:02:15.860 | They always look very reasonable.
01:02:17.820 | But they're frequently wrong.
01:02:19.980 | And that's still true of things like ChatsGPT.
01:02:21.860 | Yeah.
01:02:25.980 | OK, so that's encoder-decoder models.
01:02:30.300 | Next up, we've got decoders.
01:02:31.740 | And we'll spend a long time on decoders.
01:02:34.100 | So this is just our normal language model.
01:02:35.980 | So I get a sequence of hidden states from my decoder.
01:02:38.940 | The model-- the words can only look at themselves, not the future.
01:02:43.220 | And then I predict the next word in the sentence.
01:02:46.780 | And then here again, I can--
01:02:48.700 | to do sentiment analysis, maybe take the last state
01:02:50.900 | for the last word, and then predict happy or sad
01:02:53.540 | based on that last embedding.
01:02:56.340 | Back-propagate the gradients of the whole network,
01:02:58.420 | train the whole thing, or do some kind of lightweight
01:03:01.700 | or parameter-efficient fine-tuning,
01:03:03.420 | like we mentioned earlier.
01:03:05.100 | So this is our pre-training a decoder.
01:03:07.940 | And I can just pre-train it on language modeling.
01:03:13.460 | So again, you might want to do this if you are wanting to generate texts,
01:03:19.820 | generate things.
01:03:22.220 | You sort of can use this like you use an encoder-decoder.
01:03:25.700 | But in practice, as we'll see, a lot of the sort of biggest,
01:03:29.580 | most powerful pre-trained models tend to be decoder-only.
01:03:33.740 | It's not really clear exactly why, except they
01:03:36.780 | seem a little bit simpler than encoder-decoders.
01:03:41.140 | And you get to share all the parameters in one big network for the decoder,
01:03:45.060 | whereas in an encoder-decoder, you have to split them,
01:03:47.820 | sort of some into the encoder, some into the decoder.
01:03:50.620 | So for the rest of this lecture, we'll talk only about decoders.
01:03:55.500 | In modern things, the biggest networks do tend to be decoders.
01:04:00.780 | So we're coming all the way back again to 2018.
01:04:03.740 | And the GPT model from OpenAI was a big success.
01:04:09.420 | It had 117 million parameters.
01:04:13.060 | It had 768 dimensional hidden states.
01:04:16.660 | And it had this vocabulary that was 40,000-ish words that
01:04:23.180 | was defined via a method like what we showed at the beginning of class,
01:04:26.780 | trained on BooksCorpus.
01:04:28.620 | And actually, GPT never actually showed up in the original paper.
01:04:32.860 | It's unclear what exactly it's supposed to refer to.
01:04:39.180 | But this model was a precursor to all the things
01:04:43.580 | that you're hearing about nowadays.
01:04:46.100 | If you move forward--
01:04:48.700 | oh, yeah.
01:04:49.200 | So if you-- hmm.
01:04:55.820 | So if we wanted to do something like natural language inference, which
01:04:59.900 | says, take these pairs of sentences-- the man is in the doorway,
01:05:03.780 | the person is near the door--
01:05:05.460 | and say that these mean that one entails the other,
01:05:09.100 | the premise entails the hypothesis, that I can believe the hypothesis
01:05:12.900 | if I believe the premise, I'd just concatenate them together.
01:05:16.780 | So give it maybe a start token, pass in one sentence,
01:05:21.180 | pass in some delimiter token, pass in the other,
01:05:23.920 | and then predict yes, no, entailment, not entailment, fine tuning.
01:05:30.220 | GPT on this, it worked really well.
01:05:33.340 | And then BERT came after GPT.
01:05:35.620 | BERT did a bit better.
01:05:36.740 | It had bidirectional context.
01:05:39.740 | But it did an excellent job.
01:05:44.180 | And then came GPT-2, where they focused more
01:05:47.220 | on the generative abilities of the network.
01:05:49.660 | So we looked at now a much larger network.
01:05:54.640 | We've gone from 117 million to 1.5 billion.
01:05:57.840 | And given some sort of prompt, it could generate, at the time,
01:06:01.800 | a quite surprisingly coherent continuation to the prompt.
01:06:04.680 | So it's telling this sort of story about scientists and unicorns here.
01:06:11.480 | And this size of model is still sort of small enough
01:06:15.280 | that you can use on a small GPU and fine tune and whatever.
01:06:19.620 | And its capabilities of generating long, coherent text
01:06:23.060 | was just sort of exceptional at the time.
01:06:28.140 | It was also trained on more data, although I don't--
01:06:32.020 | something like 9 billion words of text.
01:06:35.580 | And then, so after GPT-2, we come to GPT-3,
01:06:40.340 | sort of walking through these models.
01:06:42.280 | And then we come with a different way of interacting with the models.
01:06:45.620 | So we've interacted with pre-trained models in two ways so far.
01:06:49.060 | We've sort of sampled from the distribution that they define.
01:06:53.180 | We generated text via a machine translation system or whatever.
01:06:57.380 | Or we fine-tuned them on a task that we care about.
01:06:59.620 | And then we take their predictions.
01:07:03.620 | But GPT-3 seems to have an interesting new ability.
01:07:10.180 | It's much larger.
01:07:11.580 | And it can do some tasks without any sort of fine-tuning whatsoever.
01:07:17.820 | GPT-3 is much larger than GPT-2.
01:07:20.060 | So we went from GPT, 100-ish million parameters,
01:07:23.500 | GPT-2, 1.5 billion, GPT-3, 175 billion, much larger,
01:07:28.820 | trained on 300 billion words of text.
01:07:32.100 | And this sort of notion of in-context learning,
01:07:34.500 | that it could define or figure out patterns in the training
01:07:37.740 | or in the example that it's currently seeing
01:07:40.140 | and continue the pattern, is called in-context learning.
01:07:44.700 | So you've got the word "thanks."
01:07:46.440 | And I pass in this little arrow and say, OK, thanks goes to merci.
01:07:50.180 | And then hello goes to bonjour.
01:07:51.580 | And then I give it all of these examples
01:07:53.700 | and ask it what otter should go to.
01:07:57.300 | And it's learned to sort of continue the pattern
01:08:01.020 | and say that this is the translation of otter.
01:08:04.660 | So now, remember, this is a single sort of input that I've given to my model.
01:08:09.860 | And I haven't said, oh, do translation or fine-tune it on translation
01:08:13.460 | or whatever.
01:08:14.380 | I've just passed in the input, given it some examples.
01:08:16.980 | And then it is able to, to some extent, do this seemingly complex task.
01:08:22.260 | That's in-context learning.
01:08:25.620 | And here are more examples.
01:08:27.140 | Maybe you give it examples of addition.
01:08:29.820 | And then it can do some simple addition afterward.
01:08:33.900 | You give it-- in this case, this is sort of rewriting typos.
01:08:36.900 | It can figure out how to rewrite typos in context
01:08:39.460 | learning for machine translation.
01:08:41.820 | And this was the start of this idea that there
01:08:43.860 | were these emergent properties that showed up in much larger models.
01:08:47.940 | And it wasn't clear, when looking at the smaller models,
01:08:51.020 | that you'd get this sort of new, this qualitatively new behavior out of them.
01:08:57.780 | Like, it's not obvious from just the language modeling signal, right?
01:09:01.140 | GPT-3 is just trained on that decoder only, just predict the next word,
01:09:06.420 | that it would, as a result of that training,
01:09:09.740 | learn to perform seemingly quite complex things
01:09:12.620 | as a function of its context.
01:09:13.700 | Yeah, OK.
01:09:17.900 | One or two questions about that.
01:09:19.540 | This should be quite surprising, I think, right?
01:09:29.060 | So far, we've talked about good representations,
01:09:31.900 | contextual representations, meanings of words in context.
01:09:35.060 | This is some very, very high-level pattern matching, right?
01:09:37.500 | It's coming up with patterns in just the input data.
01:09:40.660 | And that one sequence of text that you've passed it so far,
01:09:43.660 | and it's able to sort of identify how to complete the pattern.
01:09:48.180 | And you think, what kinds of things can this solve?
01:09:50.780 | What are its capabilities?
01:09:52.380 | What are its limitations?
01:09:54.220 | This ends up being an open area of research.
01:09:56.100 | Sort of, what are the kinds of problems that you maybe
01:09:58.700 | saw in the training data a lot?
01:10:00.020 | Maybe GPT-3 saw a ton of pairs of words, right?
01:10:03.780 | It saw a bunch of dictionaries, bilingual dictionaries
01:10:06.860 | in its training data.
01:10:07.740 | So it learned to do something like this.
01:10:09.660 | Or is it doing something much more general,
01:10:11.420 | where it's really learning the task in context?
01:10:14.660 | The actual story, we're not totally sure.
01:10:17.460 | Something in the middle.
01:10:18.740 | It seems like it has to be tied to your training data in ways
01:10:22.540 | that we don't quite understand.
01:10:24.140 | But there's also a non-trivial ability
01:10:26.180 | to learn new sort of, at least, types of patterns
01:10:30.140 | just from the context.
01:10:31.580 | So this is a very interesting thing to work on.
01:10:34.740 | Now, we've talked a lot about the size of these models so far.
01:10:37.660 | And as models have gotten larger,
01:10:39.700 | they've always gotten better.
01:10:40.900 | We train them on more data.
01:10:43.220 | So GPT-3 was trained on 300 billion words of text.
01:10:46.940 | And it was 175 billion parameters.
01:10:50.900 | And at that scale, it costs a lot of money
01:10:55.100 | to build these things.
01:10:56.140 | And it's very unclear whether you're getting the best
01:10:58.260 | use out of your money.
01:10:59.220 | It's bigger, really, what you should
01:11:00.640 | have been doing in terms of the number of parameters.
01:11:03.740 | So the cost of training one of these
01:11:06.180 | is roughly you take the number of parameters,
01:11:08.140 | you multiply it by the number of tokens
01:11:09.740 | that you're going to train it on, the number of words.
01:11:12.780 | And some folks at DeepMind--
01:11:14.820 | oh, I forgot the citation on this.
01:11:16.240 | Some folks at DeepMind realized through some experimentation
01:11:20.980 | that actually GPT-3 was just comically oversized.
01:11:25.300 | So Chinchilla, the model they trained,
01:11:27.660 | is less than half the size and works better.
01:11:30.720 | But they just trained it on way more data.
01:11:34.640 | And this is an interesting trade-off about how do you
01:11:38.020 | best spend your compute?
01:11:39.120 | I mean, you can't do this more than a handful of times,
01:11:41.420 | even if you're Google.
01:11:44.100 | So open questions there as well.
01:11:48.280 | Another way of interacting with these networks
01:11:51.320 | that has come out recently is called chain of thought.
01:11:56.120 | So the prefix, we saw in the in-context learning slide
01:12:00.200 | that the prefix can help specify what task you're
01:12:02.600 | trying to solve right now.
01:12:04.360 | And it can do even more.
01:12:06.000 | So here's standard prompting.
01:12:07.680 | We have a prefix of examples of questions and answers.
01:12:11.440 | So you have a question and then an example answer.
01:12:14.800 | So that's your prompt that's specifying the task.
01:12:17.360 | And then you have a new question.
01:12:18.800 | And you're having the model generate an answer.
01:12:20.760 | And it generates it wrong.
01:12:23.160 | And chain of thought prompting says, well,
01:12:26.560 | how about in the example, in the demonstration we give,
01:12:29.280 | we give the question.
01:12:30.600 | And then we give this sort of decomposition of steps
01:12:34.080 | towards how to get an answer.
01:12:36.180 | So I'm actually writing this out as part of the input.
01:12:38.380 | I'm giving annotations as a human to say,
01:12:41.480 | oh, to solve this sort of word problem,
01:12:44.360 | here's how you could think it through-ish.
01:12:47.280 | And then I give it a new question.
01:12:49.480 | And the model says, oh, I know what I'm supposed to do.
01:12:51.760 | I'm supposed to first generate a sequence of steps,
01:12:55.920 | of intermediate steps.
01:12:57.640 | And then next, say the answer is--
01:13:00.160 | and then say what the answer is.
01:13:01.840 | And it turns out-- and this should, again,
01:13:04.040 | be very surprising--
01:13:06.440 | that the model can tend to generate plausible sequences
01:13:09.960 | of steps and then much more frequently
01:13:12.440 | generates the correct answer after doing so,
01:13:14.880 | relative to trying to generate the answer by itself.
01:13:18.160 | So you can think of this as a scratch pad.
01:13:20.600 | You can think of this as increasing
01:13:23.080 | the amount of computation that you're
01:13:24.660 | putting into trying to solve the problem,
01:13:27.000 | sort of writing out your thoughts.
01:13:28.420 | Right?
01:13:28.920 | As I generate each word of this continuation here,
01:13:33.040 | I'm able to condition on all the past words so far.
01:13:36.200 | And so maybe it just allows the network
01:13:40.280 | to sort of decompose the problem into smaller, simpler
01:13:43.160 | problems, which it's more able to solve each.
01:13:47.800 | No one's really sure why this works exactly either.
01:13:51.240 | At this point, with networks that are this large,
01:13:54.120 | their emergent properties are both very powerful
01:13:57.720 | and exceptionally hard to understand,
01:13:59.600 | and very hard, you should think, to trust.
01:14:03.440 | Because it's unclear what its capabilities are
01:14:05.560 | and what its limitations are, where it will fail.
01:14:09.200 | So what do we think pre-training is teaching?
01:14:11.720 | Gosh, a wide range of things, even
01:14:14.520 | beyond what I've written in this slide, which
01:14:17.360 | I mostly wrote two years ago.
01:14:19.600 | So it can teach you trivia, and syntax, and coreference,
01:14:22.480 | and maybe some lexical semantics, and sentiment,
01:14:24.920 | and some reasoning, like way more reasoning
01:14:27.360 | than we would have thought even three years ago.
01:14:30.280 | And yet, they also learn and exacerbate
01:14:33.400 | racism and sexism, all manner of biases.
01:14:37.480 | There'll be more on this later.
01:14:38.920 | But the generality of this is really,
01:14:42.800 | I think, what's taken many people aback.
01:14:45.040 | And so increasingly, these objects
01:14:47.440 | are not just studied for the sake of using them,
01:14:51.040 | but studied for the sake of understanding anything
01:14:53.760 | about how they work and how they fail.
01:14:55.440 | Yeah, any questions?
01:14:59.440 | Has anyone tried benchmarking GPT for programming tasks,
01:15:11.240 | like how accurately it does, et cetera?
01:15:13.920 | The question is, has anyone tried benchmarking
01:15:16.320 | GPT for programming tasks?
01:15:18.920 | Anyone seen how well it does?
01:15:21.600 | Yes, so there's definitely examples
01:15:23.120 | of people using GPT-3 for simple programming things.
01:15:28.400 | And then the modern, state-of-the-art,
01:15:30.920 | competitive programming bots are all based on ideas
01:15:34.760 | from language modeling.
01:15:36.600 | And I think they're all also based on pre-trained language
01:15:40.160 | models themselves.
01:15:41.160 | If you just take all of these ideas
01:15:43.360 | and apply it to GitHub, then you get
01:15:46.960 | some very interesting emergent behaviors
01:15:48.720 | relating to code fallout.
01:15:50.920 | And so yeah, I think all of the best systems use this,
01:15:55.280 | more or less.
01:15:56.160 | So lots of benchmarking there, for sure.
01:15:58.840 | Is that the basis for what GitHub Copilot's trying to do?
01:16:02.680 | The question is, is this the basis?
01:16:04.120 | Is what we just mentioned the basis for the GitHub Copilot
01:16:07.080 | system?
01:16:07.580 | Yes, absolutely.
01:16:10.320 | We don't know exactly what it is in terms of details,
01:16:13.680 | but it's all these ideas.
01:16:16.080 | What if you have a situation where you have still
01:16:18.640 | a large amount of data for general data,
01:16:21.000 | and then you have also a large amount of data
01:16:23.280 | for your fine-tuning task?
01:16:24.880 | At what point is it better to train a new model
01:16:27.760 | for that fine-tuning versus get data from both?
01:16:30.760 | Yeah, the question is, what if you
01:16:32.160 | have a large amount of data for pre-training
01:16:33.760 | and a large amount of data for fine-tuning?
01:16:35.560 | When is it better to do a separate training on just
01:16:39.280 | the fine-tuning data?
01:16:41.880 | Almost never.
01:16:43.240 | If you have a bunch of data for the task that you care about,
01:16:48.400 | what's frequently done instead is three-part training,
01:16:51.840 | where you pre-train on a very broad corpus.
01:16:54.720 | Then you continue to pre-train using something
01:16:57.320 | like language modeling on an unlabeled version of the label
01:17:02.200 | data that you have.
01:17:03.280 | You just strip the labels off and just treat it all as text
01:17:05.660 | and do language modeling on that,
01:17:07.560 | adapt the parameters a little bit,
01:17:09.320 | and then do the final stage of fine-tuning with the labels
01:17:12.360 | that you want, and that works even better.
01:17:14.240 | There's an interesting paper called Don't Stop Pre-Training.
01:17:19.280 | Nice.
01:17:20.280 | Final question.
01:17:21.840 | We asked a lot of questions.
01:17:23.920 | Anyone?
01:17:25.420 | Someone new with a question?
01:17:27.220 | Yeah, I was wondering, do you know
01:17:32.800 | if there's a lot of instances where a pre-trained model can
01:17:36.840 | do some task that's not seen before even without fine-tuning?
01:17:41.160 | Yeah, so are there any instances of where a pre-trained model
01:17:43.700 | can do a task that it hasn't seen before without fine-tuning?
01:17:47.240 | The question is, what does hasn't seen before mean?
01:17:50.960 | These models, especially GPT-3 and similar very large models,
01:17:55.280 | during pre-training, did it ever see something exactly
01:17:58.320 | like this sort of word problem arithmetic?
01:18:01.600 | Maybe, maybe not.
01:18:03.080 | It's actually sort of unclear.
01:18:05.080 | It's clearly able to recombine bits and pieces of tasks
01:18:08.920 | that it saw implicitly during pre-training.
01:18:11.360 | We saw the same thing with trivia.
01:18:13.040 | Language modeling looks a lot like trivia sometimes,
01:18:15.520 | where you just read the first paragraph of a Wikipedia page,
01:18:19.080 | and it's kind of like answering a bunch of little trivia
01:18:21.360 | questions about where someone was born and when.
01:18:24.400 | But it's never seen something quite like this.
01:18:26.480 | And it's actually still kind of astounding
01:18:28.280 | how much it's able to do things that don't seem like they
01:18:30.640 | should have shown up all that directly in the pre-training
01:18:33.040 | data.
01:18:33.920 | Quantifying that extent is an open research problem.
01:18:37.480 | OK, that's it.
01:18:38.080 | Let's call it.
01:18:40.360 | Exactly.
01:18:41.920 | [BLANK_AUDIO]