Stanford CS224N NLP with Deep Learning | 2023

00:00:00.000 | [AUDIO OUT]

00:00:05.480 | Hello.

00:00:06.240 | Welcome to CS224N.

00:00:09.520 | Today we'll be talking about pre-training,

00:00:12.440 | which is another exciting topic on the road to modern natural language

00:00:18.120 | processing.

00:00:20.160 | OK.

00:00:21.760 | How is everyone doing?

00:00:23.520 | Thumbs up, some side, thumbs down.

00:00:27.280 | Wow.

00:00:28.240 | No response bias there.

00:00:29.880 | All thumbs up.

00:00:31.200 | Oh, side.

00:00:31.760 | Nice.

00:00:32.000 | I like that honesty.

00:00:32.840 | That's good.

00:00:33.360 | Well, OK.

00:00:35.800 | So we're now-- what is this, week five?

00:00:39.360 | Yes, it's week five.

00:00:40.640 | And we have a couple--

00:00:42.880 | so this lecture, the Transformers lecture, and then to a lesser extent,

00:00:47.800 | Thursday's lecture on natural language generation

00:00:51.680 | will be sort of the sum of lectures for the assignments you have to do.

00:00:56.280 | So assignment five is coming out on Thursday.

00:01:01.480 | And the topics covered in this lecture, and self-attention transformers,

00:01:06.640 | and again, a little bit of natural language generation

00:01:09.000 | will be tested in assignment five.

00:01:10.440 | And then the rest of the course will go through some really fascinating topics

00:01:14.600 | in sort of modern natural language processing

00:01:17.640 | that should be useful for your final projects, and future jobs,

00:01:20.920 | and interviews, and intellectual curiosity.

00:01:25.240 | But I think that today's lecture is significantly less technical in detail

00:01:31.800 | than last Thursday's on self-attention and transformers,

00:01:35.600 | but should give you an idea of the sort of world of pre-training

00:01:41.160 | and sort of how it helps define natural language processing today.

00:01:46.600 | So a reminder about assignment five, your project proposals

00:01:49.240 | also are due on Tuesday, next Tuesday.

00:01:53.760 | Please do get those in.

00:01:55.160 | Try to get them in on time so that we can give you prompt feedback

00:01:58.860 | about your project proposals.

00:02:01.360 | And yeah, so let's jump into it.

00:02:03.440 | OK, so what we're going to start with today

00:02:09.240 | is a bit of a technical detail on word structure

00:02:16.160 | and sort of how we model the input sequence of words that we get.

00:02:19.800 | So when we were teaching Word2Vec and sort of all the methods

00:02:26.360 | that we've talked about so far, we assumed a finite vocabulary.

00:02:29.840 | So you had a vocabulary v that you define via whatever.

00:02:32.560 | You've looked at some data.

00:02:33.680 | You've decided what the words are in that data.

00:02:36.640 | And so you have some words like hat and learn.

00:02:42.720 | And you have this embedding.

00:02:44.680 | It's in red because you've learned it properly.

00:02:46.920 | Actually, let's replace hat and learn with pizza and tasty.

00:02:49.360 | Those are better.

00:02:51.760 | And so that's all well and good.

00:02:53.760 | You see these words in your model.

00:02:56.440 | And you have an embedding that's been learned on your data

00:03:00.800 | to sort of know what to do when you see those words.

00:03:04.120 | But when you see some sort of variations,

00:03:06.020 | maybe you see like tasty and maybe a typo like learn,

00:03:11.640 | or maybe novel items where it's like a word that you as a human

00:03:15.760 | can understand as sort of this combination.

00:03:18.160 | This is called derivational morphology of this word

00:03:22.240 | transformer that you know and if I, which means take this noun

00:03:26.800 | and give me back a verb.

00:03:29.240 | That means to make more like that noun.

00:03:31.160 | To transformerify NLP might mean to make NLP more

00:03:36.200 | like using transformers and such.

00:03:39.000 | And for each of these, this maybe didn't show up

00:03:41.200 | in your training corpus.

00:03:42.400 | And language is always doing this.

00:03:45.760 | People are always coming up with new words.

00:03:47.640 | And there's new domains.

00:03:48.960 | And young people are always making new words.

00:03:52.360 | It's great.

00:03:52.840 | And so it's a problem for your model,

00:03:54.640 | though, because you've defined this finite vocabulary.

00:03:57.440 | And there's sort of no mapping in that vocabulary

00:04:00.880 | for each of these things.

00:04:02.400 | Even though their meanings should be relatively well

00:04:05.280 | defined based on the data you've seen so far,

00:04:08.120 | it's just that the sort of string of characters that define them

00:04:11.560 | aren't quite what you've seen.

00:04:13.760 | And so what do you do?

00:04:14.640 | Well, maybe you map them to this sort of universal unknown token.

00:04:18.440 | This is UNK.

00:04:20.000 | So it's like, oh, I see something.

00:04:21.000 | I don't know what.

00:04:21.960 | I've never seen it before.

00:04:23.240 | I'm going to say it's always represented by the same token UNK.

00:04:26.840 | And so that's been done in the past.

00:04:29.120 | And that's sort of bad, right, because it's

00:04:30.960 | totally losing tons of information.

00:04:34.760 | But you need to map it to something.

00:04:38.640 | And so this is like a clear problem, especially--

00:04:42.480 | I mean, in English, it's a problem.

00:04:44.120 | In many of the roles languages, it's a substantially larger problem.

00:04:49.000 | So English has relatively simple word structure.

00:04:53.360 | There's a couple of conjugations for each verb, like eat, eats, eaten, ate.

00:05:00.360 | But in a language with much more complex morphology or word structure,

00:05:06.960 | you'll have a considerably more complex sort of set of things

00:05:11.040 | that you could see in the world.

00:05:12.360 | So here is a conjugation table for a Swahili verb.

00:05:17.560 | And it has over 300 conjugations.

00:05:20.840 | And if I define the vocabulary to be every unique string of characters

00:05:24.800 | maps to its own word, then every one of the 300 conjugations

00:05:28.400 | would get an independent vector under my model, which makes no sense,

00:05:33.280 | because the 300 conjugations obviously have a lot in common

00:05:37.200 | and differ by sort of meaningful extent.

00:05:39.680 | So you don't want to do this.

00:05:41.240 | You'd have to have a huge vocabulary if I wanted all conjugations to show up.

00:05:46.400 | And that's a mistake for efficiency reasons and for learning reasons.

00:05:51.200 | Any questions so far?

00:05:52.080 | Cool.

00:05:56.000 | OK.

00:05:57.160 | And so what we end up doing is we'll look at subword structure,

00:06:05.440 | subword modeling.

00:06:06.640 | So what we're going to do is we're going to say,

00:06:08.680 | if I can try to define what the set of all words is,

00:06:12.640 | I'm going to define my vocabulary to include parts of words.

00:06:17.640 | So I'm going to split words into sequences of known subwords.

00:06:30.280 | And so there's a simple sort of algorithm for this,

00:06:33.200 | where you start with all characters.

00:06:35.480 | So if I only had a vocabulary of all characters,

00:06:38.240 | and maybe like an end of word symbol for a finite data set,

00:06:44.320 | then no matter what word I saw in the future,

00:06:46.480 | as long as I had seen all possible characters,

00:06:48.560 | I could take the word and say, I don't know what this word is.

00:06:51.100 | I'm going to split it into all of its individual characters.

00:06:53.960 | So you won't have this unk problem.

00:06:55.440 | You can sort of represent any word.

00:06:57.360 | And then you're going to find common adjacent characters and say, OK,

00:07:01.240 | A and B co-occur next to each other quite a bit.

00:07:03.920 | So I'm going to add a new word to my vocabulary.

00:07:07.120 | Now it's all characters plus this new word A, B, which is a subword.

00:07:13.440 | And likewise, so now I'm going to replace the character pair

00:07:16.040 | with the new subword and repeat until you add a lot, a lot, a lot of vocabulary

00:07:20.720 | items through this process of what things tend to co-occur next to each other.

00:07:24.520 | And so what you'll end up with is a vocabulary

00:07:28.480 | of very commonly co-occurring sort of substrings

00:07:31.600 | by which you can build up words.

00:07:33.540 | And this was originally developed for machine translation,

00:07:36.000 | but then it's been used considerably in pretty much all modern language models.

00:07:41.200 | So now we have a hat and learn, hat and learn.

00:07:44.120 | So in our subword vocabulary, hat and learn

00:07:46.840 | showed up enough that they're their own individual words.

00:07:49.920 | So that's sort of good, right?

00:07:51.800 | So simple common words show up as a word in your vocabulary

00:07:56.560 | just like you'd like them to.

00:07:57.760 | But now tasty maybe gets split into T-A-A.

00:08:01.200 | And then maybe in some cases, this hash hash

00:08:04.040 | means like don't add a space next, right?

00:08:07.160 | So T-A-A and then A-A-A and then S-T-Y, right?

00:08:12.160 | So I've actually taken one sort of thing that seems like a word,

00:08:15.200 | and in my vocabulary, it's now split into three subword tokens.

00:08:20.120 | So when I pass this to my transformer or to my recurrent neural network,

00:08:24.960 | the recurrent neural network would take T-A-A as just a single element,

00:08:29.960 | do the RNN update, and then take A-A-A, do the RNN update, and then S-T-Y.

00:08:35.200 | So it could learn to process constructions like this.

00:08:39.720 | And maybe I can even add more A-A-As in the middle,

00:08:41.920 | and have it do something similar.

00:08:44.080 | Instead of just seeing the entire word tasty and not knowing what it means.

00:08:51.960 | Is that?

00:08:53.240 | That's feedback, yeah.

00:08:58.920 | How loud is that feedback?

00:09:01.320 | We good?

00:09:02.920 | OK, I think we're fixed.

00:09:04.200 | Great.

00:09:06.480 | And so same with transformerify.

00:09:08.040 | Maybe transformer is its own word.

00:09:10.080 | And then if I--

00:09:11.240 | and so you can see that you have sort of three learned embeddings instead

00:09:14.760 | of one sort of useless unkembedding.

00:09:17.760 | This is just wildly useful and is used pretty much everywhere.

00:09:21.280 | Variants of this algorithm are used pretty much everywhere in modern NLP.

00:09:26.480 | Questions?

00:09:27.040 | Yes.

00:09:28.640 | If we have three embeddings for tasty, do we just add them together?

00:09:32.840 | So the question is, if we have three embeddings for tasty,

00:09:35.220 | do we just add them together?

00:09:38.080 | If we want to represent--

00:09:39.920 | so when we're actually processing the sequence,

00:09:42.520 | I'd see something like I learned about the T-A-A, A-A-A, S-T-Y.

00:09:50.160 | So it'd actually be totally separate tokens.

00:09:52.480 | But if I wanted to then say, what's my representation of this thing?

00:09:57.520 | Depends on what you want to do.

00:09:58.800 | Sometimes you average the contextual representations of the three

00:10:02.960 | or look at the last one maybe.

00:10:06.400 | At that point, it's unclear what to do.

00:10:08.000 | But everything sort of works OK.

00:10:10.920 | How do you know where to split?

00:10:12.800 | How do you what?

00:10:13.520 | How do you know where to split?

00:10:15.200 | Yeah.

00:10:15.720 | So you know where to split based on the algorithm

00:10:18.720 | that I specified earlier for learning the vocabulary.

00:10:23.280 | So you learn this vocabulary by just combining

00:10:25.800 | commonly co-occurring adjacent strings of letters.

00:10:29.080 | So like A, B co-occurred a lot.

00:10:30.920 | So now I've got a new word that's A, B.

00:10:34.000 | And then when I'm actually walking through and tokenizing,

00:10:36.520 | I try to split as little as possible.

00:10:38.560 | So I split words into the maximal sort of subword

00:10:41.600 | that takes up the most characters.

00:10:43.060 | There are algorithms for this.

00:10:45.120 | Yeah, so I'm like, OK, if I want to split this up,

00:10:49.040 | there's many ways I could split it up.

00:10:50.580 | And you try to find some approximate what the best way to split it

00:10:54.080 | into the fewest words is.

00:10:55.120 | Yeah.

00:10:56.120 | Does it seem to make sense to use punctuation in the character set?

00:11:00.520 | So the question is, do people use punctuation in the character set?

00:11:04.600 | Do people do it?

00:11:05.240 | Yes, absolutely.

00:11:06.360 | So sort of from this point on, just assume

00:11:12.760 | that what text is given to these models is as unprocessed as possible.

00:11:17.680 | You try to make it sort of clean looking text, where you've removed HTML tags,

00:11:22.680 | maybe if it's from the internet or whatever.

00:11:26.240 | But then beyond that, you process it as little as possible

00:11:29.120 | so that it reflects as well as possible what people might actually

00:11:32.600 | be using this for.

00:11:35.080 | So maybe earlier in the course, when we were looking at Word2Vec,

00:11:38.320 | maybe we had what might have thought about,

00:11:40.280 | oh, we don't want Word2Vec vectors of punctuation or something like that.

00:11:45.520 | Now everything is just as close as possible

00:11:48.240 | to what the text you'd get with people trying to use your system would be.

00:11:52.120 | So yes, in practice, punctuation and dot, dot, dot

00:11:55.600 | might be its own word, and maybe a sequence of hyphens,

00:12:00.320 | because people make big bars across tables.

00:12:03.200 | Yeah.

00:12:03.700 | How does it impact one wordage now?

00:12:11.800 | Could be multiple embeddings versus a single embedding.

00:12:16.680 | Does the system treat those any differently?

00:12:21.760 | The question is, does the system treat any differently words

00:12:24.280 | that are really themselves a whole word versus words that are pieces?

00:12:28.440 | No, the system has no idea.

00:12:29.680 | They're all just indices into your embedding vocabulary matrix.

00:12:36.320 | So they're all treated equally.

00:12:37.960 | What about really long words that are relatively common?

00:12:44.640 | Because if you're building up from single character all the way up,

00:12:47.880 | what happens then?

00:12:49.440 | The question is, what happens to very long words

00:12:51.920 | if you're building up from character pairs and portions of characters?

00:12:57.400 | In practice, the statistics speak really well for themselves.

00:13:01.080 | So if a long word is very common, it will end up in the vocabulary.

00:13:04.720 | And if it's not very common, it won't.

00:13:07.920 | There are algorithms that aren't this that do slightly better in various ways.

00:13:13.000 | But the intuition that you figure out what the common co-occurring

00:13:17.520 | substrings are, independent of length almost,

00:13:20.480 | is the right intuition to have.

00:13:22.040 | And so you can actually just look at the learned vocabularies

00:13:25.080 | of a lot of these models.

00:13:26.600 | And you see some long words just because they showed up a lot.

00:13:32.240 | I'm curious, how does it weigh the frequency?

00:13:41.280 | So let's say there's if-y.

00:13:43.680 | In your next slide, it was like if-i at the very last one.

00:13:48.080 | So if could be really common.

00:13:50.120 | So how does it weigh the frequency of a subword versus the length of it?

00:13:54.320 | It tries to spread it up into the smallest number.

00:13:56.920 | But what if it split it up into three, but one of them was super common?

00:14:00.960 | Yeah, so the question is, if transformer is a subword in my vocabulary,

00:14:05.920 | and if is a subword, and y is a subword, and if-i as a three-letter tuple

00:14:12.840 | is also a subword, how does it choose to take the--

00:14:15.800 | if-i, maybe it's not very common, as opposed

00:14:19.920 | to splitting it into more subwords.

00:14:23.000 | It's just a choice.

00:14:23.840 | We choose to try to take the smallest number of subwords,

00:14:26.480 | because that tends to be more of the bottleneck, as opposed

00:14:29.720 | to having a bunch of very common, very short subwords.

00:14:34.800 | Sequence length is a big problem in transformers.

00:14:36.960 | And this seems to be what works.

00:14:39.360 | Although trying to split things into multiple options of a sequence

00:14:42.560 | and running the transformer on all of them

00:14:44.600 | is the thing that people have done to see which one will work better.

00:14:47.760 | But yeah, having fewer bigger subwords tends to be the best sort of idea.

00:14:51.640 | I'm going to start moving on, though.

00:14:53.320 | Feel free to ask me more questions about this afterward.

00:14:56.720 | OK, so let's talk about pre-training from the context of the course so far.

00:15:03.120 | So at the very beginning of the course, we gave you this quote, which was,

00:15:07.480 | "You shall know a word by the company it keeps."

00:15:09.640 | This was the sort of thesis of the distributional hypothesis,

00:15:13.640 | that the meaning of the word is defined by, or at least reflected by,

00:15:17.960 | what words it tends to co-occur around.

00:15:19.800 | And we implemented this via Word2Vec.

00:15:23.960 | The same person who made that quote had a separate quote, actually earlier,

00:15:29.720 | that continues this notion of meaning as defined by context, which

00:15:34.800 | has something along the lines of, well, since the word shows up

00:15:38.920 | in context when we actually use it, when we speak to each other,

00:15:42.560 | the meaning of the word should be defined in the context

00:15:45.760 | that it actually shows up in.

00:15:47.480 | And so the complete meaning of a word is always contextual,

00:15:51.360 | and no study of meaning apart from a complete context

00:15:54.280 | can be taken seriously.

00:15:55.920 | So the big difference here is, at Word2Vec training time,

00:16:01.240 | if I have the word record, R-E-C-O-R-D, when I'm training Word2Vec,

00:16:07.920 | I get one vector or two, but one vector meaning record, the string.

00:16:16.160 | And it has to learn by what context it shows up in,

00:16:19.960 | that sometimes it can mean I record, i.e. the verb, or record, i.e.

00:16:26.720 | the noun.

00:16:28.040 | But I only have one vector to represent it.

00:16:30.480 | And so when I use the Word2Vec embedding of record,

00:16:33.320 | it sort of has this mixture meaning of both of its sort of senses, right?

00:16:38.960 | It doesn't get to specialize and say, oh, this part means record,

00:16:43.040 | and this part means record.

00:16:45.040 | And so Word2Vec is going to just sort of fail.

00:16:48.320 | And so I can build better representations of language

00:16:51.360 | through these contextual representations that

00:16:53.640 | are going to take things like recurrent neural networks or transformers

00:16:56.640 | that we used before to build up sort of contextual meaning.

00:16:59.640 | [AUDIO OUT]

00:17:03.320 | So what we had before were pre-trained word embeddings.

00:17:07.600 | And then we had sort of a big box on top of it,

00:17:10.960 | like a transformer or an LSTM, that was not pre-trained, right?

00:17:15.160 | So you learn via context your word embeddings here.

00:17:19.320 | And then you have a task, like sentiment analysis or machine translation

00:17:23.400 | or parsing or whatever.

00:17:25.760 | And you initialize all the parameters of this randomly.

00:17:29.180 | And then you train to predict your label.

00:17:33.120 | And the big difference in today's work is

00:17:37.040 | that we're going to try to pre-train all the parameters.

00:17:39.600 | So I have my big transformer.

00:17:41.180 | And instead of just pre-training my word embeddings with Word2Vec,

00:17:45.500 | I'm going to train all of the parameters of the network,

00:17:50.800 | trying to teach it much more about language

00:17:54.560 | that I could use in my downstream tasks.

00:17:57.600 | So now the labeled data that I have for, say, machine translation

00:18:03.600 | might need to be smaller.

00:18:05.720 | I might not need as much of it, because I've already

00:18:08.520 | trained much more of the network than I otherwise

00:18:10.760 | would have if I had just gotten Word2Vec embeddings.

00:18:13.480 | So here, I've pre-trained this entire structure--

00:18:20.360 | the word embeddings, the transformer on top.

00:18:23.640 | Everything's been trained via methods that we'll talk about today.

00:18:27.040 | And so what does this give you?

00:18:28.680 | I mean, it gives you very strong representations of language.

00:18:31.520 | So the meaning of record and record will be different

00:18:36.120 | in the sort of contextual representations that

00:18:38.920 | know where in the sequence it is and what words are co-occurring with it

00:18:42.920 | in this specific input than Word2Vec, which only has one representation

00:18:46.800 | for record independent of where it shows up.

00:18:50.080 | It'll also be used as strong parameter initializations for NLP models.

00:18:55.040 | So in all of your homework so far, you've

00:18:56.920 | worked with building out a natural language processing

00:19:00.440 | system sort of from scratch.

00:19:02.040 | How do I initialize this weight matrix?

00:19:03.680 | And we always say, oh, small, normally distributed noise,

00:19:08.080 | like little values close to 0.

00:19:12.280 | And here, we're going to say, well, just like we

00:19:14.800 | were going to use the Word2Vec embeddings and those sort of encoded

00:19:18.440 | structure, I'm going to start maybe my machine translation

00:19:21.400 | system from a parameter initialization that's

00:19:23.760 | given to me via pre-training.

00:19:27.380 | And then also, it's going to give us probability distributions

00:19:29.880 | over language that we can use to generate and otherwise.

00:19:33.440 | And we'll talk about this.

00:19:35.800 | So whole models are going to be pre-trained.

00:19:38.240 | So all of pre-training is effectively going

00:19:42.020 | to be centered around this idea of reconstructing the input.

00:19:45.600 | So you have an input.

00:19:47.040 | It's a sequence of text that some human has generated.

00:19:49.840 | And the sort of hypothesis is that by masking out part of it

00:19:55.960 | and tasking a neural network with reconstructing the original input,

00:20:00.720 | that neural network has to learn a lot about language, about the world,

00:20:05.320 | in order to do a good job of reconstructing the input.

00:20:07.960 | So this is now a supervised learning problem,

00:20:10.880 | just like machine translation.

00:20:13.520 | Taking this sentence that just existed, Stanford University

00:20:16.120 | is located in, say, Palo Alto, California, or Stanford, California,

00:20:20.560 | I guess.

00:20:23.240 | And I have, by removing this part of the sentence, made a label for myself.

00:20:29.560 | The input is this sort of broken masked sentence.

00:20:33.680 | And the label is Stanford or Palo Alto.

00:20:36.420 | So if I give this example to a network and ask

00:20:41.940 | it to predict the center thing, as it's doing its gradient step

00:20:45.360 | on this input, it's going to encode information

00:20:47.760 | about the co-occurrence between this context, Stanford University is located

00:20:51.600 | in, and Palo Alto.

00:20:53.680 | So by tasking it with this, it might learn, say, where Stanford is.

00:20:58.320 | What else might it learn?

00:20:59.320 | Well, it can learn things about maybe syntax.

00:21:01.560 | So I put blank fork down on the table.

00:21:05.480 | Here, there's only a certain set of words that could go here.

00:21:08.200 | I put the fork down on the table.

00:21:09.960 | I put a fork down on the table.

00:21:11.960 | These are syntactic constraints.

00:21:14.240 | So the context shows me what kinds of words can appear

00:21:18.520 | in what kinds of contexts.

00:21:19.720 | The woman walked across the street checking

00:21:24.320 | for traffic over blank shoulder.

00:21:27.000 | Any ideas on what could go here?

00:21:29.520 | Her, right?

00:21:30.080 | So this sort of co-reference between this entity

00:21:35.320 | who is being discussed in the world, this woman, and her shoulder.

00:21:39.040 | Now, when I discuss--

00:21:40.840 | this is sort of a linguistic concept.

00:21:42.380 | Her here is a co-referent to woman.

00:21:44.840 | It's referring to the same entity in the discourse.

00:21:47.240 | And so the network might be able to learn things about what

00:21:51.400 | entities are doing what where.

00:21:52.800 | It can learn things about semantics.

00:21:58.480 | So if I went to the ocean to see the fish, turtles, seals, and blank,

00:22:02.800 | then the word that's in the blank should be a member of the class

00:22:06.520 | that I'm thinking of as a person writing this sentence of stuff

00:22:09.840 | that I see when I go to the ocean and see these other things as well.

00:22:13.860 | So in order to do this prediction task, maybe

00:22:15.840 | I learn about the semantics of aquatic creatures.

00:22:22.840 | OK, so what else could I learn?

00:22:24.580 | I've got overall, the value I got from the two hours watching it

00:22:27.460 | was the sum total of the popcorn and drink.

00:22:29.760 | The movie was blank.

00:22:31.920 | What kind of task could I be learning from doing

00:22:33.980 | this sort of prediction problem?

00:22:37.680 | Sentiment, exactly.

00:22:38.820 | So this is just a naturalistic sort of text that I naturally wrote myself.

00:22:45.800 | But by saying, oh, the movie was bad, I'm

00:22:48.920 | learning about sort of the latent sentiment of the person who

00:22:53.200 | wrote this, what they were feeling about the movie at the time.

00:22:57.080 | So maybe if I see a new review later on, I can just paste in the review,

00:23:01.240 | say the movie was blank.

00:23:04.400 | And if the model generates bad or good, that

00:23:07.200 | could be implicitly solving the task of sentiment analysis.

00:23:10.640 | So here's another one.

00:23:14.720 | Iroh went to the kitchen to make some tea.

00:23:16.760 | Standing next to Iroh, Zuko pondered his destiny.

00:23:19.760 | Zuko left the blank.

00:23:23.160 | OK, so in this scenario, we've got a world implicitly

00:23:27.120 | that's been designed by the person who is creating this text.

00:23:31.160 | I've got physical locations in the discourse, like the kitchen.

00:23:35.280 | And I've got Zuko.

00:23:37.160 | Iroh's in the kitchen.

00:23:38.480 | Zuko's next to Iroh.

00:23:40.680 | So Zuko must be in the kitchen.

00:23:44.080 | So what could Zuko leave but the kitchen?

00:23:47.120 | And so in terms of latent notions of embodiment and physical location,

00:23:51.640 | the way that people talk about people being next to something

00:23:54.760 | and then leaving something could tell you

00:23:57.800 | stuff about sort of, yeah, a little bit about how the world works even.

00:24:04.920 | So here's a sequence.

00:24:06.360 | I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, blank.

00:24:12.640 | And this is a pretty tough one, right?

00:24:18.000 | This is the Fibonacci sequence, right?

00:24:19.640 | If you had to model by looking at a bunch of numbers from the Fibonacci

00:24:23.120 | sequence, learn to, in general, predict the next one,

00:24:27.160 | it's a question you should be thinking about throughout the lecture.

00:24:31.920 | OK, any questions on these sort of examples

00:24:34.240 | of what you might learn from predicting the context?

00:24:36.360 | OK, OK, cool.

00:24:44.240 | So a very simple way to think about pre-training

00:24:47.800 | is pre-training is language modeling.

00:24:49.340 | So we saw language modeling earlier in the course.

00:24:51.640 | And now we're just going to say, instead of using my language model just

00:24:55.140 | to provide probabilities over the next word,

00:24:57.560 | I am going to train it on that task.

00:24:59.600 | I'm going to actually model the distribution p theta of the word

00:25:06.440 | t given all the words previous.

00:25:10.080 | And there's a ton of data for this, right?

00:25:12.240 | There's just an amazing amount of data for this in a lot of languages,

00:25:15.720 | especially English.

00:25:16.800 | There's very little data for this in actually

00:25:18.680 | most of the world's languages, which is a separate problem.

00:25:21.920 | But you can pre-train just through language modeling, right?

00:25:24.340 | So I'm going to sort of do the teacher forcing thing.

00:25:27.160 | So I have IRO.

00:25:28.080 | I predict goes.

00:25:28.880 | I have goes.

00:25:29.400 | I predict to.

00:25:30.600 | And I'm going to train my sort of LSTM or my transformer to do this task.

00:25:35.760 | And then I'm just going to keep all the weights.

00:25:38.400 | OK, I'm going to save all the network parameters.

00:25:41.000 | And then once I have these parameters, instead

00:25:46.340 | of generating from my language model, I'm

00:25:48.040 | just going to use them as an initialization for my parameters.

00:25:52.280 | So I have this pre-training fine-tuning paradigm.

00:25:55.280 | Two steps.

00:25:56.640 | Most of you, I think, in your--

00:25:58.320 | well, maybe not this year.

00:25:59.680 | Let's say a large portion of you this year in your final projects

00:26:02.440 | will be doing the pre-training fine-tuning sort of paradigm,

00:26:05.160 | where someone has done the pre-training for you, right?

00:26:07.460 | So you have a ton of text.

00:26:09.100 | You learn very general things about the distribution of words

00:26:13.080 | and sort of the latent things that that tells you about the world

00:26:15.940 | and about language.

00:26:17.400 | And then in step two, you've got some task, maybe sentiment analysis.

00:26:22.040 | And you have maybe not very many labels.

00:26:24.560 | You have a little bit of labeled data.

00:26:26.720 | And you adapt the pre-trained model to the task

00:26:29.840 | that you care about by further doing gradient steps on this task.

00:26:34.040 | So you give it the movie was.

00:26:35.680 | You predict happy or sad.

00:26:37.960 | And then you sort of continue to update the parameters

00:26:42.080 | based on the initialization from the pre-training.

00:26:46.240 | And this just works exceptionally well--

00:26:48.440 | I mean, unbelievably well-- compared to training from scratch.

00:26:51.800 | Intuitively, because you've taken a lot of the burden of learning

00:26:54.760 | about language, learning about the world, off of the data

00:26:58.280 | that you've labeled for sentiment analysis.

00:27:00.400 | And you're sort of giving that task of learning

00:27:02.560 | all this sort of very general stuff to the much more general task of language

00:27:06.400 | modeling.

00:27:06.960 | Yes?

00:27:07.460 | You said we didn't have much data in other languages.

00:27:10.880 | What do you mean by data?

00:27:11.920 | Is it just text in that language?

00:27:13.960 | Yeah.

00:27:14.460 | Or is it labeled in some way?

00:27:16.600 | The question is, you said we have a lot of data in English,

00:27:19.720 | but not in other languages.

00:27:22.320 | What do you mean by data that we don't have a lot of in other languages?

00:27:25.280 | Is it just text?

00:27:25.980 | It's literally just text.

00:27:28.320 | No annotations.

00:27:29.960 | Because you don't need annotations to do language model pre-training, right?

00:27:33.240 | The existence of that sequence of words that someone has written

00:27:37.280 | provides you with all these pairs of input and output.

00:27:41.040 | Input IRO, output goes.

00:27:42.680 | Input IRO goes, output too.

00:27:44.800 | Those are all labels sort of that you've constructed from the input just

00:27:48.840 | existing.

00:27:49.520 | But in most languages, even on the entire internet,

00:27:52.840 | I mean, there's about 7,000-ish languages on Earth.

00:27:55.960 | And most of them don't have the sort of billions of words

00:28:01.200 | you might want to train these systems on.

00:28:04.760 | Yeah?

00:28:06.680 | If you're pre-training the entire thing,

00:28:08.320 | are you still learning one vector representation per word?

00:28:11.480 | The question is, if you're pre-training the entire thing,

00:28:13.800 | do you still learn one vector representation per word?

00:28:16.120 | You learn one vector representation that

00:28:17.940 | is the non-contextual input vector.

00:28:21.280 | So you have your vocabulary matrix.

00:28:23.000 | You've got your embedding matrix that is vocabulary size

00:28:26.240 | by model dimensionality.

00:28:28.920 | And so yeah, IRO has one vector.

00:28:30.720 | GOES has one vector.

00:28:32.680 | But then the transformer that you're learning on top of it

00:28:35.520 | takes in the sequence so far and sort of gives a vector to each of them

00:28:39.440 | that's dependent on the context in that case.

00:28:41.760 | But still, at the input, you only have one embedding per word.

00:28:46.000 | Yeah?

00:28:46.500 | So what sort of metrics would you use to evaluate a pre-trained model?

00:28:51.740 | It's supposed to be general.

00:28:53.900 | But there's application-specific metrics.

00:28:55.660 | So which one do you use?

00:28:56.860 | Yeah.

00:28:57.340 | So the question is, what metric do you

00:28:58.700 | use to evaluate pre-trained models since it's

00:29:00.620 | supposed to be so general?

00:29:02.740 | But there are lots of very specific evaluations you could use.

00:29:07.300 | We'll get into a lot of that in the rest of the lecture.

00:29:09.940 | While you're training it, you can use simple metrics

00:29:12.220 | that sort of correlate with what you want

00:29:13.900 | but aren't actually what you want, just like the probability quality.

00:29:18.340 | So you can evaluate the perplexity of your language model

00:29:21.180 | just like you would have when you cared about language modeling.

00:29:23.760 | And it turns out to be the case that better perplexity correlates

00:29:27.460 | with all the stuff that's much harder to evaluate,

00:29:30.080 | like lots and lots of different tasks.

00:29:32.420 | But also, the natural language processing community

00:29:34.460 | has built very large sort of benchmark suites of varying tasks

00:29:39.520 | to try to get at sort of a notion of generality,

00:29:41.780 | although that's very, very difficult.

00:29:43.460 | It's sort of ill-defined, even.

00:29:45.540 | And so when you develop new pre-training methods, what you often do

00:29:48.820 | is you try to pick a whole bunch of evaluations

00:29:51.260 | and show that you do better on all of them.

00:29:53.700 | And that's your argument for generality.

00:29:55.660 | So why should this sort of pre-training, fine-tuning, two-part paradigm help?

00:30:06.740 | This is still an open area of research, but the intuitions

00:30:10.380 | are all you're going to take from this course.

00:30:12.500 | So pre-training provides some sort of starting parameters, L theta.

00:30:17.500 | So this is like all the parameters in your network,

00:30:20.140 | from trying to do this minimum over all possible settings of your parameters

00:30:24.300 | of the pre-training loss.

00:30:26.900 | And then the fine-tuning process takes your data for fine-tuning.

00:30:31.220 | You've got some labels.

00:30:32.580 | And it tries to approximate the minimum through gradient descent

00:30:36.380 | of the loss of the fine-tuning task of theta.

00:30:39.140 | But you start at theta hat.

00:30:41.340 | So you start gradient descent at theta hat,

00:30:43.820 | which your pre-training process gave you.

00:30:46.420 | And then if you could actually solve this min and wanted to,

00:30:51.900 | it sort of feels like the starting point shouldn't matter.

00:30:55.700 | But it really, really, really does.

00:30:58.140 | It really does.

00:31:00.940 | So we'll talk a bit more about this later.

00:31:03.900 | But the process of gradient descent, maybe it

00:31:07.660 | sticks relatively close to the theta hat during fine-tuning.

00:31:11.700 | So you start at theta hat.

00:31:14.620 | And then you sort of walk downhill with gradient descent

00:31:17.500 | until you hit sort of a valley.

00:31:19.380 | And that valley ends up being really good

00:31:21.620 | because it's close to the pre-training parameters, which were

00:31:24.260 | really good for a lot of things.

00:31:26.060 | This is a cool place where sort of practice and theory

00:31:29.300 | are sort of like meeting, where optimization people want

00:31:31.940 | to understand why this is so useful.

00:31:34.700 | NLP people sort of just want to build better systems.

00:31:39.060 | So yeah, maybe the stuff around theta hat

00:31:43.140 | tends to generalize well.

00:31:44.460 | If you want to work on this kind of thing,

00:31:46.220 | you should talk about it.

00:31:47.220 | Yeah?

00:31:48.220 | So if stochastic gradient descent

00:31:50.220 | sticks relatively close, but what

00:31:51.980 | if we were to use a different optimizer?

00:31:53.740 | How would that change our results?

00:31:56.180 | The question is, if stochastic gradient descent

00:31:59.180 | sticks relatively close, what if we use a different optimizer?

00:32:01.780 | I mean, if we use sort of any common variant of gradient

00:32:05.020 | descent, like any first order method,

00:32:07.100 | like Adam, which we use in this course, or AdaGrad,

00:32:10.420 | or they all have this very, very similar properties.

00:32:14.860 | Other types of optimization we just tend to not use.

00:32:17.660 | So who knows?

00:32:19.460 | Yeah?

00:32:19.960 | Yeah, I'm still a little unclear on why

00:32:21.700 | the pre-training plus fine tuning works better than just

00:32:25.060 | fine tuning, but making the model more powerful,

00:32:27.180 | like adding more layers, more data, et cetera.

00:32:29.580 | Yeah.

00:32:30.300 | The question is, why does the pre-trained fine tune paradigm

00:32:33.540 | work better than just making the model more powerful,

00:32:36.580 | adding more layers, adding more data to just the fine tuning?

00:32:39.180 | The simple answer is that you have orders of magnitude

00:32:45.860 | more data that's unlabeled.

00:32:48.500 | That's just text that you found.

00:32:51.860 | Then you do carefully labeled data and the tasks

00:32:54.460 | that you care about, right?

00:32:55.540 | Because that's expensive to get.

00:32:57.140 | It has to be examples of your movie reviews

00:32:59.660 | or whatever that you've had someone label carefully.

00:33:03.220 | So you have something like on the internet at least 5

00:33:09.460 | trillion, maybe 10 trillion words of this,

00:33:13.020 | and you have maybe a million words of your labeled data

00:33:16.620 | or whatever over here.

00:33:17.860 | So it's just the scale is way off.

00:33:21.460 | But there's also an intuition that learning

00:33:24.260 | to do a very, very simple thing like sentiment analysis

00:33:28.180 | is not going to get you a very generally able agent

00:33:34.940 | in a wide range of settings compared to language modeling.

00:33:38.940 | So it's hard to get--

00:33:40.900 | how do I put it?

00:33:42.180 | Even if you have a lot of labeled data of movie reviews

00:33:45.020 | of the kind that people are writing today, maybe tomorrow

00:33:49.260 | they start writing slightly different kinds of movie

00:33:51.380 | reviews, and your system doesn't perform as well.

00:33:53.660 | Whereas if you pre-trained on a really diverse set of text

00:33:56.700 | from a wide range of sources and people,

00:33:58.900 | it might be more adaptable to seeing stuff that doesn't quite

00:34:03.260 | look like the training data you showed it,

00:34:05.060 | even if you showed it a ton of training data.

00:34:07.580 | So one of the big takeaways of pre-training

00:34:10.420 | is that you get this huge amount of variety of text

00:34:14.980 | on the internet.

00:34:15.660 | And you have to be very careful.

00:34:17.100 | I mean, yeah, you should be very careful about what kind of text

00:34:20.220 | you're showing it and what kind of text you're not,

00:34:22.460 | because the internet is full of awful text as well.

00:34:27.780 | But some of that generality just comes

00:34:29.660 | from how hard this problem is and how much data

00:34:31.940 | you can show it.

00:34:33.940 | [INAUDIBLE]

00:34:34.420 | --pre-trained model was trained on so much data.

00:34:37.780 | How do you then train it so that it considers the stuff

00:34:42.140 | that you're fine-tuning it with as more important, more

00:34:44.660 | salient to the task it's trying to do,

00:34:46.660 | rather than just one in a billion articles of data?

00:34:50.580 | Yeah, it's a good question.

00:34:51.900 | So the question is, given that the amount of data

00:34:54.380 | on the pre-training side is orders of magnitude

00:34:56.340 | more than the amount of data on the fine-tuning side,

00:34:58.540 | how do you get across to the model that, OK, actually,

00:35:01.220 | the fine-tuning task is what I care about.

00:35:03.140 | So focus on that.

00:35:04.940 | It's about the fact that I did this first,

00:35:07.220 | the pre-training first.

00:35:08.540 | And then I do the fine-tuning second.

00:35:11.900 | So I've gotten my parameter initialization from this.

00:35:14.780 | I've set it somewhere.

00:35:16.100 | And then I fine-tune.

00:35:17.620 | I move to where the parameters are doing well

00:35:20.100 | for this task afterward.

00:35:22.220 | And so, well, it might just forget a lot

00:35:25.060 | about how to do this, because now I'm just asking

00:35:27.540 | it to do this at this point.

00:35:30.820 | I should move on, I think.

00:35:32.940 | But we're going to keep talking about this in much more detail

00:35:36.060 | with more concrete elements.

00:35:38.180 | So OK, so let's talk about model pre-training.

00:35:44.980 | Oh, wait.

00:35:47.140 | That did not advance the slides.

00:35:49.100 | Nice, OK.

00:35:55.140 | Let's talk about model pre-training three ways.

00:35:58.020 | In our Transformers lecture Tuesday,

00:36:01.660 | we talked about encoders, encoder decoders, and decoders.

00:36:04.980 | And we'll do decoders last, because actually,

00:36:08.580 | many of the largest models that are being used today

00:36:12.140 | are all decoders.

00:36:14.180 | And so we'll have a bit more to say about them.

00:36:17.260 | So let's recall these three.

00:36:19.340 | So encoders get bidirectional context.

00:36:21.540 | You have a single sequence, and you're

00:36:23.700 | able to see the whole thing, kind of like an encoder

00:36:25.940 | in machine translation.

00:36:28.100 | Encoder decoders have one portion of the network

00:36:32.340 | that gets bidirectional context.

00:36:34.140 | So that's like the source sentence of my machine

00:36:36.620 | translation system.

00:36:37.900 | And then they're sort of paired with a decoder that

00:36:40.540 | gets unidirectional context, so that I

00:36:42.420 | have this sort of informational masking where

00:36:45.420 | I can't see the future, so that I can do things

00:36:47.460 | like language modeling.

00:36:48.500 | I can generate the next token of my translation, whatever.

00:36:51.260 | So you could think of it as I've got my source sentence here,

00:36:54.820 | and my partial translation here, and I'm sort of decoding

00:36:57.260 | out the translation.

00:36:59.180 | And then decoders only are things like language models.

00:37:02.060 | We've seen a lot of this so far.

00:37:03.540 | And there's pre-training for all three sort

00:37:05.580 | of large classes of models.

00:37:09.100 | And how you pre-train them and then how you use them

00:37:11.380 | depends on the properties and the proactivities

00:37:14.260 | of the specific architecture.

00:37:15.740 | So let's look at encoders first.

00:37:18.740 | So we've looked at language modeling quite a bit.

00:37:21.460 | But we can't do language modeling with an encoder,

00:37:24.100 | because they get bidirectional context.

00:37:26.620 | So if I'm down here at i, and I want to present--

00:37:31.100 | I want to predict the next word, it's

00:37:33.460 | a trivial task at this level here to predict the next word.

00:37:38.020 | Because in the middle, I was able to look at the next word.

00:37:41.900 | And so I should just know.

00:37:43.060 | There's nothing hard about learning to predict the next word here,

00:37:45.560 | because I could just look at it, see what it is, and then copy it over.

00:37:49.380 | So when I'm training an encoder in something for pre-training,

00:37:54.720 | I have to be a little bit more clever.

00:37:57.380 | In practice, what I do is something like this.

00:37:59.900 | I take the input, and I modify it somewhat.

00:38:02.100 | I mask out words, sort of like I did in the examples

00:38:04.620 | I gave at the beginning of class.

00:38:06.020 | So I blank to the blank.

00:38:09.260 | And then I have the network predict with its whole--

00:38:12.980 | I have it build contextual representations.

00:38:15.340 | So now this vector representation of the blank

00:38:18.060 | sees the entire context around it here.

00:38:22.340 | And then I predict the word "went," and then here, the word "store."

00:38:29.340 | Any questions?

00:38:34.460 | And you can see how this is doing something quite a bit like language

00:38:37.940 | modeling, but with bidirectional context.

00:38:41.180 | I've removed the network's information about the words that go in the blanks,

00:38:45.340 | and I'm training it to reconstruct that.

00:38:47.740 | So I only have loss terms, right?

00:38:49.620 | I only ask it to actually do the prediction, compute the loss,

00:38:52.780 | backpropagate the gradients for the words that I've masked out.

00:38:56.580 | And you can think of this as instead of learning probability of x,

00:39:00.580 | where x is like a sentence or a document,

00:39:03.140 | this is learning the probability of x, the real document,

00:39:06.300 | given x tilde, which is this sort of corrupted document,

00:39:11.420 | with some of the information missing.

00:39:14.940 | And so we get the sequence of vectors here,

00:39:17.780 | one per word, which is the output of my encoder in blue.

00:39:21.380 | And then I'd say that for the words that I want to predict, yi, I draw them.

00:39:25.700 | This is the sim means the probability is proportional to my embedding matrix

00:39:32.940 | times my representation of it.

00:39:36.500 | So it's just a linear transformation of that last thing here.

00:39:38.980 | So this a plus b is this red portion here.

00:39:41.860 | And I do the prediction, and I train the entire network to do this.

00:39:44.820 | Yes?

00:39:47.020 | So the words that we mask out, do we just select them randomly,

00:39:51.900 | or is there some scheme to it?

00:39:54.260 | The question is, do we just choose words randomly to mask out,

00:39:57.100 | or is there a scheme?

00:39:58.380 | Mostly randomly.

00:39:59.380 | We'll talk about a slightly smarter scheme in a couple of slides,

00:40:02.140 | but yeah, just mostly randomly.

00:40:05.500 | Yeah?

00:40:07.020 | What was that last part on the bottom, x, the masked version of--

00:40:11.460 | like, if it's the first or the very last sentence?

00:40:16.580 | Yeah, so I'm saying that I'm defining x tilde to be this input part, where

00:40:23.100 | I've got the masked version of the sentence with these words missing.

00:40:26.820 | And then I'm defining a probability distribution

00:40:29.060 | that's the probability of a sequence conditioned

00:40:32.340 | on the input being the corrupted sequence, the masked sequence.

00:40:35.940 | So this brings us to a very, very popular NLP model

00:40:47.300 | that you need to know about.

00:40:48.460 | It's called BERT.

00:40:49.940 | And it was the first one to popularize this masked language modeling

00:40:53.500 | objective.

00:40:55.300 | And they released the weights of this pre-trained transformer

00:40:58.420 | that they pre-trained via something that looks a lot like masked language

00:41:01.380 | modeling.

00:41:01.900 | And so these you can download.

00:41:03.780 | You can use them via code that's released by the company HuggingFace

00:41:07.660 | that we have continued to bring up.

00:41:10.300 | Many of you will use a model like BERT in your final project

00:41:13.700 | because it's such a useful builder of representations

00:41:16.500 | of language and context.

00:41:18.340 | So let's talk a little bit about the details

00:41:20.140 | of masked language modeling in BERT.

00:41:23.260 | First, we take 15% of the subword tokens.

00:41:27.020 | So remember, all of our inputs now are subword tokens.

00:41:30.460 | I've made them all look like words.

00:41:32.500 | But just like we saw at the very beginning of class,

00:41:34.620 | each of these tokens could just be some portion, some subword.

00:41:38.940 | And I'm going to do a couple of things with it.

00:41:40.900 | Sometimes I am going to just mask out the word

00:41:45.860 | and then predict the true word.

00:41:48.220 | Sometimes I'm going to replace the word with some random sample

00:41:53.260 | of another word from my vocabulary and predict

00:41:56.700 | the real word that was supposed to go there.

00:41:58.780 | And sometimes I'm going to not change the word at all

00:42:02.780 | and still predict it.

00:42:04.300 | The intuition of this is the following.

00:42:07.340 | If I just had to build good representations

00:42:11.820 | in the middle of this network for words that are masked out,

00:42:15.940 | then when I actually use the model at test time

00:42:19.220 | on some real review to do sentiment analysis on,

00:42:22.820 | well, there are never going to be any tokens like this.

00:42:25.340 | So maybe the model won't do a very good job

00:42:27.300 | because it's like, oh, I have no job to do here

00:42:29.780 | because I only need to deal with the mask tokens.

00:42:33.540 | By giving it sequences of words where sometimes it's

00:42:36.660 | the real word that needs to be predicted,

00:42:38.420 | sometimes you have to detect if the word is wrong.

00:42:41.300 | The idea is that now when I give it

00:42:43.100 | a sentence that doesn't have any masks,

00:42:46.660 | it actually does a good job of representing

00:42:48.660 | all the words in context because it has this chance

00:42:51.660 | that it could be asked to predict anything at any time.

00:42:54.120 | OK, so the folks at Google who were defining this

00:43:03.980 | had a separate additional task that is sort of interesting

00:43:09.100 | to think about.

00:43:10.780 | So this was their BERT model from their paper.

00:43:13.340 | They had their position embeddings

00:43:14.760 | just like we saw from our transformers lecture,

00:43:18.180 | token embeddings just like we saw from the transformers

00:43:20.500 | lecture.

00:43:21.620 | But then also they had this thing called a segment embedding

00:43:23.980 | where they had two possible segments, segment A

00:43:26.380 | and segment B. And they had this additional task

00:43:31.820 | where they would get a big chunk of text for segment A

00:43:34.780 | and a big chunk of text for segment B.

00:43:37.220 | And then they would ask the model,

00:43:38.780 | is segment B a real continuation of segment A?

00:43:43.140 | Was it the text that actually came next?

00:43:45.780 | Or did I just pick this big segment randomly

00:43:48.100 | from somewhere else?

00:43:49.660 | And the idea was that this should teach the network

00:43:52.180 | some notion of long distance coherence

00:43:55.460 | about the connection between a bunch of text over here

00:43:58.420 | and a bunch of text over there.

00:44:00.180 | Turns out it's not really necessary,

00:44:01.740 | but it's an interesting idea.

00:44:04.940 | And similar things have continued

00:44:06.880 | to have some sort of influence since then.

00:44:09.980 | But again, you should get this intuition

00:44:12.060 | that we're trying to come up with hard problems

00:44:14.100 | for the network to solve such that by solving them,

00:44:16.780 | it has to learn a lot about language.

00:44:19.460 | And we're defining those problems

00:44:21.580 | by making simple transformations or removing information

00:44:25.060 | from text that just happened to occur.

00:44:26.860 | Questions?

00:44:32.580 | Yeah.

00:44:33.080 | The plus signs, do we concatenate the vectors,

00:44:35.500 | or do we do an element-wise addition?

00:44:38.420 | The question is, for these plus signs,

00:44:40.020 | do we concatenate the vectors or do element-wise addition?

00:44:43.140 | We do element-wise addition.

00:44:45.940 | You could have concatenated them.

00:44:48.180 | However, one of the big conventions

00:44:50.660 | of all of these networks is that you always

00:44:52.420 | have exactly the same number of dimensions

00:44:54.980 | everywhere at every layer of the network.

00:44:56.660 | It just makes everything very simple.

00:44:58.420 | So just saying everything's the same dimension

00:45:00.300 | and then doing addition just ends up being simpler.

00:45:03.980 | So why was the next sentence prediction not necessary?

00:45:09.220 | What's the main question for that?

00:45:11.060 | Yeah, why was the next sentence prediction not necessary?

00:45:14.420 | One thing that it does that's a negative

00:45:16.460 | is that now the effective context length for a lot

00:45:24.300 | of your examples is halved.

00:45:26.580 | So one of the things that's useful about pre-training

00:45:28.820 | seemingly is that you get to build representations

00:45:30.980 | of very long sequences of text.

00:45:33.220 | This is very short, but in practice,

00:45:35.460 | segment A was going to be something like 250 words,

00:45:39.540 | and segment B was going to be 250 words.

00:45:42.060 | And in the paper that let us know that this wasn't necessary,

00:45:45.460 | they always had a long segment of 500 words.

00:45:48.940 | And it seemed to be useful to always have

00:45:50.740 | this very long context because longer contexts help give you

00:45:55.380 | more information about the role that each word is playing

00:45:58.060 | in that specific context.

00:45:59.820 | If I see one word, it's hard to know.

00:46:02.100 | If I just see record, it's hard to know

00:46:03.980 | what it's supposed to mean.

00:46:05.220 | But if I see 1,000 words around it,

00:46:06.900 | it's much clearer what its role is in that context is.

00:46:09.540 | So yeah, it cuts the effective context size is one answer.

00:46:13.100 | OK.

00:46:13.600 | Another thing is that this is actually much more difficult.

00:46:19.860 | This is a much more recent paper that I

00:46:21.980 | don't have in the slides.

00:46:23.260 | But it's been shown since then that these models are really,

00:46:25.760 | really bad at the next sentence prediction task.

00:46:28.860 | So it could be that maybe it just

00:46:31.140 | was too hard at the time.

00:46:34.860 | And so it just wasn't useful because the model

00:46:37.060 | was failing to do it at all.

00:46:39.740 | So I can give the link for that paper later.

00:46:43.100 | Can you explain again why we need to do a next sentence

00:46:45.940 | prediction?

00:46:46.500 | What about just masking and predicting the next?

00:46:49.020 | I missed that jump.

00:46:50.140 | So it's the next sentence.

00:46:52.020 | Yeah.

00:46:52.540 | So the question is, why do we need

00:46:53.620 | to do next sentence prediction?

00:46:54.700 | Why not just do the masking we saw before?

00:46:57.020 | That's the thing.

00:46:57.380 | You seem to not need to do next sentence prediction.

00:46:59.660 | But as history of the research, it

00:47:03.020 | was thought that this was useful.

00:47:05.380 | And the idea was that it required

00:47:07.420 | you to develop this pairwise, do these two segments of text

00:47:12.060 | interact?

00:47:12.560 | How do they interact?

00:47:13.500 | Are they related?

00:47:14.260 | The sort of longer distance notion.

00:47:16.300 | And many NLP tasks are defined on pairs of things.

00:47:19.860 | And they thought that might be useful.

00:47:22.180 | And so they published it with this.

00:47:24.020 | And then someone else came through,

00:47:25.500 | published a new model that didn't do that.

00:47:27.260 | And it sort of did better.

00:47:29.500 | So this is just-- yeah.

00:47:31.820 | So yeah.

00:47:33.060 | There are intuitions as to why it could work.

00:47:34.860 | It just didn't.

00:47:36.260 | So BERT wasn't doing masking or was doing--

00:47:38.700 | It was doing both.

00:47:39.420 | It was doing both.

00:47:40.260 | It was doing both this next sentence--

00:47:42.100 | so BERT was doing both this next sentence prediction training

00:47:46.540 | as well as this masking training all at the same time.

00:47:52.220 | And so you had to have a separate predictor head

00:47:55.380 | on top of BERT, a separate predictor sort

00:47:57.340 | of classification thing.

00:47:59.580 | And so one detail there is that there's

00:48:02.300 | this special word at the beginning of BERT

00:48:04.460 | in every sequence that's CLS.

00:48:07.140 | And you can define a predictor on top

00:48:10.140 | of that sort of fake word embedding that

00:48:12.420 | was going to say, is the next sentence real or fake or not?

00:48:16.140 | Yeah.

00:48:17.740 | OK, I'm going to move on.

00:48:20.620 | And so this gets at sort of the question

00:48:22.500 | that we had earlier about how do you evaluate these things.

00:48:25.540 | There's a lot of different NLP tasks out there.

00:48:27.780 | Gosh.

00:48:28.380 | And when people were defining these papers,

00:48:32.140 | they would look at a ton of different evaluations

00:48:34.580 | that had been sort of compiled as a set of things that

00:48:36.880 | are a little hard for today's systems.

00:48:38.860 | So are you detecting paraphrases between questions?

00:48:41.900 | Are two Quora questions actually the same question?

00:48:44.260 | That turns out to be hard.

00:48:47.500 | Can you do sentiment analysis on this hard data set?

00:48:51.540 | Can you tell if sentences are linguistically acceptable?

00:48:54.460 | Are they grammatical or not?

00:48:56.620 | Are two sequences similar semantically?

00:48:59.020 | Do they mean sort of vaguely the similar thing?

00:49:01.900 | And we'll talk a bit about natural language inference

00:49:04.100 | later, but that's the task of defining sort of if I say,

00:49:08.240 | you know, I saw the dog, that does not necessarily

00:49:11.400 | mean I saw the little dog.

00:49:14.440 | But saying I saw the little dog does mean I saw the dog.

00:49:18.000 | So that's sort of this natural language inference task.

00:49:20.560 | And the difference between sort of pre-pre-training days,

00:49:26.920 | where you had this sort of this row here

00:49:29.320 | before you had substantial amounts of pre-training

00:49:33.600 | and BERT was just like the field was taken aback in a way that's

00:49:37.540 | hard to describe.

00:49:39.420 | You know, very carefully crafted architectures

00:49:41.960 | for each individual task, where everyone

00:49:44.040 | was designing their own neural network

00:49:45.660 | and doing things that they thought were sort of clever as

00:49:48.000 | to how to define all the connections and the weights

00:49:50.300 | and whatever to do their tasks independently.

00:49:52.400 | So everyone was doing a different thing for each one

00:49:54.600 | of these tasks, roughly.

00:49:57.360 | All of that was blown out of the water

00:49:59.360 | by just build a big transformer and just teach it

00:50:02.120 | to predict the missing words a whole bunch

00:50:04.160 | and then fine tune it on each of these tasks.

00:50:07.000 | So this was just a sea change in the field.

00:50:09.680 | People were, I mean, amazed.

00:50:11.920 | It's a little bit less flashy than chat GPT, I'll admit.

00:50:14.760 | But it's really part of the story that gets us to it,

00:50:17.160 | you know?

00:50:18.920 | OK, questions?

00:50:20.800 | So like to get stuff out of the--

00:50:28.200 | during the encoder pre-training stage,

00:50:31.680 | encoder usually outputs some sort of hidden values.

00:50:36.680 | How do we correlate those to words

00:50:39.200 | that we are trying to test against?

00:50:41.720 | So the question is, the encoder output

00:50:44.760 | is a bunch of hidden values.

00:50:48.320 | How do we actually correlate those values to stuff

00:50:51.320 | that we want to predict?

00:50:52.640 | I'm going to go on to the next slide here to bring up

00:50:54.980 | this example here, right?

00:50:56.120 | So the encoder gives us, for each input word token,

00:51:00.200 | a vector of that token that represents

00:51:02.640 | the token in context.

00:51:04.360 | And the question is, how do we get these representations

00:51:07.520 | and turn them into sort of answers for the tasks

00:51:11.560 | that we care about?

00:51:13.080 | And the answer comes back to something like this.

00:51:30.080 | Something like this, maybe?

00:51:32.480 | Sure.

00:51:39.360 | So when we were doing the pre-training,

00:51:41.040 | we had the transformer that was giving us our representations.

00:51:43.840 | And we had this little last layer here,

00:51:46.040 | this little sort of affine transformation

00:51:49.840 | that moved us from the encoder's hidden state size

00:51:52.480 | to the vocabulary to do our prediction.

00:51:55.000 | And we just removed this last prediction layer here.

00:51:58.280 | And let's say we want to do something that is classifying

00:52:03.320 | the sentiment of the sentence.

00:52:04.600 | We just pick arbitrarily maybe the last word in the sentence.

00:52:08.320 | And we stick a linear classifier on top

00:52:11.480 | and map it to positive or negative,

00:52:13.320 | and then fine tune the whole thing.

00:52:16.040 | OK.

00:52:17.160 | So yeah, the BERT model had two different models.

00:52:21.160 | One was 110 million parameters.

00:52:22.920 | One was 340 million.

00:52:24.840 | Keep that sort of in the back of your head sort of percolating

00:52:27.480 | as we talk about models with many, many more parameters

00:52:30.160 | later on.

00:52:31.160 | It was trained on 800 million words plus--

00:52:38.320 | that is definitely wrong--

00:52:40.000 | maybe 25 million words, but on the order of less than a billion

00:52:44.520 | words of text, quite a bit still.

00:52:48.220 | And it was trained on what was considered at the time

00:52:50.640 | to be a whole lot of compute.

00:52:53.120 | Just it was Google doing this.

00:52:54.840 | And they released it.

00:52:55.720 | And we were like, oh, who has that kind of compute?

00:52:57.400 | But Google-- although nowadays, it's

00:52:59.280 | not considered to be very much.

00:53:01.600 | But fine tuning is practical and common on a single GPU.

00:53:04.720 | So you could take the BERT model that they've spent a lot of time

00:53:07.560 | training and fine tune it yourself on your task

00:53:10.640 | on even sort of a very small GPU.

00:53:14.120 | OK.

00:53:17.820 | So one question is like, well, this seems really great.

00:53:24.520 | Why don't we just use this for everything?

00:53:27.080 | Yeah.

00:53:31.400 | And the answer is, well, what is the sort

00:53:33.960 | of pre-training objective?

00:53:35.040 | What's the structure of the pre-trained model good for?

00:53:38.520 | BERT is really good for sort of filling in the blanks.

00:53:41.920 | But it's much less naturally used

00:53:44.320 | for actually generating text.

00:53:46.800 | So I wouldn't want to use BERT to generate

00:53:48.960 | a summary of something because it's not really built for it.

00:53:53.080 | It doesn't have a natural notion of predicting the next word given

00:53:56.520 | all the words that came before it.

00:53:58.280 | So maybe I want to use BERT if I want a good representation of, say,

00:54:01.840 | a document to classify it, give it a set of topic labels,

00:54:05.440 | or say it's toxic or non-toxic or whatever.

00:54:07.960 | But I wouldn't want to use it to generate a whole sequence.

00:54:12.960 | OK.

00:54:13.460 | Some extensions of BERT.

00:54:15.040 | So we had a question earlier of whether you just

00:54:17.400 | mask things out randomly.

00:54:18.920 | One thing that seems to work better is you mask out

00:54:23.480 | sort of whole contiguous spans because sort of the difficulty

00:54:30.480 | of this problem is much easier than it would otherwise be because sort of this

00:54:35.720 | is part of irresistibly.

00:54:37.120 | And you can tell very easily based on the sort of subwords that came before it.

00:54:41.160 | Whereas if I have a much longer sequence, it's a trade-off.

00:54:45.500 | But this might be a harder problem.

00:54:47.840 | And it ends up being better to do this sort of span-based masking

00:54:51.600 | than random masking.

00:54:52.600 | And that might be because subwords make very simple prediction problems when

00:54:56.680 | you mask out just one subword of a word versus all the subwords of a word.

00:55:02.160 | OK.

00:55:02.660 | So this ends up doing much better.

00:55:05.360 | There's also a paper called the Roberta paper,

00:55:07.360 | which showed that the next sentence prediction wasn't necessary.

00:55:12.180 | They also showed that they really should

00:55:13.880 | have trained it on a lot more text.

00:55:16.840 | So Roberta is a drop-in replacement for BERT.

00:55:19.560 | So if you're thinking of using BERT, just use Roberta.

00:55:21.760 | It's better.

00:55:22.640 | And it gave us this intuition that we really

00:55:24.280 | don't know a whole lot about the best practices for training these things.

00:55:27.400 | You sort of train it for as long as you're willing to.

00:55:29.800 | And things do good stuff and whatever.

00:55:33.600 | So this is very-- but it's very difficult to do sort of iteration

00:55:37.920 | on these models because they're big.

00:55:39.420 | It's expensive to train them.

00:55:42.520 | Another thing that you should know for your final projects in the world ahead

00:55:45.960 | is this notion of fine-tuning all parameters of the network

00:55:49.200 | versus just a couple of them.

00:55:51.200 | So what we've talked about so far is you pre-train all the parameters

00:55:54.840 | and then you fine-tune all of them as well.

00:55:56.640 | So all the parameter values change.

00:55:59.480 | Alternative, which you call parameter efficient or lightweight fine-tuning,

00:56:04.000 | you sort of choose little bits of parameters

00:56:06.520 | or you choose the very smart way of keeping most of the parameters fixed

00:56:09.640 | and only fine-tuning others.

00:56:11.480 | And the intuition is that these pre-trained parameters were really good.

00:56:16.600 | And you want to make the minimal change from the pre-trained model

00:56:20.080 | to the model that does what you want so that you

00:56:22.120 | keep some of the generality, some of the goodness of the pre-training.

00:56:26.280 | So one way that this is done is called prefix tuning.

00:56:29.280 | Prompt tuning is very similar, where you actually

00:56:31.560 | freeze all the parameters of the network.

00:56:33.280 | So I've pre-trained my network here.

00:56:36.920 | And I never change any of the parameter values.

00:56:39.720 | Instead, I make a bunch of fake sort of pseudo word vectors

00:56:44.360 | that I prepend to the very beginning of the sequence.

00:56:47.360 | And I train just them.

00:56:49.280 | Sort of unintuitive.

00:56:50.800 | It's like these would have been like inputs to the network,

00:56:53.480 | but I'm specifying them as parameters.

00:56:55.340 | And I'm training everything to do my sentiment analysis task

00:56:58.640 | just by changing the values of these sort of fake words.

00:57:03.120 | And this is nice because I get to keep all the good pre-trained parameters

00:57:08.960 | and then just specify the sort of diff that ends up generalizing better.

00:57:15.000 | This is a very open field of research.

00:57:17.520 | But this is also cheaper because I don't have to compute the gradients,

00:57:21.480 | or I don't have to store the gradients and all the optimizer state.

00:57:25.240 | With respect to all these parameters, I'm only training

00:57:28.160 | a very small number of parameters.

00:57:30.840 | Yeah.

00:57:31.340 | [INAUDIBLE]

00:57:33.340 | It's like fake parameters at the end, as if like here.

00:57:38.180 | It doesn't make any difference if you put these

00:57:40.180 | at the end or the beginning.

00:57:41.380 | In a decoder, you have to put them at the beginning

00:57:43.540 | because otherwise you don't see them before you process the whole sequence.

00:57:48.780 | Yes.

00:57:49.280 | Can we just attach a few layers and only train the new layers?

00:57:53.500 | The question is, can we just attach a new layers at the top of this

00:57:56.660 | and only train those?

00:57:57.660 | Absolutely.

00:57:58.820 | This works a bit better.

00:58:00.540 | Another thing that works well-- sorry, we're running out of time--

00:58:04.340 | is taking each weight matrix.

00:58:06.780 | So I have a bunch of weight matrices in my transformer.

00:58:09.700 | And I freeze the weight matrix and learn a very low rank little diff.

00:58:15.420 | And I set the weight matrix's value to be sort of the original value

00:58:19.660 | plus my sort of very low rank diff from the original one.

00:58:24.900 | And this ends up being a very similarly useful technique.

00:58:29.620 | And the overall idea here is that, again, I'm

00:58:31.700 | learning way fewer parameters than I did via pre-training and freezing

00:58:36.180 | most of the pre-training parameters.

00:58:39.300 | OK, encoder-decoders.

00:58:41.140 | So for encoder-decoders, we could do something like language modeling.

00:58:45.300 | I've got my input sequence here, encoder, output sequence here.

00:58:49.700 | And I could say this part is my prefix for sort

00:58:52.980 | of having bidirectional context.

00:58:55.100 | And I could then predict all the words that

00:58:58.220 | are sort of in the latter half of the sequence,

00:59:00.780 | just like a language model.

00:59:01.900 | And that would work fine.

00:59:04.460 | And so this is something that you could do.

00:59:07.140 | You sort of take a long text, split it into two,

00:59:09.700 | give half of it to the encoder, and then generate

00:59:12.080 | the second half with the decoder.

00:59:13.420 | But in practice, what works much better is this notion of span corruption.

00:59:20.300 | Span corruption is going to show up in your assignment 5.

00:59:23.260 | And the idea here is a lot like BERT, but in a sort of generative sense,

00:59:30.580 | where I'm going to mask out a bunch of words in the input.

00:59:33.500 | Thank you, mask token 1, me to your party, mask token 2, week.

00:59:40.860 | And then at the output, I generate the mask token

00:59:44.660 | and then what was supposed to be there, but the mask token replaced it.

00:59:48.580 | So thank you, then predict for inviting at the output,

00:59:52.500 | me to your party, last week.

00:59:54.860 | And what this does is that it allows you to have bidirectional context.

01:00:00.900 | I get to see the whole sequence, except I can generate

01:00:05.220 | the parts that were missing.

01:00:07.100 | So this feels a little bit like you mask out parts of the input,

01:00:10.020 | but you actually generate the output as a sequence

01:00:12.960 | like you would in language modeling.

01:00:14.860 | So this might be good for something like machine translation,

01:00:17.420 | where I have an input that I want bidirectional context in,

01:00:20.340 | but then I want to generate an output.

01:00:22.300 | And I want to pre-train the whole thing.

01:00:24.380 | So this was shown to work better than language modeling at the scales

01:00:27.780 | that these folks at Google were able to test back in 2018.

01:00:31.580 | This is still quite popular.

01:00:32.860 | Yeah, there's a lot of numbers.

01:00:37.940 | It works better than the other stuff.

01:00:40.100 | I'm not going to worry about it.

01:00:43.780 | There's a fascinating property of these models also.

01:00:46.540 | So T5 was the model that was originally introduced

01:00:51.060 | with salient span masking.

01:00:52.820 | And you can think of at pre-training time,

01:00:55.820 | you saw a bunch of things like Franklin D. Roosevelt was born in blank,

01:01:00.540 | and you generated out the blank.

01:01:02.580 | And there's this task called open domain question

01:01:07.620 | answering, which has a bunch of trivia questions,

01:01:10.220 | like when was Franklin D. Roosevelt born?

01:01:12.820 | And then you're supposed to generate out the answer as a string, just

01:01:16.860 | from your parameters.

01:01:17.940 | So you did a bunch of pre-training.

01:01:19.400 | You saw a bunch of text.

01:01:20.580 | And then you're supposed to generate these answers.

01:01:22.900 | And what's fascinating is that this salient span masking method

01:01:29.900 | allowed you to pre-train and then fine tune

01:01:32.380 | on some examples of trivia questions.

01:01:36.820 | And then when you tested on new trivia questions,

01:01:40.260 | the model would implicitly extract from its pre-training data

01:01:44.540 | somehow the answer to that new question that it never

01:01:47.780 | saw explicitly at fine tuning time.

01:01:49.700 | So it learned this sort of implicit retrieval-- sometimes,

01:01:53.060 | sometimes, less than 50% of the time or whatever,

01:01:55.740 | but much more than random chance.

01:02:00.020 | And that's fascinating.

01:02:01.580 | So you've learned to access this latent knowledge

01:02:05.180 | that you stored up by pre-training.

01:02:07.380 | And so you just pass it the text, when was Roosevelt born,

01:02:10.820 | and it would pass out an answer.

01:02:13.020 | And one thing to know is that the answers always look very fluent.

01:02:15.860 | They always look very reasonable.

01:02:17.820 | But they're frequently wrong.

01:02:19.980 | And that's still true of things like ChatsGPT.

01:02:21.860 | Yeah.

01:02:25.980 | OK, so that's encoder-decoder models.

01:02:30.300 | Next up, we've got decoders.

01:02:31.740 | And we'll spend a long time on decoders.

01:02:34.100 | So this is just our normal language model.

01:02:35.980 | So I get a sequence of hidden states from my decoder.

01:02:38.940 | The model-- the words can only look at themselves, not the future.

01:02:43.220 | And then I predict the next word in the sentence.

01:02:46.780 | And then here again, I can--

01:02:48.700 | to do sentiment analysis, maybe take the last state

01:02:50.900 | for the last word, and then predict happy or sad

01:02:53.540 | based on that last embedding.

01:02:56.340 | Back-propagate the gradients of the whole network,

01:02:58.420 | train the whole thing, or do some kind of lightweight

01:03:01.700 | or parameter-efficient fine-tuning,

01:03:03.420 | like we mentioned earlier.

01:03:05.100 | So this is our pre-training a decoder.

01:03:07.940 | And I can just pre-train it on language modeling.

01:03:13.460 | So again, you might want to do this if you are wanting to generate texts,

01:03:19.820 | generate things.

01:03:22.220 | You sort of can use this like you use an encoder-decoder.

01:03:25.700 | But in practice, as we'll see, a lot of the sort of biggest,

01:03:29.580 | most powerful pre-trained models tend to be decoder-only.

01:03:33.740 | It's not really clear exactly why, except they

01:03:36.780 | seem a little bit simpler than encoder-decoders.

01:03:41.140 | And you get to share all the parameters in one big network for the decoder,

01:03:45.060 | whereas in an encoder-decoder, you have to split them,

01:03:47.820 | sort of some into the encoder, some into the decoder.

01:03:50.620 | So for the rest of this lecture, we'll talk only about decoders.

01:03:55.500 | In modern things, the biggest networks do tend to be decoders.

01:04:00.780 | So we're coming all the way back again to 2018.

01:04:03.740 | And the GPT model from OpenAI was a big success.

01:04:09.420 | It had 117 million parameters.

01:04:13.060 | It had 768 dimensional hidden states.

01:04:16.660 | And it had this vocabulary that was 40,000-ish words that

01:04:23.180 | was defined via a method like what we showed at the beginning of class,

01:04:26.780 | trained on BooksCorpus.

01:04:28.620 | And actually, GPT never actually showed up in the original paper.

01:04:32.860 | It's unclear what exactly it's supposed to refer to.

01:04:39.180 | But this model was a precursor to all the things

01:04:43.580 | that you're hearing about nowadays.

01:04:46.100 | If you move forward--

01:04:48.700 | oh, yeah.

01:04:49.200 | So if you-- hmm.

01:04:55.820 | So if we wanted to do something like natural language inference, which

01:04:59.900 | says, take these pairs of sentences-- the man is in the doorway,

01:05:03.780 | the person is near the door--

01:05:05.460 | and say that these mean that one entails the other,

01:05:09.100 | the premise entails the hypothesis, that I can believe the hypothesis

01:05:12.900 | if I believe the premise, I'd just concatenate them together.

01:05:16.780 | So give it maybe a start token, pass in one sentence,

01:05:21.180 | pass in some delimiter token, pass in the other,

01:05:23.920 | and then predict yes, no, entailment, not entailment, fine tuning.

01:05:30.220 | GPT on this, it worked really well.

01:05:33.340 | And then BERT came after GPT.

01:05:35.620 | BERT did a bit better.

01:05:36.740 | It had bidirectional context.

01:05:39.740 | But it did an excellent job.

01:05:44.180 | And then came GPT-2, where they focused more

01:05:47.220 | on the generative abilities of the network.

01:05:49.660 | So we looked at now a much larger network.

01:05:54.640 | We've gone from 117 million to 1.5 billion.

01:05:57.840 | And given some sort of prompt, it could generate, at the time,

01:06:01.800 | a quite surprisingly coherent continuation to the prompt.

01:06:04.680 | So it's telling this sort of story about scientists and unicorns here.

01:06:11.480 | And this size of model is still sort of small enough

01:06:15.280 | that you can use on a small GPU and fine tune and whatever.

01:06:19.620 | And its capabilities of generating long, coherent text

01:06:23.060 | was just sort of exceptional at the time.

01:06:28.140 | It was also trained on more data, although I don't--

01:06:32.020 | something like 9 billion words of text.

01:06:35.580 | And then, so after GPT-2, we come to GPT-3,

01:06:40.340 | sort of walking through these models.

01:06:42.280 | And then we come with a different way of interacting with the models.

01:06:45.620 | So we've interacted with pre-trained models in two ways so far.

01:06:49.060 | We've sort of sampled from the distribution that they define.

01:06:53.180 | We generated text via a machine translation system or whatever.

01:06:57.380 | Or we fine-tuned them on a task that we care about.

01:06:59.620 | And then we take their predictions.

01:07:03.620 | But GPT-3 seems to have an interesting new ability.

01:07:10.180 | It's much larger.

01:07:11.580 | And it can do some tasks without any sort of fine-tuning whatsoever.

01:07:17.820 | GPT-3 is much larger than GPT-2.

01:07:20.060 | So we went from GPT, 100-ish million parameters,

01:07:23.500 | GPT-2, 1.5 billion, GPT-3, 175 billion, much larger,

01:07:28.820 | trained on 300 billion words of text.

01:07:32.100 | And this sort of notion of in-context learning,

01:07:34.500 | that it could define or figure out patterns in the training

01:07:37.740 | or in the example that it's currently seeing

01:07:40.140 | and continue the pattern, is called in-context learning.

01:07:44.700 | So you've got the word "thanks."

01:07:46.440 | And I pass in this little arrow and say, OK, thanks goes to merci.

01:07:50.180 | And then hello goes to bonjour.

01:07:51.580 | And then I give it all of these examples

01:07:53.700 | and ask it what otter should go to.

01:07:57.300 | And it's learned to sort of continue the pattern

01:08:01.020 | and say that this is the translation of otter.

01:08:04.660 | So now, remember, this is a single sort of input that I've given to my model.

01:08:09.860 | And I haven't said, oh, do translation or fine-tune it on translation

01:08:13.460 | or whatever.

01:08:14.380 | I've just passed in the input, given it some examples.

01:08:16.980 | And then it is able to, to some extent, do this seemingly complex task.

01:08:22.260 | That's in-context learning.

01:08:25.620 | And here are more examples.

01:08:27.140 | Maybe you give it examples of addition.

01:08:29.820 | And then it can do some simple addition afterward.

01:08:33.900 | You give it-- in this case, this is sort of rewriting typos.

01:08:36.900 | It can figure out how to rewrite typos in context

01:08:39.460 | learning for machine translation.

01:08:41.820 | And this was the start of this idea that there

01:08:43.860 | were these emergent properties that showed up in much larger models.

01:08:47.940 | And it wasn't clear, when looking at the smaller models,

01:08:51.020 | that you'd get this sort of new, this qualitatively new behavior out of them.

01:08:57.780 | Like, it's not obvious from just the language modeling signal, right?

01:09:01.140 | GPT-3 is just trained on that decoder only, just predict the next word,

01:09:06.420 | that it would, as a result of that training,

01:09:09.740 | learn to perform seemingly quite complex things

01:09:12.620 | as a function of its context.

01:09:13.700 | Yeah, OK.

01:09:17.900 | One or two questions about that.

01:09:19.540 | This should be quite surprising, I think, right?

01:09:29.060 | So far, we've talked about good representations,

01:09:31.900 | contextual representations, meanings of words in context.

01:09:35.060 | This is some very, very high-level pattern matching, right?

01:09:37.500 | It's coming up with patterns in just the input data.

01:09:40.660 | And that one sequence of text that you've passed it so far,

01:09:43.660 | and it's able to sort of identify how to complete the pattern.

01:09:48.180 | And you think, what kinds of things can this solve?

01:09:50.780 | What are its capabilities?

01:09:52.380 | What are its limitations?

01:09:54.220 | This ends up being an open area of research.

01:09:56.100 | Sort of, what are the kinds of problems that you maybe

01:09:58.700 | saw in the training data a lot?

01:10:00.020 | Maybe GPT-3 saw a ton of pairs of words, right?

01:10:03.780 | It saw a bunch of dictionaries, bilingual dictionaries

01:10:06.860 | in its training data.

01:10:07.740 | So it learned to do something like this.

01:10:09.660 | Or is it doing something much more general,

01:10:11.420 | where it's really learning the task in context?

01:10:14.660 | The actual story, we're not totally sure.

01:10:17.460 | Something in the middle.

01:10:18.740 | It seems like it has to be tied to your training data in ways

01:10:22.540 | that we don't quite understand.

01:10:24.140 | But there's also a non-trivial ability

01:10:26.180 | to learn new sort of, at least, types of patterns

01:10:30.140 | just from the context.

01:10:31.580 | So this is a very interesting thing to work on.

01:10:34.740 | Now, we've talked a lot about the size of these models so far.

01:10:37.660 | And as models have gotten larger,

01:10:39.700 | they've always gotten better.

01:10:40.900 | We train them on more data.

01:10:43.220 | So GPT-3 was trained on 300 billion words of text.

01:10:46.940 | And it was 175 billion parameters.

01:10:50.900 | And at that scale, it costs a lot of money

01:10:55.100 | to build these things.

01:10:56.140 | And it's very unclear whether you're getting the best

01:10:58.260 | use out of your money.

01:10:59.220 | It's bigger, really, what you should

01:11:00.640 | have been doing in terms of the number of parameters.

01:11:03.740 | So the cost of training one of these

01:11:06.180 | is roughly you take the number of parameters,

01:11:08.140 | you multiply it by the number of tokens

01:11:09.740 | that you're going to train it on, the number of words.

01:11:12.780 | And some folks at DeepMind--

01:11:14.820 | oh, I forgot the citation on this.

01:11:16.240 | Some folks at DeepMind realized through some experimentation

01:11:20.980 | that actually GPT-3 was just comically oversized.

01:11:25.300 | So Chinchilla, the model they trained,

01:11:27.660 | is less than half the size and works better.

01:11:30.720 | But they just trained it on way more data.

01:11:34.640 | And this is an interesting trade-off about how do you

01:11:38.020 | best spend your compute?

01:11:39.120 | I mean, you can't do this more than a handful of times,

01:11:41.420 | even if you're Google.

01:11:44.100 | So open questions there as well.

01:11:48.280 | Another way of interacting with these networks

01:11:51.320 | that has come out recently is called chain of thought.

01:11:56.120 | So the prefix, we saw in the in-context learning slide

01:12:00.200 | that the prefix can help specify what task you're

01:12:02.600 | trying to solve right now.

01:12:04.360 | And it can do even more.

01:12:06.000 | So here's standard prompting.

01:12:07.680 | We have a prefix of examples of questions and answers.

01:12:11.440 | So you have a question and then an example answer.

01:12:14.800 | So that's your prompt that's specifying the task.

01:12:17.360 | And then you have a new question.

01:12:18.800 | And you're having the model generate an answer.

01:12:20.760 | And it generates it wrong.

01:12:23.160 | And chain of thought prompting says, well,

01:12:26.560 | how about in the example, in the demonstration we give,

01:12:29.280 | we give the question.

01:12:30.600 | And then we give this sort of decomposition of steps

01:12:34.080 | towards how to get an answer.

01:12:36.180 | So I'm actually writing this out as part of the input.

01:12:38.380 | I'm giving annotations as a human to say,

01:12:41.480 | oh, to solve this sort of word problem,

01:12:44.360 | here's how you could think it through-ish.

01:12:47.280 | And then I give it a new question.

01:12:49.480 | And the model says, oh, I know what I'm supposed to do.

01:12:51.760 | I'm supposed to first generate a sequence of steps,

01:12:55.920 | of intermediate steps.

01:12:57.640 | And then next, say the answer is--

01:13:00.160 | and then say what the answer is.

01:13:01.840 | And it turns out-- and this should, again,

01:13:04.040 | be very surprising--

01:13:06.440 | that the model can tend to generate plausible sequences

01:13:09.960 | of steps and then much more frequently

01:13:12.440 | generates the correct answer after doing so,

01:13:14.880 | relative to trying to generate the answer by itself.

01:13:18.160 | So you can think of this as a scratch pad.

01:13:20.600 | You can think of this as increasing

01:13:23.080 | the amount of computation that you're

01:13:24.660 | putting into trying to solve the problem,

01:13:27.000 | sort of writing out your thoughts.

01:13:28.420 | Right?

01:13:28.920 | As I generate each word of this continuation here,

01:13:33.040 | I'm able to condition on all the past words so far.

01:13:36.200 | And so maybe it just allows the network

01:13:40.280 | to sort of decompose the problem into smaller, simpler

01:13:43.160 | problems, which it's more able to solve each.

01:13:47.800 | No one's really sure why this works exactly either.

01:13:51.240 | At this point, with networks that are this large,

01:13:54.120 | their emergent properties are both very powerful

01:13:57.720 | and exceptionally hard to understand,

01:13:59.600 | and very hard, you should think, to trust.

01:14:03.440 | Because it's unclear what its capabilities are

01:14:05.560 | and what its limitations are, where it will fail.

01:14:09.200 | So what do we think pre-training is teaching?

01:14:11.720 | Gosh, a wide range of things, even

01:14:14.520 | beyond what I've written in this slide, which

01:14:17.360 | I mostly wrote two years ago.

01:14:19.600 | So it can teach you trivia, and syntax, and coreference,

01:14:22.480 | and maybe some lexical semantics, and sentiment,

01:14:24.920 | and some reasoning, like way more reasoning

01:14:27.360 | than we would have thought even three years ago.

01:14:30.280 | And yet, they also learn and exacerbate

01:14:33.400 | racism and sexism, all manner of biases.

01:14:37.480 | There'll be more on this later.

01:14:38.920 | But the generality of this is really,

01:14:42.800 | I think, what's taken many people aback.

01:14:45.040 | And so increasingly, these objects

01:14:47.440 | are not just studied for the sake of using them,

01:14:51.040 | but studied for the sake of understanding anything

01:14:53.760 | about how they work and how they fail.

01:14:55.440 | Yeah, any questions?

01:14:59.440 | Has anyone tried benchmarking GPT for programming tasks,

01:15:11.240 | like how accurately it does, et cetera?

01:15:13.920 | The question is, has anyone tried benchmarking

01:15:16.320 | GPT for programming tasks?

01:15:18.920 | Anyone seen how well it does?

01:15:21.600 | Yes, so there's definitely examples

01:15:23.120 | of people using GPT-3 for simple programming things.

01:15:28.400 | And then the modern, state-of-the-art,

01:15:30.920 | competitive programming bots are all based on ideas

01:15:34.760 | from language modeling.

01:15:36.600 | And I think they're all also based on pre-trained language

01:15:40.160 | models themselves.

01:15:41.160 | If you just take all of these ideas

01:15:43.360 | and apply it to GitHub, then you get

01:15:46.960 | some very interesting emergent behaviors

01:15:48.720 | relating to code fallout.

01:15:50.920 | And so yeah, I think all of the best systems use this,

01:15:55.280 | more or less.

01:15:56.160 | So lots of benchmarking there, for sure.

01:15:58.840 | Is that the basis for what GitHub Copilot's trying to do?

01:16:02.680 | The question is, is this the basis?

01:16:04.120 | Is what we just mentioned the basis for the GitHub Copilot

01:16:07.080 | system?

01:16:07.580 | Yes, absolutely.

01:16:10.320 | We don't know exactly what it is in terms of details,

01:16:13.680 | but it's all these ideas.

01:16:16.080 | What if you have a situation where you have still

01:16:18.640 | a large amount of data for general data,

01:16:21.000 | and then you have also a large amount of data

01:16:23.280 | for your fine-tuning task?

01:16:24.880 | At what point is it better to train a new model

01:16:27.760 | for that fine-tuning versus get data from both?

01:16:30.760 | Yeah, the question is, what if you

01:16:32.160 | have a large amount of data for pre-training

01:16:33.760 | and a large amount of data for fine-tuning?

01:16:35.560 | When is it better to do a separate training on just

01:16:39.280 | the fine-tuning data?

01:16:41.880 | Almost never.

01:16:43.240 | If you have a bunch of data for the task that you care about,

01:16:48.400 | what's frequently done instead is three-part training,

01:16:51.840 | where you pre-train on a very broad corpus.

01:16:54.720 | Then you continue to pre-train using something

01:16:57.320 | like language modeling on an unlabeled version of the label

01:17:02.200 | data that you have.

01:17:03.280 | You just strip the labels off and just treat it all as text

01:17:05.660 | and do language modeling on that,

01:17:07.560 | adapt the parameters a little bit,

01:17:09.320 | and then do the final stage of fine-tuning with the labels

01:17:12.360 | that you want, and that works even better.

01:17:14.240 | There's an interesting paper called Don't Stop Pre-Training.

01:17:19.280 | Nice.

01:17:20.280 | Final question.

01:17:21.840 | We asked a lot of questions.

01:17:23.920 | Anyone?

01:17:24.420 | New?

01:17:24.920 | New?

01:17:25.420 | Someone new with a question?

01:17:26.720 | Yes.

01:17:27.220 | Yeah, I was wondering, do you know

01:17:32.800 | if there's a lot of instances where a pre-trained model can

01:17:36.840 | do some task that's not seen before even without fine-tuning?

01:17:41.160 | Yeah, so are there any instances of where a pre-trained model

01:17:43.700 | can do a task that it hasn't seen before without fine-tuning?

01:17:47.240 | The question is, what does hasn't seen before mean?

01:17:50.960 | These models, especially GPT-3 and similar very large models,

01:17:55.280 | during pre-training, did it ever see something exactly

01:17:58.320 | like this sort of word problem arithmetic?

01:18:01.600 | Maybe, maybe not.

01:18:03.080 | It's actually sort of unclear.

01:18:05.080 | It's clearly able to recombine bits and pieces of tasks

01:18:08.920 | that it saw implicitly during pre-training.

01:18:11.360 | We saw the same thing with trivia.

01:18:13.040 | Language modeling looks a lot like trivia sometimes,

01:18:15.520 | where you just read the first paragraph of a Wikipedia page,

01:18:19.080 | and it's kind of like answering a bunch of little trivia

01:18:21.360 | questions about where someone was born and when.

01:18:24.400 | But it's never seen something quite like this.

01:18:26.480 | And it's actually still kind of astounding

01:18:28.280 | how much it's able to do things that don't seem like they

01:18:30.640 | should have shown up all that directly in the pre-training

01:18:33.040 | data.

01:18:33.920 | Quantifying that extent is an open research problem.

01:18:37.480 | OK, that's it.

01:18:38.080 | Let's call it.

01:18:40.360 | Exactly.

01:18:41.920 | [BLANK_AUDIO]

Stanford CS224N NLP with Deep Learning | 2023 | Lecture 9 - Pretraining