Hello. Welcome to CS224N. Today we'll be talking about pre-training, which is another exciting topic on the road to modern natural language processing. OK. How is everyone doing? Thumbs up, some side, thumbs down. Wow. No response bias there. All thumbs up. Oh, side. Nice. I like that honesty. That's good.
Well, OK. So we're now-- what is this, week five? Yes, it's week five. And we have a couple-- so this lecture, the Transformers lecture, and then to a lesser extent, Thursday's lecture on natural language generation will be sort of the sum of lectures for the assignments you have to do.
So assignment five is coming out on Thursday. And the topics covered in this lecture, and self-attention transformers, and again, a little bit of natural language generation will be tested in assignment five. And then the rest of the course will go through some really fascinating topics in sort of modern natural language processing that should be useful for your final projects, and future jobs, and interviews, and intellectual curiosity.
But I think that today's lecture is significantly less technical in detail than last Thursday's on self-attention and transformers, but should give you an idea of the sort of world of pre-training and sort of how it helps define natural language processing today. So a reminder about assignment five, your project proposals also are due on Tuesday, next Tuesday.
Please do get those in. Try to get them in on time so that we can give you prompt feedback about your project proposals. And yeah, so let's jump into it. OK, so what we're going to start with today is a bit of a technical detail on word structure and sort of how we model the input sequence of words that we get.
So when we were teaching Word2Vec and sort of all the methods that we've talked about so far, we assumed a finite vocabulary. So you had a vocabulary v that you define via whatever. You've looked at some data. You've decided what the words are in that data. And so you have some words like hat and learn.
And you have this embedding. It's in red because you've learned it properly. Actually, let's replace hat and learn with pizza and tasty. Those are better. And so that's all well and good. You see these words in your model. And you have an embedding that's been learned on your data to sort of know what to do when you see those words.
But when you see some sort of variations, maybe you see like tasty and maybe a typo like learn, or maybe novel items where it's like a word that you as a human can understand as sort of this combination. This is called derivational morphology of this word transformer that you know and if I, which means take this noun and give me back a verb.
That means to make more like that noun. To transformerify NLP might mean to make NLP more like using transformers and such. And for each of these, this maybe didn't show up in your training corpus. And language is always doing this. People are always coming up with new words. And there's new domains.
And young people are always making new words. It's great. And so it's a problem for your model, though, because you've defined this finite vocabulary. And there's sort of no mapping in that vocabulary for each of these things. Even though their meanings should be relatively well defined based on the data you've seen so far, it's just that the sort of string of characters that define them aren't quite what you've seen.
And so what do you do? Well, maybe you map them to this sort of universal unknown token. This is UNK. So it's like, oh, I see something. I don't know what. I've never seen it before. I'm going to say it's always represented by the same token UNK. And so that's been done in the past.
And that's sort of bad, right, because it's totally losing tons of information. But you need to map it to something. And so this is like a clear problem, especially-- I mean, in English, it's a problem. In many of the roles languages, it's a substantially larger problem. So English has relatively simple word structure.
There's a couple of conjugations for each verb, like eat, eats, eaten, ate. But in a language with much more complex morphology or word structure, you'll have a considerably more complex sort of set of things that you could see in the world. So here is a conjugation table for a Swahili verb.
And it has over 300 conjugations. And if I define the vocabulary to be every unique string of characters maps to its own word, then every one of the 300 conjugations would get an independent vector under my model, which makes no sense, because the 300 conjugations obviously have a lot in common and differ by sort of meaningful extent.
So you don't want to do this. You'd have to have a huge vocabulary if I wanted all conjugations to show up. And that's a mistake for efficiency reasons and for learning reasons. Any questions so far? Cool. OK. And so what we end up doing is we'll look at subword structure, subword modeling.
So what we're going to do is we're going to say, if I can try to define what the set of all words is, I'm going to define my vocabulary to include parts of words. So I'm going to split words into sequences of known subwords. And so there's a simple sort of algorithm for this, where you start with all characters.
So if I only had a vocabulary of all characters, and maybe like an end of word symbol for a finite data set, then no matter what word I saw in the future, as long as I had seen all possible characters, I could take the word and say, I don't know what this word is.
I'm going to split it into all of its individual characters. So you won't have this unk problem. You can sort of represent any word. And then you're going to find common adjacent characters and say, OK, A and B co-occur next to each other quite a bit. So I'm going to add a new word to my vocabulary.
Now it's all characters plus this new word A, B, which is a subword. And likewise, so now I'm going to replace the character pair with the new subword and repeat until you add a lot, a lot, a lot of vocabulary items through this process of what things tend to co-occur next to each other.
And so what you'll end up with is a vocabulary of very commonly co-occurring sort of substrings by which you can build up words. And this was originally developed for machine translation, but then it's been used considerably in pretty much all modern language models. So now we have a hat and learn, hat and learn.
So in our subword vocabulary, hat and learn showed up enough that they're their own individual words. So that's sort of good, right? So simple common words show up as a word in your vocabulary just like you'd like them to. But now tasty maybe gets split into T-A-A. And then maybe in some cases, this hash hash means like don't add a space next, right?
So T-A-A and then A-A-A and then S-T-Y, right? So I've actually taken one sort of thing that seems like a word, and in my vocabulary, it's now split into three subword tokens. So when I pass this to my transformer or to my recurrent neural network, the recurrent neural network would take T-A-A as just a single element, do the RNN update, and then take A-A-A, do the RNN update, and then S-T-Y.
So it could learn to process constructions like this. And maybe I can even add more A-A-As in the middle, and have it do something similar. Instead of just seeing the entire word tasty and not knowing what it means. Is that? That's feedback, yeah. How loud is that feedback? We good?
OK, I think we're fixed. Great. And so same with transformerify. Maybe transformer is its own word. And then if I-- and so you can see that you have sort of three learned embeddings instead of one sort of useless unkembedding. This is just wildly useful and is used pretty much everywhere.
Variants of this algorithm are used pretty much everywhere in modern NLP. Questions? Yes. If we have three embeddings for tasty, do we just add them together? So the question is, if we have three embeddings for tasty, do we just add them together? If we want to represent-- so when we're actually processing the sequence, I'd see something like I learned about the T-A-A, A-A-A, S-T-Y.
So it'd actually be totally separate tokens. But if I wanted to then say, what's my representation of this thing? Depends on what you want to do. Sometimes you average the contextual representations of the three or look at the last one maybe. At that point, it's unclear what to do.
But everything sort of works OK. How do you know where to split? How do you what? How do you know where to split? Yeah. So you know where to split based on the algorithm that I specified earlier for learning the vocabulary. So you learn this vocabulary by just combining commonly co-occurring adjacent strings of letters.
So like A, B co-occurred a lot. So now I've got a new word that's A, B. And then when I'm actually walking through and tokenizing, I try to split as little as possible. So I split words into the maximal sort of subword that takes up the most characters. There are algorithms for this.
Yeah, so I'm like, OK, if I want to split this up, there's many ways I could split it up. And you try to find some approximate what the best way to split it into the fewest words is. Yeah. Does it seem to make sense to use punctuation in the character set?
So the question is, do people use punctuation in the character set? Do people do it? Yes, absolutely. So sort of from this point on, just assume that what text is given to these models is as unprocessed as possible. You try to make it sort of clean looking text, where you've removed HTML tags, maybe if it's from the internet or whatever.
But then beyond that, you process it as little as possible so that it reflects as well as possible what people might actually be using this for. So maybe earlier in the course, when we were looking at Word2Vec, maybe we had what might have thought about, oh, we don't want Word2Vec vectors of punctuation or something like that.
Now everything is just as close as possible to what the text you'd get with people trying to use your system would be. So yes, in practice, punctuation and dot, dot, dot might be its own word, and maybe a sequence of hyphens, because people make big bars across tables. Yeah.
How does it impact one wordage now? Could be multiple embeddings versus a single embedding. Does the system treat those any differently? The question is, does the system treat any differently words that are really themselves a whole word versus words that are pieces? No, the system has no idea. They're all just indices into your embedding vocabulary matrix.
So they're all treated equally. What about really long words that are relatively common? Because if you're building up from single character all the way up, what happens then? The question is, what happens to very long words if you're building up from character pairs and portions of characters? In practice, the statistics speak really well for themselves.
So if a long word is very common, it will end up in the vocabulary. And if it's not very common, it won't. There are algorithms that aren't this that do slightly better in various ways. But the intuition that you figure out what the common co-occurring substrings are, independent of length almost, is the right intuition to have.
And so you can actually just look at the learned vocabularies of a lot of these models. And you see some long words just because they showed up a lot. I'm curious, how does it weigh the frequency? So let's say there's if-y. In your next slide, it was like if-i at the very last one.
So if could be really common. So how does it weigh the frequency of a subword versus the length of it? It tries to spread it up into the smallest number. But what if it split it up into three, but one of them was super common? Yeah, so the question is, if transformer is a subword in my vocabulary, and if is a subword, and y is a subword, and if-i as a three-letter tuple is also a subword, how does it choose to take the-- if-i, maybe it's not very common, as opposed to splitting it into more subwords.
It's just a choice. We choose to try to take the smallest number of subwords, because that tends to be more of the bottleneck, as opposed to having a bunch of very common, very short subwords. Sequence length is a big problem in transformers. And this seems to be what works.
Although trying to split things into multiple options of a sequence and running the transformer on all of them is the thing that people have done to see which one will work better. But yeah, having fewer bigger subwords tends to be the best sort of idea. I'm going to start moving on, though.
Feel free to ask me more questions about this afterward. OK, so let's talk about pre-training from the context of the course so far. So at the very beginning of the course, we gave you this quote, which was, "You shall know a word by the company it keeps." This was the sort of thesis of the distributional hypothesis, that the meaning of the word is defined by, or at least reflected by, what words it tends to co-occur around.
And we implemented this via Word2Vec. The same person who made that quote had a separate quote, actually earlier, that continues this notion of meaning as defined by context, which has something along the lines of, well, since the word shows up in context when we actually use it, when we speak to each other, the meaning of the word should be defined in the context that it actually shows up in.
And so the complete meaning of a word is always contextual, and no study of meaning apart from a complete context can be taken seriously. So the big difference here is, at Word2Vec training time, if I have the word record, R-E-C-O-R-D, when I'm training Word2Vec, I get one vector or two, but one vector meaning record, the string.
And it has to learn by what context it shows up in, that sometimes it can mean I record, i.e. the verb, or record, i.e. the noun. But I only have one vector to represent it. And so when I use the Word2Vec embedding of record, it sort of has this mixture meaning of both of its sort of senses, right?
It doesn't get to specialize and say, oh, this part means record, and this part means record. And so Word2Vec is going to just sort of fail. And so I can build better representations of language through these contextual representations that are going to take things like recurrent neural networks or transformers that we used before to build up sort of contextual meaning.
So what we had before were pre-trained word embeddings. And then we had sort of a big box on top of it, like a transformer or an LSTM, that was not pre-trained, right? So you learn via context your word embeddings here. And then you have a task, like sentiment analysis or machine translation or parsing or whatever.
And you initialize all the parameters of this randomly. And then you train to predict your label. And the big difference in today's work is that we're going to try to pre-train all the parameters. So I have my big transformer. And instead of just pre-training my word embeddings with Word2Vec, I'm going to train all of the parameters of the network, trying to teach it much more about language that I could use in my downstream tasks.
So now the labeled data that I have for, say, machine translation might need to be smaller. I might not need as much of it, because I've already trained much more of the network than I otherwise would have if I had just gotten Word2Vec embeddings. So here, I've pre-trained this entire structure-- the word embeddings, the transformer on top.
Everything's been trained via methods that we'll talk about today. And so what does this give you? I mean, it gives you very strong representations of language. So the meaning of record and record will be different in the sort of contextual representations that know where in the sequence it is and what words are co-occurring with it in this specific input than Word2Vec, which only has one representation for record independent of where it shows up.
It'll also be used as strong parameter initializations for NLP models. So in all of your homework so far, you've worked with building out a natural language processing system sort of from scratch. How do I initialize this weight matrix? And we always say, oh, small, normally distributed noise, like little values close to 0.
And here, we're going to say, well, just like we were going to use the Word2Vec embeddings and those sort of encoded structure, I'm going to start maybe my machine translation system from a parameter initialization that's given to me via pre-training. And then also, it's going to give us probability distributions over language that we can use to generate and otherwise.
And we'll talk about this. So whole models are going to be pre-trained. So all of pre-training is effectively going to be centered around this idea of reconstructing the input. So you have an input. It's a sequence of text that some human has generated. And the sort of hypothesis is that by masking out part of it and tasking a neural network with reconstructing the original input, that neural network has to learn a lot about language, about the world, in order to do a good job of reconstructing the input.
So this is now a supervised learning problem, just like machine translation. Taking this sentence that just existed, Stanford University is located in, say, Palo Alto, California, or Stanford, California, I guess. And I have, by removing this part of the sentence, made a label for myself. The input is this sort of broken masked sentence.
And the label is Stanford or Palo Alto. So if I give this example to a network and ask it to predict the center thing, as it's doing its gradient step on this input, it's going to encode information about the co-occurrence between this context, Stanford University is located in, and Palo Alto.
So by tasking it with this, it might learn, say, where Stanford is. What else might it learn? Well, it can learn things about maybe syntax. So I put blank fork down on the table. Here, there's only a certain set of words that could go here. I put the fork down on the table.
I put a fork down on the table. These are syntactic constraints. So the context shows me what kinds of words can appear in what kinds of contexts. The woman walked across the street checking for traffic over blank shoulder. Any ideas on what could go here? Her, right? So this sort of co-reference between this entity who is being discussed in the world, this woman, and her shoulder.
Now, when I discuss-- this is sort of a linguistic concept. Her here is a co-referent to woman. It's referring to the same entity in the discourse. And so the network might be able to learn things about what entities are doing what where. It can learn things about semantics. So if I went to the ocean to see the fish, turtles, seals, and blank, then the word that's in the blank should be a member of the class that I'm thinking of as a person writing this sentence of stuff that I see when I go to the ocean and see these other things as well.
So in order to do this prediction task, maybe I learn about the semantics of aquatic creatures. OK, so what else could I learn? I've got overall, the value I got from the two hours watching it was the sum total of the popcorn and drink. The movie was blank. What kind of task could I be learning from doing this sort of prediction problem?
Sentiment, exactly. So this is just a naturalistic sort of text that I naturally wrote myself. But by saying, oh, the movie was bad, I'm learning about sort of the latent sentiment of the person who wrote this, what they were feeling about the movie at the time. So maybe if I see a new review later on, I can just paste in the review, say the movie was blank.
And if the model generates bad or good, that could be implicitly solving the task of sentiment analysis. So here's another one. Iroh went to the kitchen to make some tea. Standing next to Iroh, Zuko pondered his destiny. Zuko left the blank. OK, so in this scenario, we've got a world implicitly that's been designed by the person who is creating this text.
I've got physical locations in the discourse, like the kitchen. And I've got Zuko. Iroh's in the kitchen. Zuko's next to Iroh. So Zuko must be in the kitchen. So what could Zuko leave but the kitchen? And so in terms of latent notions of embodiment and physical location, the way that people talk about people being next to something and then leaving something could tell you stuff about sort of, yeah, a little bit about how the world works even.
So here's a sequence. I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, blank. And this is a pretty tough one, right? This is the Fibonacci sequence, right? If you had to model by looking at a bunch of numbers from the Fibonacci sequence, learn to, in general, predict the next one, it's a question you should be thinking about throughout the lecture.
OK, any questions on these sort of examples of what you might learn from predicting the context? OK, OK, cool. So a very simple way to think about pre-training is pre-training is language modeling. So we saw language modeling earlier in the course. And now we're just going to say, instead of using my language model just to provide probabilities over the next word, I am going to train it on that task.
I'm going to actually model the distribution p theta of the word t given all the words previous. And there's a ton of data for this, right? There's just an amazing amount of data for this in a lot of languages, especially English. There's very little data for this in actually most of the world's languages, which is a separate problem.
But you can pre-train just through language modeling, right? So I'm going to sort of do the teacher forcing thing. So I have IRO. I predict goes. I have goes. I predict to. And I'm going to train my sort of LSTM or my transformer to do this task. And then I'm just going to keep all the weights.
OK, I'm going to save all the network parameters. And then once I have these parameters, instead of generating from my language model, I'm just going to use them as an initialization for my parameters. So I have this pre-training fine-tuning paradigm. Two steps. Most of you, I think, in your-- well, maybe not this year.
Let's say a large portion of you this year in your final projects will be doing the pre-training fine-tuning sort of paradigm, where someone has done the pre-training for you, right? So you have a ton of text. You learn very general things about the distribution of words and sort of the latent things that that tells you about the world and about language.
And then in step two, you've got some task, maybe sentiment analysis. And you have maybe not very many labels. You have a little bit of labeled data. And you adapt the pre-trained model to the task that you care about by further doing gradient steps on this task. So you give it the movie was.
You predict happy or sad. And then you sort of continue to update the parameters based on the initialization from the pre-training. And this just works exceptionally well-- I mean, unbelievably well-- compared to training from scratch. Intuitively, because you've taken a lot of the burden of learning about language, learning about the world, off of the data that you've labeled for sentiment analysis.
And you're sort of giving that task of learning all this sort of very general stuff to the much more general task of language modeling. Yes? You said we didn't have much data in other languages. What do you mean by data? Is it just text in that language? Yeah. Or is it labeled in some way?
The question is, you said we have a lot of data in English, but not in other languages. What do you mean by data that we don't have a lot of in other languages? Is it just text? It's literally just text. No annotations. Because you don't need annotations to do language model pre-training, right?
The existence of that sequence of words that someone has written provides you with all these pairs of input and output. Input IRO, output goes. Input IRO goes, output too. Those are all labels sort of that you've constructed from the input just existing. But in most languages, even on the entire internet, I mean, there's about 7,000-ish languages on Earth.
And most of them don't have the sort of billions of words you might want to train these systems on. Yeah? If you're pre-training the entire thing, are you still learning one vector representation per word? The question is, if you're pre-training the entire thing, do you still learn one vector representation per word?
You learn one vector representation that is the non-contextual input vector. So you have your vocabulary matrix. You've got your embedding matrix that is vocabulary size by model dimensionality. And so yeah, IRO has one vector. GOES has one vector. But then the transformer that you're learning on top of it takes in the sequence so far and sort of gives a vector to each of them that's dependent on the context in that case.
But still, at the input, you only have one embedding per word. Yeah? So what sort of metrics would you use to evaluate a pre-trained model? It's supposed to be general. But there's application-specific metrics. So which one do you use? Yeah. So the question is, what metric do you use to evaluate pre-trained models since it's supposed to be so general?
But there are lots of very specific evaluations you could use. We'll get into a lot of that in the rest of the lecture. While you're training it, you can use simple metrics that sort of correlate with what you want but aren't actually what you want, just like the probability quality.
So you can evaluate the perplexity of your language model just like you would have when you cared about language modeling. And it turns out to be the case that better perplexity correlates with all the stuff that's much harder to evaluate, like lots and lots of different tasks. But also, the natural language processing community has built very large sort of benchmark suites of varying tasks to try to get at sort of a notion of generality, although that's very, very difficult.
It's sort of ill-defined, even. And so when you develop new pre-training methods, what you often do is you try to pick a whole bunch of evaluations and show that you do better on all of them. And that's your argument for generality. So why should this sort of pre-training, fine-tuning, two-part paradigm help?
This is still an open area of research, but the intuitions are all you're going to take from this course. So pre-training provides some sort of starting parameters, L theta. So this is like all the parameters in your network, from trying to do this minimum over all possible settings of your parameters of the pre-training loss.
And then the fine-tuning process takes your data for fine-tuning. You've got some labels. And it tries to approximate the minimum through gradient descent of the loss of the fine-tuning task of theta. But you start at theta hat. So you start gradient descent at theta hat, which your pre-training process gave you.
And then if you could actually solve this min and wanted to, it sort of feels like the starting point shouldn't matter. But it really, really, really does. It really does. So we'll talk a bit more about this later. But the process of gradient descent, maybe it sticks relatively close to the theta hat during fine-tuning.
So you start at theta hat. And then you sort of walk downhill with gradient descent until you hit sort of a valley. And that valley ends up being really good because it's close to the pre-training parameters, which were really good for a lot of things. This is a cool place where sort of practice and theory are sort of like meeting, where optimization people want to understand why this is so useful.
NLP people sort of just want to build better systems. So yeah, maybe the stuff around theta hat tends to generalize well. If you want to work on this kind of thing, you should talk about it. Yeah? So if stochastic gradient descent sticks relatively close, but what if we were to use a different optimizer?
How would that change our results? The question is, if stochastic gradient descent sticks relatively close, what if we use a different optimizer? I mean, if we use sort of any common variant of gradient descent, like any first order method, like Adam, which we use in this course, or AdaGrad, or they all have this very, very similar properties.
Other types of optimization we just tend to not use. So who knows? Yeah? Yeah, I'm still a little unclear on why the pre-training plus fine tuning works better than just fine tuning, but making the model more powerful, like adding more layers, more data, et cetera. Yeah. The question is, why does the pre-trained fine tune paradigm work better than just making the model more powerful, adding more layers, adding more data to just the fine tuning?
The simple answer is that you have orders of magnitude more data that's unlabeled. That's just text that you found. Then you do carefully labeled data and the tasks that you care about, right? Because that's expensive to get. It has to be examples of your movie reviews or whatever that you've had someone label carefully.
So you have something like on the internet at least 5 trillion, maybe 10 trillion words of this, and you have maybe a million words of your labeled data or whatever over here. So it's just the scale is way off. But there's also an intuition that learning to do a very, very simple thing like sentiment analysis is not going to get you a very generally able agent in a wide range of settings compared to language modeling.
So it's hard to get-- how do I put it? Even if you have a lot of labeled data of movie reviews of the kind that people are writing today, maybe tomorrow they start writing slightly different kinds of movie reviews, and your system doesn't perform as well. Whereas if you pre-trained on a really diverse set of text from a wide range of sources and people, it might be more adaptable to seeing stuff that doesn't quite look like the training data you showed it, even if you showed it a ton of training data.
So one of the big takeaways of pre-training is that you get this huge amount of variety of text on the internet. And you have to be very careful. I mean, yeah, you should be very careful about what kind of text you're showing it and what kind of text you're not, because the internet is full of awful text as well.
But some of that generality just comes from how hard this problem is and how much data you can show it. --pre-trained model was trained on so much data. How do you then train it so that it considers the stuff that you're fine-tuning it with as more important, more salient to the task it's trying to do, rather than just one in a billion articles of data?
Yeah, it's a good question. So the question is, given that the amount of data on the pre-training side is orders of magnitude more than the amount of data on the fine-tuning side, how do you get across to the model that, OK, actually, the fine-tuning task is what I care about.
So focus on that. It's about the fact that I did this first, the pre-training first. And then I do the fine-tuning second. So I've gotten my parameter initialization from this. I've set it somewhere. And then I fine-tune. I move to where the parameters are doing well for this task afterward.
And so, well, it might just forget a lot about how to do this, because now I'm just asking it to do this at this point. I should move on, I think. But we're going to keep talking about this in much more detail with more concrete elements. So OK, so let's talk about model pre-training.
Oh, wait. That did not advance the slides. Nice, OK. Let's talk about model pre-training three ways. In our Transformers lecture Tuesday, we talked about encoders, encoder decoders, and decoders. And we'll do decoders last, because actually, many of the largest models that are being used today are all decoders. And so we'll have a bit more to say about them.
So let's recall these three. So encoders get bidirectional context. You have a single sequence, and you're able to see the whole thing, kind of like an encoder in machine translation. Encoder decoders have one portion of the network that gets bidirectional context. So that's like the source sentence of my machine translation system.
And then they're sort of paired with a decoder that gets unidirectional context, so that I have this sort of informational masking where I can't see the future, so that I can do things like language modeling. I can generate the next token of my translation, whatever. So you could think of it as I've got my source sentence here, and my partial translation here, and I'm sort of decoding out the translation.
And then decoders only are things like language models. We've seen a lot of this so far. And there's pre-training for all three sort of large classes of models. And how you pre-train them and then how you use them depends on the properties and the proactivities of the specific architecture.
So let's look at encoders first. So we've looked at language modeling quite a bit. But we can't do language modeling with an encoder, because they get bidirectional context. So if I'm down here at i, and I want to present-- I want to predict the next word, it's a trivial task at this level here to predict the next word.
Because in the middle, I was able to look at the next word. And so I should just know. There's nothing hard about learning to predict the next word here, because I could just look at it, see what it is, and then copy it over. So when I'm training an encoder in something for pre-training, I have to be a little bit more clever.
In practice, what I do is something like this. I take the input, and I modify it somewhat. I mask out words, sort of like I did in the examples I gave at the beginning of class. So I blank to the blank. And then I have the network predict with its whole-- I have it build contextual representations.
So now this vector representation of the blank sees the entire context around it here. And then I predict the word "went," and then here, the word "store." Any questions? And you can see how this is doing something quite a bit like language modeling, but with bidirectional context. I've removed the network's information about the words that go in the blanks, and I'm training it to reconstruct that.
So I only have loss terms, right? I only ask it to actually do the prediction, compute the loss, backpropagate the gradients for the words that I've masked out. And you can think of this as instead of learning probability of x, where x is like a sentence or a document, this is learning the probability of x, the real document, given x tilde, which is this sort of corrupted document, with some of the information missing.
And so we get the sequence of vectors here, one per word, which is the output of my encoder in blue. And then I'd say that for the words that I want to predict, yi, I draw them. This is the sim means the probability is proportional to my embedding matrix times my representation of it.
So it's just a linear transformation of that last thing here. So this a plus b is this red portion here. And I do the prediction, and I train the entire network to do this. Yes? So the words that we mask out, do we just select them randomly, or is there some scheme to it?
The question is, do we just choose words randomly to mask out, or is there a scheme? Mostly randomly. We'll talk about a slightly smarter scheme in a couple of slides, but yeah, just mostly randomly. Yeah? What was that last part on the bottom, x, the masked version of-- like, if it's the first or the very last sentence?
Yeah, so I'm saying that I'm defining x tilde to be this input part, where I've got the masked version of the sentence with these words missing. And then I'm defining a probability distribution that's the probability of a sequence conditioned on the input being the corrupted sequence, the masked sequence.
So this brings us to a very, very popular NLP model that you need to know about. It's called BERT. And it was the first one to popularize this masked language modeling objective. And they released the weights of this pre-trained transformer that they pre-trained via something that looks a lot like masked language modeling.
And so these you can download. You can use them via code that's released by the company HuggingFace that we have continued to bring up. Many of you will use a model like BERT in your final project because it's such a useful builder of representations of language and context. So let's talk a little bit about the details of masked language modeling in BERT.
First, we take 15% of the subword tokens. So remember, all of our inputs now are subword tokens. I've made them all look like words. But just like we saw at the very beginning of class, each of these tokens could just be some portion, some subword. And I'm going to do a couple of things with it.
Sometimes I am going to just mask out the word and then predict the true word. Sometimes I'm going to replace the word with some random sample of another word from my vocabulary and predict the real word that was supposed to go there. And sometimes I'm going to not change the word at all and still predict it.
The intuition of this is the following. If I just had to build good representations in the middle of this network for words that are masked out, then when I actually use the model at test time on some real review to do sentiment analysis on, well, there are never going to be any tokens like this.
So maybe the model won't do a very good job because it's like, oh, I have no job to do here because I only need to deal with the mask tokens. By giving it sequences of words where sometimes it's the real word that needs to be predicted, sometimes you have to detect if the word is wrong.
The idea is that now when I give it a sentence that doesn't have any masks, it actually does a good job of representing all the words in context because it has this chance that it could be asked to predict anything at any time. OK, so the folks at Google who were defining this had a separate additional task that is sort of interesting to think about.
So this was their BERT model from their paper. They had their position embeddings just like we saw from our transformers lecture, token embeddings just like we saw from the transformers lecture. But then also they had this thing called a segment embedding where they had two possible segments, segment A and segment B.
And they had this additional task where they would get a big chunk of text for segment A and a big chunk of text for segment B. And then they would ask the model, is segment B a real continuation of segment A? Was it the text that actually came next?
Or did I just pick this big segment randomly from somewhere else? And the idea was that this should teach the network some notion of long distance coherence about the connection between a bunch of text over here and a bunch of text over there. Turns out it's not really necessary, but it's an interesting idea.
And similar things have continued to have some sort of influence since then. But again, you should get this intuition that we're trying to come up with hard problems for the network to solve such that by solving them, it has to learn a lot about language. And we're defining those problems by making simple transformations or removing information from text that just happened to occur.
Questions? Yeah. The plus signs, do we concatenate the vectors, or do we do an element-wise addition? The question is, for these plus signs, do we concatenate the vectors or do element-wise addition? We do element-wise addition. You could have concatenated them. However, one of the big conventions of all of these networks is that you always have exactly the same number of dimensions everywhere at every layer of the network.
It just makes everything very simple. So just saying everything's the same dimension and then doing addition just ends up being simpler. So why was the next sentence prediction not necessary? What's the main question for that? Yeah, why was the next sentence prediction not necessary? One thing that it does that's a negative is that now the effective context length for a lot of your examples is halved.
So one of the things that's useful about pre-training seemingly is that you get to build representations of very long sequences of text. This is very short, but in practice, segment A was going to be something like 250 words, and segment B was going to be 250 words. And in the paper that let us know that this wasn't necessary, they always had a long segment of 500 words.
And it seemed to be useful to always have this very long context because longer contexts help give you more information about the role that each word is playing in that specific context. If I see one word, it's hard to know. If I just see record, it's hard to know what it's supposed to mean.
But if I see 1,000 words around it, it's much clearer what its role is in that context is. So yeah, it cuts the effective context size is one answer. OK. Another thing is that this is actually much more difficult. This is a much more recent paper that I don't have in the slides.
But it's been shown since then that these models are really, really bad at the next sentence prediction task. So it could be that maybe it just was too hard at the time. And so it just wasn't useful because the model was failing to do it at all. So I can give the link for that paper later.
Can you explain again why we need to do a next sentence prediction? What about just masking and predicting the next? I missed that jump. So it's the next sentence. Yeah. So the question is, why do we need to do next sentence prediction? Why not just do the masking we saw before?
That's the thing. You seem to not need to do next sentence prediction. But as history of the research, it was thought that this was useful. And the idea was that it required you to develop this pairwise, do these two segments of text interact? How do they interact? Are they related?
The sort of longer distance notion. And many NLP tasks are defined on pairs of things. And they thought that might be useful. And so they published it with this. And then someone else came through, published a new model that didn't do that. And it sort of did better. So this is just-- yeah.
So yeah. There are intuitions as to why it could work. It just didn't. So BERT wasn't doing masking or was doing-- It was doing both. It was doing both. It was doing both this next sentence-- so BERT was doing both this next sentence prediction training as well as this masking training all at the same time.
And so you had to have a separate predictor head on top of BERT, a separate predictor sort of classification thing. And so one detail there is that there's this special word at the beginning of BERT in every sequence that's CLS. And you can define a predictor on top of that sort of fake word embedding that was going to say, is the next sentence real or fake or not?
Yeah. OK, I'm going to move on. And so this gets at sort of the question that we had earlier about how do you evaluate these things. There's a lot of different NLP tasks out there. Gosh. And when people were defining these papers, they would look at a ton of different evaluations that had been sort of compiled as a set of things that are a little hard for today's systems.
So are you detecting paraphrases between questions? Are two Quora questions actually the same question? That turns out to be hard. Can you do sentiment analysis on this hard data set? Can you tell if sentences are linguistically acceptable? Are they grammatical or not? Are two sequences similar semantically? Do they mean sort of vaguely the similar thing?
And we'll talk a bit about natural language inference later, but that's the task of defining sort of if I say, you know, I saw the dog, that does not necessarily mean I saw the little dog. But saying I saw the little dog does mean I saw the dog. So that's sort of this natural language inference task.
And the difference between sort of pre-pre-training days, where you had this sort of this row here before you had substantial amounts of pre-training and BERT was just like the field was taken aback in a way that's hard to describe. You know, very carefully crafted architectures for each individual task, where everyone was designing their own neural network and doing things that they thought were sort of clever as to how to define all the connections and the weights and whatever to do their tasks independently.
So everyone was doing a different thing for each one of these tasks, roughly. All of that was blown out of the water by just build a big transformer and just teach it to predict the missing words a whole bunch and then fine tune it on each of these tasks.
So this was just a sea change in the field. People were, I mean, amazed. It's a little bit less flashy than chat GPT, I'll admit. But it's really part of the story that gets us to it, you know? OK, questions? So like to get stuff out of the-- during the encoder pre-training stage, encoder usually outputs some sort of hidden values.
How do we correlate those to words that we are trying to test against? So the question is, the encoder output is a bunch of hidden values. How do we actually correlate those values to stuff that we want to predict? I'm going to go on to the next slide here to bring up this example here, right?
So the encoder gives us, for each input word token, a vector of that token that represents the token in context. And the question is, how do we get these representations and turn them into sort of answers for the tasks that we care about? And the answer comes back to something like this.
Something like this, maybe? Sure. So when we were doing the pre-training, we had the transformer that was giving us our representations. And we had this little last layer here, this little sort of affine transformation that moved us from the encoder's hidden state size to the vocabulary to do our prediction.
And we just removed this last prediction layer here. And let's say we want to do something that is classifying the sentiment of the sentence. We just pick arbitrarily maybe the last word in the sentence. And we stick a linear classifier on top and map it to positive or negative, and then fine tune the whole thing.
OK. So yeah, the BERT model had two different models. One was 110 million parameters. One was 340 million. Keep that sort of in the back of your head sort of percolating as we talk about models with many, many more parameters later on. It was trained on 800 million words plus-- that is definitely wrong-- maybe 25 million words, but on the order of less than a billion words of text, quite a bit still.
And it was trained on what was considered at the time to be a whole lot of compute. Just it was Google doing this. And they released it. And we were like, oh, who has that kind of compute? But Google-- although nowadays, it's not considered to be very much. But fine tuning is practical and common on a single GPU.
So you could take the BERT model that they've spent a lot of time training and fine tune it yourself on your task on even sort of a very small GPU. OK. So one question is like, well, this seems really great. Why don't we just use this for everything? Yeah.
And the answer is, well, what is the sort of pre-training objective? What's the structure of the pre-trained model good for? BERT is really good for sort of filling in the blanks. But it's much less naturally used for actually generating text. So I wouldn't want to use BERT to generate a summary of something because it's not really built for it.
It doesn't have a natural notion of predicting the next word given all the words that came before it. So maybe I want to use BERT if I want a good representation of, say, a document to classify it, give it a set of topic labels, or say it's toxic or non-toxic or whatever.
But I wouldn't want to use it to generate a whole sequence. OK. Some extensions of BERT. So we had a question earlier of whether you just mask things out randomly. One thing that seems to work better is you mask out sort of whole contiguous spans because sort of the difficulty of this problem is much easier than it would otherwise be because sort of this is part of irresistibly.
And you can tell very easily based on the sort of subwords that came before it. Whereas if I have a much longer sequence, it's a trade-off. But this might be a harder problem. And it ends up being better to do this sort of span-based masking than random masking. And that might be because subwords make very simple prediction problems when you mask out just one subword of a word versus all the subwords of a word.
OK. So this ends up doing much better. There's also a paper called the Roberta paper, which showed that the next sentence prediction wasn't necessary. They also showed that they really should have trained it on a lot more text. So Roberta is a drop-in replacement for BERT. So if you're thinking of using BERT, just use Roberta.
It's better. And it gave us this intuition that we really don't know a whole lot about the best practices for training these things. You sort of train it for as long as you're willing to. And things do good stuff and whatever. So this is very-- but it's very difficult to do sort of iteration on these models because they're big.
It's expensive to train them. Another thing that you should know for your final projects in the world ahead is this notion of fine-tuning all parameters of the network versus just a couple of them. So what we've talked about so far is you pre-train all the parameters and then you fine-tune all of them as well.
So all the parameter values change. Alternative, which you call parameter efficient or lightweight fine-tuning, you sort of choose little bits of parameters or you choose the very smart way of keeping most of the parameters fixed and only fine-tuning others. And the intuition is that these pre-trained parameters were really good.
And you want to make the minimal change from the pre-trained model to the model that does what you want so that you keep some of the generality, some of the goodness of the pre-training. So one way that this is done is called prefix tuning. Prompt tuning is very similar, where you actually freeze all the parameters of the network.
So I've pre-trained my network here. And I never change any of the parameter values. Instead, I make a bunch of fake sort of pseudo word vectors that I prepend to the very beginning of the sequence. And I train just them. Sort of unintuitive. It's like these would have been like inputs to the network, but I'm specifying them as parameters.
And I'm training everything to do my sentiment analysis task just by changing the values of these sort of fake words. And this is nice because I get to keep all the good pre-trained parameters and then just specify the sort of diff that ends up generalizing better. This is a very open field of research.
But this is also cheaper because I don't have to compute the gradients, or I don't have to store the gradients and all the optimizer state. With respect to all these parameters, I'm only training a very small number of parameters. Yeah. It's like fake parameters at the end, as if like here.
It doesn't make any difference if you put these at the end or the beginning. In a decoder, you have to put them at the beginning because otherwise you don't see them before you process the whole sequence. Yes. Can we just attach a few layers and only train the new layers?
The question is, can we just attach a new layers at the top of this and only train those? Absolutely. This works a bit better. Another thing that works well-- sorry, we're running out of time-- is taking each weight matrix. So I have a bunch of weight matrices in my transformer.
And I freeze the weight matrix and learn a very low rank little diff. And I set the weight matrix's value to be sort of the original value plus my sort of very low rank diff from the original one. And this ends up being a very similarly useful technique. And the overall idea here is that, again, I'm learning way fewer parameters than I did via pre-training and freezing most of the pre-training parameters.
OK, encoder-decoders. So for encoder-decoders, we could do something like language modeling. I've got my input sequence here, encoder, output sequence here. And I could say this part is my prefix for sort of having bidirectional context. And I could then predict all the words that are sort of in the latter half of the sequence, just like a language model.
And that would work fine. And so this is something that you could do. You sort of take a long text, split it into two, give half of it to the encoder, and then generate the second half with the decoder. But in practice, what works much better is this notion of span corruption.
Span corruption is going to show up in your assignment 5. And the idea here is a lot like BERT, but in a sort of generative sense, where I'm going to mask out a bunch of words in the input. Thank you, mask token 1, me to your party, mask token 2, week.
And then at the output, I generate the mask token and then what was supposed to be there, but the mask token replaced it. So thank you, then predict for inviting at the output, me to your party, last week. And what this does is that it allows you to have bidirectional context.
I get to see the whole sequence, except I can generate the parts that were missing. So this feels a little bit like you mask out parts of the input, but you actually generate the output as a sequence like you would in language modeling. So this might be good for something like machine translation, where I have an input that I want bidirectional context in, but then I want to generate an output.
And I want to pre-train the whole thing. So this was shown to work better than language modeling at the scales that these folks at Google were able to test back in 2018. This is still quite popular. Yeah, there's a lot of numbers. It works better than the other stuff.
I'm not going to worry about it. There's a fascinating property of these models also. So T5 was the model that was originally introduced with salient span masking. And you can think of at pre-training time, you saw a bunch of things like Franklin D. Roosevelt was born in blank, and you generated out the blank.
And there's this task called open domain question answering, which has a bunch of trivia questions, like when was Franklin D. Roosevelt born? And then you're supposed to generate out the answer as a string, just from your parameters. So you did a bunch of pre-training. You saw a bunch of text.
And then you're supposed to generate these answers. And what's fascinating is that this salient span masking method allowed you to pre-train and then fine tune on some examples of trivia questions. And then when you tested on new trivia questions, the model would implicitly extract from its pre-training data somehow the answer to that new question that it never saw explicitly at fine tuning time.
So it learned this sort of implicit retrieval-- sometimes, sometimes, less than 50% of the time or whatever, but much more than random chance. And that's fascinating. So you've learned to access this latent knowledge that you stored up by pre-training. And so you just pass it the text, when was Roosevelt born, and it would pass out an answer.
And one thing to know is that the answers always look very fluent. They always look very reasonable. But they're frequently wrong. And that's still true of things like ChatsGPT. Yeah. OK, so that's encoder-decoder models. Next up, we've got decoders. And we'll spend a long time on decoders. So this is just our normal language model.
So I get a sequence of hidden states from my decoder. The model-- the words can only look at themselves, not the future. And then I predict the next word in the sentence. And then here again, I can-- to do sentiment analysis, maybe take the last state for the last word, and then predict happy or sad based on that last embedding.
Back-propagate the gradients of the whole network, train the whole thing, or do some kind of lightweight or parameter-efficient fine-tuning, like we mentioned earlier. So this is our pre-training a decoder. And I can just pre-train it on language modeling. So again, you might want to do this if you are wanting to generate texts, generate things.
You sort of can use this like you use an encoder-decoder. But in practice, as we'll see, a lot of the sort of biggest, most powerful pre-trained models tend to be decoder-only. It's not really clear exactly why, except they seem a little bit simpler than encoder-decoders. And you get to share all the parameters in one big network for the decoder, whereas in an encoder-decoder, you have to split them, sort of some into the encoder, some into the decoder.
So for the rest of this lecture, we'll talk only about decoders. In modern things, the biggest networks do tend to be decoders. So we're coming all the way back again to 2018. And the GPT model from OpenAI was a big success. It had 117 million parameters. It had 768 dimensional hidden states.
And it had this vocabulary that was 40,000-ish words that was defined via a method like what we showed at the beginning of class, trained on BooksCorpus. And actually, GPT never actually showed up in the original paper. It's unclear what exactly it's supposed to refer to. But this model was a precursor to all the things that you're hearing about nowadays.
If you move forward-- oh, yeah. So if you-- hmm. So if we wanted to do something like natural language inference, which says, take these pairs of sentences-- the man is in the doorway, the person is near the door-- and say that these mean that one entails the other, the premise entails the hypothesis, that I can believe the hypothesis if I believe the premise, I'd just concatenate them together.
So give it maybe a start token, pass in one sentence, pass in some delimiter token, pass in the other, and then predict yes, no, entailment, not entailment, fine tuning. GPT on this, it worked really well. And then BERT came after GPT. BERT did a bit better. It had bidirectional context.
But it did an excellent job. And then came GPT-2, where they focused more on the generative abilities of the network. So we looked at now a much larger network. We've gone from 117 million to 1.5 billion. And given some sort of prompt, it could generate, at the time, a quite surprisingly coherent continuation to the prompt.
So it's telling this sort of story about scientists and unicorns here. And this size of model is still sort of small enough that you can use on a small GPU and fine tune and whatever. And its capabilities of generating long, coherent text was just sort of exceptional at the time.
It was also trained on more data, although I don't-- something like 9 billion words of text. And then, so after GPT-2, we come to GPT-3, sort of walking through these models. And then we come with a different way of interacting with the models. So we've interacted with pre-trained models in two ways so far.
We've sort of sampled from the distribution that they define. We generated text via a machine translation system or whatever. Or we fine-tuned them on a task that we care about. And then we take their predictions. But GPT-3 seems to have an interesting new ability. It's much larger. And it can do some tasks without any sort of fine-tuning whatsoever.
GPT-3 is much larger than GPT-2. So we went from GPT, 100-ish million parameters, GPT-2, 1.5 billion, GPT-3, 175 billion, much larger, trained on 300 billion words of text. And this sort of notion of in-context learning, that it could define or figure out patterns in the training or in the example that it's currently seeing and continue the pattern, is called in-context learning.
So you've got the word "thanks." And I pass in this little arrow and say, OK, thanks goes to merci. And then hello goes to bonjour. And then I give it all of these examples and ask it what otter should go to. And it's learned to sort of continue the pattern and say that this is the translation of otter.
So now, remember, this is a single sort of input that I've given to my model. And I haven't said, oh, do translation or fine-tune it on translation or whatever. I've just passed in the input, given it some examples. And then it is able to, to some extent, do this seemingly complex task.
That's in-context learning. And here are more examples. Maybe you give it examples of addition. And then it can do some simple addition afterward. You give it-- in this case, this is sort of rewriting typos. It can figure out how to rewrite typos in context learning for machine translation. And this was the start of this idea that there were these emergent properties that showed up in much larger models.
And it wasn't clear, when looking at the smaller models, that you'd get this sort of new, this qualitatively new behavior out of them. Like, it's not obvious from just the language modeling signal, right? GPT-3 is just trained on that decoder only, just predict the next word, that it would, as a result of that training, learn to perform seemingly quite complex things as a function of its context.
Yeah, OK. One or two questions about that. This should be quite surprising, I think, right? So far, we've talked about good representations, contextual representations, meanings of words in context. This is some very, very high-level pattern matching, right? It's coming up with patterns in just the input data. And that one sequence of text that you've passed it so far, and it's able to sort of identify how to complete the pattern.
And you think, what kinds of things can this solve? What are its capabilities? What are its limitations? This ends up being an open area of research. Sort of, what are the kinds of problems that you maybe saw in the training data a lot? Maybe GPT-3 saw a ton of pairs of words, right?
It saw a bunch of dictionaries, bilingual dictionaries in its training data. So it learned to do something like this. Or is it doing something much more general, where it's really learning the task in context? The actual story, we're not totally sure. Something in the middle. It seems like it has to be tied to your training data in ways that we don't quite understand.
But there's also a non-trivial ability to learn new sort of, at least, types of patterns just from the context. So this is a very interesting thing to work on. Now, we've talked a lot about the size of these models so far. And as models have gotten larger, they've always gotten better.
We train them on more data. So GPT-3 was trained on 300 billion words of text. And it was 175 billion parameters. And at that scale, it costs a lot of money to build these things. And it's very unclear whether you're getting the best use out of your money. It's bigger, really, what you should have been doing in terms of the number of parameters.
So the cost of training one of these is roughly you take the number of parameters, you multiply it by the number of tokens that you're going to train it on, the number of words. And some folks at DeepMind-- oh, I forgot the citation on this. Some folks at DeepMind realized through some experimentation that actually GPT-3 was just comically oversized.
So Chinchilla, the model they trained, is less than half the size and works better. But they just trained it on way more data. And this is an interesting trade-off about how do you best spend your compute? I mean, you can't do this more than a handful of times, even if you're Google.
So open questions there as well. Another way of interacting with these networks that has come out recently is called chain of thought. So the prefix, we saw in the in-context learning slide that the prefix can help specify what task you're trying to solve right now. And it can do even more.
So here's standard prompting. We have a prefix of examples of questions and answers. So you have a question and then an example answer. So that's your prompt that's specifying the task. And then you have a new question. And you're having the model generate an answer. And it generates it wrong.
And chain of thought prompting says, well, how about in the example, in the demonstration we give, we give the question. And then we give this sort of decomposition of steps towards how to get an answer. So I'm actually writing this out as part of the input. I'm giving annotations as a human to say, oh, to solve this sort of word problem, here's how you could think it through-ish.
And then I give it a new question. And the model says, oh, I know what I'm supposed to do. I'm supposed to first generate a sequence of steps, of intermediate steps. And then next, say the answer is-- and then say what the answer is. And it turns out-- and this should, again, be very surprising-- that the model can tend to generate plausible sequences of steps and then much more frequently generates the correct answer after doing so, relative to trying to generate the answer by itself.
So you can think of this as a scratch pad. You can think of this as increasing the amount of computation that you're putting into trying to solve the problem, sort of writing out your thoughts. Right? As I generate each word of this continuation here, I'm able to condition on all the past words so far.
And so maybe it just allows the network to sort of decompose the problem into smaller, simpler problems, which it's more able to solve each. No one's really sure why this works exactly either. At this point, with networks that are this large, their emergent properties are both very powerful and exceptionally hard to understand, and very hard, you should think, to trust.
Because it's unclear what its capabilities are and what its limitations are, where it will fail. So what do we think pre-training is teaching? Gosh, a wide range of things, even beyond what I've written in this slide, which I mostly wrote two years ago. So it can teach you trivia, and syntax, and coreference, and maybe some lexical semantics, and sentiment, and some reasoning, like way more reasoning than we would have thought even three years ago.
And yet, they also learn and exacerbate racism and sexism, all manner of biases. There'll be more on this later. But the generality of this is really, I think, what's taken many people aback. And so increasingly, these objects are not just studied for the sake of using them, but studied for the sake of understanding anything about how they work and how they fail.
Yeah, any questions? Has anyone tried benchmarking GPT for programming tasks, like how accurately it does, et cetera? The question is, has anyone tried benchmarking GPT for programming tasks? Anyone seen how well it does? Yes, so there's definitely examples of people using GPT-3 for simple programming things. And then the modern, state-of-the-art, competitive programming bots are all based on ideas from language modeling.
And I think they're all also based on pre-trained language models themselves. If you just take all of these ideas and apply it to GitHub, then you get some very interesting emergent behaviors relating to code fallout. And so yeah, I think all of the best systems use this, more or less.
So lots of benchmarking there, for sure. Is that the basis for what GitHub Copilot's trying to do? The question is, is this the basis? Is what we just mentioned the basis for the GitHub Copilot system? Yes, absolutely. We don't know exactly what it is in terms of details, but it's all these ideas.
What if you have a situation where you have still a large amount of data for general data, and then you have also a large amount of data for your fine-tuning task? At what point is it better to train a new model for that fine-tuning versus get data from both?
Yeah, the question is, what if you have a large amount of data for pre-training and a large amount of data for fine-tuning? When is it better to do a separate training on just the fine-tuning data? Almost never. If you have a bunch of data for the task that you care about, what's frequently done instead is three-part training, where you pre-train on a very broad corpus.
Then you continue to pre-train using something like language modeling on an unlabeled version of the label data that you have. You just strip the labels off and just treat it all as text and do language modeling on that, adapt the parameters a little bit, and then do the final stage of fine-tuning with the labels that you want, and that works even better.
There's an interesting paper called Don't Stop Pre-Training. Nice. Final question. We asked a lot of questions. Anyone? New? New? Someone new with a question? Yes. Yeah, I was wondering, do you know if there's a lot of instances where a pre-trained model can do some task that's not seen before even without fine-tuning?
Yeah, so are there any instances of where a pre-trained model can do a task that it hasn't seen before without fine-tuning? The question is, what does hasn't seen before mean? These models, especially GPT-3 and similar very large models, during pre-training, did it ever see something exactly like this sort of word problem arithmetic?
Maybe, maybe not. It's actually sort of unclear. It's clearly able to recombine bits and pieces of tasks that it saw implicitly during pre-training. We saw the same thing with trivia. Language modeling looks a lot like trivia sometimes, where you just read the first paragraph of a Wikipedia page, and it's kind of like answering a bunch of little trivia questions about where someone was born and when.
But it's never seen something quite like this. And it's actually still kind of astounding how much it's able to do things that don't seem like they should have shown up all that directly in the pre-training data. Quantifying that extent is an open research problem. OK, that's it. Let's call it.
Exactly.