Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 18

Good afternoon, folks. Welcome to lecture 18. Today, we'll be talking about some of the latest and greatest developments in neural NLP, where we've come and where we're headed. Chris, just to be sure, are my presenter notes visible from this part? They're visible. Okay, but not my presenter notes, right?

Correct. Okay, thank you. So just as a reminder, note that your guest lecture reactions are due tomorrow at 11.59 PM. Great job with the project milestone reports. You should have received feedback now. If not, contact the core staff. I think we had some last minute issues, but if that's not resolved, please contact us.

Finally, the project reports are due very soon, on the March 16th, which is next week. There's one question on Ed about the leaderboard, and the last day to submit on the leaderboard is March 19th as well. Okay, so for today, we'll start by talking about extremely large language models and GPT-3 that have recently gained a lot of popularity.

We'll then take a closer look at compositionality and generalization of these neural models. While transformer models like BERT and GPT have really high performance on all benchmarks, they still fail in really surprising ways when deployed. How can we strengthen our understanding of evaluating these models so they more closely reflect task performance in the real world?

And then we end by talking about how we can move beyond this really limited paradigm of teaching models language only through text and look at language grappling. Finally, I'll give some practical tips on how to move forward in your neural NLP research, and this will include some practical tips for the final project as well.

So this beam really kind of captures what's been going on in the field, really. And it's just that our ability to harness unlabeled data has vastly increased over the last few years. And this has been made possible due to advances in not just hardware, but also systems and our understanding of self-supervised training so we can use lots and lots of unlabeled data.

So based on this, here is a general representation learning recipe that just works for basically most modalities. So the recipe is basically as follows. So convert your data if it's images converted-- or it's really modality agnostic. So you take your data, if it's images, text, or videos, and you convert it into a sequence of integers.

And in step 2, you define a loss function to maximize data likelihood or create a denoising autoencoder loss. Finally, in step 3, train on lots and lots of data. Different properties emerge only when we scale up model size. And this is really the surprising fact about scale. So to give some examples of this recipe in action, here's GPD 3, which can learn to do a really non-trivial classification problem with just two demonstrations.

And we'll talk more about this soon. Another example, as we saw in lecture 14, is T5, which does really effective closed-book QA by storing knowledge in its parameters. Finally, just so I cover another modality, here's a recent text-to-image generation model with really impressive zero-shot generalization. OK, so now let's talk about GPD 3.

So how big really are these models? This table presents some numbers to put things in perspective. So we have a collection of models starting with medium-sized LSTMs, which was a staple in pre-2016 NLP, all the way to humans who have 100 trillion synapses. And some in the middle, we have GPD 2 with over a billion parameters, and GPD 3 with over 150 billion parameters.

And this exceeds the number of synaptic connections in a honeybee brain. So obviously, anyone with little knowledge of neuroscience knows that this is not an Apple Store and just embarrassing-- that this is an Apple Store and just embarrassing. But the point here is that the scale of these models is really starting to reach astronomical numbers.

So here are some facts about GPD 3. For one, it's a large transformer with 96 layers. It has more or less the same architecture as GPD 2, with the exception that to scale up attention computation, it uses these locally-banded sparse attention patterns. And I really encourage you to look at the paper to understand the details.

The reason we mention this here is because it kind of highlights that scaling up is simply not just changing hyperparameters, as many might believe. And it involves really non-trivial engineering and algorithms to make computations efficient. Finally, all of this is trained on 500 billion tokens taken from the Common Crawl, the Toronto Books Corpus, Wikipedia.

So what's new about GPD 3? So let's look at some of the results on the paper first. So obviously, it does better on language modeling and text completion problems. As you can see from this table, it does better than GPD 2 at language modeling in the Pentry Bank, as well as better on the story completion data set called Limbada.

To give a flavor of what's to come, let's take a closer look at this Limbada story completion data set. So the task here is that we're given a short story, and we are supposed to fill in the last word. Satisfying the constraints of the problem can be hard for a language model, which could generate a multi-word completion.

But with GPD 3, the really new thing is that we can just give a few examples as prompts and sort of communicate a task specification to the model. And now, GPD 3 knows how the completion must be a single word. This is a very, very powerful paradigm. And we give some more examples of this in-context learning in a couple more slides.

So apart from language modeling, it's really good at these knowledge-intensive tasks, like closed-book QA, as well as reading comprehension. And here, we observe that scaling up parameters results in a massive improvement in performance. So now let's talk about in-context learning. GPD 3 demonstrates some level of fast adaptation to completely new tasks.

This happens via what's called in-context learning. As shown in the figure, the model training can be characterized as having an outer loop that learns a set of parameters that makes the learning of the inner loop as efficient as possible. And with this sort of framework in mind, we can really see how a good language model can also serve as a good few-shot learner.

So in this segment, we will have some fun with GPD 3 and look at some demonstrations of this in-context learning. So to start off, here is an example where someone's trying to create an application that converts a language description to batch one-liners. The first three examples are prompts, followed by generated examples from GPD 3.

So it gets a list of running processes. This one's easy. It probably just involves looking at your hash table. Some of the more challenging ones that involve copying over some spans from the text, like the SCP example is kind of interesting, as well as the harder one to parse grep.

The SCP example comes up a lot during office hours, so GPD 3 knows how to do that. Here's a somewhat more challenging one, where the model is given a description of a database in natural language, and it starts to emulate that behavior. So the text in bold is sort of the prompt given to the model.

The prompt includes somewhat of a function specification of what a database is. So it says that the database begins knowing nothing. The database knows everything that's added to it. The database does not know anything else. And when you ask a question to the database, if the answer is there in the database, the database must return the answer.

Otherwise, it should say it does not know the answer. So this is very new and very powerful. And the prompt also includes some example usages. So when you ask 2+2, the database does not know. When you ask the capital of France, the database does not know. And then you add in a fact that Tom is 20 years old to the database.

And now you can start asking it questions like, where does Tom live? And as expected, it says that the database does not know. But now if you ask it, what's Tom's age? The database says that Tom is 20 years old. And if you ask, what's my age? The database says basically that it does not know, because that's not been added.

So this is really powerful. Here's another one. Now in this example, the model is asked to blend concepts together. And so there's a definition of what does it mean to blend concepts. So if you take airplane and car, you can blend that to give flying car. That's essentially like there's a Wikipedia definition of what concept blending is, along with some examples.

Now let's look at some prompts followed by what GPT-3 answers. So the first one is straightforward, two-dimensional space blended with 3D space gives 2.5-dimensional space. The one that is somewhat interesting is old and new gives recycled. Then a triangle and square gives trapezoid. That's also interesting. The one that's really non-trivial is geology plus neurology.

It's just sediment neurology, and I had no idea what this was. It's apparently correct. So clearly, it's able to do these very flexible things just from a prompt. So here's another class of examples that GPT-3 gets somewhat right. And these are these copycat analogy problems, which have been really well studied in cognitive science.

And the way it works is that I'm going to give you some examples and then ask you to induce a function from these examples and apply it to new queries. So if ABC changes to ABT, what does PQR change to? Well, PQR must change to PQS, because the function we've learned is that the last letter must be incremented by 1.

And this function, humans can now apply to examples of varying types. So like P repeated twice, Q repeated twice, R repeated twice must change to P repeated twice, Q repeated twice, and S repeated twice. And it seems like GPT-3 is able to get them right, more or less. But the problem is that if you ask it to generalize to examples that have increasing number of repetitions than were seen in the prompt, it's not able to do that.

So in this situation, you ask it to make an analogy where the letters are repeated four times, and it's never seen that before and doesn't know what to do. And so it gets all of these wrong. So there's a point to be made here about just maybe these prompts are not enough to convey the function the model should be learning and maybe even more examples it can learn in.

But it probably does not have the same kinds of generalization that humans have. And that brings us to the limitations of these models and some open questions. So just looking at the paper and passing through the results, it seems like the model is bad at logical and mathematical reasoning, anything that involves doing multiple steps of reasoning.

And that explains why it's bad at arithmetic, why it's bad at word problems, why it's not great at analogy making, and even traditional textual entailment data sets that seem to require logical reasoning like RTE. So second most subtle point is that it's unclear how we can make permanent updates to the model.

Maybe if I want to teach a model a new concept, that's possible to do it while I'm interacting with the system. But once the interaction is over, it restarts and does not have a notion of knowledge. And it's not that this is something that the model cannot do in principle, but just something that's not really been explored.

It doesn't seem to exhibit human-like generalization, which is often called systematicity. And I'll talk a lot more about that. And finally, language is situated. And GPT-3 is just learning from text. And there's no exposure to other modalities. There's no interaction. So maybe the aspects of meaning that it acquires are somewhat limited.

And maybe we should explore how we can bring in other modalities. So we'll talk a lot more about these last two limitations in the rest of the lecture. But maybe I can foster some questions now if there are any. I don't think there's a big outstanding question. But I mean, I think some people aren't really clear on few-shot setting and prompting versus learning.

And I think it might actually be good to explain that a bit more. OK. Yeah. So maybe let's-- let me pick a simple example. Let me pick this example here. So prompting just means that-- so GPT-3, if you go back to first principles, GPT-3 is basically just a language model.

And what that means is given a context, it'll tell you what's the probability of the next word. So if I give it a context, w1 through wk, GPT-3 will tell me what's the probability of wk plus 1 for the vocabulary. So that's what a language model is. A prompt is essentially a context that gets prepended before GPT-3 can start generating.

And what's happening with in-context learning is that the context that you append-- that you prepend to GPT-3 are basically xy examples. So that's the prompt. And the reason why it's also-- it's equivalent to few-shot learning is because you prepend a small number of xy examples. So in this case, if I just prepend this one example that's highlighted in purple, then that's essentially one-shot learning because I just give it a single example as context.

And now, given this query, which is also appended to the model, it has to make a prediction. So the input-output format is the same as how a few-shot learner would receive. But since it's a language model, the training data set is essentially presented as a context. So someone is still asking, can you be more specific about the in-context learning setups?

What is the task? Right. So let's see. Maybe I can go to-- yeah, so maybe I can go to this slide. So the task is just that it's a language model. So it gets a context, which is just a sequence of tokens. And the task is just to-- so you have a sequence of tokens.

And then the model has to generate given a sequence of tokens. And the way you can convert that into an actual machine learning classification problem is that-- so for this example, maybe you give it 5 plus 8 equals 13, 7 plus 2 equals 9, and then 1 plus 0 equals.

And now, GPT-3 can fill in a number there. So that's how you convert it into a classification problem. The context here would be these two examples of arithmetic, like 5 plus 8 equals 13 and 7 plus 2 equals 9. And then the query is 1 plus 0 equals. And then the model, since it's just a language model, has to fill in 1 plus 0 equals question mark.

So it fills in something there. It doesn't have to fill in numbers. It could fill in anything. But if it fills in a 1, it does the right job. So that's how you can take a language model and do few-shot learning with it. I'll keep on these questions. How is in-context learning different from transfer learning?

So I guess in-context learning-- I mean, you can think of in-context learning as being kind of transfer learning. But transfer learning does not specify the mechanism through which the transfer is going to happen. With in-context learning, the mechanism is that the training examples are sort of appended to the model, which is a language model, just in order.

So let's say you have x1, y1, x2, y2. And these are just appended directly to the model. And now it makes prediction on some queries that are drawn from this data set. So yes, it is a subcategory of transfer learning. But transfer learning does not specify exactly how this transfer learning is achieved.

But in-context learning is very specific and says that for language models, you can essentially concatenate the training data set and then present that to the language model. People still aren't sufficiently clear on what is or isn't happening with learning and prompting. So another question is, so in-context learning still needs fine-tuning, question mark?

We need to train GPT-3 to do in-context learning? Question mark. Right. So there are two parts to this question. So the answer is yes and no. So of course, the model is a language model. So it needs to be trained. So you start with some random parameters. And you need to train them.

But the model is trained as a language model. And once the model is trained, you can now use it to do transfer learning. And the model parameters in-context learning are fixed. You do not update the model parameters. All you do is that you give it these small training set to the model, which is just appended to the model as context.

And now the model can start generating from that point on. So in this example, if 5 plus 8 equals 13 and 7 plus 2 equals 9 are two xy examples. In vanilla transfer learning, what you would do is that you would take some gradient steps, update your model parameters, and then make a prediction on 1 plus 0 equals what.

But in context learning, all you're doing is you just concatenate 5 plus 8 equals 13 and 7 plus 2 equals 9 to the model's context window, and then make it predict what 1 plus 0 should be equal to. Maybe we should end for now with one other bigger picture question, which is, do you know of any research combining these models with reinforcement learning for the more complicated reasoning tasks?

So that is an excellent question. There is some recent work on kind of trying to align language models with human preferences, where yes, there is some amount of fine tuning with reinforcement learning based on these preferences from humans. So maybe you want to do a summarization problem in GPT-3.

The model produces multiple summaries. And for each summary, maybe you have a reward that is essentially a human preference. Maybe I want to include some facts, and I don't want to include some other non-important facts. So I can construct a reward out of that, and I can fine tune the parameters of my language model basically using reinforcement learning based on this reward, which is essentially human preferences.

So there's some very recent work that tries to do this. But I'm not sure-- yeah, I'm not aware of any work that tries to use reinforcement learning to teach a reasoning to these models. But I think it's an interesting future direction to explore. OK. Maybe you should go on at this point.

OK. OK, so we'll talk a bit more about these last two points, so systematicity and language grounding. So just to start off, how do you define systematicity? So really, the definition is that there is a definite and predictable pattern among the sentences that native speakers of a language understand.

And so there's a systematic pattern among the sentences that we understand. What that means is, let's say there's a sentence like, John loves Mary. And if a native speaker understands the sentence, then they should also be able to understand the sentence, Mary loves John. And closely related to this idea of systematicity is the principle of compositionality.

And for now, I'm going to ignore the definition by Montague and just look at the rough definition. And then we can come back to this other more concrete definition. The rough definition is essentially that the meaning of an expression is a function of the meaning of its parts. So that brings us to the question, are human languages really compositionally?

And here are some examples that make us think that maybe, yes. So if you look at what is the meaning of the noun phrase brown cow, so it is composed of the meaning of the adjective brown and the noun cow. So all things that are brown and all things that are cow take the intersection and get brown cow.

Similarly, red rabbits, so all things that are red, all things that are rabbit, combine them and get red rabbits. And then kick the ball, this word phrase can be understood as you have some agent that's performing a kicking operation on the ball. But this is not always the case that you can get the meaning of the whole by combining meanings of parts.

So here, we have some counter examples that people often use. So red herring does not mean all things that are red and all things that are herring. And kick the bucket definitely does not mean that there's an agent that's kicking the bucket. So while these examples are supposed to be provocative, we think that language is mostly compositional.

There's lots of exceptions, but for a vast majority of sentences that we've never heard before, we're able to understand what they mean by piecing together the words that the sentence is composed of. And so what that means is that maybe compositionality of representations are helpful prior that could lead to systematicity in behavior.

And that brings us to the questions that we ask in the segment, are neural representations compositional? And the second question is, if so, do they generalize systematically? So how do you even measure if representations that a neural network learns exhibit compositionality? So let's go back to this definition from Montague, which says that compositionality is about the existence of a homomorphism from syntax to semantics.

And to look at that, we have this example, which is Lisa does not skateboard. And we have a syntax tree corresponding to this example. And the meaning of the sentence can be composed according to the structure that's decided by the syntax. So the meaning of Lisa does not skateboard is a function of the meaning of Lisa and does not skateboard.

The meaning of does not skateboard is a function of does and not skateboard. The meaning of not skateboard is a function of not and skateboard. So that's good. And so this gives us one way of formalizing how we can measure compositionality in neural representations. And so compositionality of representations could be thought of as how well the representation approximates an explicitly homomorphic function in a learned representation space.

So what we're going to do is essentially measure if we were to construct a neural network whose computations are based exactly according to these parse trees, how far are the representations of a learned model from this explicitly compositional representation? And that'll give us some understanding of how compositional the neural networks representations should be.

So to unpack that a little bit, instead of having-- yeah, so instead of having denotations, we have representations in the node. And to be more concrete about that, we first start by choosing a distance function that tells us how far away two representations are. And then we also need a way to compose together two constituents to give us the meaning of the whole.

But once we have that, we can start by-- we can create an explicitly compositional function, right? So what we do is we have these representations at the leaves that are initialized randomly and the composition function that's also initialized randomly. And then a forward pass according to this syntax is used to compute the representation of Lisa does not skateboard.

And now once you have this representation, you can create a loss function. And this loss function measures how far are the representations of my neural network from this second proxy neural network that I've created. And then I can basically optimize both the composition function and the embeddings of the leaves.

And then once the optimization is finished, I can measure how far was the representation of my neural net from this explicitly compositional network on a held outside. And that then tells me whether the representation of my neural net learned were actually compositional or not. So to see how well this works, let's look at a plot.

And this is relatively complex. But just to unpack this a little bit, it plots the mutual information between the input that the neural network receives versus the representation against this tree reconstruction error that we were talking about. And to give some more background about what's to come, there is a theory which is called the information bottleneck theory, which says that as a neural network trains, it first tries to maximize the mutual information between the representation and the input in an attempt to memorize the entire data set.

And that is a memorization phase. And then once memorization is done, there is a learning or a compression phase where this mutual information starts to decrease. And the model is essentially trying to compress the data or consolidate the knowledge in the data into its parameters. And what we are seeing here is that as a model learns, which is characterized by decreasing mutual information, we see that the representations themselves are becoming more and more compositional.

And overall, we observe that learning is correlated with increased compositionality as measured by the tree reconstruction error. So that's really encouraging. So now that we have a method of measuring compositionality of representations in these neural nets, how do we start to create benchmarks that see if they are generalizing systematically or not?

So to do that, here is a method for taking any data set and splitting it into a train test split that explicitly tests for this kind of generalization. So to do that, we use this principle called maximizing the compound divergence. And to illustrate how this principle works, we look at this toy example.

So in this toy example, we have a training data set that consists of just two examples and test data set of just two examples. The atoms are defined as the primitive elements, so entity words, predicates, question types. So in this toy example, Goldfinger, Christopher Nolan, these are all the primitive elements.

And the compounds are compositions of these primitive elements. So who directed entity would be the composition of the question type. Did x predicate y? And the predicate direct. So here's a basic machinery for producing compositionally challenging splits. So let's start by introducing two distributions. The first distribution is the normalized frequency distribution of the atoms.

So given any data set, if we know what the notion of atoms are, we can basically compute the frequency of all of the atoms and then normalize that by the total count. And that's going to give us one distribution. And we can repeat the same thing for the compounds.

And that will give us a second frequency distribution. So note that these are just two probability distributions. And once we have these two distributions, we can essentially define the atom and compound divergence simply as this quantity here. And where there is the Chernoff coefficient between two categorical distributions. The Chernoff coefficient basically measures how far two categorical distributions are.

So just to get a bit more intuition about this, if we set p to q, then the Chernoff coefficient is 1, which means these representations are maximally similar. And then if p is non-zero everywhere q is 0, or if p is 0 in all the places where q is 0, then the Chernoff coefficient is exactly 0, which means that these two distributions are maximally far away.

And the overall goal by describing this objective is that-- this loss objective is just that we are going to maximize the compound divergence and minimize the atom divergence. And so what is the intuition behind doing such a thing? So what we want is to ensure that the unigram distribution, in some sense, is constant between the train and test split so that the model does not encounter any new words.

But we want the compound divergence to be very high, which means that these same words that the model has seen many times must appear in new combinations, which means that we are testing for systematicity. And so if you follow this procedure for a semantic parsing data set, let's say, what we see is that as you increase the scale, we see that this model just does better and better at compositional generalization.

But just pulling out a quote from this paper, "pre-training helps for compositional generalization but doesn't fully solve it." And what that means is that maybe as you keep scaling up these models, you'll see better and better performance, or maybe it starts to saturate at some point. In any case, we should probably be thinking more about this problem instead of just trying to brute force it.

So now this segment tells us that the way we split a data set, we can measure different behaviors of the model. And that tells us that maybe we should be thinking more critically about how we're evaluating models in NLP in general. So there has been a revolution basically over the last few years in the field where we're seeing all of these large transform models beat all of our benchmarks.

At the same time, there is still not complete confidence that once we deploy these systems in the real world, they're going to maintain their performance. And so it's unclear if these gains are coming from spurious correlations or some real task understanding. And so how do we design benchmarks that accurately tell us how well this model is going to do in the real world?

And so I'm going to give one example of works that try to do this. And that's the idea of dynamic benchmarks. And the idea of dynamic benchmarks is basically saying that instead of testing our models on static test sets, we should be evaluating them on an ever-changing dynamic benchmark.

And there's many recent examples of this. And the idea dates back to a 2017 workshop at EMLP. And so the overall schematic looks something like this, that we start with a training data set and a test data set, which is the static opera. We train a model on that.

And then once the model is trained, we deploy that and then have humans create new examples that the model fails to classify. And crucially, we're looking for examples. The model does not get tried, but humans have no issue figuring out the answer to. So by playing this game of whack-a-mole, where humans figure out what are the holes in the model's understanding, and then add that back into the training data, re-train the model, deploy it again, have humans create new examples, we can essentially construct this never-ending data set, this never-ending test set, which can hopefully be a better proxy of estimating real-world performance.

So this is some really cutting-edge research. And one of the main challenges of this class of works is that it's unclear how much this can scale up, because maybe after multiple iterations of this whack-a-mole, humans are just fundamentally limited by creativity. So figuring out how to deal with that is really an open problem.

And current approaches just use examples from other data sets to prompt humans to think more creatively. But maybe we can come up with better, more automated methods of doing this. So this brings us to the final segment. Or actually, let me stop for questions at this point and see if people have questions.

Here's a question. With dynamic benchmark, doesn't this mean that the model creator will also need to continually test/evaluate the models on the new benchmarks, new data sets? Wait a second. Sorry. Yeah, so with dynamic benchmarks, yes, it's absolutely true that you will have to continuously keep training your model.

And that's just to ensure that the reason your model is not doing well on the test set doesn't have to do with this domain mismatch. And what we're really trying to do is measure how-- just come up with a better estimate of the model's performance in the overall task and just trying to get more and more data.

So yes, to answer your question, yes, we need to keep training the model again and again. But this can be automated. So I'll move on to language-grounded. So in this final segment, I'll talk about how we can move beyond just training models on text alone. So many have articulated the need to use modalities other than text if we someday want to get at real language understanding.

And ever since we've had these big language models, there has been a rekindling of this debate. And recently, there was multiple papers on this. And so at ACL last year, there was this paper that argues through multiple thought experiments that it's actually impossible to acquire meaning from form alone, where meaning refers to the communicative intent of a speaker, and form refers to text or speech signals.

A more modern version of this was put forward by the second paper, where they say that training on only web-scale data limits the world scope of models and limits the aspects of meanings that the model can actually acquire. And so here is a diagram that I borrowed from the paper.

And what they say is the era where we were training models on supervised data sets, models were limited in world scope one. And now that we've moved on to exploiting unlabeled data, we're now in world scope two, where models just have strictly more signal to get more aspects of meaning in.

If you mix in additional modalities into this-- so maybe you mix in videos, and maybe you mix in images-- then that expands out the world scope of the model further. And now maybe it can acquire more aspects of meaning, such that now it knows that the lexical item red refers to red images.

And then if you go beyond that, you can have a model that is embodied, and it's actually living in an environment where it can interact with its data, conduct interventions and experiments. And then if you go even beyond that, you can have models that live in a social world where they can interact with other models.

Because after all, the purpose of language is to communicate. And so if you can have a social world where models can communicate with other models, that expands out aspects of meaning. And so GPT-3 is in world scope two. So there are a lot of open questions in this space.

But given that there are all of these good arguments about how we need to move beyond text, what is the best way to do this at scale? We know that babies cannot learn language from watching TV alone, for example. So there has to be some interventions, and there has to be interactions with the environment that need to happen.

But at the same time, the question is, how far can models go by just training on static data as long as we have additional modalities, especially when we combine this with scale? And if interactions with the environment are really necessary, how do we collect data and design systems that interact minimally or in a cost-effective way?

And then finally, could pre-training on text still be useful if any of these other research directions become more sample efficient? So if you're interested in learning more about this topic, I highly encourage you to take CS224U, which is offered in the spring. They have multiple lectures on just language learning.

So in this final segment, I'm going to talk a little bit more about how you can get involved with NLP and deep learning research and how you can make more progress. So here are some general principles for how to make progress in NLP research. So I think the most important thing is to just read broadly, which means not just read the latest and greatest papers and archive, but also read pre-2010 statistical NLP.

Learn about the mathematical foundations of machine learning to understand how generalization works, so take CS229M. Learn more about language, which means taking classes in the linguistics department. In particular, I would recommend maybe this 138A. And also take CS224U. And finally, if you wanted inspiration from how babies learn, then definitely read about child language acquisition literature.

It's fascinating. Finally, learn your software tools, which involves scripting tools, version control, data wrangling, learning how to visualize quickly with Jupyter Notebooks. And deep learning often involves running multiple experiments with different hyperparameters and different ideas all in Paddle. And sometimes it can get really hard to keep track of everything.

So learn how to use experiment management tools like weights and biases. And finally, I'll talk about some really quick final project tips. So first, let's just start by saying that if your approach doesn't seem to be working, please do not panic. Put assert statements everywhere and check if the computations that you're doing are correct.

Use breakpoints extensively, and I'll talk a bit more about this. Check if the loss function that you've implemented is correct. And one way of debugging that is to see that the initial values are correct. So if you're doing a k-way classification problem, then the initial loss should be a natural log of k.

Always, always, always start by creating a small training data set which has like 5 to 10 examples and see if your model can completely cope with that. If not, there's a problem with your training loop. Check for saturating activations and dead values. And often, this can be fixed by-- maybe there's some problems with the gradients, or maybe there's some problems with the initialization.

Which brings me to the next point. Check your gradient values. See if they're too small, which means that maybe you should be using residual connections or LSTMs. Or if they're too large, then you should use gradient clipping. In fact, always use gradient clipping. Overall, be methodical. If your approach doesn't work, come up with hypotheses for why this might be the case.

Design Oracle experiments to debug it. Look at your data. Look at the errors that it's making. And just try to be systematic about everything. So I'll just say a little bit more about breakpoints. So there's this great library called PDB. It's like GDP, but it's for Python. So that's why PDB.

To create a breakpoint, just add the line import PDB, PDB.setTrace before the line you want to inspect. So earlier today, I was trying to play around with the Transformers library. So I was trying to do question answering. So I have a really small training corpus. And the context is, one morning, I shot an elephant in my pajamas.

How he got into my pajamas, I don't know. And the question is, what did I shoot? And to solve this problem, I basically imported a tokenizer and a BERT model. And I initialized my tokenizer, initialized my model, tokenized my input. I set my model into the eval mode. And I tried to look at the output.

But I get this error. And I'm very sad. It's not clear what's causing this error. And so the best way to look at what's causing this error is to actually put a breakpoint. So right after model.eval, I put a breakpoint. Because I know that that's where the problem is.

So the problem is in line 21. So I put a breakpoint at line 21. And now once I put this breakpoint, I can just run my script again. And it stops before executing line 21. And at this point, I can examine all of my variables. So I can look at the token as input, because maybe that's where the problem is.

And lo and behold, I see that it's actually a list. So it's a dictionary of lists, whereas models typically expect a Dodge tensor. So now I know what the problem is. And that means I can quickly go ahead and fix it. And everything just works. So this just shows that you should use breakpoints everywhere if your code is not working.

And it can just help you debug really quickly. So finally, I'd say that if you want to get involved with NLP and deep learning research, and if you really like the final project, we have the CLIPS program at Stanford. And this is a way for undergrads, master's students, and PhDs who are interested in deep learning and doing NLP research and want to get involved with the NLP group.

So we highly encourage you to apply to CLIPS. And so I'll conclude today's class by saying that we've made a lot of progress in the last decade. And that's mostly due to clever understanding of neural networks, data, hardware, all of that combined with scale. We have some really amazing technologies that can do really exciting things.

And we saw some examples of that today. In the short term, I expect that we'll see more scaling because it just seems to help. So perhaps even larger models. But this is not trivial. So I said that before, and I'll just say it again. Scaling requires really non-trivial engineering efforts, and sometimes even clever algorithms.

And so there's a lot of interesting systems work to be done here. But in the long term, we really need to be thinking more about these bigger problems of systematicity, generalization. How can we make our models learn a new concept really quickly so that it's fast adaptation? And then we also need to create benchmarks that we can actually trust.

If my model has some performance on some sentiment analysis data set and deployed in the real world, that should be reflected in the number that I get from the benchmark. So we need to make progress in the way we evaluate models. And then also figuring out a way to move beyond text in a more tractable way.

This is also really essential. So yeah, that's it. Good luck with your final projects. I can take more questions at this point. So I answered a question earlier that actually I think you could also opine on. It was the question of whether you have a large model that's pre-trained on language, if it will actually help you in other domains, like you apply it to vision stuff.

Yeah. Yeah. So I guess the answer is actually, yes. So there was a paper that came out really, really recently, like just two days ago, that it just takes-- I think it was GBD2. I'm not sure. It's like one large transformer model that's pre-trained on text. And like other modalities, so they definitely apply to images.

And I think they apply to math problems and some more modalities and show that it's actually really effective at transfers. So if you pre-train on text and then you move to a different modality, that helps. I think part of the reason for that is just that across modalities, there is a lot of autoregressive structure that is shared.

And I think one reason for that is that language is really referring to the world around it. And so you might expect that there is some correspondence that's just beyond the autoregressive structure. So there's also works that show that if you have just text-only representations and image-only representations, you can actually learn a simple linear classifier that can learn to align both of these representations.

And all of these works are just showing that there's actually a lot more common between modalities than we thought in the beginning. So yeah, I think it's possible to pre-train on text and then fine-tune on your modality of interest. And it should probably be effective, of course, based on what the modality is.

But for images and videos, it's certainly effective. Any questions? A couple of questions have turned up. One is, what's the difference between CS224U and this class in terms of the topics covered and focus? Do you want to answer that one, Shikhar, or should I have a go at answering it?

Maybe you should answer this one. OK. So next quarter, CS224U, Natural Language Understanding, is co-taught by Chris Potts and Bill McCartney. So in essence, it's meant to be different that natural language understanding focuses on what its name is, sort of how to build computer systems that understand the sentences of natural language.

Now, in truth, the boundary is kind of complex because we do some natural language understanding in this class as well. And certainly for the people who are doing the default final project, question answering, well, that's absolutely a natural language understanding task. But the distinction is meant to be that at least a lot of what we do in this class, things like the assignment three dependency parser or building the machine translation system in assignment four, that they are in some sense natural language processing tasks where processing can mean anything but commonly means you're doing useful intelligent stuff with human language input, but you're not necessarily deeply understanding it.

So there is some overlap in the classes. If you do CS224U, you'll certainly see word vectors and transformers again. But the emphasis is on doing a lot more with natural language understanding tasks. And so that includes things like building semantic parsers. So they're the kind of devices that will, you know, respond to questions and commands such as an Alexa or Google assistant will do.

Building relation extraction systems, which get out particular facts out of a piece of text of, oh, this person took on this position at this company. Looking at grounded language learning and grounded language understanding where you're not only using the language, but the world context to get information and other tasks that sort.

I mean, I guess you're going to look at the website to get more details of it. I mean, you know, relevant to this class, I mean, a lot of people also find it an opportunity to just get further in doing a project in the area of natural language processing that sort of by the nature of the structure of the class, since, you know, it more assumes that people know how to build deep learning natural language systems at the beginning that rather than a large percentage of the class going into, okay, you have to do all of these assignments, although there are little assignments earlier on that there's sort of more time to work on a project for the quarter.

Okay. Here's one more question that maybe Shikhar could do. Do you know of attempts to crowdsource dynamic benchmarks, e.g. users uploading adversarial examples for evaluation or online learning? Yeah, so actually, like, the main idea there is to use crowdsourcing, right? So in fact, there is this bench. So there is this platform that was created by Pair, it's called DynaBench.

And the objective is just that to construct this like dynamically evolving benchmark, we are just going to offload it to users of this platform. And you can, you know, it essentially gives you utilities for like, deploying your model and then having, you know, humans kind of try to fool the model.

Yeah, so this is like, it's basically how the dynamic benchmark collection actually works. So we deploy a model on some platform, and then we get humans to like fool the system. Yeah. Here's a question. Can you address the problems of NLP models, not able to remember really long contexts and techniques to infer on really large input length?

Yeah, so I guess like, there have been like a few works recently that kind of try to scale up transformers to like really large context lengths. One of them is like the reformer. And there's also like the transformer Excel that was the first one to try and do that.

I think what is unclear is whether you can combine that with the scale of these GPT like models. And if you see like qualitatively different things, once you do that, like, and part of it is just that all of this is just like so recent, right? But yeah, I think the open question there is that, you know, can you take these like really long context transformers that can operate over long context, combine that with scale of GPT-3, and then get models that can actually reason over these like really large contexts?

Because I guess the hypothesis of scale is that once you train language models at scale, it can start to do these things. And so to do that for long context, we actually need to like have long context transformers that are trained at scale. And I don't think people have done that yet.

So I'm seeing this other question about language acquisition. Chris, do you have some thoughts on this? Or maybe I can just say something. Yeah, so the question is, what do you think we can learn from baby language acquisition? Can we build a language model in a more interactive way, like reinforcement learning?

Do you know any of these attempts? That's a big, huge question. And you know, I think the short non-helpful answer is that there are kind of no answers at the moment. I know people have certainly tried to do things at various scales, but you know, we just have no technology that is the least bit convincing for being able to replicate the language learning ability of a human child.

But after that prologue, what I could say is, I mean, yeah, there are definitely ideas to have in your head. So you know, there are sort of clear results, which is that little kids don't learn by watching videos. So it seems like interaction is completely key. Little kids don't learn from language alone.

They're in a very rich environment where people are sort of both learning stuff from the environment in general, and in particular, you know, they're learning a lot from what language acquisition researchers refer to as attention, which is different to what we mean by attention. But it means that the caregiver will be looking at the object that's the focus of interest and you know, commonly other things as well, like sort of, you know, picking it up and bringing it near the kid and all those kinds of things.

And you know, babies and young kids get to experiment a lot, right? So regardless of whether it's learning what happens when you have some blocks that you stack up and play with them, or you're learning language, you sort of experiment by trying some things and see what kind of response you get.

And again, that's essentially building on the interactivity of it that you're getting some kind of response to any offerings you make. And you know, this is something that's sort of been hotly debated in the language acquisition literature. So a traditional chompskin position is that, you know, human beings don't get effective feedback, you know, supervised labels when they talk.

And you know, in some very narrow sense, well, that's true, right? It's just not the case that after a baby tries to say something that they get feedback of, you know, syntax error in English on word for, or they get given, here's the semantic form I took away from your utterance.

But in a more indirect way, they clearly get enormous feedback, they can see what kind of response they get from their caregiver at every corner. And so like in your question, you were suggesting that, well, somehow we should be making use of reinforcement learning because we have something like a reward signal there.

And you know, in a big picture way, I'd say, oh, yeah, I agree. In terms of a much more specific way as to, well, how can we possibly get that to work to learn something with the richness of human language? You know, I think we don't have much idea, but you know, there has started to be some work.

So people have been sort of building virtual environments, which, you know, you have your avatar in and that can manipulate in the virtual environment and there's linguistic input, and it can succeed in getting rewards for sort of doing a command where the command can be something like, you know, pick up the orange block or something like that.

And you know, to a small extent, people have been able to build things that work. I mean, as you might be picking up, I mean, I guess so far, at least I've just been kind of underwhelmed because it seems like the complexity of what people have achieved is sort of, you know, just so primitive compared to the full complexity of language, right?

You know, the kind of languages that people have been able to get systems to learn are ones that can, yeah, do pick up commands where they can learn, you know, blue cube versus orange sphere. And that's sort of about how far people have gotten. And that's sort of such a teeny small corner of what's involved in learning a human language.

One thing I'll just add to that is I think there are some principles of how kids learn that people have tried to apply to deep learning. And one example that comes to mind is curriculum learning, where there's like a lot of literature that shows that, you know, babies, they tend to pay attention to things that they just that is just slightly challenging for them.

And they don't pay attention to things that are extremely challenging, and also don't pay attention to things that they know how to solve. And many researchers have really tried to get curriculum learning to work. And the verdict on that is that it seems to kind of work when you're in like reinforcement learning settings.

But it's unclear if it's going to work on like supervised learning settings. But I still think that it's like under explored. And maybe, you know, there should be like more attempts to kind of see if we can like add in curriculum learning and if that improves anything. Yeah, I agree.

Curriculum learning is an important idea, which we haven't really talked about. But it seems like it's certainly essential to human learning. And there's been some minor successes with it in the machine learning world. But it sort of seems like it's an idea you should be able to do a lot more with in the future as you move from models that are just doing one narrow task to trying to do a more general language acquisition process.

Should I attempt the next question as well? Okay, the next question is, is the reason humans learn languages better just because we are pre trained over millions of years of physics simulation? Maybe we should pre train a model the same way. So I mean, I presume what you're saying is physics simulation, you're evoking evolution when you're talking about millions of years.

So you know, this is a controversial, debated, big question. So you know, again, if I invoke Chomsky again, so Noam Chomsky is sort of the most famous linguist in the world. And you know, essentially, Noam Chomsky's career starting in the 1950s is built around the idea that little children get such dubious linguistic input because you know, they hear a random bunch of stuff, they don't get much feedback on what they say, etc.

But language could not be learned empirically just from the data observed. And the only possible assumption to work from is significant parts of human language are innate or in the sort of human genome, babies are born with that. And that explains the miracle by which very little humans learn amazingly fast how human languages work.

Now, to speak in credit for that idea, for those of you who have not been around little children, I mean, I think one does just have to acknowledge, you know, human language acquisition by live little kids. I mean, it does just seem to be miraculous, right? As you go through this sort of slow phase for a couple of years where, you know, the kids sort of goos and gahs some syllables, and then there's a fairly long period where they picked up a few words, and they can say "juice, juice" when they want to drink some juice and nothing else.

And then it just sort of seems like there's this phase change, where the kids suddenly realize, wait, this is a productive generative sentence system, I can say whole sentences. And then in an incredibly short period, they sort of seem to transition from saying one and two word utterances to suddenly they can say, you know, "Daddy come home in garage, putting bike in garage." And you go, wow, how did they suddenly discover language?

So, you know, so it is kind of amazing. But personally, for me, at least, you know, I've just never believed the strong versions of the hypothesis that human beings have much in the way of language specific knowledge or structure in their brains that comes from genetic inheritance. Like clearly, humans do have these very clever brains.

And if we're at the level of saying, being able to think, or being able to interpret the visual world, that's things that have developed over tens of millions of years. And evolution can be a large part of the explanation. And humans are clearly born with lots of vision specific hardware in their brains, as are a lot of other creatures.

But when you come to language, you know, no one knows when language was in a sort of a modern like form first became available, because, you know, there aren't any fossils of people saying, you know, the word spear or something like that. But, you know, to the extent that there are estimates based on sort of what you can see of the sort of spread of proto humans and their sort of apparent social structures from sort of what you can find in fossils, you know, most people guess that language is at most a million years old.

And you know, that's just too short a time for any significant, for evolution to sort of build any significant structure inside human brains that's specific to language. So I kind of think that the working assumption has to be that sort of there's just about nothing specific to language in human brains.

And you know, the most plausible hypothesis, not that I know very much about neuroscience when it comes down to it, is that humans were being able to repurpose hardware that was originally built for other purposes, like visual scene interpretation and memory, and that that gave a basis of sort of having all this clever hardware that you could then use for language.

You know, it's kind of like GPUs were invented for playing computer games, and we were able to repurpose that hardware to do deep learning. Okay, we've got a lot of have come out at the end. Okay, so this one is answered live. Let's see. Yeah, if you could name, I guess this is for either of you, one main bottleneck as to if we could provide feedback efficiently to our systems like babies are given feedback, what's the bottleneck that remains in trying to have more human-like language acquisition?

I mean, I sort of, I can apply it on this. Were you saying something, Shikhar? Yeah, I was just going to say that I think it's a bit of everything, right? Like, I think in terms of models, one thing I would say is that we know that there's more feedback connections and feed forward connections in the brain.

And we haven't really figured out a way of kind of, so, you know, of course, we had RNNs, you know, which sort of implement like, you know, you can like look through an RNN that sort of implements a feedback loop, but we still haven't really figured out how to, you know, use that knowledge is that the brain has a lot of feedback connections and then apply that to practical systems, I think on the modeling end, like maybe that's one problem.

There is like, yeah, I think curriculum learning is maybe one of them, but I think the one that's probably going to have most bang for buck is really figuring out how we can move beyond text. And I think there's just like so much more information that's available that we're just not using.

And so I think that's where most of the progress might come from, like figuring out what's most practical of going beyond text. This is what I think. Okay. Let's see. What are some important NLP topics that we have not covered in this class? I do that. You know, well, sort of one answer is a lot of the topics that are covered in CS224U because, you know, we do make a bit of an effort to keep them destroyed, though not fully.

Right. So there's sort of lots of topics in language understanding that we haven't covered. Right. So if you want to make a voice assistant like Alexa Siri or Google Assistant, well, you need to sort of be able to interface with systems, APIs that can do things like delete your mail or buy you concert tickets.

And so you need to be able to convert from language into explicit semantic form that can interact with the systems of the world. We haven't talked about that at all. So there's lots of language understanding stuff. There's also lots of language generation things. So, you know, effectively for language generation, all we have done is neural language models.

They are great. Run them and they will generate language. And, you know, in one sense, that's true. Right. It's awesome the kind of generation you can do with things like GPT-2 or 3. But, you know, where that's missing is that's really only giving you the ability to produce fluent text where it rabbits often produces fluent text that if you actually wanted to have a good natural language generation system, you also have to have higher level planning of what you're going to talk about and how you are going to express it.

Right. So then in most situations in natural language, you think, OK, well, I want to explain to people something about why it's important to do math classes at college. Let me think how to organize this. Maybe I should talk about some of the different applications where math turns up and how it's a really good grounding.

Whatever you kind of plan out, here's how I can present some ideas. Right. And that kind of natural language generation, we're not doing any we haven't done any of. Yeah. So that's sort of saying more understanding, more generation, which is most of NLP, you say. I mean, obviously, there are then sort of particular tasks that we can talk about that we either have or have not explicitly addressed.

OK. Is there has there been any work in putting language models into an environment in which they can communicate to achieve a task? And do you think this would help with unsupervised learning? So I guess there's been a lot of work on emergent communication and also self play, where you have these different models which are initialized as language models that attempt to communicate with each other to solve some task.

And then you have a reward at the end, whether they were able to finish the task or not. And then based on that reward, you attempt to learn the communication strategy. And this started out as emergent communication and self play. And then there was recent work. I think it was last year or the year before that, where they showed that if you initialize these models with language model pre-training, you basically to prevent this problem of language drift, where the language or the communication protocol that your models end up learning has nothing to do with actual language.

And so, yeah, I mean, from that sense, there has been some work. But it's very limited. I think there's some groups that try to study this, but not beyond that. OK, I mean, the last two questions are about gene. There's one question about whether genes make some correlations from social cues or a word based system.

I don't know if either of you have opinions about this, but if you do. Yeah, I mean, I don't have anything very deep to say about this question. It's on the importance of social cues as opposed to pure award based systems. Well, I mean, in some sense, a social cue, you can also regard as a reward that people like to have other people put a smile on their face when you say something.

But I do think generally, when people are saying, what have we not covered? Another thing that we've barely covered is the social side of language. So, you know, a huge, a huge interesting thing about language is it has this very dynamic, big dynamic range. So on the one hand, you can talk about very precise things in language.

So you can sort of talk about math formulas and steps in a proof and things like that, so that there's a lot of precision and language. You know, on the other hand, you can just sort of emphatically mumble, mumble whatever words at all, and you're not really sort of communicating anything in the way of a propositional content.

What you're really trying to communicate is, you know, I'm, oh, I'm thinking about you right now. And, oh, I'm concerned with how you're feeling or whatever it is in the circumstances, right? So that a huge part of language use is in forms of sort of social communication between human beings.

And, you know, that's another big part of actually building successful natural language systems, right? So if you, you know, if you think negatively about something like the virtual assistants I've been falling back on a lot is, you know, that they have virtually no ability as social language users, right?

So we're now training a generation of little kids that what you should do is sort of bark out commands as if you were, you know, serving in the German army in World War II or something, and that there's none of the kind of social part of how to, you know, use language to communicate satisfactorily with human beings and to maintain a social system.

And that, you know, that's a huge part of human language use that kids have to learn and learn to use successfully, right? You know, a lot of being successful in the world is, you know, you know, when you want someone to do something for you, you know, that there are good ways to ask them for it.

You know, some of its choice of how to present the arguments, but, you know, some of it is by building social rapport and asking nicely and reasonably and making it seem like you're a sweet person that other people should do something for. And, you know, human beings are very good at that.

And being good at that is a really important skill for being able to navigate the world well.

Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 18 - Future of NLP + Deep Learning

Chapters

Transcript