back to indexStanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 18 - Future of NLP + Deep Learning
Chapters
0:0
2:35 General Representation Learning Recipe
4:46 Facts about Gbt3
21:40 Systematicity and Language Grounding
22:18 Principle of Compositionality
22:38 Are Human Languages Really Compositional
24:4 Are Neural Representations Compositional
28:1 The Information Bottleneck Theory
30:19 Producing Compositionally Challenging Splits
30:29 Normalized Frequency Distribution of the Atoms
34:8 Dynamic Benchmarks
37:33 Language Grounding
44:21 Breakpoints
58:37 Attention
00:00:00.000 |
Good afternoon, folks. Welcome to lecture 18. Today, we'll be talking about some of 00:00:10.240 |
the latest and greatest developments in neural NLP, where we've come and where we're headed. 00:00:16.720 |
Chris, just to be sure, are my presenter notes visible from this part? 00:00:32.000 |
So just as a reminder, note that your guest lecture reactions are due tomorrow at 11.59 00:00:37.960 |
Great job with the project milestone reports. 00:00:44.480 |
I think we had some last minute issues, but if that's not resolved, please contact us. 00:00:51.120 |
Finally, the project reports are due very soon, on the March 16th, which is next week. 00:00:57.160 |
There's one question on Ed about the leaderboard, and the last day to submit on the leaderboard 00:01:07.840 |
Okay, so for today, we'll start by talking about extremely large language models and 00:01:13.360 |
GPT-3 that have recently gained a lot of popularity. 00:01:18.520 |
We'll then take a closer look at compositionality and generalization of these neural models. 00:01:26.400 |
While transformer models like BERT and GPT have really high performance on all benchmarks, 00:01:30.520 |
they still fail in really surprising ways when deployed. 00:01:33.640 |
How can we strengthen our understanding of evaluating these models so they more closely 00:01:44.280 |
And then we end by talking about how we can move beyond this really limited paradigm of 00:01:48.640 |
teaching models language only through text and look at language grappling. 00:01:53.760 |
Finally, I'll give some practical tips on how to move forward in your neural NLP research, 00:01:59.600 |
and this will include some practical tips for the final project as well. 00:02:07.040 |
So this beam really kind of captures what's been going on in the field, really. 00:02:14.680 |
And it's just that our ability to harness unlabeled data has vastly increased over the 00:02:20.240 |
And this has been made possible due to advances in not just hardware, but also systems and 00:02:26.120 |
our understanding of self-supervised training so we can use lots and lots of unlabeled data. 00:02:34.640 |
So based on this, here is a general representation learning recipe that just works for basically 00:02:47.460 |
So convert your data if it's images converted-- or it's really modality agnostic. 00:02:56.560 |
So you take your data, if it's images, text, or videos, and you convert it into a sequence 00:03:02.280 |
And in step 2, you define a loss function to maximize data likelihood or create a denoising 00:03:08.960 |
Finally, in step 3, train on lots and lots of data. 00:03:14.760 |
Different properties emerge only when we scale up model size. 00:03:17.280 |
And this is really the surprising fact about scale. 00:03:20.660 |
So to give some examples of this recipe in action, here's GPD 3, which can learn to do 00:03:26.620 |
a really non-trivial classification problem with just two demonstrations. 00:03:34.660 |
Another example, as we saw in lecture 14, is T5, which does really effective closed-book 00:03:43.380 |
Finally, just so I cover another modality, here's a recent text-to-image generation model 00:03:51.460 |
with really impressive zero-shot generalization. 00:04:02.020 |
This table presents some numbers to put things in perspective. 00:04:07.660 |
So we have a collection of models starting with medium-sized LSTMs, which was a staple 00:04:12.380 |
in pre-2016 NLP, all the way to humans who have 100 trillion synapses. 00:04:18.580 |
And some in the middle, we have GPD 2 with over a billion parameters, and GPD 3 with 00:04:26.580 |
And this exceeds the number of synaptic connections in a honeybee brain. 00:04:31.860 |
So obviously, anyone with little knowledge of neuroscience knows that this is not an 00:04:36.180 |
Apple Store and just embarrassing-- that this is an Apple Store and just embarrassing. 00:04:40.220 |
But the point here is that the scale of these models is really starting to reach astronomical 00:04:48.500 |
For one, it's a large transformer with 96 layers. 00:04:53.980 |
It has more or less the same architecture as GPD 2, with the exception that to scale 00:04:59.420 |
up attention computation, it uses these locally-banded sparse attention patterns. 00:05:04.060 |
And I really encourage you to look at the paper to understand the details. 00:05:07.340 |
The reason we mention this here is because it kind of highlights that scaling up is simply 00:05:11.100 |
not just changing hyperparameters, as many might believe. 00:05:14.220 |
And it involves really non-trivial engineering and algorithms to make computations efficient. 00:05:19.420 |
Finally, all of this is trained on 500 billion tokens taken from the Common Crawl, the Toronto 00:05:32.140 |
So let's look at some of the results on the paper first. 00:05:35.140 |
So obviously, it does better on language modeling and text completion problems. 00:05:39.420 |
As you can see from this table, it does better than GPD 2 at language modeling in the Pentry 00:05:44.220 |
Bank, as well as better on the story completion data set called Limbada. 00:05:50.340 |
To give a flavor of what's to come, let's take a closer look at this Limbada story completion 00:05:56.200 |
So the task here is that we're given a short story, and we are supposed to fill in the 00:06:00.660 |
Satisfying the constraints of the problem can be hard for a language model, which could 00:06:08.620 |
But with GPD 3, the really new thing is that we can just give a few examples as prompts 00:06:13.260 |
and sort of communicate a task specification to the model. 00:06:15.660 |
And now, GPD 3 knows how the completion must be a single word. 00:06:21.300 |
And we give some more examples of this in-context learning in a couple more slides. 00:06:27.580 |
So apart from language modeling, it's really good at these knowledge-intensive tasks, like 00:06:32.900 |
closed-book QA, as well as reading comprehension. 00:06:36.340 |
And here, we observe that scaling up parameters results in a massive improvement in performance. 00:06:44.780 |
GPD 3 demonstrates some level of fast adaptation to completely new tasks. 00:06:50.380 |
This happens via what's called in-context learning. 00:06:53.020 |
As shown in the figure, the model training can be characterized as having an outer loop 00:06:58.300 |
that learns a set of parameters that makes the learning of the inner loop as efficient 00:07:04.780 |
And with this sort of framework in mind, we can really see how a good language model can 00:07:14.020 |
So in this segment, we will have some fun with GPD 3 and look at some demonstrations 00:07:22.540 |
So to start off, here is an example where someone's trying to create an application 00:07:26.860 |
that converts a language description to batch one-liners. 00:07:32.380 |
The first three examples are prompts, followed by generated examples from GPD 3. 00:07:41.860 |
It probably just involves looking at your hash table. 00:07:44.100 |
Some of the more challenging ones that involve copying over some spans from the text, like 00:07:51.060 |
the SCP example is kind of interesting, as well as the harder one to parse grep. 00:07:56.300 |
The SCP example comes up a lot during office hours, so GPD 3 knows how to do that. 00:08:03.980 |
Here's a somewhat more challenging one, where the model is given a description of a database 00:08:07.580 |
in natural language, and it starts to emulate that behavior. 00:08:12.580 |
So the text in bold is sort of the prompt given to the model. 00:08:16.380 |
The prompt includes somewhat of a function specification of what a database is. 00:08:22.980 |
So it says that the database begins knowing nothing. 00:08:25.660 |
The database knows everything that's added to it. 00:08:30.340 |
And when you ask a question to the database, if the answer is there in the database, the 00:08:34.980 |
Otherwise, it should say it does not know the answer. 00:08:42.580 |
And the prompt also includes some example usages. 00:08:45.540 |
So when you ask 2+2, the database does not know. 00:08:48.180 |
When you ask the capital of France, the database does not know. 00:08:51.300 |
And then you add in a fact that Tom is 20 years old to the database. 00:08:55.340 |
And now you can start asking it questions like, where does Tom live? 00:08:59.300 |
And as expected, it says that the database does not know. 00:09:09.860 |
The database says basically that it does not know, because that's not been added. 00:09:18.140 |
Now in this example, the model is asked to blend concepts together. 00:09:22.380 |
And so there's a definition of what does it mean to blend concepts. 00:09:25.820 |
So if you take airplane and car, you can blend that to give flying car. 00:09:31.140 |
That's essentially like there's a Wikipedia definition of what concept blending is, along 00:09:38.340 |
Now let's look at some prompts followed by what GPT-3 answers. 00:09:43.780 |
So the first one is straightforward, two-dimensional space blended with 3D space gives 2.5-dimensional 00:09:51.140 |
The one that is somewhat interesting is old and new gives recycled. 00:10:02.020 |
The one that's really non-trivial is geology plus neurology. 00:10:06.180 |
It's just sediment neurology, and I had no idea what this was. 00:10:11.780 |
So clearly, it's able to do these very flexible things just from a prompt. 00:10:18.620 |
So here's another class of examples that GPT-3 gets somewhat right. 00:10:25.620 |
And these are these copycat analogy problems, which have been really well studied in cognitive 00:10:32.180 |
And the way it works is that I'm going to give you some examples and then ask you to 00:10:37.860 |
induce a function from these examples and apply it to new queries. 00:10:41.860 |
So if ABC changes to ABT, what does PQR change to? 00:10:45.300 |
Well, PQR must change to PQS, because the function we've learned is that the last letter 00:10:52.220 |
And this function, humans can now apply to examples of varying types. 00:10:57.220 |
So like P repeated twice, Q repeated twice, R repeated twice must change to P repeated 00:11:02.180 |
twice, Q repeated twice, and S repeated twice. 00:11:06.060 |
And it seems like GPT-3 is able to get them right, more or less. 00:11:10.980 |
But the problem is that if you ask it to generalize to examples that have increasing number of 00:11:19.380 |
repetitions than were seen in the prompt, it's not able to do that. 00:11:23.180 |
So in this situation, you ask it to make an analogy where the letters are repeated four 00:11:31.340 |
times, and it's never seen that before and doesn't know what to do. 00:11:36.740 |
So there's a point to be made here about just maybe these prompts are not enough to convey 00:11:44.020 |
the function the model should be learning and maybe even more examples it can learn 00:11:48.380 |
But it probably does not have the same kinds of generalization that humans have. 00:11:55.620 |
And that brings us to the limitations of these models and some open questions. 00:12:01.180 |
So just looking at the paper and passing through the results, it seems like the model is bad 00:12:06.500 |
at logical and mathematical reasoning, anything that involves doing multiple steps of reasoning. 00:12:14.020 |
And that explains why it's bad at arithmetic, why it's bad at word problems, why it's 00:12:18.300 |
not great at analogy making, and even traditional textual entailment data sets that seem to 00:12:27.780 |
So second most subtle point is that it's unclear how we can make permanent updates to the model. 00:12:33.540 |
Maybe if I want to teach a model a new concept, that's possible to do it while I'm interacting 00:12:39.580 |
But once the interaction is over, it restarts and does not have a notion of knowledge. 00:12:44.500 |
And it's not that this is something that the model cannot do in principle, but just something 00:12:52.020 |
It doesn't seem to exhibit human-like generalization, which is often called systematicity. 00:13:06.080 |
So maybe the aspects of meaning that it acquires are somewhat limited. 00:13:09.820 |
And maybe we should explore how we can bring in other modalities. 00:13:13.780 |
So we'll talk a lot more about these last two limitations in the rest of the lecture. 00:13:20.540 |
But maybe I can foster some questions now if there are any. 00:13:34.140 |
I don't think there's a big outstanding question. 00:13:37.620 |
But I mean, I think some people aren't really clear on few-shot setting and prompting versus 00:13:46.900 |
And I think it might actually be good to explain that a bit more. 00:13:52.180 |
So maybe let's-- let me pick a simple example. 00:14:04.660 |
So prompting just means that-- so GPT-3, if you go back to first principles, GPT-3 is 00:14:13.100 |
And what that means is given a context, it'll tell you what's the probability of the next 00:14:20.700 |
So if I give it a context, w1 through wk, GPT-3 will tell me what's the probability 00:14:35.820 |
A prompt is essentially a context that gets prepended before GPT-3 can start generating. 00:14:43.060 |
And what's happening with in-context learning is that the context that you append-- that 00:14:48.820 |
you prepend to GPT-3 are basically xy examples. 00:14:57.220 |
And the reason why it's also-- it's equivalent to few-shot learning is because you prepend 00:15:05.660 |
So in this case, if I just prepend this one example that's highlighted in purple, then 00:15:10.900 |
that's essentially one-shot learning because I just give it a single example as context. 00:15:16.820 |
And now, given this query, which is also appended to the model, it has to make a prediction. 00:15:25.940 |
So the input-output format is the same as how a few-shot learner would receive. 00:15:31.820 |
But since it's a language model, the training data set is essentially presented as a context. 00:15:40.220 |
So someone is still asking, can you be more specific about the in-context learning setups? 00:15:54.620 |
Maybe I can go to-- yeah, so maybe I can go to this slide. 00:16:03.900 |
So the task is just that it's a language model. 00:16:08.620 |
So it gets a context, which is just a sequence of tokens. 00:16:13.580 |
And the task is just to-- so you have a sequence of tokens. 00:16:18.760 |
And then the model has to generate given a sequence of tokens. 00:16:23.320 |
And the way you can convert that into an actual machine learning classification problem is 00:16:27.940 |
that-- so for this example, maybe you give it 5 plus 8 equals 13, 7 plus 2 equals 9, 00:16:41.700 |
So that's how you convert it into a classification problem. 00:16:44.860 |
The context here would be these two examples of arithmetic, like 5 plus 8 equals 13 and 00:16:55.420 |
And then the model, since it's just a language model, has to fill in 1 plus 0 equals question 00:17:05.020 |
But if it fills in a 1, it does the right job. 00:17:09.640 |
So that's how you can take a language model and do few-shot learning with it. 00:17:16.780 |
How is in-context learning different from transfer learning? 00:17:21.140 |
So I guess in-context learning-- I mean, you can think of in-context learning as being 00:17:33.220 |
But transfer learning does not specify the mechanism through which the transfer is going 00:17:39.140 |
With in-context learning, the mechanism is that the training examples are sort of appended 00:17:45.620 |
to the model, which is a language model, just in order. 00:17:55.900 |
And these are just appended directly to the model. 00:17:58.600 |
And now it makes prediction on some queries that are drawn from this data set. 00:18:05.220 |
So yes, it is a subcategory of transfer learning. 00:18:09.100 |
But transfer learning does not specify exactly how this transfer learning is achieved. 00:18:14.180 |
But in-context learning is very specific and says that for language models, you can essentially 00:18:19.680 |
concatenate the training data set and then present that to the language model. 00:18:25.780 |
People still aren't sufficiently clear on what is or isn't happening with learning and 00:18:33.980 |
So another question is, so in-context learning still needs fine-tuning, question mark? 00:18:39.420 |
We need to train GPT-3 to do in-context learning? 00:19:02.180 |
But the model is trained as a language model. 00:19:05.780 |
And once the model is trained, you can now use it to do transfer learning. 00:19:11.460 |
And the model parameters in-context learning are fixed. 00:19:17.740 |
All you do is that you give it these small training set to the model, which is just appended 00:19:26.060 |
And now the model can start generating from that point on. 00:19:29.740 |
So in this example, if 5 plus 8 equals 13 and 7 plus 2 equals 9 are two xy examples. 00:19:38.540 |
In vanilla transfer learning, what you would do is that you would take some gradient steps, 00:19:43.100 |
update your model parameters, and then make a prediction on 1 plus 0 equals what. 00:19:47.620 |
But in context learning, all you're doing is you just concatenate 5 plus 8 equals 13 00:19:54.420 |
and 7 plus 2 equals 9 to the model's context window, and then make it predict what 1 plus 00:20:04.660 |
Maybe we should end for now with one other bigger picture question, which is, do you 00:20:13.740 |
know of any research combining these models with reinforcement learning for the more complicated 00:20:22.700 |
There is some recent work on kind of trying to align language models with human preferences, 00:20:30.340 |
where yes, there is some amount of fine tuning with reinforcement learning based on these 00:20:38.820 |
So maybe you want to do a summarization problem in GPT-3. 00:20:45.300 |
And for each summary, maybe you have a reward that is essentially a human preference. 00:20:50.260 |
Maybe I want to include some facts, and I don't want to include some other non-important 00:20:56.060 |
So I can construct a reward out of that, and I can fine tune the parameters of my language 00:21:00.820 |
model basically using reinforcement learning based on this reward, which is essentially 00:21:08.260 |
So there's some very recent work that tries to do this. 00:21:11.180 |
But I'm not sure-- yeah, I'm not aware of any work that tries to use reinforcement learning 00:21:17.940 |
But I think it's an interesting future direction to explore. 00:21:34.100 |
OK, so we'll talk a bit more about these last two points, so systematicity and language 00:21:48.220 |
So just to start off, how do you define systematicity? 00:21:51.820 |
So really, the definition is that there is a definite and predictable pattern among the 00:21:56.540 |
sentences that native speakers of a language understand. 00:22:00.540 |
And so there's a systematic pattern among the sentences that we understand. 00:22:04.500 |
What that means is, let's say there's a sentence like, John loves Mary. 00:22:09.020 |
And if a native speaker understands the sentence, then they should also be able to understand 00:22:17.020 |
And closely related to this idea of systematicity is the principle of compositionality. 00:22:21.780 |
And for now, I'm going to ignore the definition by Montague and just look at the rough definition. 00:22:26.660 |
And then we can come back to this other more concrete definition. 00:22:30.620 |
The rough definition is essentially that the meaning of an expression is a function of 00:22:37.860 |
So that brings us to the question, are human languages really compositionally? 00:22:42.340 |
And here are some examples that make us think that maybe, yes. 00:22:47.900 |
So if you look at what is the meaning of the noun phrase brown cow, so it is composed of 00:22:53.180 |
the meaning of the adjective brown and the noun cow. 00:22:58.660 |
So all things that are brown and all things that are cow take the intersection and get 00:23:03.300 |
Similarly, red rabbits, so all things that are red, all things that are rabbit, combine 00:23:07.820 |
And then kick the ball, this word phrase can be understood as you have some agent that's 00:23:16.180 |
But this is not always the case that you can get the meaning of the whole by combining 00:23:23.700 |
So here, we have some counter examples that people often use. 00:23:26.700 |
So red herring does not mean all things that are red and all things that are herring. 00:23:31.260 |
And kick the bucket definitely does not mean that there's an agent that's kicking the bucket. 00:23:35.700 |
So while these examples are supposed to be provocative, we think that language is mostly 00:23:42.900 |
There's lots of exceptions, but for a vast majority of sentences that we've never heard 00:23:47.540 |
before, we're able to understand what they mean by piecing together the words that the 00:23:54.100 |
And so what that means is that maybe compositionality of representations are helpful prior that 00:24:02.740 |
And that brings us to the questions that we ask in the segment, are neural representations 00:24:08.100 |
And the second question is, if so, do they generalize systematically? 00:24:12.420 |
So how do you even measure if representations that a neural network learns exhibit compositionality? 00:24:23.500 |
So let's go back to this definition from Montague, which says that compositionality is about 00:24:29.780 |
the existence of a homomorphism from syntax to semantics. 00:24:34.700 |
And to look at that, we have this example, which is Lisa does not skateboard. 00:24:40.660 |
And we have a syntax tree corresponding to this example. 00:24:44.700 |
And the meaning of the sentence can be composed according to the structure that's decided 00:24:52.900 |
So the meaning of Lisa does not skateboard is a function of the meaning of Lisa and does 00:24:58.740 |
The meaning of does not skateboard is a function of does and not skateboard. 00:25:01.820 |
The meaning of not skateboard is a function of not and skateboard. 00:25:06.820 |
And so this gives us one way of formalizing how we can measure compositionality in neural 00:25:14.260 |
And so compositionality of representations could be thought of as how well the representation 00:25:19.700 |
approximates an explicitly homomorphic function in a learned representation space. 00:25:26.340 |
So what we're going to do is essentially measure if we were to construct a neural network whose 00:25:32.340 |
computations are based exactly according to these parse trees, how far are the representations 00:25:37.660 |
of a learned model from this explicitly compositional representation? 00:25:44.020 |
And that'll give us some understanding of how compositional the neural networks representations 00:25:50.500 |
So to unpack that a little bit, instead of having-- yeah, so instead of having denotations, 00:26:03.700 |
And to be more concrete about that, we first start by choosing a distance function that 00:26:09.780 |
tells us how far away two representations are. 00:26:12.660 |
And then we also need a way to compose together two constituents to give us the meaning of 00:26:21.740 |
But once we have that, we can start by-- we can create an explicitly compositional function, 00:26:28.460 |
So what we do is we have these representations at the leaves that are initialized randomly 00:26:37.660 |
and the composition function that's also initialized randomly. 00:26:40.660 |
And then a forward pass according to this syntax is used to compute the representation 00:26:48.060 |
And now once you have this representation, you can create a loss function. 00:26:51.700 |
And this loss function measures how far are the representations of my neural network from 00:26:57.500 |
this second proxy neural network that I've created. 00:27:02.180 |
And then I can basically optimize both the composition function and the embeddings of 00:27:10.220 |
And then once the optimization is finished, I can measure how far was the representation 00:27:15.860 |
of my neural net from this explicitly compositional network on a held outside. 00:27:22.380 |
And that then tells me whether the representation of my neural net learned were actually compositional 00:27:28.780 |
So to see how well this works, let's look at a plot. 00:27:36.860 |
But just to unpack this a little bit, it plots the mutual information between the input that 00:27:45.840 |
the neural network receives versus the representation against this tree reconstruction error that 00:27:55.740 |
And to give some more background about what's to come, there is a theory which is called 00:28:02.420 |
the information bottleneck theory, which says that as a neural network trains, it first 00:28:09.980 |
tries to maximize the mutual information between the representation and the input in an attempt 00:28:22.840 |
And then once memorization is done, there is a learning or a compression phase where 00:28:31.060 |
And the model is essentially trying to compress the data or consolidate the knowledge in the 00:28:38.140 |
And what we are seeing here is that as a model learns, which is characterized by decreasing 00:28:43.380 |
mutual information, we see that the representations themselves are becoming more and more compositional. 00:28:50.620 |
And overall, we observe that learning is correlated with increased compositionality as measured 00:29:01.900 |
So now that we have a method of measuring compositionality of representations in these 00:29:08.060 |
neural nets, how do we start to create benchmarks that see if they are generalizing systematically 00:29:18.000 |
So to do that, here is a method for taking any data set and splitting it into a train 00:29:23.580 |
test split that explicitly tests for this kind of generalization. 00:29:31.820 |
So to do that, we use this principle called maximizing the compound divergence. 00:29:38.140 |
And to illustrate how this principle works, we look at this toy example. 00:29:43.700 |
So in this toy example, we have a training data set that consists of just two examples 00:29:51.780 |
The atoms are defined as the primitive elements, so entity words, predicates, question types. 00:29:58.940 |
So in this toy example, Goldfinger, Christopher Nolan, these are all the primitive elements. 00:30:05.500 |
And the compounds are compositions of these primitive elements. 00:30:08.460 |
So who directed entity would be the composition of the question type. 00:30:19.140 |
So here's a basic machinery for producing compositionally challenging splits. 00:30:23.600 |
So let's start by introducing two distributions. 00:30:27.700 |
The first distribution is the normalized frequency distribution of the atoms. 00:30:33.020 |
So given any data set, if we know what the notion of atoms are, we can basically compute 00:30:38.740 |
the frequency of all of the atoms and then normalize that by the total count. 00:30:43.460 |
And that's going to give us one distribution. 00:30:47.260 |
And we can repeat the same thing for the compounds. 00:30:49.900 |
And that will give us a second frequency distribution. 00:30:53.720 |
So note that these are just two probability distributions. 00:30:57.860 |
And once we have these two distributions, we can essentially define the atom and compound 00:31:08.860 |
And where there is the Chernoff coefficient between two categorical distributions. 00:31:15.060 |
The Chernoff coefficient basically measures how far two categorical distributions are. 00:31:20.820 |
So just to get a bit more intuition about this, if we set p to q, then the Chernoff 00:31:26.440 |
coefficient is 1, which means these representations are maximally similar. 00:31:32.440 |
And then if p is non-zero everywhere q is 0, or if p is 0 in all the places where q 00:31:40.560 |
is 0, then the Chernoff coefficient is exactly 0, which means that these two distributions 00:31:49.240 |
And the overall goal by describing this objective is that-- this loss objective is just that 00:31:58.080 |
we are going to maximize the compound divergence and minimize the atom divergence. 00:32:03.780 |
And so what is the intuition behind doing such a thing? 00:32:06.500 |
So what we want is to ensure that the unigram distribution, in some sense, is constant between 00:32:12.380 |
the train and test split so that the model does not encounter any new words. 00:32:18.820 |
But we want the compound divergence to be very high, which means that these same words 00:32:24.660 |
that the model has seen many times must appear in new combinations, which means that we are 00:32:33.380 |
And so if you follow this procedure for a semantic parsing data set, let's say, what 00:32:40.060 |
we see is that as you increase the scale, we see that this model just does better and 00:32:50.020 |
But just pulling out a quote from this paper, "pre-training helps for compositional generalization 00:32:57.180 |
And what that means is that maybe as you keep scaling up these models, you'll see better 00:33:00.820 |
and better performance, or maybe it starts to saturate at some point. 00:33:06.940 |
In any case, we should probably be thinking more about this problem instead of just trying 00:33:14.260 |
So now this segment tells us that the way we split a data set, we can measure different 00:33:25.020 |
And that tells us that maybe we should be thinking more critically about how we're evaluating 00:33:31.600 |
So there has been a revolution basically over the last few years in the field where we're 00:33:37.260 |
seeing all of these large transform models beat all of our benchmarks. 00:33:40.500 |
At the same time, there is still not complete confidence that once we deploy these systems 00:33:46.460 |
in the real world, they're going to maintain their performance. 00:33:51.940 |
And so it's unclear if these gains are coming from spurious correlations or some real task 00:33:56.780 |
And so how do we design benchmarks that accurately tell us how well this model is going to do 00:34:03.100 |
And so I'm going to give one example of works that try to do this. 00:34:11.180 |
And the idea of dynamic benchmarks is basically saying that instead of testing our models 00:34:17.140 |
on static test sets, we should be evaluating them on an ever-changing dynamic benchmark. 00:34:27.420 |
And the idea dates back to a 2017 workshop at EMLP. 00:34:33.300 |
And so the overall schematic looks something like this, that we start with a training data 00:34:38.060 |
set and a test data set, which is the static opera. 00:34:44.020 |
And then once the model is trained, we deploy that and then have humans create new examples 00:34:54.940 |
The model does not get tried, but humans have no issue figuring out the answer to. 00:34:59.860 |
So by playing this game of whack-a-mole, where humans figure out what are the holes 00:35:06.540 |
in the model's understanding, and then add that back into the training data, re-train 00:35:11.420 |
the model, deploy it again, have humans create new examples, we can essentially construct 00:35:16.060 |
this never-ending data set, this never-ending test set, which can hopefully be a better 00:35:28.500 |
So this is some really cutting-edge research. 00:35:33.700 |
And one of the main challenges of this class of works is that it's unclear how much this 00:35:39.180 |
can scale up, because maybe after multiple iterations of this whack-a-mole, humans are 00:35:49.380 |
So figuring out how to deal with that is really an open problem. 00:35:55.060 |
And current approaches just use examples from other data sets to prompt humans to think 00:36:01.360 |
But maybe we can come up with better, more automated methods of doing this. 00:36:11.940 |
Or actually, let me stop for questions at this point and see if people have questions. 00:36:29.980 |
With dynamic benchmark, doesn't this mean that the model creator will also need to continually 00:36:36.060 |
test/evaluate the models on the new benchmarks, new data sets? 00:36:48.740 |
Yeah, so with dynamic benchmarks, yes, it's absolutely true that you will have to continuously 00:36:58.940 |
And that's just to ensure that the reason your model is not doing well on the test set 00:37:06.340 |
doesn't have to do with this domain mismatch. 00:37:09.820 |
And what we're really trying to do is measure how-- just come up with a better estimate 00:37:17.180 |
of the model's performance in the overall task and just trying to get more and more 00:37:23.420 |
So yes, to answer your question, yes, we need to keep training the model again and again. 00:37:35.860 |
So in this final segment, I'll talk about how we can move beyond just training models 00:37:45.600 |
So many have articulated the need to use modalities other than text if we someday want to get 00:37:55.380 |
And ever since we've had these big language models, there has been a rekindling of this 00:38:03.420 |
And recently, there was multiple papers on this. 00:38:06.300 |
And so at ACL last year, there was this paper that argues through multiple thought experiments 00:38:11.900 |
that it's actually impossible to acquire meaning from form alone, where meaning refers to the 00:38:17.700 |
communicative intent of a speaker, and form refers to text or speech signals. 00:38:24.300 |
A more modern version of this was put forward by the second paper, where they say that training 00:38:31.180 |
on only web-scale data limits the world scope of models and limits the aspects of meanings 00:38:41.360 |
And so here is a diagram that I borrowed from the paper. 00:38:44.640 |
And what they say is the era where we were training models on supervised data sets, models 00:38:54.020 |
And now that we've moved on to exploiting unlabeled data, we're now in world scope two, 00:38:59.720 |
where models just have strictly more signal to get more aspects of meaning in. 00:39:05.400 |
If you mix in additional modalities into this-- so maybe you mix in videos, and maybe you 00:39:10.000 |
mix in images-- then that expands out the world scope of the model further. 00:39:15.960 |
And now maybe it can acquire more aspects of meaning, such that now it knows that the 00:39:27.160 |
And then if you go beyond that, you can have a model that is embodied, and it's actually 00:39:32.320 |
living in an environment where it can interact with its data, conduct interventions and experiments. 00:39:39.620 |
And then if you go even beyond that, you can have models that live in a social world where 00:39:46.240 |
Because after all, the purpose of language is to communicate. 00:39:49.480 |
And so if you can have a social world where models can communicate with other models, 00:40:04.280 |
So there are a lot of open questions in this space. 00:40:07.600 |
But given that there are all of these good arguments about how we need to move beyond 00:40:11.600 |
text, what is the best way to do this at scale? 00:40:16.320 |
We know that babies cannot learn language from watching TV alone, for example. 00:40:21.960 |
So there has to be some interventions, and there has to be interactions with the environment 00:40:28.360 |
But at the same time, the question is, how far can models go by just training on static 00:40:34.360 |
data as long as we have additional modalities, especially when we combine this with scale? 00:40:40.880 |
And if interactions with the environment are really necessary, how do we collect data and 00:40:45.400 |
design systems that interact minimally or in a cost-effective way? 00:40:50.280 |
And then finally, could pre-training on text still be useful if any of these other research 00:41:03.900 |
So if you're interested in learning more about this topic, I highly encourage you to take 00:41:11.160 |
They have multiple lectures on just language learning. 00:41:19.680 |
So in this final segment, I'm going to talk a little bit more about how you can get involved 00:41:24.880 |
with NLP and deep learning research and how you can make more progress. 00:41:32.240 |
So here are some general principles for how to make progress in NLP research. 00:41:38.220 |
So I think the most important thing is to just read broadly, which means not just read 00:41:42.680 |
the latest and greatest papers and archive, but also read pre-2010 statistical NLP. 00:41:50.360 |
Learn about the mathematical foundations of machine learning to understand how generalization 00:41:58.840 |
Learn more about language, which means taking classes in the linguistics department. 00:42:03.200 |
In particular, I would recommend maybe this 138A. 00:42:09.680 |
And finally, if you wanted inspiration from how babies learn, then definitely read about 00:42:19.200 |
Finally, learn your software tools, which involves scripting tools, version control, 00:42:29.240 |
data wrangling, learning how to visualize quickly with Jupyter Notebooks. 00:42:34.480 |
And deep learning often involves running multiple experiments with different hyperparameters 00:42:41.720 |
And sometimes it can get really hard to keep track of everything. 00:42:44.440 |
So learn how to use experiment management tools like weights and biases. 00:42:50.920 |
And finally, I'll talk about some really quick final project tips. 00:42:57.400 |
So first, let's just start by saying that if your approach doesn't seem to be working, 00:43:03.400 |
Put assert statements everywhere and check if the computations that you're doing are 00:43:09.160 |
Use breakpoints extensively, and I'll talk a bit more about this. 00:43:13.200 |
Check if the loss function that you've implemented is correct. 00:43:16.760 |
And one way of debugging that is to see that the initial values are correct. 00:43:21.360 |
So if you're doing a k-way classification problem, then the initial loss should be a 00:43:25.840 |
Always, always, always start by creating a small training data set which has like 5 to 00:43:31.520 |
10 examples and see if your model can completely cope with that. 00:43:35.000 |
If not, there's a problem with your training loop. 00:43:38.720 |
Check for saturating activations and dead values. 00:43:41.800 |
And often, this can be fixed by-- maybe there's some problems with the gradients, or maybe 00:43:46.320 |
there's some problems with the initialization. 00:43:51.160 |
See if they're too small, which means that maybe you should be using residual connections 00:43:55.960 |
Or if they're too large, then you should use gradient clipping. 00:44:04.000 |
If your approach doesn't work, come up with hypotheses for why this might be the case. 00:44:14.320 |
And just try to be systematic about everything. 00:44:17.760 |
So I'll just say a little bit more about breakpoints. 00:44:29.280 |
To create a breakpoint, just add the line import PDB, PDB.setTrace before the line you 00:44:37.360 |
So earlier today, I was trying to play around with the Transformers library. 00:44:47.560 |
And the context is, one morning, I shot an elephant in my pajamas. 00:44:57.760 |
And to solve this problem, I basically imported a tokenizer and a BERT model. 00:45:04.240 |
And I initialized my tokenizer, initialized my model, tokenized my input. 00:45:19.400 |
And so the best way to look at what's causing this error is to actually put a breakpoint. 00:45:24.920 |
So right after model.eval, I put a breakpoint. 00:45:27.960 |
Because I know that that's where the problem is. 00:45:35.600 |
And now once I put this breakpoint, I can just run my script again. 00:45:43.320 |
And at this point, I can examine all of my variables. 00:45:46.360 |
So I can look at the token as input, because maybe that's where the problem is. 00:45:50.400 |
And lo and behold, I see that it's actually a list. 00:45:54.460 |
So it's a dictionary of lists, whereas models typically expect a Dodge tensor. 00:46:01.560 |
And that means I can quickly go ahead and fix it. 00:46:06.280 |
So this just shows that you should use breakpoints everywhere if your code is not working. 00:46:10.960 |
And it can just help you debug really quickly. 00:46:16.160 |
So finally, I'd say that if you want to get involved with NLP and deep learning research, 00:46:21.640 |
and if you really like the final project, we have the CLIPS program at Stanford. 00:46:26.160 |
And this is a way for undergrads, master's students, and PhDs who are interested in deep 00:46:31.520 |
learning and doing NLP research and want to get involved with the NLP group. 00:46:36.480 |
So we highly encourage you to apply to CLIPS. 00:46:40.560 |
And so I'll conclude today's class by saying that we've made a lot of progress in the last 00:46:49.240 |
And that's mostly due to clever understanding of neural networks, data, hardware, all of 00:46:55.520 |
We have some really amazing technologies that can do really exciting things. 00:47:03.200 |
In the short term, I expect that we'll see more scaling because it just seems to help. 00:47:12.480 |
So I said that before, and I'll just say it again. 00:47:16.320 |
Scaling requires really non-trivial engineering efforts, and sometimes even clever algorithms. 00:47:21.740 |
And so there's a lot of interesting systems work to be done here. 00:47:25.040 |
But in the long term, we really need to be thinking more about these bigger problems 00:47:31.120 |
How can we make our models learn a new concept really quickly so that it's fast adaptation? 00:47:39.040 |
And then we also need to create benchmarks that we can actually trust. 00:47:42.560 |
If my model has some performance on some sentiment analysis data set and deployed in the real 00:47:48.360 |
world, that should be reflected in the number that I get from the benchmark. 00:47:52.000 |
So we need to make progress in the way we evaluate models. 00:47:56.080 |
And then also figuring out a way to move beyond text in a more tractable way. 00:48:13.000 |
So I answered a question earlier that actually I think you could also opine on. 00:48:19.960 |
It was the question of whether you have a large model that's pre-trained on language, 00:48:24.040 |
if it will actually help you in other domains, like you apply it to vision stuff. 00:48:37.000 |
So there was a paper that came out really, really recently, like just two days ago, that 00:48:45.240 |
It's like one large transformer model that's pre-trained on text. 00:48:48.960 |
And like other modalities, so they definitely apply to images. 00:48:53.600 |
And I think they apply to math problems and some more modalities and show that it's actually 00:49:02.280 |
So if you pre-train on text and then you move to a different modality, that helps. 00:49:05.600 |
I think part of the reason for that is just that across modalities, there is a lot of 00:49:13.680 |
And I think one reason for that is that language is really referring to the world around it. 00:49:19.800 |
And so you might expect that there is some correspondence that's just beyond the autoregressive 00:49:28.360 |
So there's also works that show that if you have just text-only representations and image-only 00:49:34.080 |
representations, you can actually learn a simple linear classifier that can learn to 00:49:39.320 |
And all of these works are just showing that there's actually a lot more common between 00:49:47.120 |
So yeah, I think it's possible to pre-train on text and then fine-tune on your modality 00:49:55.160 |
And it should probably be effective, of course, based on what the modality is. 00:49:59.960 |
But for images and videos, it's certainly effective. 00:50:24.640 |
One is, what's the difference between CS224U and this class in terms of the topics covered 00:50:31.960 |
Do you want to answer that one, Shikhar, or should I have a go at answering it? 00:50:39.600 |
So next quarter, CS224U, Natural Language Understanding, is co-taught by Chris Potts 00:50:50.840 |
So in essence, it's meant to be different that natural language understanding focuses 00:51:00.360 |
on what its name is, sort of how to build computer systems that understand the sentences 00:51:08.960 |
Now, in truth, the boundary is kind of complex because we do some natural language understanding 00:51:19.920 |
And certainly for the people who are doing the default final project, question answering, 00:51:24.560 |
well, that's absolutely a natural language understanding task. 00:51:29.260 |
But the distinction is meant to be that at least a lot of what we do in this class, things 00:51:37.680 |
like the assignment three dependency parser or building the machine translation system 00:51:45.960 |
in assignment four, that they are in some sense natural language processing tasks where 00:51:53.600 |
processing can mean anything but commonly means you're doing useful intelligent stuff 00:52:00.460 |
with human language input, but you're not necessarily deeply understanding it. 00:52:10.300 |
If you do CS224U, you'll certainly see word vectors and transformers again. 00:52:16.740 |
But the emphasis is on doing a lot more with natural language understanding tasks. 00:52:22.880 |
And so that includes things like building semantic parsers. 00:52:28.020 |
So they're the kind of devices that will, you know, respond to questions and commands 00:52:34.320 |
such as an Alexa or Google assistant will do. 00:52:40.460 |
Building relation extraction systems, which get out particular facts out of a piece of 00:52:45.480 |
text of, oh, this person took on this position at this company. 00:52:52.940 |
Looking at grounded language learning and grounded language understanding where you're 00:52:57.780 |
not only using the language, but the world context to get information and other tasks 00:53:06.500 |
I mean, I guess you're going to look at the website to get more details of it. 00:53:10.580 |
I mean, you know, relevant to this class, I mean, a lot of people also find it an opportunity 00:53:17.200 |
to just get further in doing a project in the area of natural language processing that 00:53:24.880 |
sort of by the nature of the structure of the class, since, you know, it more assumes 00:53:30.320 |
that people know how to build deep learning natural language systems at the beginning 00:53:35.400 |
that rather than a large percentage of the class going into, okay, you have to do all 00:53:41.400 |
of these assignments, although there are little assignments earlier on that there's sort 00:53:46.080 |
of more time to work on a project for the quarter. 00:53:53.080 |
Here's one more question that maybe Shikhar could do. 00:53:57.680 |
Do you know of attempts to crowdsource dynamic benchmarks, e.g. users uploading adversarial 00:54:08.160 |
Yeah, so actually, like, the main idea there is to use crowdsourcing, right? 00:54:19.140 |
So there is this platform that was created by Pair, it's called DynaBench. 00:54:25.080 |
And the objective is just that to construct this like dynamically evolving benchmark, 00:54:31.800 |
we are just going to offload it to users of this platform. 00:54:36.560 |
And you can, you know, it essentially gives you utilities for like, deploying your model 00:54:41.200 |
and then having, you know, humans kind of try to fool the model. 00:54:46.360 |
Yeah, so this is like, it's basically how the dynamic benchmark collection actually 00:54:56.160 |
So we deploy a model on some platform, and then we get humans to like fool the system. 00:55:19.720 |
Can you address the problems of NLP models, not able to remember really long contexts 00:55:25.240 |
and techniques to infer on really large input length? 00:55:29.320 |
Yeah, so I guess like, there have been like a few works recently that kind of try to scale 00:55:38.640 |
up transformers to like really large context lengths. 00:55:45.600 |
And there's also like the transformer Excel that was the first one to try and do that. 00:55:51.720 |
I think what is unclear is whether you can combine that with the scale of these GPT like 00:56:02.240 |
And if you see like qualitatively different things, once you do that, like, and part of 00:56:07.920 |
it is just that all of this is just like so recent, right? 00:56:10.720 |
But yeah, I think the open question there is that, you know, can you take these like 00:56:15.320 |
really long context transformers that can operate over long context, combine that with 00:56:21.400 |
scale of GPT-3, and then get models that can actually reason over these like really large 00:56:28.080 |
Because I guess the hypothesis of scale is that once you train language models at scale, 00:56:36.320 |
And so to do that for long context, we actually need to like have long context transformers 00:56:55.720 |
So I'm seeing this other question about language acquisition. 00:57:07.680 |
Yeah, so the question is, what do you think we can learn from baby language acquisition? 00:57:17.000 |
Can we build a language model in a more interactive way, like reinforcement learning? 00:57:28.600 |
And you know, I think the short non-helpful answer is that there are kind of no answers 00:57:35.680 |
I know people have certainly tried to do things at various scales, but you know, we just have 00:57:41.840 |
no technology that is the least bit convincing for being able to replicate the language learning 00:57:53.800 |
But after that prologue, what I could say is, I mean, yeah, there are definitely ideas 00:58:02.840 |
So you know, there are sort of clear results, which is that little kids don't learn by watching 00:58:11.240 |
So it seems like interaction is completely key. 00:58:21.440 |
They're in a very rich environment where people are sort of both learning stuff from the environment 00:58:28.040 |
in general, and in particular, you know, they're learning a lot from what language acquisition 00:58:35.240 |
researchers refer to as attention, which is different to what we mean by attention. 00:58:42.360 |
But it means that the caregiver will be looking at the object that's the focus of interest 00:58:48.680 |
and you know, commonly other things as well, like sort of, you know, picking it up and 00:58:52.440 |
bringing it near the kid and all those kinds of things. 00:58:57.720 |
And you know, babies and young kids get to experiment a lot, right? 00:59:03.440 |
So regardless of whether it's learning what happens when you have some blocks that you 00:59:09.880 |
stack up and play with them, or you're learning language, you sort of experiment by trying 00:59:17.280 |
some things and see what kind of response you get. 00:59:20.920 |
And again, that's essentially building on the interactivity of it that you're getting 00:59:27.240 |
some kind of response to any offerings you make. 00:59:30.800 |
And you know, this is something that's sort of been hotly debated in the language acquisition 00:59:36.680 |
So a traditional chompskin position is that, you know, human beings don't get effective 00:59:47.040 |
feedback, you know, supervised labels when they talk. 00:59:52.880 |
And you know, in some very narrow sense, well, that's true, right? 00:59:56.240 |
It's just not the case that after a baby tries to say something that they get feedback of, 01:00:01.280 |
you know, syntax error in English on word for, or they get given, here's the semantic 01:00:12.520 |
But in a more indirect way, they clearly get enormous feedback, they can see what kind 01:00:18.040 |
of response they get from their caregiver at every corner. 01:00:24.720 |
And so like in your question, you were suggesting that, well, somehow we should be making use 01:00:32.920 |
of reinforcement learning because we have something like a reward signal there. 01:00:37.960 |
And you know, in a big picture way, I'd say, oh, yeah, I agree. 01:00:42.840 |
In terms of a much more specific way as to, well, how can we possibly get that to work 01:00:48.160 |
to learn something with the richness of human language? 01:00:52.800 |
You know, I think we don't have much idea, but you know, there has started to be some 01:01:00.560 |
So people have been sort of building virtual environments, which, you know, you have your 01:01:07.600 |
avatar in and that can manipulate in the virtual environment and there's linguistic input, 01:01:14.720 |
and it can succeed in getting rewards for sort of doing a command where the command 01:01:19.480 |
can be something like, you know, pick up the orange block or something like that. 01:01:24.920 |
And you know, to a small extent, people have been able to build things that work. 01:01:31.880 |
I mean, as you might be picking up, I mean, I guess so far, at least I've just been kind 01:01:39.760 |
of underwhelmed because it seems like the complexity of what people have achieved is 01:01:45.440 |
sort of, you know, just so primitive compared to the full complexity of language, right? 01:01:51.680 |
You know, the kind of languages that people have been able to get systems to learn are 01:01:57.080 |
ones that can, yeah, do pick up commands where they can learn, you know, blue cube versus 01:02:04.720 |
And that's sort of about how far people have gotten. 01:02:07.720 |
And that's sort of such a teeny small corner of what's involved in learning a human language. 01:02:14.840 |
One thing I'll just add to that is I think there are some principles of how kids learn 01:02:24.480 |
that people have tried to apply to deep learning. 01:02:27.520 |
And one example that comes to mind is curriculum learning, where there's like a lot of literature 01:02:33.200 |
that shows that, you know, babies, they tend to pay attention to things that they just 01:02:41.520 |
And they don't pay attention to things that are extremely challenging, and also don't 01:02:45.000 |
pay attention to things that they know how to solve. 01:02:47.680 |
And many researchers have really tried to get curriculum learning to work. 01:02:53.480 |
And the verdict on that is that it seems to kind of work when you're in like reinforcement 01:02:59.280 |
But it's unclear if it's going to work on like supervised learning settings. 01:03:03.280 |
But I still think that it's like under explored. 01:03:05.640 |
And maybe, you know, there should be like more attempts to kind of see if we can like 01:03:11.920 |
add in curriculum learning and if that improves anything. 01:03:18.840 |
Curriculum learning is an important idea, which we haven't really talked about. 01:03:23.480 |
But it seems like it's certainly essential to human learning. 01:03:27.820 |
And there's been some minor successes with it in the machine learning world. 01:03:31.880 |
But it sort of seems like it's an idea you should be able to do a lot more with in the 01:03:36.560 |
future as you move from models that are just doing one narrow task to trying to do a more 01:03:51.880 |
Okay, the next question is, is the reason humans learn languages better just because 01:03:57.240 |
we are pre trained over millions of years of physics simulation? 01:04:01.680 |
Maybe we should pre train a model the same way. 01:04:05.680 |
So I mean, I presume what you're saying is physics simulation, you're evoking evolution 01:04:15.000 |
So you know, this is a controversial, debated, big question. 01:04:24.180 |
So you know, again, if I invoke Chomsky again, so Noam Chomsky is sort of the most famous 01:04:36.480 |
And you know, essentially, Noam Chomsky's career starting in the 1950s is built around 01:04:42.520 |
the idea that little children get such dubious linguistic input because you know, they hear 01:04:52.280 |
a random bunch of stuff, they don't get much feedback on what they say, etc. 01:04:57.720 |
But language could not be learned empirically just from the data observed. 01:05:04.300 |
And the only possible assumption to work from is significant parts of human language are 01:05:15.000 |
innate or in the sort of human genome, babies are born with that. 01:05:19.400 |
And that explains the miracle by which very little humans learn amazingly fast how human 01:05:28.560 |
Now, to speak in credit for that idea, for those of you who have not been around little 01:05:36.560 |
children, I mean, I think one does just have to acknowledge, you know, human language acquisition 01:05:46.680 |
I mean, it does just seem to be miraculous, right? 01:05:49.880 |
As you go through this sort of slow phase for a couple of years where, you know, the 01:05:57.000 |
kids sort of goos and gahs some syllables, and then there's a fairly long period where 01:06:02.280 |
they picked up a few words, and they can say "juice, juice" when they want to drink some 01:06:10.360 |
And then it just sort of seems like there's this phase change, where the kids suddenly 01:06:15.720 |
realize, wait, this is a productive generative sentence system, I can say whole sentences. 01:06:21.680 |
And then in an incredibly short period, they sort of seem to transition from saying one 01:06:27.380 |
and two word utterances to suddenly they can say, you know, "Daddy come home in garage, 01:06:39.160 |
And you go, wow, how did they suddenly discover language? 01:06:47.920 |
But personally, for me, at least, you know, I've just never believed the strong versions 01:06:56.360 |
of the hypothesis that human beings have much in the way of language specific knowledge 01:07:05.140 |
or structure in their brains that comes from genetic inheritance. 01:07:09.920 |
Like clearly, humans do have these very clever brains. 01:07:14.600 |
And if we're at the level of saying, being able to think, or being able to interpret 01:07:21.160 |
the visual world, that's things that have developed over tens of millions of years. 01:07:29.640 |
And evolution can be a large part of the explanation. 01:07:35.000 |
And humans are clearly born with lots of vision specific hardware in their brains, as are 01:07:44.720 |
But when you come to language, you know, no one knows when language was in a sort of a 01:07:53.160 |
modern like form first became available, because, you know, there aren't any fossils of people 01:07:59.840 |
saying, you know, the word spear or something like that. 01:08:04.400 |
But, you know, to the extent that there are estimates based on sort of what you can see 01:08:09.920 |
of the sort of spread of proto humans and their sort of apparent social structures from 01:08:19.280 |
sort of what you can find in fossils, you know, most people guess that language is at 01:08:27.880 |
And you know, that's just too short a time for any significant, for evolution to sort 01:08:35.040 |
of build any significant structure inside human brains that's specific to language. 01:08:40.120 |
So I kind of think that the working assumption has to be that sort of there's just about 01:08:47.880 |
nothing specific to language in human brains. 01:08:52.160 |
And you know, the most plausible hypothesis, not that I know very much about neuroscience 01:08:57.960 |
when it comes down to it, is that humans were being able to repurpose hardware that was 01:09:04.680 |
originally built for other purposes, like visual scene interpretation and memory, and 01:09:11.760 |
that that gave a basis of sort of having all this clever hardware that you could then use 01:09:18.280 |
You know, it's kind of like GPUs were invented for playing computer games, and we were able 01:09:23.360 |
to repurpose that hardware to do deep learning. 01:09:27.280 |
Okay, we've got a lot of have come out at the end. 01:09:43.120 |
Yeah, if you could name, I guess this is for either of you, one main bottleneck as 01:09:48.560 |
to if we could provide feedback efficiently to our systems like babies are given feedback, 01:09:56.240 |
what's the bottleneck that remains in trying to have more human-like language acquisition? 01:10:24.680 |
Yeah, I was just going to say that I think it's a bit of everything, right? 01:10:30.840 |
Like, I think in terms of models, one thing I would say is that we know that there's more 01:10:37.040 |
feedback connections and feed forward connections in the brain. 01:10:42.160 |
And we haven't really figured out a way of kind of, so, you know, of course, we had RNNs, 01:10:49.360 |
you know, which sort of implement like, you know, you can like look through an RNN that 01:10:52.880 |
sort of implements a feedback loop, but we still haven't really figured out how to, you 01:10:58.080 |
know, use that knowledge is that the brain has a lot of feedback connections and then 01:11:01.800 |
apply that to practical systems, I think on the modeling end, like maybe that's one problem. 01:11:10.840 |
There is like, yeah, I think curriculum learning is maybe one of them, but I think the one 01:11:16.360 |
that's probably going to have most bang for buck is really figuring out how we can move 01:11:21.680 |
And I think there's just like so much more information that's available that we're just 01:11:28.840 |
And so I think that's where most of the progress might come from, like figuring out what's 01:11:49.320 |
What are some important NLP topics that we have not covered in this class? 01:12:01.320 |
You know, well, sort of one answer is a lot of the topics that are covered in CS224U because, 01:12:07.320 |
you know, we do make a bit of an effort to keep them destroyed, though not fully. 01:12:13.320 |
So there's sort of lots of topics in language understanding that we haven't covered. 01:12:21.320 |
So if you want to make a voice assistant like Alexa Siri or Google Assistant, well, you 01:12:31.320 |
need to sort of be able to interface with systems, APIs that can do things like delete 01:12:40.320 |
And so you need to be able to convert from language into explicit semantic form that 01:12:51.320 |
So there's lots of language understanding stuff. 01:12:54.320 |
There's also lots of language generation things. 01:12:59.320 |
So, you know, effectively for language generation, all we have done is neural language models. 01:13:14.320 |
It's awesome the kind of generation you can do with things like GPT-2 or 3. 01:13:22.320 |
But, you know, where that's missing is that's really only giving you the ability to produce 01:13:32.320 |
fluent text where it rabbits often produces fluent text that if you actually wanted to 01:13:39.320 |
have a good natural language generation system, you also have to have higher level planning 01:13:46.320 |
of what you're going to talk about and how you are going to express it. 01:13:53.320 |
So then in most situations in natural language, you think, OK, well, I want to explain to 01:14:00.320 |
people something about why it's important to do math classes at college. 01:14:08.320 |
Maybe I should talk about some of the different applications where math turns up and how it's 01:14:15.320 |
Whatever you kind of plan out, here's how I can present some ideas. 01:14:20.320 |
And that kind of natural language generation, we're not doing any we haven't done any of. 01:14:28.320 |
So that's sort of saying more understanding, more generation, which is most of NLP, you 01:14:38.320 |
I mean, obviously, there are then sort of particular tasks that we can talk about that 01:14:43.320 |
we either have or have not explicitly addressed. 01:14:48.320 |
Is there has there been any work in putting language models into an environment in which 01:15:03.320 |
And do you think this would help with unsupervised learning? 01:15:12.320 |
So I guess there's been a lot of work on emergent communication and also self play, where you 01:15:19.320 |
have these different models which are initialized as language models that attempt to communicate 01:15:30.320 |
And then you have a reward at the end, whether they were able to finish the task or not. 01:15:35.320 |
And then based on that reward, you attempt to learn the communication strategy. 01:15:40.320 |
And this started out as emergent communication and self play. 01:15:45.320 |
I think it was last year or the year before that, where they showed that if you initialize 01:15:50.320 |
these models with language model pre-training, you basically to prevent this problem of language 01:15:58.320 |
drift, where the language or the communication protocol that your models end up learning 01:16:08.320 |
And so, yeah, I mean, from that sense, there has been some work. 01:16:13.320 |
I think there's some groups that try to study this, but not beyond that. 01:16:23.320 |
OK, I mean, the last two questions are about gene. 01:16:31.320 |
There's one question about whether genes make some correlations from social cues or a word 01:16:38.320 |
I don't know if either of you have opinions about this, but if you do. 01:16:45.320 |
Yeah, I mean, I don't have anything very deep to say about this question. 01:16:50.320 |
It's on the importance of social cues as opposed to pure award based systems. 01:16:56.320 |
Well, I mean, in some sense, a social cue, you can also regard as a reward that people 01:17:04.320 |
like to have other people put a smile on their face when you say something. 01:17:10.320 |
But I do think generally, when people are saying, what have we not covered? 01:17:18.320 |
Another thing that we've barely covered is the social side of language. 01:17:23.320 |
So, you know, a huge, a huge interesting thing about language is it has this very dynamic, 01:17:33.320 |
So on the one hand, you can talk about very precise things in language. 01:17:38.320 |
So you can sort of talk about math formulas and steps in a proof and things like that, 01:17:43.320 |
so that there's a lot of precision and language. 01:17:46.320 |
You know, on the other hand, you can just sort of emphatically mumble, mumble whatever 01:17:51.320 |
words at all, and you're not really sort of communicating anything in the way of a propositional 01:17:58.320 |
What you're really trying to communicate is, you know, I'm, oh, I'm thinking about you 01:18:04.320 |
And, oh, I'm concerned with how you're feeling or whatever it is in the circumstances, right? 01:18:10.320 |
So that a huge part of language use is in forms of sort of social communication between 01:18:20.320 |
And, you know, that's another big part of actually building successful natural language 01:18:30.320 |
So if you, you know, if you think negatively about something like the virtual assistants 01:18:35.320 |
I've been falling back on a lot is, you know, that they have virtually no ability as social 01:18:44.320 |
So we're now training a generation of little kids that what you should do is sort of bark 01:18:52.320 |
out commands as if you were, you know, serving in the German army in World War II or something, 01:18:59.320 |
and that there's none of the kind of social part of how to, you know, use language to 01:19:08.320 |
communicate satisfactorily with human beings and to maintain a social system. 01:19:15.320 |
And that, you know, that's a huge part of human language use that kids have to learn 01:19:23.320 |
You know, a lot of being successful in the world is, you know, you know, when you want 01:19:29.320 |
someone to do something for you, you know, that there are good ways to ask them for it. 01:19:35.320 |
You know, some of its choice of how to present the arguments, but, you know, some of it is 01:19:41.320 |
by building social rapport and asking nicely and reasonably and making it seem like you're 01:19:48.320 |
a sweet person that other people should do something for. 01:19:51.320 |
And, you know, human beings are very good at that. 01:19:54.320 |
And being good at that is a really important skill for being able to navigate the world well.