Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 18

00:00:00.000 | Good afternoon, folks. Welcome to lecture 18. Today, we'll be talking about some of

00:00:10.240 | the latest and greatest developments in neural NLP, where we've come and where we're headed.

00:00:16.720 | Chris, just to be sure, are my presenter notes visible from this part?

00:00:22.440 | They're visible.

00:00:24.440 | Okay, but not my presenter notes, right?

00:00:28.640 | Correct.

00:00:29.640 | Okay, thank you.

00:00:32.000 | So just as a reminder, note that your guest lecture reactions are due tomorrow at 11.59

00:00:36.960 | PM.

00:00:37.960 | Great job with the project milestone reports.

00:00:40.720 | You should have received feedback now.

00:00:42.720 | If not, contact the core staff.

00:00:44.480 | I think we had some last minute issues, but if that's not resolved, please contact us.

00:00:51.120 | Finally, the project reports are due very soon, on the March 16th, which is next week.

00:00:57.160 | There's one question on Ed about the leaderboard, and the last day to submit on the leaderboard

00:01:03.400 | is March 19th as well.

00:01:07.840 | Okay, so for today, we'll start by talking about extremely large language models and

00:01:13.360 | GPT-3 that have recently gained a lot of popularity.

00:01:18.520 | We'll then take a closer look at compositionality and generalization of these neural models.

00:01:26.400 | While transformer models like BERT and GPT have really high performance on all benchmarks,

00:01:30.520 | they still fail in really surprising ways when deployed.

00:01:33.640 | How can we strengthen our understanding of evaluating these models so they more closely

00:01:38.800 | reflect task performance in the real world?

00:01:44.280 | And then we end by talking about how we can move beyond this really limited paradigm of

00:01:48.640 | teaching models language only through text and look at language grappling.

00:01:53.760 | Finally, I'll give some practical tips on how to move forward in your neural NLP research,

00:01:59.600 | and this will include some practical tips for the final project as well.

00:02:07.040 | So this beam really kind of captures what's been going on in the field, really.

00:02:14.680 | And it's just that our ability to harness unlabeled data has vastly increased over the

00:02:19.000 | last few years.

00:02:20.240 | And this has been made possible due to advances in not just hardware, but also systems and

00:02:26.120 | our understanding of self-supervised training so we can use lots and lots of unlabeled data.

00:02:34.640 | So based on this, here is a general representation learning recipe that just works for basically

00:02:41.520 | most modalities.

00:02:43.080 | So the recipe is basically as follows.

00:02:47.460 | So convert your data if it's images converted-- or it's really modality agnostic.

00:02:56.560 | So you take your data, if it's images, text, or videos, and you convert it into a sequence

00:03:00.160 | of integers.

00:03:02.280 | And in step 2, you define a loss function to maximize data likelihood or create a denoising

00:03:06.960 | autoencoder loss.

00:03:08.960 | Finally, in step 3, train on lots and lots of data.

00:03:14.760 | Different properties emerge only when we scale up model size.

00:03:17.280 | And this is really the surprising fact about scale.

00:03:20.660 | So to give some examples of this recipe in action, here's GPD 3, which can learn to do

00:03:26.620 | a really non-trivial classification problem with just two demonstrations.

00:03:31.160 | And we'll talk more about this soon.

00:03:34.660 | Another example, as we saw in lecture 14, is T5, which does really effective closed-book

00:03:39.300 | QA by storing knowledge in its parameters.

00:03:43.380 | Finally, just so I cover another modality, here's a recent text-to-image generation model

00:03:51.460 | with really impressive zero-shot generalization.

00:03:54.660 | OK, so now let's talk about GPD 3.

00:03:59.700 | So how big really are these models?

00:04:02.020 | This table presents some numbers to put things in perspective.

00:04:07.660 | So we have a collection of models starting with medium-sized LSTMs, which was a staple

00:04:12.380 | in pre-2016 NLP, all the way to humans who have 100 trillion synapses.

00:04:18.580 | And some in the middle, we have GPD 2 with over a billion parameters, and GPD 3 with

00:04:23.020 | over 150 billion parameters.

00:04:26.580 | And this exceeds the number of synaptic connections in a honeybee brain.

00:04:31.860 | So obviously, anyone with little knowledge of neuroscience knows that this is not an

00:04:36.180 | Apple Store and just embarrassing-- that this is an Apple Store and just embarrassing.

00:04:40.220 | But the point here is that the scale of these models is really starting to reach astronomical

00:04:44.020 | numbers.

00:04:45.020 | So here are some facts about GPD 3.

00:04:48.500 | For one, it's a large transformer with 96 layers.

00:04:53.980 | It has more or less the same architecture as GPD 2, with the exception that to scale

00:04:59.420 | up attention computation, it uses these locally-banded sparse attention patterns.

00:05:04.060 | And I really encourage you to look at the paper to understand the details.

00:05:07.340 | The reason we mention this here is because it kind of highlights that scaling up is simply

00:05:11.100 | not just changing hyperparameters, as many might believe.

00:05:14.220 | And it involves really non-trivial engineering and algorithms to make computations efficient.

00:05:19.420 | Finally, all of this is trained on 500 billion tokens taken from the Common Crawl, the Toronto

00:05:26.140 | Books Corpus, Wikipedia.

00:05:30.180 | So what's new about GPD 3?

00:05:32.140 | So let's look at some of the results on the paper first.

00:05:35.140 | So obviously, it does better on language modeling and text completion problems.

00:05:39.420 | As you can see from this table, it does better than GPD 2 at language modeling in the Pentry

00:05:44.220 | Bank, as well as better on the story completion data set called Limbada.

00:05:50.340 | To give a flavor of what's to come, let's take a closer look at this Limbada story completion

00:05:54.940 | data set.

00:05:56.200 | So the task here is that we're given a short story, and we are supposed to fill in the

00:05:59.660 | last word.

00:06:00.660 | Satisfying the constraints of the problem can be hard for a language model, which could

00:06:06.700 | generate a multi-word completion.

00:06:08.620 | But with GPD 3, the really new thing is that we can just give a few examples as prompts

00:06:13.260 | and sort of communicate a task specification to the model.

00:06:15.660 | And now, GPD 3 knows how the completion must be a single word.

00:06:18.780 | This is a very, very powerful paradigm.

00:06:21.300 | And we give some more examples of this in-context learning in a couple more slides.

00:06:27.580 | So apart from language modeling, it's really good at these knowledge-intensive tasks, like

00:06:32.900 | closed-book QA, as well as reading comprehension.

00:06:36.340 | And here, we observe that scaling up parameters results in a massive improvement in performance.

00:06:42.440 | So now let's talk about in-context learning.

00:06:44.780 | GPD 3 demonstrates some level of fast adaptation to completely new tasks.

00:06:50.380 | This happens via what's called in-context learning.

00:06:53.020 | As shown in the figure, the model training can be characterized as having an outer loop

00:06:58.300 | that learns a set of parameters that makes the learning of the inner loop as efficient

00:07:03.460 | as possible.

00:07:04.780 | And with this sort of framework in mind, we can really see how a good language model can

00:07:09.100 | also serve as a good few-shot learner.

00:07:14.020 | So in this segment, we will have some fun with GPD 3 and look at some demonstrations

00:07:18.620 | of this in-context learning.

00:07:22.540 | So to start off, here is an example where someone's trying to create an application

00:07:26.860 | that converts a language description to batch one-liners.

00:07:32.380 | The first three examples are prompts, followed by generated examples from GPD 3.

00:07:38.020 | So it gets a list of running processes.

00:07:40.860 | This one's easy.

00:07:41.860 | It probably just involves looking at your hash table.

00:07:44.100 | Some of the more challenging ones that involve copying over some spans from the text, like

00:07:51.060 | the SCP example is kind of interesting, as well as the harder one to parse grep.

00:07:56.300 | The SCP example comes up a lot during office hours, so GPD 3 knows how to do that.

00:08:03.980 | Here's a somewhat more challenging one, where the model is given a description of a database

00:08:07.580 | in natural language, and it starts to emulate that behavior.

00:08:12.580 | So the text in bold is sort of the prompt given to the model.

00:08:16.380 | The prompt includes somewhat of a function specification of what a database is.

00:08:22.980 | So it says that the database begins knowing nothing.

00:08:25.660 | The database knows everything that's added to it.

00:08:27.940 | The database does not know anything else.

00:08:30.340 | And when you ask a question to the database, if the answer is there in the database, the

00:08:33.980 | database must return the answer.

00:08:34.980 | Otherwise, it should say it does not know the answer.

00:08:38.380 | So this is very new and very powerful.

00:08:42.580 | And the prompt also includes some example usages.

00:08:45.540 | So when you ask 2+2, the database does not know.

00:08:48.180 | When you ask the capital of France, the database does not know.

00:08:51.300 | And then you add in a fact that Tom is 20 years old to the database.

00:08:55.340 | And now you can start asking it questions like, where does Tom live?

00:08:59.300 | And as expected, it says that the database does not know.

00:09:02.900 | But now if you ask it, what's Tom's age?

00:09:05.900 | The database says that Tom is 20 years old.

00:09:08.420 | And if you ask, what's my age?

00:09:09.860 | The database says basically that it does not know, because that's not been added.

00:09:13.580 | So this is really powerful.

00:09:16.180 | Here's another one.

00:09:18.140 | Now in this example, the model is asked to blend concepts together.

00:09:22.380 | And so there's a definition of what does it mean to blend concepts.

00:09:25.820 | So if you take airplane and car, you can blend that to give flying car.

00:09:31.140 | That's essentially like there's a Wikipedia definition of what concept blending is, along

00:09:36.180 | with some examples.

00:09:38.340 | Now let's look at some prompts followed by what GPT-3 answers.

00:09:43.780 | So the first one is straightforward, two-dimensional space blended with 3D space gives 2.5-dimensional

00:09:49.780 | space.

00:09:51.140 | The one that is somewhat interesting is old and new gives recycled.

00:09:57.820 | Then a triangle and square gives trapezoid.

00:09:59.940 | That's also interesting.

00:10:02.020 | The one that's really non-trivial is geology plus neurology.

00:10:06.180 | It's just sediment neurology, and I had no idea what this was.

00:10:09.700 | It's apparently correct.

00:10:11.780 | So clearly, it's able to do these very flexible things just from a prompt.

00:10:18.620 | So here's another class of examples that GPT-3 gets somewhat right.

00:10:25.620 | And these are these copycat analogy problems, which have been really well studied in cognitive

00:10:30.500 | science.

00:10:32.180 | And the way it works is that I'm going to give you some examples and then ask you to

00:10:37.860 | induce a function from these examples and apply it to new queries.

00:10:41.860 | So if ABC changes to ABT, what does PQR change to?

00:10:45.300 | Well, PQR must change to PQS, because the function we've learned is that the last letter

00:10:49.480 | must be incremented by 1.

00:10:52.220 | And this function, humans can now apply to examples of varying types.

00:10:57.220 | So like P repeated twice, Q repeated twice, R repeated twice must change to P repeated

00:11:02.180 | twice, Q repeated twice, and S repeated twice.

00:11:06.060 | And it seems like GPT-3 is able to get them right, more or less.

00:11:10.980 | But the problem is that if you ask it to generalize to examples that have increasing number of

00:11:19.380 | repetitions than were seen in the prompt, it's not able to do that.

00:11:23.180 | So in this situation, you ask it to make an analogy where the letters are repeated four

00:11:31.340 | times, and it's never seen that before and doesn't know what to do.

00:11:34.700 | And so it gets all of these wrong.

00:11:36.740 | So there's a point to be made here about just maybe these prompts are not enough to convey

00:11:44.020 | the function the model should be learning and maybe even more examples it can learn

00:11:47.380 | in.

00:11:48.380 | But it probably does not have the same kinds of generalization that humans have.

00:11:55.620 | And that brings us to the limitations of these models and some open questions.

00:12:01.180 | So just looking at the paper and passing through the results, it seems like the model is bad

00:12:06.500 | at logical and mathematical reasoning, anything that involves doing multiple steps of reasoning.

00:12:14.020 | And that explains why it's bad at arithmetic, why it's bad at word problems, why it's

00:12:18.300 | not great at analogy making, and even traditional textual entailment data sets that seem to

00:12:23.580 | require logical reasoning like RTE.

00:12:27.780 | So second most subtle point is that it's unclear how we can make permanent updates to the model.

00:12:33.540 | Maybe if I want to teach a model a new concept, that's possible to do it while I'm interacting

00:12:38.580 | with the system.

00:12:39.580 | But once the interaction is over, it restarts and does not have a notion of knowledge.

00:12:44.500 | And it's not that this is something that the model cannot do in principle, but just something

00:12:48.540 | that's not really been explored.

00:12:52.020 | It doesn't seem to exhibit human-like generalization, which is often called systematicity.

00:12:56.460 | And I'll talk a lot more about that.

00:12:58.780 | And finally, language is situated.

00:13:00.700 | And GPT-3 is just learning from text.

00:13:03.140 | And there's no exposure to other modalities.

00:13:04.620 | There's no interaction.

00:13:06.080 | So maybe the aspects of meaning that it acquires are somewhat limited.

00:13:09.820 | And maybe we should explore how we can bring in other modalities.

00:13:13.780 | So we'll talk a lot more about these last two limitations in the rest of the lecture.

00:13:20.540 | But maybe I can foster some questions now if there are any.

00:13:34.140 | I don't think there's a big outstanding question.

00:13:37.620 | But I mean, I think some people aren't really clear on few-shot setting and prompting versus

00:13:45.900 | learning.

00:13:46.900 | And I think it might actually be good to explain that a bit more.

00:13:50.180 | OK.

00:13:51.180 | Yeah.

00:13:52.180 | So maybe let's-- let me pick a simple example.

00:14:02.540 | Let me pick this example here.

00:14:04.660 | So prompting just means that-- so GPT-3, if you go back to first principles, GPT-3 is

00:14:11.060 | basically just a language model.

00:14:13.100 | And what that means is given a context, it'll tell you what's the probability of the next

00:14:19.500 | word.

00:14:20.700 | So if I give it a context, w1 through wk, GPT-3 will tell me what's the probability

00:14:27.980 | of wk plus 1 for [INAUDIBLE] the vocabulary.

00:14:33.420 | So that's what a language model is.

00:14:35.820 | A prompt is essentially a context that gets prepended before GPT-3 can start generating.

00:14:43.060 | And what's happening with in-context learning is that the context that you append-- that

00:14:48.820 | you prepend to GPT-3 are basically xy examples.

00:14:54.900 | So that's the prompt.

00:14:57.220 | And the reason why it's also-- it's equivalent to few-shot learning is because you prepend

00:15:03.340 | a small number of xy examples.

00:15:05.660 | So in this case, if I just prepend this one example that's highlighted in purple, then

00:15:10.900 | that's essentially one-shot learning because I just give it a single example as context.

00:15:16.820 | And now, given this query, which is also appended to the model, it has to make a prediction.

00:15:25.940 | So the input-output format is the same as how a few-shot learner would receive.

00:15:31.820 | But since it's a language model, the training data set is essentially presented as a context.

00:15:40.220 | So someone is still asking, can you be more specific about the in-context learning setups?

00:15:48.900 | What is the task?

00:15:50.500 | Right.

00:15:51.500 | So let's see.

00:15:54.620 | Maybe I can go to-- yeah, so maybe I can go to this slide.

00:16:03.900 | So the task is just that it's a language model.

00:16:08.620 | So it gets a context, which is just a sequence of tokens.

00:16:13.580 | And the task is just to-- so you have a sequence of tokens.

00:16:18.760 | And then the model has to generate given a sequence of tokens.

00:16:23.320 | And the way you can convert that into an actual machine learning classification problem is

00:16:27.940 | that-- so for this example, maybe you give it 5 plus 8 equals 13, 7 plus 2 equals 9,

00:16:35.540 | and then 1 plus 0 equals.

00:16:37.780 | And now, GPT-3 can fill in a number there.

00:16:41.700 | So that's how you convert it into a classification problem.

00:16:44.860 | The context here would be these two examples of arithmetic, like 5 plus 8 equals 13 and

00:16:51.020 | 7 plus 2 equals 9.

00:16:52.620 | And then the query is 1 plus 0 equals.

00:16:55.420 | And then the model, since it's just a language model, has to fill in 1 plus 0 equals question

00:16:59.780 | mark.

00:17:00.780 | So it fills in something there.

00:17:01.780 | It doesn't have to fill in numbers.

00:17:02.780 | It could fill in anything.

00:17:05.020 | But if it fills in a 1, it does the right job.

00:17:09.640 | So that's how you can take a language model and do few-shot learning with it.

00:17:14.340 | I'll keep on these questions.

00:17:16.780 | How is in-context learning different from transfer learning?

00:17:21.140 | So I guess in-context learning-- I mean, you can think of in-context learning as being

00:17:31.380 | kind of transfer learning.

00:17:33.220 | But transfer learning does not specify the mechanism through which the transfer is going

00:17:37.780 | to happen.

00:17:39.140 | With in-context learning, the mechanism is that the training examples are sort of appended

00:17:45.620 | to the model, which is a language model, just in order.

00:17:51.900 | So let's say you have x1, y1, x2, y2.

00:17:55.900 | And these are just appended directly to the model.

00:17:58.600 | And now it makes prediction on some queries that are drawn from this data set.

00:18:05.220 | So yes, it is a subcategory of transfer learning.

00:18:09.100 | But transfer learning does not specify exactly how this transfer learning is achieved.

00:18:14.180 | But in-context learning is very specific and says that for language models, you can essentially

00:18:19.680 | concatenate the training data set and then present that to the language model.

00:18:25.780 | People still aren't sufficiently clear on what is or isn't happening with learning and

00:18:32.620 | prompting.

00:18:33.980 | So another question is, so in-context learning still needs fine-tuning, question mark?

00:18:39.420 | We need to train GPT-3 to do in-context learning?

00:18:43.500 | Question mark.

00:18:44.500 | Right.

00:18:45.500 | So there are two parts to this question.

00:18:51.500 | So the answer is yes and no.

00:18:53.060 | So of course, the model is a language model.

00:18:56.460 | So it needs to be trained.

00:18:57.460 | So you start with some random parameters.

00:19:00.180 | And you need to train them.

00:19:02.180 | But the model is trained as a language model.

00:19:05.780 | And once the model is trained, you can now use it to do transfer learning.

00:19:11.460 | And the model parameters in-context learning are fixed.

00:19:14.660 | You do not update the model parameters.

00:19:17.740 | All you do is that you give it these small training set to the model, which is just appended

00:19:24.140 | to the model as context.

00:19:26.060 | And now the model can start generating from that point on.

00:19:29.740 | So in this example, if 5 plus 8 equals 13 and 7 plus 2 equals 9 are two xy examples.

00:19:38.540 | In vanilla transfer learning, what you would do is that you would take some gradient steps,

00:19:43.100 | update your model parameters, and then make a prediction on 1 plus 0 equals what.

00:19:47.620 | But in context learning, all you're doing is you just concatenate 5 plus 8 equals 13

00:19:54.420 | and 7 plus 2 equals 9 to the model's context window, and then make it predict what 1 plus

00:20:01.260 | 0 should be equal to.

00:20:04.660 | Maybe we should end for now with one other bigger picture question, which is, do you

00:20:13.740 | know of any research combining these models with reinforcement learning for the more complicated

00:20:18.540 | reasoning tasks?

00:20:20.440 | So that is an excellent question.

00:20:22.700 | There is some recent work on kind of trying to align language models with human preferences,

00:20:30.340 | where yes, there is some amount of fine tuning with reinforcement learning based on these

00:20:37.180 | preferences from humans.

00:20:38.820 | So maybe you want to do a summarization problem in GPT-3.

00:20:42.780 | The model produces multiple summaries.

00:20:45.300 | And for each summary, maybe you have a reward that is essentially a human preference.

00:20:50.260 | Maybe I want to include some facts, and I don't want to include some other non-important

00:20:55.060 | facts.

00:20:56.060 | So I can construct a reward out of that, and I can fine tune the parameters of my language

00:21:00.820 | model basically using reinforcement learning based on this reward, which is essentially

00:21:07.260 | human preferences.

00:21:08.260 | So there's some very recent work that tries to do this.

00:21:11.180 | But I'm not sure-- yeah, I'm not aware of any work that tries to use reinforcement learning

00:21:15.860 | to teach a reasoning to these models.

00:21:17.940 | But I think it's an interesting future direction to explore.

00:21:23.460 | OK.

00:21:29.100 | Maybe you should go on at this point.

00:21:32.100 | OK.

00:21:34.100 | OK, so we'll talk a bit more about these last two points, so systematicity and language

00:21:44.660 | grounding.

00:21:48.220 | So just to start off, how do you define systematicity?

00:21:51.820 | So really, the definition is that there is a definite and predictable pattern among the

00:21:56.540 | sentences that native speakers of a language understand.

00:22:00.540 | And so there's a systematic pattern among the sentences that we understand.

00:22:04.500 | What that means is, let's say there's a sentence like, John loves Mary.

00:22:09.020 | And if a native speaker understands the sentence, then they should also be able to understand

00:22:12.820 | the sentence, Mary loves John.

00:22:17.020 | And closely related to this idea of systematicity is the principle of compositionality.

00:22:21.780 | And for now, I'm going to ignore the definition by Montague and just look at the rough definition.

00:22:26.660 | And then we can come back to this other more concrete definition.

00:22:30.620 | The rough definition is essentially that the meaning of an expression is a function of

00:22:34.500 | the meaning of its parts.

00:22:37.860 | So that brings us to the question, are human languages really compositionally?

00:22:42.340 | And here are some examples that make us think that maybe, yes.

00:22:47.900 | So if you look at what is the meaning of the noun phrase brown cow, so it is composed of

00:22:53.180 | the meaning of the adjective brown and the noun cow.

00:22:58.660 | So all things that are brown and all things that are cow take the intersection and get

00:23:02.300 | brown cow.

00:23:03.300 | Similarly, red rabbits, so all things that are red, all things that are rabbit, combine

00:23:06.380 | them and get red rabbits.

00:23:07.820 | And then kick the ball, this word phrase can be understood as you have some agent that's

00:23:12.780 | performing a kicking operation on the ball.

00:23:16.180 | But this is not always the case that you can get the meaning of the whole by combining

00:23:22.700 | meanings of parts.

00:23:23.700 | So here, we have some counter examples that people often use.

00:23:26.700 | So red herring does not mean all things that are red and all things that are herring.

00:23:31.260 | And kick the bucket definitely does not mean that there's an agent that's kicking the bucket.

00:23:35.700 | So while these examples are supposed to be provocative, we think that language is mostly

00:23:41.900 | compositional.

00:23:42.900 | There's lots of exceptions, but for a vast majority of sentences that we've never heard

00:23:47.540 | before, we're able to understand what they mean by piecing together the words that the

00:23:52.060 | sentence is composed of.

00:23:54.100 | And so what that means is that maybe compositionality of representations are helpful prior that

00:23:58.540 | could lead to systematicity in behavior.

00:24:02.740 | And that brings us to the questions that we ask in the segment, are neural representations

00:24:06.700 | compositional?

00:24:08.100 | And the second question is, if so, do they generalize systematically?

00:24:12.420 | So how do you even measure if representations that a neural network learns exhibit compositionality?

00:24:23.500 | So let's go back to this definition from Montague, which says that compositionality is about

00:24:29.780 | the existence of a homomorphism from syntax to semantics.

00:24:34.700 | And to look at that, we have this example, which is Lisa does not skateboard.

00:24:40.660 | And we have a syntax tree corresponding to this example.

00:24:44.700 | And the meaning of the sentence can be composed according to the structure that's decided

00:24:51.900 | by the syntax.

00:24:52.900 | So the meaning of Lisa does not skateboard is a function of the meaning of Lisa and does

00:24:57.740 | not skateboard.

00:24:58.740 | The meaning of does not skateboard is a function of does and not skateboard.

00:25:01.820 | The meaning of not skateboard is a function of not and skateboard.

00:25:05.740 | So that's good.

00:25:06.820 | And so this gives us one way of formalizing how we can measure compositionality in neural

00:25:13.260 | representations.

00:25:14.260 | And so compositionality of representations could be thought of as how well the representation

00:25:19.700 | approximates an explicitly homomorphic function in a learned representation space.

00:25:26.340 | So what we're going to do is essentially measure if we were to construct a neural network whose

00:25:32.340 | computations are based exactly according to these parse trees, how far are the representations

00:25:37.660 | of a learned model from this explicitly compositional representation?

00:25:44.020 | And that'll give us some understanding of how compositional the neural networks representations

00:25:48.420 | should be.

00:25:50.500 | So to unpack that a little bit, instead of having-- yeah, so instead of having denotations,

00:25:58.100 | we have representations in the node.

00:26:03.700 | And to be more concrete about that, we first start by choosing a distance function that

00:26:09.780 | tells us how far away two representations are.

00:26:12.660 | And then we also need a way to compose together two constituents to give us the meaning of

00:26:20.180 | the whole.

00:26:21.740 | But once we have that, we can start by-- we can create an explicitly compositional function,

00:26:27.460 | right?

00:26:28.460 | So what we do is we have these representations at the leaves that are initialized randomly

00:26:37.660 | and the composition function that's also initialized randomly.

00:26:40.660 | And then a forward pass according to this syntax is used to compute the representation

00:26:45.980 | of Lisa does not skateboard.

00:26:48.060 | And now once you have this representation, you can create a loss function.

00:26:51.700 | And this loss function measures how far are the representations of my neural network from

00:26:57.500 | this second proxy neural network that I've created.

00:27:02.180 | And then I can basically optimize both the composition function and the embeddings of

00:27:08.660 | the leaves.

00:27:10.220 | And then once the optimization is finished, I can measure how far was the representation

00:27:15.860 | of my neural net from this explicitly compositional network on a held outside.

00:27:22.380 | And that then tells me whether the representation of my neural net learned were actually compositional

00:27:26.540 | or not.

00:27:28.780 | So to see how well this works, let's look at a plot.

00:27:33.620 | And this is relatively complex.

00:27:36.860 | But just to unpack this a little bit, it plots the mutual information between the input that

00:27:45.840 | the neural network receives versus the representation against this tree reconstruction error that

00:27:53.460 | we were talking about.

00:27:55.740 | And to give some more background about what's to come, there is a theory which is called

00:28:02.420 | the information bottleneck theory, which says that as a neural network trains, it first

00:28:09.980 | tries to maximize the mutual information between the representation and the input in an attempt

00:28:15.820 | to memorize the entire data set.

00:28:18.820 | And that is a memorization phase.

00:28:22.840 | And then once memorization is done, there is a learning or a compression phase where

00:28:28.020 | this mutual information starts to decrease.

00:28:31.060 | And the model is essentially trying to compress the data or consolidate the knowledge in the

00:28:35.540 | data into its parameters.

00:28:38.140 | And what we are seeing here is that as a model learns, which is characterized by decreasing

00:28:43.380 | mutual information, we see that the representations themselves are becoming more and more compositional.

00:28:50.620 | And overall, we observe that learning is correlated with increased compositionality as measured

00:28:55.780 | by the tree reconstruction error.

00:28:58.020 | So that's really encouraging.

00:29:01.900 | So now that we have a method of measuring compositionality of representations in these

00:29:08.060 | neural nets, how do we start to create benchmarks that see if they are generalizing systematically

00:29:15.620 | or not?

00:29:18.000 | So to do that, here is a method for taking any data set and splitting it into a train

00:29:23.580 | test split that explicitly tests for this kind of generalization.

00:29:31.820 | So to do that, we use this principle called maximizing the compound divergence.

00:29:38.140 | And to illustrate how this principle works, we look at this toy example.

00:29:43.700 | So in this toy example, we have a training data set that consists of just two examples

00:29:47.980 | and test data set of just two examples.

00:29:51.780 | The atoms are defined as the primitive elements, so entity words, predicates, question types.

00:29:58.940 | So in this toy example, Goldfinger, Christopher Nolan, these are all the primitive elements.

00:30:05.500 | And the compounds are compositions of these primitive elements.

00:30:08.460 | So who directed entity would be the composition of the question type.

00:30:13.140 | Did x predicate y?

00:30:14.780 | And the predicate direct.

00:30:19.140 | So here's a basic machinery for producing compositionally challenging splits.

00:30:23.600 | So let's start by introducing two distributions.

00:30:27.700 | The first distribution is the normalized frequency distribution of the atoms.

00:30:33.020 | So given any data set, if we know what the notion of atoms are, we can basically compute

00:30:38.740 | the frequency of all of the atoms and then normalize that by the total count.

00:30:43.460 | And that's going to give us one distribution.

00:30:47.260 | And we can repeat the same thing for the compounds.

00:30:49.900 | And that will give us a second frequency distribution.

00:30:53.720 | So note that these are just two probability distributions.

00:30:57.860 | And once we have these two distributions, we can essentially define the atom and compound

00:31:04.060 | divergence simply as this quantity here.

00:31:08.860 | And where there is the Chernoff coefficient between two categorical distributions.

00:31:15.060 | The Chernoff coefficient basically measures how far two categorical distributions are.

00:31:20.820 | So just to get a bit more intuition about this, if we set p to q, then the Chernoff

00:31:26.440 | coefficient is 1, which means these representations are maximally similar.

00:31:32.440 | And then if p is non-zero everywhere q is 0, or if p is 0 in all the places where q

00:31:40.560 | is 0, then the Chernoff coefficient is exactly 0, which means that these two distributions

00:31:46.920 | are maximally far away.

00:31:49.240 | And the overall goal by describing this objective is that-- this loss objective is just that

00:31:58.080 | we are going to maximize the compound divergence and minimize the atom divergence.

00:32:03.780 | And so what is the intuition behind doing such a thing?

00:32:06.500 | So what we want is to ensure that the unigram distribution, in some sense, is constant between

00:32:12.380 | the train and test split so that the model does not encounter any new words.

00:32:18.820 | But we want the compound divergence to be very high, which means that these same words

00:32:24.660 | that the model has seen many times must appear in new combinations, which means that we are

00:32:29.420 | testing for systematicity.

00:32:33.380 | And so if you follow this procedure for a semantic parsing data set, let's say, what

00:32:40.060 | we see is that as you increase the scale, we see that this model just does better and

00:32:46.540 | better at compositional generalization.

00:32:50.020 | But just pulling out a quote from this paper, "pre-training helps for compositional generalization

00:32:55.380 | but doesn't fully solve it."

00:32:57.180 | And what that means is that maybe as you keep scaling up these models, you'll see better

00:33:00.820 | and better performance, or maybe it starts to saturate at some point.

00:33:06.940 | In any case, we should probably be thinking more about this problem instead of just trying

00:33:10.700 | to brute force it.

00:33:14.260 | So now this segment tells us that the way we split a data set, we can measure different

00:33:24.020 | behaviors of the model.

00:33:25.020 | And that tells us that maybe we should be thinking more critically about how we're evaluating

00:33:29.060 | models in NLP in general.

00:33:31.600 | So there has been a revolution basically over the last few years in the field where we're

00:33:37.260 | seeing all of these large transform models beat all of our benchmarks.

00:33:40.500 | At the same time, there is still not complete confidence that once we deploy these systems

00:33:46.460 | in the real world, they're going to maintain their performance.

00:33:51.940 | And so it's unclear if these gains are coming from spurious correlations or some real task

00:33:55.780 | understanding.

00:33:56.780 | And so how do we design benchmarks that accurately tell us how well this model is going to do

00:34:02.100 | in the real world?

00:34:03.100 | And so I'm going to give one example of works that try to do this.

00:34:08.020 | And that's the idea of dynamic benchmarks.

00:34:11.180 | And the idea of dynamic benchmarks is basically saying that instead of testing our models

00:34:17.140 | on static test sets, we should be evaluating them on an ever-changing dynamic benchmark.

00:34:24.380 | And there's many recent examples of this.

00:34:27.420 | And the idea dates back to a 2017 workshop at EMLP.

00:34:33.300 | And so the overall schematic looks something like this, that we start with a training data

00:34:38.060 | set and a test data set, which is the static opera.

00:34:42.380 | We train a model on that.

00:34:44.020 | And then once the model is trained, we deploy that and then have humans create new examples

00:34:49.500 | that the model fails to classify.

00:34:52.180 | And crucially, we're looking for examples.

00:34:54.940 | The model does not get tried, but humans have no issue figuring out the answer to.

00:34:59.860 | So by playing this game of whack-a-mole, where humans figure out what are the holes

00:35:06.540 | in the model's understanding, and then add that back into the training data, re-train

00:35:11.420 | the model, deploy it again, have humans create new examples, we can essentially construct

00:35:16.060 | this never-ending data set, this never-ending test set, which can hopefully be a better

00:35:24.420 | proxy of estimating real-world performance.

00:35:28.500 | So this is some really cutting-edge research.

00:35:33.700 | And one of the main challenges of this class of works is that it's unclear how much this

00:35:39.180 | can scale up, because maybe after multiple iterations of this whack-a-mole, humans are

00:35:47.060 | just fundamentally limited by creativity.

00:35:49.380 | So figuring out how to deal with that is really an open problem.

00:35:55.060 | And current approaches just use examples from other data sets to prompt humans to think

00:35:59.860 | more creatively.

00:36:01.360 | But maybe we can come up with better, more automated methods of doing this.

00:36:08.180 | So this brings us to the final segment.

00:36:11.940 | Or actually, let me stop for questions at this point and see if people have questions.

00:36:23.700 | Here's a question.

00:36:29.980 | With dynamic benchmark, doesn't this mean that the model creator will also need to continually

00:36:36.060 | test/evaluate the models on the new benchmarks, new data sets?

00:36:42.500 | Wait a second.

00:36:47.220 | Sorry.

00:36:48.740 | Yeah, so with dynamic benchmarks, yes, it's absolutely true that you will have to continuously

00:36:57.340 | keep training your model.

00:36:58.940 | And that's just to ensure that the reason your model is not doing well on the test set

00:37:06.340 | doesn't have to do with this domain mismatch.

00:37:09.820 | And what we're really trying to do is measure how-- just come up with a better estimate

00:37:17.180 | of the model's performance in the overall task and just trying to get more and more

00:37:22.420 | data.

00:37:23.420 | So yes, to answer your question, yes, we need to keep training the model again and again.

00:37:27.380 | But this can be automated.

00:37:29.740 | So I'll move on to language-grounded.

00:37:35.860 | So in this final segment, I'll talk about how we can move beyond just training models

00:37:42.500 | on text alone.

00:37:45.600 | So many have articulated the need to use modalities other than text if we someday want to get

00:37:52.380 | at real language understanding.

00:37:55.380 | And ever since we've had these big language models, there has been a rekindling of this

00:38:02.420 | debate.

00:38:03.420 | And recently, there was multiple papers on this.

00:38:06.300 | And so at ACL last year, there was this paper that argues through multiple thought experiments

00:38:11.900 | that it's actually impossible to acquire meaning from form alone, where meaning refers to the

00:38:17.700 | communicative intent of a speaker, and form refers to text or speech signals.

00:38:24.300 | A more modern version of this was put forward by the second paper, where they say that training

00:38:31.180 | on only web-scale data limits the world scope of models and limits the aspects of meanings

00:38:37.640 | that the model can actually acquire.

00:38:41.360 | And so here is a diagram that I borrowed from the paper.

00:38:44.640 | And what they say is the era where we were training models on supervised data sets, models

00:38:51.640 | were limited in world scope one.

00:38:54.020 | And now that we've moved on to exploiting unlabeled data, we're now in world scope two,

00:38:59.720 | where models just have strictly more signal to get more aspects of meaning in.

00:39:05.400 | If you mix in additional modalities into this-- so maybe you mix in videos, and maybe you

00:39:10.000 | mix in images-- then that expands out the world scope of the model further.

00:39:15.960 | And now maybe it can acquire more aspects of meaning, such that now it knows that the

00:39:21.500 | lexical item red refers to red images.

00:39:27.160 | And then if you go beyond that, you can have a model that is embodied, and it's actually

00:39:32.320 | living in an environment where it can interact with its data, conduct interventions and experiments.

00:39:39.620 | And then if you go even beyond that, you can have models that live in a social world where

00:39:44.760 | they can interact with other models.

00:39:46.240 | Because after all, the purpose of language is to communicate.

00:39:49.480 | And so if you can have a social world where models can communicate with other models,

00:39:55.600 | that expands out aspects of meaning.

00:40:00.280 | And so GPT-3 is in world scope two.

00:40:04.280 | So there are a lot of open questions in this space.

00:40:07.600 | But given that there are all of these good arguments about how we need to move beyond

00:40:11.600 | text, what is the best way to do this at scale?

00:40:16.320 | We know that babies cannot learn language from watching TV alone, for example.

00:40:21.960 | So there has to be some interventions, and there has to be interactions with the environment

00:40:26.560 | that need to happen.

00:40:28.360 | But at the same time, the question is, how far can models go by just training on static

00:40:34.360 | data as long as we have additional modalities, especially when we combine this with scale?

00:40:40.880 | And if interactions with the environment are really necessary, how do we collect data and

00:40:45.400 | design systems that interact minimally or in a cost-effective way?

00:40:50.280 | And then finally, could pre-training on text still be useful if any of these other research

00:40:59.760 | directions become more sample efficient?

00:41:03.900 | So if you're interested in learning more about this topic, I highly encourage you to take

00:41:08.560 | CS224U, which is offered in the spring.

00:41:11.160 | They have multiple lectures on just language learning.

00:41:19.680 | So in this final segment, I'm going to talk a little bit more about how you can get involved

00:41:24.880 | with NLP and deep learning research and how you can make more progress.

00:41:32.240 | So here are some general principles for how to make progress in NLP research.

00:41:38.220 | So I think the most important thing is to just read broadly, which means not just read

00:41:42.680 | the latest and greatest papers and archive, but also read pre-2010 statistical NLP.

00:41:50.360 | Learn about the mathematical foundations of machine learning to understand how generalization

00:41:55.420 | works, so take CS229M.

00:41:58.840 | Learn more about language, which means taking classes in the linguistics department.

00:42:03.200 | In particular, I would recommend maybe this 138A.

00:42:06.680 | And also take CS224U.

00:42:09.680 | And finally, if you wanted inspiration from how babies learn, then definitely read about

00:42:16.200 | child language acquisition literature.

00:42:18.200 | It's fascinating.

00:42:19.200 | Finally, learn your software tools, which involves scripting tools, version control,

00:42:29.240 | data wrangling, learning how to visualize quickly with Jupyter Notebooks.

00:42:34.480 | And deep learning often involves running multiple experiments with different hyperparameters

00:42:39.720 | and different ideas all in Paddle.

00:42:41.720 | And sometimes it can get really hard to keep track of everything.

00:42:44.440 | So learn how to use experiment management tools like weights and biases.

00:42:50.920 | And finally, I'll talk about some really quick final project tips.

00:42:57.400 | So first, let's just start by saying that if your approach doesn't seem to be working,

00:43:01.880 | please do not panic.

00:43:03.400 | Put assert statements everywhere and check if the computations that you're doing are

00:43:08.160 | correct.

00:43:09.160 | Use breakpoints extensively, and I'll talk a bit more about this.

00:43:13.200 | Check if the loss function that you've implemented is correct.

00:43:16.760 | And one way of debugging that is to see that the initial values are correct.

00:43:21.360 | So if you're doing a k-way classification problem, then the initial loss should be a

00:43:24.840 | natural log of k.

00:43:25.840 | Always, always, always start by creating a small training data set which has like 5 to

00:43:31.520 | 10 examples and see if your model can completely cope with that.

00:43:35.000 | If not, there's a problem with your training loop.

00:43:38.720 | Check for saturating activations and dead values.

00:43:41.800 | And often, this can be fixed by-- maybe there's some problems with the gradients, or maybe

00:43:46.320 | there's some problems with the initialization.

00:43:48.520 | Which brings me to the next point.

00:43:50.160 | Check your gradient values.

00:43:51.160 | See if they're too small, which means that maybe you should be using residual connections

00:43:54.960 | or LSTMs.

00:43:55.960 | Or if they're too large, then you should use gradient clipping.

00:43:59.000 | In fact, always use gradient clipping.

00:44:01.520 | Overall, be methodical.

00:44:04.000 | If your approach doesn't work, come up with hypotheses for why this might be the case.

00:44:08.800 | Design Oracle experiments to debug it.

00:44:10.880 | Look at your data.

00:44:11.880 | Look at the errors that it's making.

00:44:14.320 | And just try to be systematic about everything.

00:44:17.760 | So I'll just say a little bit more about breakpoints.

00:44:23.320 | So there's this great library called PDB.

00:44:25.440 | It's like GDP, but it's for Python.

00:44:27.080 | So that's why PDB.

00:44:29.280 | To create a breakpoint, just add the line import PDB, PDB.setTrace before the line you

00:44:36.000 | want to inspect.

00:44:37.360 | So earlier today, I was trying to play around with the Transformers library.

00:44:42.920 | So I was trying to do question answering.

00:44:45.480 | So I have a really small training corpus.

00:44:47.560 | And the context is, one morning, I shot an elephant in my pajamas.

00:44:52.040 | How he got into my pajamas, I don't know.

00:44:55.080 | And the question is, what did I shoot?

00:44:57.760 | And to solve this problem, I basically imported a tokenizer and a BERT model.

00:45:04.240 | And I initialized my tokenizer, initialized my model, tokenized my input.

00:45:08.320 | I set my model into the eval mode.

00:45:11.080 | And I tried to look at the output.

00:45:13.920 | But I get this error.

00:45:16.280 | And I'm very sad.

00:45:17.280 | It's not clear what's causing this error.

00:45:19.400 | And so the best way to look at what's causing this error is to actually put a breakpoint.

00:45:24.920 | So right after model.eval, I put a breakpoint.

00:45:27.960 | Because I know that that's where the problem is.

00:45:30.120 | So the problem is in line 21.

00:45:32.080 | So I put a breakpoint at line 21.

00:45:35.600 | And now once I put this breakpoint, I can just run my script again.

00:45:40.720 | And it stops before executing line 21.

00:45:43.320 | And at this point, I can examine all of my variables.

00:45:46.360 | So I can look at the token as input, because maybe that's where the problem is.

00:45:50.400 | And lo and behold, I see that it's actually a list.

00:45:54.460 | So it's a dictionary of lists, whereas models typically expect a Dodge tensor.

00:45:59.160 | So now I know what the problem is.

00:46:01.560 | And that means I can quickly go ahead and fix it.

00:46:04.320 | And everything just works.

00:46:06.280 | So this just shows that you should use breakpoints everywhere if your code is not working.

00:46:10.960 | And it can just help you debug really quickly.

00:46:16.160 | So finally, I'd say that if you want to get involved with NLP and deep learning research,

00:46:21.640 | and if you really like the final project, we have the CLIPS program at Stanford.

00:46:26.160 | And this is a way for undergrads, master's students, and PhDs who are interested in deep

00:46:31.520 | learning and doing NLP research and want to get involved with the NLP group.

00:46:36.480 | So we highly encourage you to apply to CLIPS.

00:46:40.560 | And so I'll conclude today's class by saying that we've made a lot of progress in the last

00:46:48.240 | decade.

00:46:49.240 | And that's mostly due to clever understanding of neural networks, data, hardware, all of

00:46:54.520 | that combined with scale.

00:46:55.520 | We have some really amazing technologies that can do really exciting things.

00:46:59.280 | And we saw some examples of that today.

00:47:03.200 | In the short term, I expect that we'll see more scaling because it just seems to help.

00:47:08.440 | So perhaps even larger models.

00:47:11.080 | But this is not trivial.

00:47:12.480 | So I said that before, and I'll just say it again.

00:47:16.320 | Scaling requires really non-trivial engineering efforts, and sometimes even clever algorithms.

00:47:21.740 | And so there's a lot of interesting systems work to be done here.

00:47:25.040 | But in the long term, we really need to be thinking more about these bigger problems

00:47:30.120 | of systematicity, generalization.

00:47:31.120 | How can we make our models learn a new concept really quickly so that it's fast adaptation?

00:47:39.040 | And then we also need to create benchmarks that we can actually trust.

00:47:42.560 | If my model has some performance on some sentiment analysis data set and deployed in the real

00:47:48.360 | world, that should be reflected in the number that I get from the benchmark.

00:47:52.000 | So we need to make progress in the way we evaluate models.

00:47:56.080 | And then also figuring out a way to move beyond text in a more tractable way.

00:48:01.440 | This is also really essential.

00:48:03.880 | So yeah, that's it.

00:48:05.560 | Good luck with your final projects.

00:48:07.760 | I can take more questions at this point.

00:48:13.000 | So I answered a question earlier that actually I think you could also opine on.

00:48:19.960 | It was the question of whether you have a large model that's pre-trained on language,

00:48:24.040 | if it will actually help you in other domains, like you apply it to vision stuff.

00:48:29.800 | Yeah.

00:48:30.800 | Yeah.

00:48:31.800 | So I guess the answer is actually, yes.

00:48:37.000 | So there was a paper that came out really, really recently, like just two days ago, that

00:48:40.440 | it just takes-- I think it was GBD2.

00:48:43.480 | I'm not sure.

00:48:45.240 | It's like one large transformer model that's pre-trained on text.

00:48:48.960 | And like other modalities, so they definitely apply to images.

00:48:53.600 | And I think they apply to math problems and some more modalities and show that it's actually

00:48:59.600 | really effective at transfers.

00:49:02.280 | So if you pre-train on text and then you move to a different modality, that helps.

00:49:05.600 | I think part of the reason for that is just that across modalities, there is a lot of

00:49:10.080 | autoregressive structure that is shared.

00:49:13.680 | And I think one reason for that is that language is really referring to the world around it.

00:49:19.800 | And so you might expect that there is some correspondence that's just beyond the autoregressive

00:49:27.360 | structure.

00:49:28.360 | So there's also works that show that if you have just text-only representations and image-only

00:49:34.080 | representations, you can actually learn a simple linear classifier that can learn to

00:49:38.320 | align both of these representations.

00:49:39.320 | And all of these works are just showing that there's actually a lot more common between

00:49:43.400 | modalities than we thought in the beginning.

00:49:47.120 | So yeah, I think it's possible to pre-train on text and then fine-tune on your modality

00:49:54.160 | of interest.

00:49:55.160 | And it should probably be effective, of course, based on what the modality is.

00:49:59.960 | But for images and videos, it's certainly effective.

00:50:03.920 | Any questions?

00:50:10.920 | A couple of questions have turned up.

00:50:24.640 | One is, what's the difference between CS224U and this class in terms of the topics covered

00:50:30.960 | and focus?

00:50:31.960 | Do you want to answer that one, Shikhar, or should I have a go at answering it?

00:50:36.280 | Maybe you should answer this one.

00:50:38.600 | OK.

00:50:39.600 | So next quarter, CS224U, Natural Language Understanding, is co-taught by Chris Potts

00:50:48.440 | and Bill McCartney.

00:50:50.840 | So in essence, it's meant to be different that natural language understanding focuses

00:51:00.360 | on what its name is, sort of how to build computer systems that understand the sentences

00:51:07.100 | of natural language.

00:51:08.960 | Now, in truth, the boundary is kind of complex because we do some natural language understanding

00:51:17.680 | in this class as well.

00:51:19.920 | And certainly for the people who are doing the default final project, question answering,

00:51:24.560 | well, that's absolutely a natural language understanding task.

00:51:29.260 | But the distinction is meant to be that at least a lot of what we do in this class, things

00:51:37.680 | like the assignment three dependency parser or building the machine translation system

00:51:45.960 | in assignment four, that they are in some sense natural language processing tasks where

00:51:53.600 | processing can mean anything but commonly means you're doing useful intelligent stuff

00:52:00.460 | with human language input, but you're not necessarily deeply understanding it.

00:52:06.820 | So there is some overlap in the classes.

00:52:10.300 | If you do CS224U, you'll certainly see word vectors and transformers again.

00:52:16.740 | But the emphasis is on doing a lot more with natural language understanding tasks.

00:52:22.880 | And so that includes things like building semantic parsers.

00:52:28.020 | So they're the kind of devices that will, you know, respond to questions and commands

00:52:34.320 | such as an Alexa or Google assistant will do.

00:52:40.460 | Building relation extraction systems, which get out particular facts out of a piece of

00:52:45.480 | text of, oh, this person took on this position at this company.

00:52:52.940 | Looking at grounded language learning and grounded language understanding where you're

00:52:57.780 | not only using the language, but the world context to get information and other tasks

00:53:05.500 | that sort.

00:53:06.500 | I mean, I guess you're going to look at the website to get more details of it.

00:53:10.580 | I mean, you know, relevant to this class, I mean, a lot of people also find it an opportunity

00:53:17.200 | to just get further in doing a project in the area of natural language processing that

00:53:24.880 | sort of by the nature of the structure of the class, since, you know, it more assumes

00:53:30.320 | that people know how to build deep learning natural language systems at the beginning

00:53:35.400 | that rather than a large percentage of the class going into, okay, you have to do all

00:53:41.400 | of these assignments, although there are little assignments earlier on that there's sort

00:53:46.080 | of more time to work on a project for the quarter.

00:53:51.280 | Okay.

00:53:53.080 | Here's one more question that maybe Shikhar could do.

00:53:57.680 | Do you know of attempts to crowdsource dynamic benchmarks, e.g. users uploading adversarial

00:54:04.280 | examples for evaluation or online learning?

00:54:08.160 | Yeah, so actually, like, the main idea there is to use crowdsourcing, right?

00:54:16.520 | So in fact, there is this bench.

00:54:19.140 | So there is this platform that was created by Pair, it's called DynaBench.

00:54:25.080 | And the objective is just that to construct this like dynamically evolving benchmark,

00:54:31.800 | we are just going to offload it to users of this platform.

00:54:36.560 | And you can, you know, it essentially gives you utilities for like, deploying your model

00:54:41.200 | and then having, you know, humans kind of try to fool the model.

00:54:46.360 | Yeah, so this is like, it's basically how the dynamic benchmark collection actually

00:54:55.160 | works.

00:54:56.160 | So we deploy a model on some platform, and then we get humans to like fool the system.

00:55:09.160 | Yeah.

00:55:11.600 | Here's a question.

00:55:19.720 | Can you address the problems of NLP models, not able to remember really long contexts

00:55:25.240 | and techniques to infer on really large input length?

00:55:29.320 | Yeah, so I guess like, there have been like a few works recently that kind of try to scale

00:55:38.640 | up transformers to like really large context lengths.

00:55:42.720 | One of them is like the reformer.

00:55:45.600 | And there's also like the transformer Excel that was the first one to try and do that.

00:55:51.720 | I think what is unclear is whether you can combine that with the scale of these GPT like

00:56:00.360 | models.

00:56:02.240 | And if you see like qualitatively different things, once you do that, like, and part of

00:56:07.920 | it is just that all of this is just like so recent, right?

00:56:10.720 | But yeah, I think the open question there is that, you know, can you take these like

00:56:15.320 | really long context transformers that can operate over long context, combine that with

00:56:21.400 | scale of GPT-3, and then get models that can actually reason over these like really large

00:56:27.080 | contexts?

00:56:28.080 | Because I guess the hypothesis of scale is that once you train language models at scale,

00:56:34.360 | it can start to do these things.

00:56:36.320 | And so to do that for long context, we actually need to like have long context transformers

00:56:42.680 | that are trained at scale.

00:56:43.680 | And I don't think people have done that yet.

00:56:55.720 | So I'm seeing this other question about language acquisition.

00:56:58.720 | Chris, do you have some thoughts on this?

00:57:01.800 | Or maybe I can just say something.

00:57:07.680 | Yeah, so the question is, what do you think we can learn from baby language acquisition?

00:57:17.000 | Can we build a language model in a more interactive way, like reinforcement learning?

00:57:22.160 | Do you know any of these attempts?

00:57:26.680 | That's a big, huge question.

00:57:28.600 | And you know, I think the short non-helpful answer is that there are kind of no answers

00:57:34.680 | at the moment.

00:57:35.680 | I know people have certainly tried to do things at various scales, but you know, we just have

00:57:41.840 | no technology that is the least bit convincing for being able to replicate the language learning

00:57:50.760 | ability of a human child.

00:57:53.800 | But after that prologue, what I could say is, I mean, yeah, there are definitely ideas

00:58:00.440 | to have in your head.

00:58:02.840 | So you know, there are sort of clear results, which is that little kids don't learn by watching

00:58:10.200 | videos.

00:58:11.240 | So it seems like interaction is completely key.

00:58:17.680 | Little kids don't learn from language alone.

00:58:21.440 | They're in a very rich environment where people are sort of both learning stuff from the environment

00:58:28.040 | in general, and in particular, you know, they're learning a lot from what language acquisition

00:58:35.240 | researchers refer to as attention, which is different to what we mean by attention.

00:58:42.360 | But it means that the caregiver will be looking at the object that's the focus of interest

00:58:48.680 | and you know, commonly other things as well, like sort of, you know, picking it up and

00:58:52.440 | bringing it near the kid and all those kinds of things.

00:58:57.720 | And you know, babies and young kids get to experiment a lot, right?

00:59:03.440 | So regardless of whether it's learning what happens when you have some blocks that you

00:59:09.880 | stack up and play with them, or you're learning language, you sort of experiment by trying

00:59:17.280 | some things and see what kind of response you get.

00:59:20.920 | And again, that's essentially building on the interactivity of it that you're getting

00:59:27.240 | some kind of response to any offerings you make.

00:59:30.800 | And you know, this is something that's sort of been hotly debated in the language acquisition

00:59:35.680 | literature.

00:59:36.680 | So a traditional chompskin position is that, you know, human beings don't get effective

00:59:47.040 | feedback, you know, supervised labels when they talk.

00:59:52.880 | And you know, in some very narrow sense, well, that's true, right?

00:59:56.240 | It's just not the case that after a baby tries to say something that they get feedback of,

01:00:01.280 | you know, syntax error in English on word for, or they get given, here's the semantic

01:00:09.400 | form I took away from your utterance.

01:00:12.520 | But in a more indirect way, they clearly get enormous feedback, they can see what kind

01:00:18.040 | of response they get from their caregiver at every corner.

01:00:24.720 | And so like in your question, you were suggesting that, well, somehow we should be making use

01:00:32.920 | of reinforcement learning because we have something like a reward signal there.

01:00:37.960 | And you know, in a big picture way, I'd say, oh, yeah, I agree.

01:00:42.840 | In terms of a much more specific way as to, well, how can we possibly get that to work

01:00:48.160 | to learn something with the richness of human language?

01:00:52.800 | You know, I think we don't have much idea, but you know, there has started to be some

01:00:59.560 | work.

01:01:00.560 | So people have been sort of building virtual environments, which, you know, you have your

01:01:07.600 | avatar in and that can manipulate in the virtual environment and there's linguistic input,

01:01:14.720 | and it can succeed in getting rewards for sort of doing a command where the command

01:01:19.480 | can be something like, you know, pick up the orange block or something like that.

01:01:24.920 | And you know, to a small extent, people have been able to build things that work.

01:01:31.880 | I mean, as you might be picking up, I mean, I guess so far, at least I've just been kind

01:01:39.760 | of underwhelmed because it seems like the complexity of what people have achieved is

01:01:45.440 | sort of, you know, just so primitive compared to the full complexity of language, right?

01:01:51.680 | You know, the kind of languages that people have been able to get systems to learn are

01:01:57.080 | ones that can, yeah, do pick up commands where they can learn, you know, blue cube versus

01:02:02.880 | orange sphere.

01:02:04.720 | And that's sort of about how far people have gotten.

01:02:07.720 | And that's sort of such a teeny small corner of what's involved in learning a human language.

01:02:14.840 | One thing I'll just add to that is I think there are some principles of how kids learn

01:02:24.480 | that people have tried to apply to deep learning.

01:02:27.520 | And one example that comes to mind is curriculum learning, where there's like a lot of literature

01:02:33.200 | that shows that, you know, babies, they tend to pay attention to things that they just

01:02:39.080 | that is just slightly challenging for them.

01:02:41.520 | And they don't pay attention to things that are extremely challenging, and also don't

01:02:45.000 | pay attention to things that they know how to solve.

01:02:47.680 | And many researchers have really tried to get curriculum learning to work.

01:02:53.480 | And the verdict on that is that it seems to kind of work when you're in like reinforcement

01:02:58.280 | learning settings.

01:02:59.280 | But it's unclear if it's going to work on like supervised learning settings.

01:03:03.280 | But I still think that it's like under explored.

01:03:05.640 | And maybe, you know, there should be like more attempts to kind of see if we can like

01:03:11.920 | add in curriculum learning and if that improves anything.

01:03:16.360 | Yeah, I agree.

01:03:18.840 | Curriculum learning is an important idea, which we haven't really talked about.

01:03:23.480 | But it seems like it's certainly essential to human learning.

01:03:27.820 | And there's been some minor successes with it in the machine learning world.

01:03:31.880 | But it sort of seems like it's an idea you should be able to do a lot more with in the

01:03:36.560 | future as you move from models that are just doing one narrow task to trying to do a more

01:03:44.480 | general language acquisition process.

01:03:46.680 | Should I attempt the next question as well?

01:03:51.880 | Okay, the next question is, is the reason humans learn languages better just because

01:03:57.240 | we are pre trained over millions of years of physics simulation?

01:04:01.680 | Maybe we should pre train a model the same way.

01:04:05.680 | So I mean, I presume what you're saying is physics simulation, you're evoking evolution

01:04:11.840 | when you're talking about millions of years.

01:04:15.000 | So you know, this is a controversial, debated, big question.

01:04:24.180 | So you know, again, if I invoke Chomsky again, so Noam Chomsky is sort of the most famous

01:04:31.560 | linguist in the world.

01:04:36.480 | And you know, essentially, Noam Chomsky's career starting in the 1950s is built around

01:04:42.520 | the idea that little children get such dubious linguistic input because you know, they hear

01:04:52.280 | a random bunch of stuff, they don't get much feedback on what they say, etc.

01:04:57.720 | But language could not be learned empirically just from the data observed.

01:05:04.300 | And the only possible assumption to work from is significant parts of human language are

01:05:15.000 | innate or in the sort of human genome, babies are born with that.

01:05:19.400 | And that explains the miracle by which very little humans learn amazingly fast how human

01:05:26.960 | languages work.

01:05:28.560 | Now, to speak in credit for that idea, for those of you who have not been around little

01:05:36.560 | children, I mean, I think one does just have to acknowledge, you know, human language acquisition

01:05:43.740 | by live little kids.

01:05:46.680 | I mean, it does just seem to be miraculous, right?

01:05:49.880 | As you go through this sort of slow phase for a couple of years where, you know, the

01:05:57.000 | kids sort of goos and gahs some syllables, and then there's a fairly long period where

01:06:02.280 | they picked up a few words, and they can say "juice, juice" when they want to drink some

01:06:08.320 | juice and nothing else.

01:06:10.360 | And then it just sort of seems like there's this phase change, where the kids suddenly

01:06:15.720 | realize, wait, this is a productive generative sentence system, I can say whole sentences.

01:06:21.680 | And then in an incredibly short period, they sort of seem to transition from saying one

01:06:27.380 | and two word utterances to suddenly they can say, you know, "Daddy come home in garage,

01:06:36.800 | putting bike in garage."

01:06:39.160 | And you go, wow, how did they suddenly discover language?

01:06:42.960 | So, you know, so it is kind of amazing.

01:06:47.920 | But personally, for me, at least, you know, I've just never believed the strong versions

01:06:56.360 | of the hypothesis that human beings have much in the way of language specific knowledge

01:07:05.140 | or structure in their brains that comes from genetic inheritance.

01:07:09.920 | Like clearly, humans do have these very clever brains.

01:07:14.600 | And if we're at the level of saying, being able to think, or being able to interpret

01:07:21.160 | the visual world, that's things that have developed over tens of millions of years.

01:07:29.640 | And evolution can be a large part of the explanation.

01:07:35.000 | And humans are clearly born with lots of vision specific hardware in their brains, as are

01:07:42.680 | a lot of other creatures.

01:07:44.720 | But when you come to language, you know, no one knows when language was in a sort of a

01:07:53.160 | modern like form first became available, because, you know, there aren't any fossils of people

01:07:59.840 | saying, you know, the word spear or something like that.

01:08:04.400 | But, you know, to the extent that there are estimates based on sort of what you can see

01:08:09.920 | of the sort of spread of proto humans and their sort of apparent social structures from

01:08:19.280 | sort of what you can find in fossils, you know, most people guess that language is at

01:08:24.840 | most a million years old.

01:08:27.880 | And you know, that's just too short a time for any significant, for evolution to sort

01:08:35.040 | of build any significant structure inside human brains that's specific to language.

01:08:40.120 | So I kind of think that the working assumption has to be that sort of there's just about

01:08:47.880 | nothing specific to language in human brains.

01:08:52.160 | And you know, the most plausible hypothesis, not that I know very much about neuroscience

01:08:57.960 | when it comes down to it, is that humans were being able to repurpose hardware that was

01:09:04.680 | originally built for other purposes, like visual scene interpretation and memory, and

01:09:11.760 | that that gave a basis of sort of having all this clever hardware that you could then use

01:09:17.280 | for language.

01:09:18.280 | You know, it's kind of like GPUs were invented for playing computer games, and we were able

01:09:23.360 | to repurpose that hardware to do deep learning.

01:09:27.280 | Okay, we've got a lot of have come out at the end.

01:09:38.280 | Okay, so this one is answered live.

01:09:42.120 | Let's see.

01:09:43.120 | Yeah, if you could name, I guess this is for either of you, one main bottleneck as

01:09:48.560 | to if we could provide feedback efficiently to our systems like babies are given feedback,

01:09:56.240 | what's the bottleneck that remains in trying to have more human-like language acquisition?

01:10:09.120 | I mean, I sort of, I can apply it on this.

01:10:23.680 | Were you saying something, Shikhar?

01:10:24.680 | Yeah, I was just going to say that I think it's a bit of everything, right?

01:10:30.840 | Like, I think in terms of models, one thing I would say is that we know that there's more

01:10:37.040 | feedback connections and feed forward connections in the brain.

01:10:42.160 | And we haven't really figured out a way of kind of, so, you know, of course, we had RNNs,

01:10:49.360 | you know, which sort of implement like, you know, you can like look through an RNN that

01:10:52.880 | sort of implements a feedback loop, but we still haven't really figured out how to, you

01:10:58.080 | know, use that knowledge is that the brain has a lot of feedback connections and then

01:11:01.800 | apply that to practical systems, I think on the modeling end, like maybe that's one problem.

01:11:10.840 | There is like, yeah, I think curriculum learning is maybe one of them, but I think the one

01:11:16.360 | that's probably going to have most bang for buck is really figuring out how we can move

01:11:20.680 | beyond text.

01:11:21.680 | And I think there's just like so much more information that's available that we're just

01:11:27.320 | not using.

01:11:28.840 | And so I think that's where most of the progress might come from, like figuring out what's

01:11:32.960 | most practical of going beyond text.

01:11:37.320 | This is what I think.

01:11:39.320 | Okay.

01:11:48.320 | Let's see.

01:11:49.320 | What are some important NLP topics that we have not covered in this class?

01:11:55.320 | I do that.

01:12:01.320 | You know, well, sort of one answer is a lot of the topics that are covered in CS224U because,

01:12:07.320 | you know, we do make a bit of an effort to keep them destroyed, though not fully.

01:12:12.320 | Right.

01:12:13.320 | So there's sort of lots of topics in language understanding that we haven't covered.

01:12:20.320 | Right.

01:12:21.320 | So if you want to make a voice assistant like Alexa Siri or Google Assistant, well, you

01:12:31.320 | need to sort of be able to interface with systems, APIs that can do things like delete

01:12:37.320 | your mail or buy you concert tickets.

01:12:40.320 | And so you need to be able to convert from language into explicit semantic form that

01:12:45.320 | can interact with the systems of the world.

01:12:48.320 | We haven't talked about that at all.

01:12:51.320 | So there's lots of language understanding stuff.

01:12:54.320 | There's also lots of language generation things.

01:12:59.320 | So, you know, effectively for language generation, all we have done is neural language models.

01:13:06.320 | They are great.

01:13:07.320 | Run them and they will generate language.

01:13:10.320 | And, you know, in one sense, that's true.

01:13:13.320 | Right.

01:13:14.320 | It's awesome the kind of generation you can do with things like GPT-2 or 3.

01:13:22.320 | But, you know, where that's missing is that's really only giving you the ability to produce

01:13:32.320 | fluent text where it rabbits often produces fluent text that if you actually wanted to

01:13:39.320 | have a good natural language generation system, you also have to have higher level planning

01:13:46.320 | of what you're going to talk about and how you are going to express it.

01:13:52.320 | Right.

01:13:53.320 | So then in most situations in natural language, you think, OK, well, I want to explain to

01:14:00.320 | people something about why it's important to do math classes at college.

01:14:05.320 | Let me think how to organize this.

01:14:08.320 | Maybe I should talk about some of the different applications where math turns up and how it's

01:14:13.320 | a really good grounding.

01:14:15.320 | Whatever you kind of plan out, here's how I can present some ideas.

01:14:19.320 | Right.

01:14:20.320 | And that kind of natural language generation, we're not doing any we haven't done any of.

01:14:27.320 | Yeah.

01:14:28.320 | So that's sort of saying more understanding, more generation, which is most of NLP, you

01:14:37.320 | say.

01:14:38.320 | I mean, obviously, there are then sort of particular tasks that we can talk about that

01:14:43.320 | we either have or have not explicitly addressed.

01:14:47.320 | OK.

01:14:48.320 | Is there has there been any work in putting language models into an environment in which

01:15:00.320 | they can communicate to achieve a task?

01:15:03.320 | And do you think this would help with unsupervised learning?

01:15:12.320 | So I guess there's been a lot of work on emergent communication and also self play, where you

01:15:19.320 | have these different models which are initialized as language models that attempt to communicate

01:15:27.320 | with each other to solve some task.

01:15:30.320 | And then you have a reward at the end, whether they were able to finish the task or not.

01:15:35.320 | And then based on that reward, you attempt to learn the communication strategy.

01:15:40.320 | And this started out as emergent communication and self play.

01:15:44.320 | And then there was recent work.

01:15:45.320 | I think it was last year or the year before that, where they showed that if you initialize

01:15:50.320 | these models with language model pre-training, you basically to prevent this problem of language

01:15:58.320 | drift, where the language or the communication protocol that your models end up learning

01:16:04.320 | has nothing to do with actual language.

01:16:08.320 | And so, yeah, I mean, from that sense, there has been some work.

01:16:12.320 | But it's very limited.

01:16:13.320 | I think there's some groups that try to study this, but not beyond that.

01:16:23.320 | OK, I mean, the last two questions are about gene.

01:16:31.320 | There's one question about whether genes make some correlations from social cues or a word

01:16:37.320 | based system.

01:16:38.320 | I don't know if either of you have opinions about this, but if you do.

01:16:45.320 | Yeah, I mean, I don't have anything very deep to say about this question.

01:16:50.320 | It's on the importance of social cues as opposed to pure award based systems.

01:16:56.320 | Well, I mean, in some sense, a social cue, you can also regard as a reward that people

01:17:04.320 | like to have other people put a smile on their face when you say something.

01:17:10.320 | But I do think generally, when people are saying, what have we not covered?

01:17:18.320 | Another thing that we've barely covered is the social side of language.

01:17:23.320 | So, you know, a huge, a huge interesting thing about language is it has this very dynamic,

01:17:31.320 | big dynamic range.

01:17:33.320 | So on the one hand, you can talk about very precise things in language.

01:17:38.320 | So you can sort of talk about math formulas and steps in a proof and things like that,

01:17:43.320 | so that there's a lot of precision and language.

01:17:46.320 | You know, on the other hand, you can just sort of emphatically mumble, mumble whatever

01:17:51.320 | words at all, and you're not really sort of communicating anything in the way of a propositional

01:17:57.320 | content.

01:17:58.320 | What you're really trying to communicate is, you know, I'm, oh, I'm thinking about you

01:18:03.320 | right now.

01:18:04.320 | And, oh, I'm concerned with how you're feeling or whatever it is in the circumstances, right?

01:18:10.320 | So that a huge part of language use is in forms of sort of social communication between

01:18:19.320 | human beings.

01:18:20.320 | And, you know, that's another big part of actually building successful natural language

01:18:29.320 | systems, right?

01:18:30.320 | So if you, you know, if you think negatively about something like the virtual assistants

01:18:35.320 | I've been falling back on a lot is, you know, that they have virtually no ability as social

01:18:43.320 | language users, right?

01:18:44.320 | So we're now training a generation of little kids that what you should do is sort of bark

01:18:52.320 | out commands as if you were, you know, serving in the German army in World War II or something,

01:18:59.320 | and that there's none of the kind of social part of how to, you know, use language to

01:19:08.320 | communicate satisfactorily with human beings and to maintain a social system.

01:19:15.320 | And that, you know, that's a huge part of human language use that kids have to learn

01:19:20.320 | and learn to use successfully, right?

01:19:23.320 | You know, a lot of being successful in the world is, you know, you know, when you want

01:19:29.320 | someone to do something for you, you know, that there are good ways to ask them for it.

01:19:35.320 | You know, some of its choice of how to present the arguments, but, you know, some of it is

01:19:41.320 | by building social rapport and asking nicely and reasonably and making it seem like you're

01:19:48.320 | a sweet person that other people should do something for.

01:19:51.320 | And, you know, human beings are very good at that.

01:19:54.320 | And being good at that is a really important skill for being able to navigate the world well.

01:19:59.320 | [ Silence ]

Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 18 - Future of NLP + Deep Learning

Chapters