back to index

Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 18 - Future of NLP + Deep Learning


Chapters

0:0
2:35 General Representation Learning Recipe
4:46 Facts about Gbt3
21:40 Systematicity and Language Grounding
22:18 Principle of Compositionality
22:38 Are Human Languages Really Compositional
24:4 Are Neural Representations Compositional
28:1 The Information Bottleneck Theory
30:19 Producing Compositionally Challenging Splits
30:29 Normalized Frequency Distribution of the Atoms
34:8 Dynamic Benchmarks
37:33 Language Grounding
44:21 Breakpoints
58:37 Attention

Whisper Transcript | Transcript Only Page

00:00:00.000 | Good afternoon, folks. Welcome to lecture 18. Today, we'll be talking about some of
00:00:10.240 | the latest and greatest developments in neural NLP, where we've come and where we're headed.
00:00:16.720 | Chris, just to be sure, are my presenter notes visible from this part?
00:00:22.440 | They're visible.
00:00:24.440 | Okay, but not my presenter notes, right?
00:00:28.640 | Correct.
00:00:29.640 | Okay, thank you.
00:00:32.000 | So just as a reminder, note that your guest lecture reactions are due tomorrow at 11.59
00:00:37.960 | Great job with the project milestone reports.
00:00:40.720 | You should have received feedback now.
00:00:42.720 | If not, contact the core staff.
00:00:44.480 | I think we had some last minute issues, but if that's not resolved, please contact us.
00:00:51.120 | Finally, the project reports are due very soon, on the March 16th, which is next week.
00:00:57.160 | There's one question on Ed about the leaderboard, and the last day to submit on the leaderboard
00:01:03.400 | is March 19th as well.
00:01:07.840 | Okay, so for today, we'll start by talking about extremely large language models and
00:01:13.360 | GPT-3 that have recently gained a lot of popularity.
00:01:18.520 | We'll then take a closer look at compositionality and generalization of these neural models.
00:01:26.400 | While transformer models like BERT and GPT have really high performance on all benchmarks,
00:01:30.520 | they still fail in really surprising ways when deployed.
00:01:33.640 | How can we strengthen our understanding of evaluating these models so they more closely
00:01:38.800 | reflect task performance in the real world?
00:01:44.280 | And then we end by talking about how we can move beyond this really limited paradigm of
00:01:48.640 | teaching models language only through text and look at language grappling.
00:01:53.760 | Finally, I'll give some practical tips on how to move forward in your neural NLP research,
00:01:59.600 | and this will include some practical tips for the final project as well.
00:02:07.040 | So this beam really kind of captures what's been going on in the field, really.
00:02:14.680 | And it's just that our ability to harness unlabeled data has vastly increased over the
00:02:19.000 | last few years.
00:02:20.240 | And this has been made possible due to advances in not just hardware, but also systems and
00:02:26.120 | our understanding of self-supervised training so we can use lots and lots of unlabeled data.
00:02:34.640 | So based on this, here is a general representation learning recipe that just works for basically
00:02:41.520 | most modalities.
00:02:43.080 | So the recipe is basically as follows.
00:02:47.460 | So convert your data if it's images converted-- or it's really modality agnostic.
00:02:56.560 | So you take your data, if it's images, text, or videos, and you convert it into a sequence
00:03:00.160 | of integers.
00:03:02.280 | And in step 2, you define a loss function to maximize data likelihood or create a denoising
00:03:06.960 | autoencoder loss.
00:03:08.960 | Finally, in step 3, train on lots and lots of data.
00:03:14.760 | Different properties emerge only when we scale up model size.
00:03:17.280 | And this is really the surprising fact about scale.
00:03:20.660 | So to give some examples of this recipe in action, here's GPD 3, which can learn to do
00:03:26.620 | a really non-trivial classification problem with just two demonstrations.
00:03:31.160 | And we'll talk more about this soon.
00:03:34.660 | Another example, as we saw in lecture 14, is T5, which does really effective closed-book
00:03:39.300 | QA by storing knowledge in its parameters.
00:03:43.380 | Finally, just so I cover another modality, here's a recent text-to-image generation model
00:03:51.460 | with really impressive zero-shot generalization.
00:03:54.660 | OK, so now let's talk about GPD 3.
00:03:59.700 | So how big really are these models?
00:04:02.020 | This table presents some numbers to put things in perspective.
00:04:07.660 | So we have a collection of models starting with medium-sized LSTMs, which was a staple
00:04:12.380 | in pre-2016 NLP, all the way to humans who have 100 trillion synapses.
00:04:18.580 | And some in the middle, we have GPD 2 with over a billion parameters, and GPD 3 with
00:04:23.020 | over 150 billion parameters.
00:04:26.580 | And this exceeds the number of synaptic connections in a honeybee brain.
00:04:31.860 | So obviously, anyone with little knowledge of neuroscience knows that this is not an
00:04:36.180 | Apple Store and just embarrassing-- that this is an Apple Store and just embarrassing.
00:04:40.220 | But the point here is that the scale of these models is really starting to reach astronomical
00:04:44.020 | numbers.
00:04:45.020 | So here are some facts about GPD 3.
00:04:48.500 | For one, it's a large transformer with 96 layers.
00:04:53.980 | It has more or less the same architecture as GPD 2, with the exception that to scale
00:04:59.420 | up attention computation, it uses these locally-banded sparse attention patterns.
00:05:04.060 | And I really encourage you to look at the paper to understand the details.
00:05:07.340 | The reason we mention this here is because it kind of highlights that scaling up is simply
00:05:11.100 | not just changing hyperparameters, as many might believe.
00:05:14.220 | And it involves really non-trivial engineering and algorithms to make computations efficient.
00:05:19.420 | Finally, all of this is trained on 500 billion tokens taken from the Common Crawl, the Toronto
00:05:26.140 | Books Corpus, Wikipedia.
00:05:30.180 | So what's new about GPD 3?
00:05:32.140 | So let's look at some of the results on the paper first.
00:05:35.140 | So obviously, it does better on language modeling and text completion problems.
00:05:39.420 | As you can see from this table, it does better than GPD 2 at language modeling in the Pentry
00:05:44.220 | Bank, as well as better on the story completion data set called Limbada.
00:05:50.340 | To give a flavor of what's to come, let's take a closer look at this Limbada story completion
00:05:54.940 | data set.
00:05:56.200 | So the task here is that we're given a short story, and we are supposed to fill in the
00:05:59.660 | last word.
00:06:00.660 | Satisfying the constraints of the problem can be hard for a language model, which could
00:06:06.700 | generate a multi-word completion.
00:06:08.620 | But with GPD 3, the really new thing is that we can just give a few examples as prompts
00:06:13.260 | and sort of communicate a task specification to the model.
00:06:15.660 | And now, GPD 3 knows how the completion must be a single word.
00:06:18.780 | This is a very, very powerful paradigm.
00:06:21.300 | And we give some more examples of this in-context learning in a couple more slides.
00:06:27.580 | So apart from language modeling, it's really good at these knowledge-intensive tasks, like
00:06:32.900 | closed-book QA, as well as reading comprehension.
00:06:36.340 | And here, we observe that scaling up parameters results in a massive improvement in performance.
00:06:42.440 | So now let's talk about in-context learning.
00:06:44.780 | GPD 3 demonstrates some level of fast adaptation to completely new tasks.
00:06:50.380 | This happens via what's called in-context learning.
00:06:53.020 | As shown in the figure, the model training can be characterized as having an outer loop
00:06:58.300 | that learns a set of parameters that makes the learning of the inner loop as efficient
00:07:03.460 | as possible.
00:07:04.780 | And with this sort of framework in mind, we can really see how a good language model can
00:07:09.100 | also serve as a good few-shot learner.
00:07:14.020 | So in this segment, we will have some fun with GPD 3 and look at some demonstrations
00:07:18.620 | of this in-context learning.
00:07:22.540 | So to start off, here is an example where someone's trying to create an application
00:07:26.860 | that converts a language description to batch one-liners.
00:07:32.380 | The first three examples are prompts, followed by generated examples from GPD 3.
00:07:38.020 | So it gets a list of running processes.
00:07:40.860 | This one's easy.
00:07:41.860 | It probably just involves looking at your hash table.
00:07:44.100 | Some of the more challenging ones that involve copying over some spans from the text, like
00:07:51.060 | the SCP example is kind of interesting, as well as the harder one to parse grep.
00:07:56.300 | The SCP example comes up a lot during office hours, so GPD 3 knows how to do that.
00:08:03.980 | Here's a somewhat more challenging one, where the model is given a description of a database
00:08:07.580 | in natural language, and it starts to emulate that behavior.
00:08:12.580 | So the text in bold is sort of the prompt given to the model.
00:08:16.380 | The prompt includes somewhat of a function specification of what a database is.
00:08:22.980 | So it says that the database begins knowing nothing.
00:08:25.660 | The database knows everything that's added to it.
00:08:27.940 | The database does not know anything else.
00:08:30.340 | And when you ask a question to the database, if the answer is there in the database, the
00:08:33.980 | database must return the answer.
00:08:34.980 | Otherwise, it should say it does not know the answer.
00:08:38.380 | So this is very new and very powerful.
00:08:42.580 | And the prompt also includes some example usages.
00:08:45.540 | So when you ask 2+2, the database does not know.
00:08:48.180 | When you ask the capital of France, the database does not know.
00:08:51.300 | And then you add in a fact that Tom is 20 years old to the database.
00:08:55.340 | And now you can start asking it questions like, where does Tom live?
00:08:59.300 | And as expected, it says that the database does not know.
00:09:02.900 | But now if you ask it, what's Tom's age?
00:09:05.900 | The database says that Tom is 20 years old.
00:09:08.420 | And if you ask, what's my age?
00:09:09.860 | The database says basically that it does not know, because that's not been added.
00:09:13.580 | So this is really powerful.
00:09:16.180 | Here's another one.
00:09:18.140 | Now in this example, the model is asked to blend concepts together.
00:09:22.380 | And so there's a definition of what does it mean to blend concepts.
00:09:25.820 | So if you take airplane and car, you can blend that to give flying car.
00:09:31.140 | That's essentially like there's a Wikipedia definition of what concept blending is, along
00:09:36.180 | with some examples.
00:09:38.340 | Now let's look at some prompts followed by what GPT-3 answers.
00:09:43.780 | So the first one is straightforward, two-dimensional space blended with 3D space gives 2.5-dimensional
00:09:49.780 | space.
00:09:51.140 | The one that is somewhat interesting is old and new gives recycled.
00:09:57.820 | Then a triangle and square gives trapezoid.
00:09:59.940 | That's also interesting.
00:10:02.020 | The one that's really non-trivial is geology plus neurology.
00:10:06.180 | It's just sediment neurology, and I had no idea what this was.
00:10:09.700 | It's apparently correct.
00:10:11.780 | So clearly, it's able to do these very flexible things just from a prompt.
00:10:18.620 | So here's another class of examples that GPT-3 gets somewhat right.
00:10:25.620 | And these are these copycat analogy problems, which have been really well studied in cognitive
00:10:30.500 | science.
00:10:32.180 | And the way it works is that I'm going to give you some examples and then ask you to
00:10:37.860 | induce a function from these examples and apply it to new queries.
00:10:41.860 | So if ABC changes to ABT, what does PQR change to?
00:10:45.300 | Well, PQR must change to PQS, because the function we've learned is that the last letter
00:10:49.480 | must be incremented by 1.
00:10:52.220 | And this function, humans can now apply to examples of varying types.
00:10:57.220 | So like P repeated twice, Q repeated twice, R repeated twice must change to P repeated
00:11:02.180 | twice, Q repeated twice, and S repeated twice.
00:11:06.060 | And it seems like GPT-3 is able to get them right, more or less.
00:11:10.980 | But the problem is that if you ask it to generalize to examples that have increasing number of
00:11:19.380 | repetitions than were seen in the prompt, it's not able to do that.
00:11:23.180 | So in this situation, you ask it to make an analogy where the letters are repeated four
00:11:31.340 | times, and it's never seen that before and doesn't know what to do.
00:11:34.700 | And so it gets all of these wrong.
00:11:36.740 | So there's a point to be made here about just maybe these prompts are not enough to convey
00:11:44.020 | the function the model should be learning and maybe even more examples it can learn
00:11:48.380 | But it probably does not have the same kinds of generalization that humans have.
00:11:55.620 | And that brings us to the limitations of these models and some open questions.
00:12:01.180 | So just looking at the paper and passing through the results, it seems like the model is bad
00:12:06.500 | at logical and mathematical reasoning, anything that involves doing multiple steps of reasoning.
00:12:14.020 | And that explains why it's bad at arithmetic, why it's bad at word problems, why it's
00:12:18.300 | not great at analogy making, and even traditional textual entailment data sets that seem to
00:12:23.580 | require logical reasoning like RTE.
00:12:27.780 | So second most subtle point is that it's unclear how we can make permanent updates to the model.
00:12:33.540 | Maybe if I want to teach a model a new concept, that's possible to do it while I'm interacting
00:12:38.580 | with the system.
00:12:39.580 | But once the interaction is over, it restarts and does not have a notion of knowledge.
00:12:44.500 | And it's not that this is something that the model cannot do in principle, but just something
00:12:48.540 | that's not really been explored.
00:12:52.020 | It doesn't seem to exhibit human-like generalization, which is often called systematicity.
00:12:56.460 | And I'll talk a lot more about that.
00:12:58.780 | And finally, language is situated.
00:13:00.700 | And GPT-3 is just learning from text.
00:13:03.140 | And there's no exposure to other modalities.
00:13:04.620 | There's no interaction.
00:13:06.080 | So maybe the aspects of meaning that it acquires are somewhat limited.
00:13:09.820 | And maybe we should explore how we can bring in other modalities.
00:13:13.780 | So we'll talk a lot more about these last two limitations in the rest of the lecture.
00:13:20.540 | But maybe I can foster some questions now if there are any.
00:13:34.140 | I don't think there's a big outstanding question.
00:13:37.620 | But I mean, I think some people aren't really clear on few-shot setting and prompting versus
00:13:45.900 | learning.
00:13:46.900 | And I think it might actually be good to explain that a bit more.
00:13:51.180 | Yeah.
00:13:52.180 | So maybe let's-- let me pick a simple example.
00:14:02.540 | Let me pick this example here.
00:14:04.660 | So prompting just means that-- so GPT-3, if you go back to first principles, GPT-3 is
00:14:11.060 | basically just a language model.
00:14:13.100 | And what that means is given a context, it'll tell you what's the probability of the next
00:14:19.500 | word.
00:14:20.700 | So if I give it a context, w1 through wk, GPT-3 will tell me what's the probability
00:14:27.980 | of wk plus 1 for [INAUDIBLE] the vocabulary.
00:14:33.420 | So that's what a language model is.
00:14:35.820 | A prompt is essentially a context that gets prepended before GPT-3 can start generating.
00:14:43.060 | And what's happening with in-context learning is that the context that you append-- that
00:14:48.820 | you prepend to GPT-3 are basically xy examples.
00:14:54.900 | So that's the prompt.
00:14:57.220 | And the reason why it's also-- it's equivalent to few-shot learning is because you prepend
00:15:03.340 | a small number of xy examples.
00:15:05.660 | So in this case, if I just prepend this one example that's highlighted in purple, then
00:15:10.900 | that's essentially one-shot learning because I just give it a single example as context.
00:15:16.820 | And now, given this query, which is also appended to the model, it has to make a prediction.
00:15:25.940 | So the input-output format is the same as how a few-shot learner would receive.
00:15:31.820 | But since it's a language model, the training data set is essentially presented as a context.
00:15:40.220 | So someone is still asking, can you be more specific about the in-context learning setups?
00:15:48.900 | What is the task?
00:15:50.500 | Right.
00:15:51.500 | So let's see.
00:15:54.620 | Maybe I can go to-- yeah, so maybe I can go to this slide.
00:16:03.900 | So the task is just that it's a language model.
00:16:08.620 | So it gets a context, which is just a sequence of tokens.
00:16:13.580 | And the task is just to-- so you have a sequence of tokens.
00:16:18.760 | And then the model has to generate given a sequence of tokens.
00:16:23.320 | And the way you can convert that into an actual machine learning classification problem is
00:16:27.940 | that-- so for this example, maybe you give it 5 plus 8 equals 13, 7 plus 2 equals 9,
00:16:35.540 | and then 1 plus 0 equals.
00:16:37.780 | And now, GPT-3 can fill in a number there.
00:16:41.700 | So that's how you convert it into a classification problem.
00:16:44.860 | The context here would be these two examples of arithmetic, like 5 plus 8 equals 13 and
00:16:51.020 | 7 plus 2 equals 9.
00:16:52.620 | And then the query is 1 plus 0 equals.
00:16:55.420 | And then the model, since it's just a language model, has to fill in 1 plus 0 equals question
00:16:59.780 | mark.
00:17:00.780 | So it fills in something there.
00:17:01.780 | It doesn't have to fill in numbers.
00:17:02.780 | It could fill in anything.
00:17:05.020 | But if it fills in a 1, it does the right job.
00:17:09.640 | So that's how you can take a language model and do few-shot learning with it.
00:17:14.340 | I'll keep on these questions.
00:17:16.780 | How is in-context learning different from transfer learning?
00:17:21.140 | So I guess in-context learning-- I mean, you can think of in-context learning as being
00:17:31.380 | kind of transfer learning.
00:17:33.220 | But transfer learning does not specify the mechanism through which the transfer is going
00:17:37.780 | to happen.
00:17:39.140 | With in-context learning, the mechanism is that the training examples are sort of appended
00:17:45.620 | to the model, which is a language model, just in order.
00:17:51.900 | So let's say you have x1, y1, x2, y2.
00:17:55.900 | And these are just appended directly to the model.
00:17:58.600 | And now it makes prediction on some queries that are drawn from this data set.
00:18:05.220 | So yes, it is a subcategory of transfer learning.
00:18:09.100 | But transfer learning does not specify exactly how this transfer learning is achieved.
00:18:14.180 | But in-context learning is very specific and says that for language models, you can essentially
00:18:19.680 | concatenate the training data set and then present that to the language model.
00:18:25.780 | People still aren't sufficiently clear on what is or isn't happening with learning and
00:18:32.620 | prompting.
00:18:33.980 | So another question is, so in-context learning still needs fine-tuning, question mark?
00:18:39.420 | We need to train GPT-3 to do in-context learning?
00:18:43.500 | Question mark.
00:18:44.500 | Right.
00:18:45.500 | So there are two parts to this question.
00:18:51.500 | So the answer is yes and no.
00:18:53.060 | So of course, the model is a language model.
00:18:56.460 | So it needs to be trained.
00:18:57.460 | So you start with some random parameters.
00:19:00.180 | And you need to train them.
00:19:02.180 | But the model is trained as a language model.
00:19:05.780 | And once the model is trained, you can now use it to do transfer learning.
00:19:11.460 | And the model parameters in-context learning are fixed.
00:19:14.660 | You do not update the model parameters.
00:19:17.740 | All you do is that you give it these small training set to the model, which is just appended
00:19:24.140 | to the model as context.
00:19:26.060 | And now the model can start generating from that point on.
00:19:29.740 | So in this example, if 5 plus 8 equals 13 and 7 plus 2 equals 9 are two xy examples.
00:19:38.540 | In vanilla transfer learning, what you would do is that you would take some gradient steps,
00:19:43.100 | update your model parameters, and then make a prediction on 1 plus 0 equals what.
00:19:47.620 | But in context learning, all you're doing is you just concatenate 5 plus 8 equals 13
00:19:54.420 | and 7 plus 2 equals 9 to the model's context window, and then make it predict what 1 plus
00:20:01.260 | 0 should be equal to.
00:20:04.660 | Maybe we should end for now with one other bigger picture question, which is, do you
00:20:13.740 | know of any research combining these models with reinforcement learning for the more complicated
00:20:18.540 | reasoning tasks?
00:20:20.440 | So that is an excellent question.
00:20:22.700 | There is some recent work on kind of trying to align language models with human preferences,
00:20:30.340 | where yes, there is some amount of fine tuning with reinforcement learning based on these
00:20:37.180 | preferences from humans.
00:20:38.820 | So maybe you want to do a summarization problem in GPT-3.
00:20:42.780 | The model produces multiple summaries.
00:20:45.300 | And for each summary, maybe you have a reward that is essentially a human preference.
00:20:50.260 | Maybe I want to include some facts, and I don't want to include some other non-important
00:20:55.060 | facts.
00:20:56.060 | So I can construct a reward out of that, and I can fine tune the parameters of my language
00:21:00.820 | model basically using reinforcement learning based on this reward, which is essentially
00:21:07.260 | human preferences.
00:21:08.260 | So there's some very recent work that tries to do this.
00:21:11.180 | But I'm not sure-- yeah, I'm not aware of any work that tries to use reinforcement learning
00:21:15.860 | to teach a reasoning to these models.
00:21:17.940 | But I think it's an interesting future direction to explore.
00:21:29.100 | Maybe you should go on at this point.
00:21:34.100 | OK, so we'll talk a bit more about these last two points, so systematicity and language
00:21:44.660 | grounding.
00:21:48.220 | So just to start off, how do you define systematicity?
00:21:51.820 | So really, the definition is that there is a definite and predictable pattern among the
00:21:56.540 | sentences that native speakers of a language understand.
00:22:00.540 | And so there's a systematic pattern among the sentences that we understand.
00:22:04.500 | What that means is, let's say there's a sentence like, John loves Mary.
00:22:09.020 | And if a native speaker understands the sentence, then they should also be able to understand
00:22:12.820 | the sentence, Mary loves John.
00:22:17.020 | And closely related to this idea of systematicity is the principle of compositionality.
00:22:21.780 | And for now, I'm going to ignore the definition by Montague and just look at the rough definition.
00:22:26.660 | And then we can come back to this other more concrete definition.
00:22:30.620 | The rough definition is essentially that the meaning of an expression is a function of
00:22:34.500 | the meaning of its parts.
00:22:37.860 | So that brings us to the question, are human languages really compositionally?
00:22:42.340 | And here are some examples that make us think that maybe, yes.
00:22:47.900 | So if you look at what is the meaning of the noun phrase brown cow, so it is composed of
00:22:53.180 | the meaning of the adjective brown and the noun cow.
00:22:58.660 | So all things that are brown and all things that are cow take the intersection and get
00:23:02.300 | brown cow.
00:23:03.300 | Similarly, red rabbits, so all things that are red, all things that are rabbit, combine
00:23:06.380 | them and get red rabbits.
00:23:07.820 | And then kick the ball, this word phrase can be understood as you have some agent that's
00:23:12.780 | performing a kicking operation on the ball.
00:23:16.180 | But this is not always the case that you can get the meaning of the whole by combining
00:23:22.700 | meanings of parts.
00:23:23.700 | So here, we have some counter examples that people often use.
00:23:26.700 | So red herring does not mean all things that are red and all things that are herring.
00:23:31.260 | And kick the bucket definitely does not mean that there's an agent that's kicking the bucket.
00:23:35.700 | So while these examples are supposed to be provocative, we think that language is mostly
00:23:41.900 | compositional.
00:23:42.900 | There's lots of exceptions, but for a vast majority of sentences that we've never heard
00:23:47.540 | before, we're able to understand what they mean by piecing together the words that the
00:23:52.060 | sentence is composed of.
00:23:54.100 | And so what that means is that maybe compositionality of representations are helpful prior that
00:23:58.540 | could lead to systematicity in behavior.
00:24:02.740 | And that brings us to the questions that we ask in the segment, are neural representations
00:24:06.700 | compositional?
00:24:08.100 | And the second question is, if so, do they generalize systematically?
00:24:12.420 | So how do you even measure if representations that a neural network learns exhibit compositionality?
00:24:23.500 | So let's go back to this definition from Montague, which says that compositionality is about
00:24:29.780 | the existence of a homomorphism from syntax to semantics.
00:24:34.700 | And to look at that, we have this example, which is Lisa does not skateboard.
00:24:40.660 | And we have a syntax tree corresponding to this example.
00:24:44.700 | And the meaning of the sentence can be composed according to the structure that's decided
00:24:51.900 | by the syntax.
00:24:52.900 | So the meaning of Lisa does not skateboard is a function of the meaning of Lisa and does
00:24:57.740 | not skateboard.
00:24:58.740 | The meaning of does not skateboard is a function of does and not skateboard.
00:25:01.820 | The meaning of not skateboard is a function of not and skateboard.
00:25:05.740 | So that's good.
00:25:06.820 | And so this gives us one way of formalizing how we can measure compositionality in neural
00:25:13.260 | representations.
00:25:14.260 | And so compositionality of representations could be thought of as how well the representation
00:25:19.700 | approximates an explicitly homomorphic function in a learned representation space.
00:25:26.340 | So what we're going to do is essentially measure if we were to construct a neural network whose
00:25:32.340 | computations are based exactly according to these parse trees, how far are the representations
00:25:37.660 | of a learned model from this explicitly compositional representation?
00:25:44.020 | And that'll give us some understanding of how compositional the neural networks representations
00:25:48.420 | should be.
00:25:50.500 | So to unpack that a little bit, instead of having-- yeah, so instead of having denotations,
00:25:58.100 | we have representations in the node.
00:26:03.700 | And to be more concrete about that, we first start by choosing a distance function that
00:26:09.780 | tells us how far away two representations are.
00:26:12.660 | And then we also need a way to compose together two constituents to give us the meaning of
00:26:20.180 | the whole.
00:26:21.740 | But once we have that, we can start by-- we can create an explicitly compositional function,
00:26:27.460 | right?
00:26:28.460 | So what we do is we have these representations at the leaves that are initialized randomly
00:26:37.660 | and the composition function that's also initialized randomly.
00:26:40.660 | And then a forward pass according to this syntax is used to compute the representation
00:26:45.980 | of Lisa does not skateboard.
00:26:48.060 | And now once you have this representation, you can create a loss function.
00:26:51.700 | And this loss function measures how far are the representations of my neural network from
00:26:57.500 | this second proxy neural network that I've created.
00:27:02.180 | And then I can basically optimize both the composition function and the embeddings of
00:27:08.660 | the leaves.
00:27:10.220 | And then once the optimization is finished, I can measure how far was the representation
00:27:15.860 | of my neural net from this explicitly compositional network on a held outside.
00:27:22.380 | And that then tells me whether the representation of my neural net learned were actually compositional
00:27:26.540 | or not.
00:27:28.780 | So to see how well this works, let's look at a plot.
00:27:33.620 | And this is relatively complex.
00:27:36.860 | But just to unpack this a little bit, it plots the mutual information between the input that
00:27:45.840 | the neural network receives versus the representation against this tree reconstruction error that
00:27:53.460 | we were talking about.
00:27:55.740 | And to give some more background about what's to come, there is a theory which is called
00:28:02.420 | the information bottleneck theory, which says that as a neural network trains, it first
00:28:09.980 | tries to maximize the mutual information between the representation and the input in an attempt
00:28:15.820 | to memorize the entire data set.
00:28:18.820 | And that is a memorization phase.
00:28:22.840 | And then once memorization is done, there is a learning or a compression phase where
00:28:28.020 | this mutual information starts to decrease.
00:28:31.060 | And the model is essentially trying to compress the data or consolidate the knowledge in the
00:28:35.540 | data into its parameters.
00:28:38.140 | And what we are seeing here is that as a model learns, which is characterized by decreasing
00:28:43.380 | mutual information, we see that the representations themselves are becoming more and more compositional.
00:28:50.620 | And overall, we observe that learning is correlated with increased compositionality as measured
00:28:55.780 | by the tree reconstruction error.
00:28:58.020 | So that's really encouraging.
00:29:01.900 | So now that we have a method of measuring compositionality of representations in these
00:29:08.060 | neural nets, how do we start to create benchmarks that see if they are generalizing systematically
00:29:15.620 | or not?
00:29:18.000 | So to do that, here is a method for taking any data set and splitting it into a train
00:29:23.580 | test split that explicitly tests for this kind of generalization.
00:29:31.820 | So to do that, we use this principle called maximizing the compound divergence.
00:29:38.140 | And to illustrate how this principle works, we look at this toy example.
00:29:43.700 | So in this toy example, we have a training data set that consists of just two examples
00:29:47.980 | and test data set of just two examples.
00:29:51.780 | The atoms are defined as the primitive elements, so entity words, predicates, question types.
00:29:58.940 | So in this toy example, Goldfinger, Christopher Nolan, these are all the primitive elements.
00:30:05.500 | And the compounds are compositions of these primitive elements.
00:30:08.460 | So who directed entity would be the composition of the question type.
00:30:13.140 | Did x predicate y?
00:30:14.780 | And the predicate direct.
00:30:19.140 | So here's a basic machinery for producing compositionally challenging splits.
00:30:23.600 | So let's start by introducing two distributions.
00:30:27.700 | The first distribution is the normalized frequency distribution of the atoms.
00:30:33.020 | So given any data set, if we know what the notion of atoms are, we can basically compute
00:30:38.740 | the frequency of all of the atoms and then normalize that by the total count.
00:30:43.460 | And that's going to give us one distribution.
00:30:47.260 | And we can repeat the same thing for the compounds.
00:30:49.900 | And that will give us a second frequency distribution.
00:30:53.720 | So note that these are just two probability distributions.
00:30:57.860 | And once we have these two distributions, we can essentially define the atom and compound
00:31:04.060 | divergence simply as this quantity here.
00:31:08.860 | And where there is the Chernoff coefficient between two categorical distributions.
00:31:15.060 | The Chernoff coefficient basically measures how far two categorical distributions are.
00:31:20.820 | So just to get a bit more intuition about this, if we set p to q, then the Chernoff
00:31:26.440 | coefficient is 1, which means these representations are maximally similar.
00:31:32.440 | And then if p is non-zero everywhere q is 0, or if p is 0 in all the places where q
00:31:40.560 | is 0, then the Chernoff coefficient is exactly 0, which means that these two distributions
00:31:46.920 | are maximally far away.
00:31:49.240 | And the overall goal by describing this objective is that-- this loss objective is just that
00:31:58.080 | we are going to maximize the compound divergence and minimize the atom divergence.
00:32:03.780 | And so what is the intuition behind doing such a thing?
00:32:06.500 | So what we want is to ensure that the unigram distribution, in some sense, is constant between
00:32:12.380 | the train and test split so that the model does not encounter any new words.
00:32:18.820 | But we want the compound divergence to be very high, which means that these same words
00:32:24.660 | that the model has seen many times must appear in new combinations, which means that we are
00:32:29.420 | testing for systematicity.
00:32:33.380 | And so if you follow this procedure for a semantic parsing data set, let's say, what
00:32:40.060 | we see is that as you increase the scale, we see that this model just does better and
00:32:46.540 | better at compositional generalization.
00:32:50.020 | But just pulling out a quote from this paper, "pre-training helps for compositional generalization
00:32:55.380 | but doesn't fully solve it."
00:32:57.180 | And what that means is that maybe as you keep scaling up these models, you'll see better
00:33:00.820 | and better performance, or maybe it starts to saturate at some point.
00:33:06.940 | In any case, we should probably be thinking more about this problem instead of just trying
00:33:10.700 | to brute force it.
00:33:14.260 | So now this segment tells us that the way we split a data set, we can measure different
00:33:24.020 | behaviors of the model.
00:33:25.020 | And that tells us that maybe we should be thinking more critically about how we're evaluating
00:33:29.060 | models in NLP in general.
00:33:31.600 | So there has been a revolution basically over the last few years in the field where we're
00:33:37.260 | seeing all of these large transform models beat all of our benchmarks.
00:33:40.500 | At the same time, there is still not complete confidence that once we deploy these systems
00:33:46.460 | in the real world, they're going to maintain their performance.
00:33:51.940 | And so it's unclear if these gains are coming from spurious correlations or some real task
00:33:55.780 | understanding.
00:33:56.780 | And so how do we design benchmarks that accurately tell us how well this model is going to do
00:34:02.100 | in the real world?
00:34:03.100 | And so I'm going to give one example of works that try to do this.
00:34:08.020 | And that's the idea of dynamic benchmarks.
00:34:11.180 | And the idea of dynamic benchmarks is basically saying that instead of testing our models
00:34:17.140 | on static test sets, we should be evaluating them on an ever-changing dynamic benchmark.
00:34:24.380 | And there's many recent examples of this.
00:34:27.420 | And the idea dates back to a 2017 workshop at EMLP.
00:34:33.300 | And so the overall schematic looks something like this, that we start with a training data
00:34:38.060 | set and a test data set, which is the static opera.
00:34:42.380 | We train a model on that.
00:34:44.020 | And then once the model is trained, we deploy that and then have humans create new examples
00:34:49.500 | that the model fails to classify.
00:34:52.180 | And crucially, we're looking for examples.
00:34:54.940 | The model does not get tried, but humans have no issue figuring out the answer to.
00:34:59.860 | So by playing this game of whack-a-mole, where humans figure out what are the holes
00:35:06.540 | in the model's understanding, and then add that back into the training data, re-train
00:35:11.420 | the model, deploy it again, have humans create new examples, we can essentially construct
00:35:16.060 | this never-ending data set, this never-ending test set, which can hopefully be a better
00:35:24.420 | proxy of estimating real-world performance.
00:35:28.500 | So this is some really cutting-edge research.
00:35:33.700 | And one of the main challenges of this class of works is that it's unclear how much this
00:35:39.180 | can scale up, because maybe after multiple iterations of this whack-a-mole, humans are
00:35:47.060 | just fundamentally limited by creativity.
00:35:49.380 | So figuring out how to deal with that is really an open problem.
00:35:55.060 | And current approaches just use examples from other data sets to prompt humans to think
00:35:59.860 | more creatively.
00:36:01.360 | But maybe we can come up with better, more automated methods of doing this.
00:36:08.180 | So this brings us to the final segment.
00:36:11.940 | Or actually, let me stop for questions at this point and see if people have questions.
00:36:23.700 | Here's a question.
00:36:29.980 | With dynamic benchmark, doesn't this mean that the model creator will also need to continually
00:36:36.060 | test/evaluate the models on the new benchmarks, new data sets?
00:36:42.500 | Wait a second.
00:36:47.220 | Sorry.
00:36:48.740 | Yeah, so with dynamic benchmarks, yes, it's absolutely true that you will have to continuously
00:36:57.340 | keep training your model.
00:36:58.940 | And that's just to ensure that the reason your model is not doing well on the test set
00:37:06.340 | doesn't have to do with this domain mismatch.
00:37:09.820 | And what we're really trying to do is measure how-- just come up with a better estimate
00:37:17.180 | of the model's performance in the overall task and just trying to get more and more
00:37:22.420 | data.
00:37:23.420 | So yes, to answer your question, yes, we need to keep training the model again and again.
00:37:27.380 | But this can be automated.
00:37:29.740 | So I'll move on to language-grounded.
00:37:35.860 | So in this final segment, I'll talk about how we can move beyond just training models
00:37:42.500 | on text alone.
00:37:45.600 | So many have articulated the need to use modalities other than text if we someday want to get
00:37:52.380 | at real language understanding.
00:37:55.380 | And ever since we've had these big language models, there has been a rekindling of this
00:38:02.420 | debate.
00:38:03.420 | And recently, there was multiple papers on this.
00:38:06.300 | And so at ACL last year, there was this paper that argues through multiple thought experiments
00:38:11.900 | that it's actually impossible to acquire meaning from form alone, where meaning refers to the
00:38:17.700 | communicative intent of a speaker, and form refers to text or speech signals.
00:38:24.300 | A more modern version of this was put forward by the second paper, where they say that training
00:38:31.180 | on only web-scale data limits the world scope of models and limits the aspects of meanings
00:38:37.640 | that the model can actually acquire.
00:38:41.360 | And so here is a diagram that I borrowed from the paper.
00:38:44.640 | And what they say is the era where we were training models on supervised data sets, models
00:38:51.640 | were limited in world scope one.
00:38:54.020 | And now that we've moved on to exploiting unlabeled data, we're now in world scope two,
00:38:59.720 | where models just have strictly more signal to get more aspects of meaning in.
00:39:05.400 | If you mix in additional modalities into this-- so maybe you mix in videos, and maybe you
00:39:10.000 | mix in images-- then that expands out the world scope of the model further.
00:39:15.960 | And now maybe it can acquire more aspects of meaning, such that now it knows that the
00:39:21.500 | lexical item red refers to red images.
00:39:27.160 | And then if you go beyond that, you can have a model that is embodied, and it's actually
00:39:32.320 | living in an environment where it can interact with its data, conduct interventions and experiments.
00:39:39.620 | And then if you go even beyond that, you can have models that live in a social world where
00:39:44.760 | they can interact with other models.
00:39:46.240 | Because after all, the purpose of language is to communicate.
00:39:49.480 | And so if you can have a social world where models can communicate with other models,
00:39:55.600 | that expands out aspects of meaning.
00:40:00.280 | And so GPT-3 is in world scope two.
00:40:04.280 | So there are a lot of open questions in this space.
00:40:07.600 | But given that there are all of these good arguments about how we need to move beyond
00:40:11.600 | text, what is the best way to do this at scale?
00:40:16.320 | We know that babies cannot learn language from watching TV alone, for example.
00:40:21.960 | So there has to be some interventions, and there has to be interactions with the environment
00:40:26.560 | that need to happen.
00:40:28.360 | But at the same time, the question is, how far can models go by just training on static
00:40:34.360 | data as long as we have additional modalities, especially when we combine this with scale?
00:40:40.880 | And if interactions with the environment are really necessary, how do we collect data and
00:40:45.400 | design systems that interact minimally or in a cost-effective way?
00:40:50.280 | And then finally, could pre-training on text still be useful if any of these other research
00:40:59.760 | directions become more sample efficient?
00:41:03.900 | So if you're interested in learning more about this topic, I highly encourage you to take
00:41:08.560 | CS224U, which is offered in the spring.
00:41:11.160 | They have multiple lectures on just language learning.
00:41:19.680 | So in this final segment, I'm going to talk a little bit more about how you can get involved
00:41:24.880 | with NLP and deep learning research and how you can make more progress.
00:41:32.240 | So here are some general principles for how to make progress in NLP research.
00:41:38.220 | So I think the most important thing is to just read broadly, which means not just read
00:41:42.680 | the latest and greatest papers and archive, but also read pre-2010 statistical NLP.
00:41:50.360 | Learn about the mathematical foundations of machine learning to understand how generalization
00:41:55.420 | works, so take CS229M.
00:41:58.840 | Learn more about language, which means taking classes in the linguistics department.
00:42:03.200 | In particular, I would recommend maybe this 138A.
00:42:06.680 | And also take CS224U.
00:42:09.680 | And finally, if you wanted inspiration from how babies learn, then definitely read about
00:42:16.200 | child language acquisition literature.
00:42:18.200 | It's fascinating.
00:42:19.200 | Finally, learn your software tools, which involves scripting tools, version control,
00:42:29.240 | data wrangling, learning how to visualize quickly with Jupyter Notebooks.
00:42:34.480 | And deep learning often involves running multiple experiments with different hyperparameters
00:42:39.720 | and different ideas all in Paddle.
00:42:41.720 | And sometimes it can get really hard to keep track of everything.
00:42:44.440 | So learn how to use experiment management tools like weights and biases.
00:42:50.920 | And finally, I'll talk about some really quick final project tips.
00:42:57.400 | So first, let's just start by saying that if your approach doesn't seem to be working,
00:43:01.880 | please do not panic.
00:43:03.400 | Put assert statements everywhere and check if the computations that you're doing are
00:43:08.160 | correct.
00:43:09.160 | Use breakpoints extensively, and I'll talk a bit more about this.
00:43:13.200 | Check if the loss function that you've implemented is correct.
00:43:16.760 | And one way of debugging that is to see that the initial values are correct.
00:43:21.360 | So if you're doing a k-way classification problem, then the initial loss should be a
00:43:24.840 | natural log of k.
00:43:25.840 | Always, always, always start by creating a small training data set which has like 5 to
00:43:31.520 | 10 examples and see if your model can completely cope with that.
00:43:35.000 | If not, there's a problem with your training loop.
00:43:38.720 | Check for saturating activations and dead values.
00:43:41.800 | And often, this can be fixed by-- maybe there's some problems with the gradients, or maybe
00:43:46.320 | there's some problems with the initialization.
00:43:48.520 | Which brings me to the next point.
00:43:50.160 | Check your gradient values.
00:43:51.160 | See if they're too small, which means that maybe you should be using residual connections
00:43:54.960 | or LSTMs.
00:43:55.960 | Or if they're too large, then you should use gradient clipping.
00:43:59.000 | In fact, always use gradient clipping.
00:44:01.520 | Overall, be methodical.
00:44:04.000 | If your approach doesn't work, come up with hypotheses for why this might be the case.
00:44:08.800 | Design Oracle experiments to debug it.
00:44:10.880 | Look at your data.
00:44:11.880 | Look at the errors that it's making.
00:44:14.320 | And just try to be systematic about everything.
00:44:17.760 | So I'll just say a little bit more about breakpoints.
00:44:23.320 | So there's this great library called PDB.
00:44:25.440 | It's like GDP, but it's for Python.
00:44:27.080 | So that's why PDB.
00:44:29.280 | To create a breakpoint, just add the line import PDB, PDB.setTrace before the line you
00:44:36.000 | want to inspect.
00:44:37.360 | So earlier today, I was trying to play around with the Transformers library.
00:44:42.920 | So I was trying to do question answering.
00:44:45.480 | So I have a really small training corpus.
00:44:47.560 | And the context is, one morning, I shot an elephant in my pajamas.
00:44:52.040 | How he got into my pajamas, I don't know.
00:44:55.080 | And the question is, what did I shoot?
00:44:57.760 | And to solve this problem, I basically imported a tokenizer and a BERT model.
00:45:04.240 | And I initialized my tokenizer, initialized my model, tokenized my input.
00:45:08.320 | I set my model into the eval mode.
00:45:11.080 | And I tried to look at the output.
00:45:13.920 | But I get this error.
00:45:16.280 | And I'm very sad.
00:45:17.280 | It's not clear what's causing this error.
00:45:19.400 | And so the best way to look at what's causing this error is to actually put a breakpoint.
00:45:24.920 | So right after model.eval, I put a breakpoint.
00:45:27.960 | Because I know that that's where the problem is.
00:45:30.120 | So the problem is in line 21.
00:45:32.080 | So I put a breakpoint at line 21.
00:45:35.600 | And now once I put this breakpoint, I can just run my script again.
00:45:40.720 | And it stops before executing line 21.
00:45:43.320 | And at this point, I can examine all of my variables.
00:45:46.360 | So I can look at the token as input, because maybe that's where the problem is.
00:45:50.400 | And lo and behold, I see that it's actually a list.
00:45:54.460 | So it's a dictionary of lists, whereas models typically expect a Dodge tensor.
00:45:59.160 | So now I know what the problem is.
00:46:01.560 | And that means I can quickly go ahead and fix it.
00:46:04.320 | And everything just works.
00:46:06.280 | So this just shows that you should use breakpoints everywhere if your code is not working.
00:46:10.960 | And it can just help you debug really quickly.
00:46:16.160 | So finally, I'd say that if you want to get involved with NLP and deep learning research,
00:46:21.640 | and if you really like the final project, we have the CLIPS program at Stanford.
00:46:26.160 | And this is a way for undergrads, master's students, and PhDs who are interested in deep
00:46:31.520 | learning and doing NLP research and want to get involved with the NLP group.
00:46:36.480 | So we highly encourage you to apply to CLIPS.
00:46:40.560 | And so I'll conclude today's class by saying that we've made a lot of progress in the last
00:46:48.240 | decade.
00:46:49.240 | And that's mostly due to clever understanding of neural networks, data, hardware, all of
00:46:54.520 | that combined with scale.
00:46:55.520 | We have some really amazing technologies that can do really exciting things.
00:46:59.280 | And we saw some examples of that today.
00:47:03.200 | In the short term, I expect that we'll see more scaling because it just seems to help.
00:47:08.440 | So perhaps even larger models.
00:47:11.080 | But this is not trivial.
00:47:12.480 | So I said that before, and I'll just say it again.
00:47:16.320 | Scaling requires really non-trivial engineering efforts, and sometimes even clever algorithms.
00:47:21.740 | And so there's a lot of interesting systems work to be done here.
00:47:25.040 | But in the long term, we really need to be thinking more about these bigger problems
00:47:30.120 | of systematicity, generalization.
00:47:31.120 | How can we make our models learn a new concept really quickly so that it's fast adaptation?
00:47:39.040 | And then we also need to create benchmarks that we can actually trust.
00:47:42.560 | If my model has some performance on some sentiment analysis data set and deployed in the real
00:47:48.360 | world, that should be reflected in the number that I get from the benchmark.
00:47:52.000 | So we need to make progress in the way we evaluate models.
00:47:56.080 | And then also figuring out a way to move beyond text in a more tractable way.
00:48:01.440 | This is also really essential.
00:48:03.880 | So yeah, that's it.
00:48:05.560 | Good luck with your final projects.
00:48:07.760 | I can take more questions at this point.
00:48:13.000 | So I answered a question earlier that actually I think you could also opine on.
00:48:19.960 | It was the question of whether you have a large model that's pre-trained on language,
00:48:24.040 | if it will actually help you in other domains, like you apply it to vision stuff.
00:48:29.800 | Yeah.
00:48:30.800 | Yeah.
00:48:31.800 | So I guess the answer is actually, yes.
00:48:37.000 | So there was a paper that came out really, really recently, like just two days ago, that
00:48:40.440 | it just takes-- I think it was GBD2.
00:48:43.480 | I'm not sure.
00:48:45.240 | It's like one large transformer model that's pre-trained on text.
00:48:48.960 | And like other modalities, so they definitely apply to images.
00:48:53.600 | And I think they apply to math problems and some more modalities and show that it's actually
00:48:59.600 | really effective at transfers.
00:49:02.280 | So if you pre-train on text and then you move to a different modality, that helps.
00:49:05.600 | I think part of the reason for that is just that across modalities, there is a lot of
00:49:10.080 | autoregressive structure that is shared.
00:49:13.680 | And I think one reason for that is that language is really referring to the world around it.
00:49:19.800 | And so you might expect that there is some correspondence that's just beyond the autoregressive
00:49:27.360 | structure.
00:49:28.360 | So there's also works that show that if you have just text-only representations and image-only
00:49:34.080 | representations, you can actually learn a simple linear classifier that can learn to
00:49:38.320 | align both of these representations.
00:49:39.320 | And all of these works are just showing that there's actually a lot more common between
00:49:43.400 | modalities than we thought in the beginning.
00:49:47.120 | So yeah, I think it's possible to pre-train on text and then fine-tune on your modality
00:49:54.160 | of interest.
00:49:55.160 | And it should probably be effective, of course, based on what the modality is.
00:49:59.960 | But for images and videos, it's certainly effective.
00:50:03.920 | Any questions?
00:50:10.920 | A couple of questions have turned up.
00:50:24.640 | One is, what's the difference between CS224U and this class in terms of the topics covered
00:50:30.960 | and focus?
00:50:31.960 | Do you want to answer that one, Shikhar, or should I have a go at answering it?
00:50:36.280 | Maybe you should answer this one.
00:50:39.600 | So next quarter, CS224U, Natural Language Understanding, is co-taught by Chris Potts
00:50:48.440 | and Bill McCartney.
00:50:50.840 | So in essence, it's meant to be different that natural language understanding focuses
00:51:00.360 | on what its name is, sort of how to build computer systems that understand the sentences
00:51:07.100 | of natural language.
00:51:08.960 | Now, in truth, the boundary is kind of complex because we do some natural language understanding
00:51:17.680 | in this class as well.
00:51:19.920 | And certainly for the people who are doing the default final project, question answering,
00:51:24.560 | well, that's absolutely a natural language understanding task.
00:51:29.260 | But the distinction is meant to be that at least a lot of what we do in this class, things
00:51:37.680 | like the assignment three dependency parser or building the machine translation system
00:51:45.960 | in assignment four, that they are in some sense natural language processing tasks where
00:51:53.600 | processing can mean anything but commonly means you're doing useful intelligent stuff
00:52:00.460 | with human language input, but you're not necessarily deeply understanding it.
00:52:06.820 | So there is some overlap in the classes.
00:52:10.300 | If you do CS224U, you'll certainly see word vectors and transformers again.
00:52:16.740 | But the emphasis is on doing a lot more with natural language understanding tasks.
00:52:22.880 | And so that includes things like building semantic parsers.
00:52:28.020 | So they're the kind of devices that will, you know, respond to questions and commands
00:52:34.320 | such as an Alexa or Google assistant will do.
00:52:40.460 | Building relation extraction systems, which get out particular facts out of a piece of
00:52:45.480 | text of, oh, this person took on this position at this company.
00:52:52.940 | Looking at grounded language learning and grounded language understanding where you're
00:52:57.780 | not only using the language, but the world context to get information and other tasks
00:53:05.500 | that sort.
00:53:06.500 | I mean, I guess you're going to look at the website to get more details of it.
00:53:10.580 | I mean, you know, relevant to this class, I mean, a lot of people also find it an opportunity
00:53:17.200 | to just get further in doing a project in the area of natural language processing that
00:53:24.880 | sort of by the nature of the structure of the class, since, you know, it more assumes
00:53:30.320 | that people know how to build deep learning natural language systems at the beginning
00:53:35.400 | that rather than a large percentage of the class going into, okay, you have to do all
00:53:41.400 | of these assignments, although there are little assignments earlier on that there's sort
00:53:46.080 | of more time to work on a project for the quarter.
00:53:51.280 | Okay.
00:53:53.080 | Here's one more question that maybe Shikhar could do.
00:53:57.680 | Do you know of attempts to crowdsource dynamic benchmarks, e.g. users uploading adversarial
00:54:04.280 | examples for evaluation or online learning?
00:54:08.160 | Yeah, so actually, like, the main idea there is to use crowdsourcing, right?
00:54:16.520 | So in fact, there is this bench.
00:54:19.140 | So there is this platform that was created by Pair, it's called DynaBench.
00:54:25.080 | And the objective is just that to construct this like dynamically evolving benchmark,
00:54:31.800 | we are just going to offload it to users of this platform.
00:54:36.560 | And you can, you know, it essentially gives you utilities for like, deploying your model
00:54:41.200 | and then having, you know, humans kind of try to fool the model.
00:54:46.360 | Yeah, so this is like, it's basically how the dynamic benchmark collection actually
00:54:55.160 | works.
00:54:56.160 | So we deploy a model on some platform, and then we get humans to like fool the system.
00:55:09.160 | Yeah.
00:55:11.600 | Here's a question.
00:55:19.720 | Can you address the problems of NLP models, not able to remember really long contexts
00:55:25.240 | and techniques to infer on really large input length?
00:55:29.320 | Yeah, so I guess like, there have been like a few works recently that kind of try to scale
00:55:38.640 | up transformers to like really large context lengths.
00:55:42.720 | One of them is like the reformer.
00:55:45.600 | And there's also like the transformer Excel that was the first one to try and do that.
00:55:51.720 | I think what is unclear is whether you can combine that with the scale of these GPT like
00:56:00.360 | models.
00:56:02.240 | And if you see like qualitatively different things, once you do that, like, and part of
00:56:07.920 | it is just that all of this is just like so recent, right?
00:56:10.720 | But yeah, I think the open question there is that, you know, can you take these like
00:56:15.320 | really long context transformers that can operate over long context, combine that with
00:56:21.400 | scale of GPT-3, and then get models that can actually reason over these like really large
00:56:27.080 | contexts?
00:56:28.080 | Because I guess the hypothesis of scale is that once you train language models at scale,
00:56:34.360 | it can start to do these things.
00:56:36.320 | And so to do that for long context, we actually need to like have long context transformers
00:56:42.680 | that are trained at scale.
00:56:43.680 | And I don't think people have done that yet.
00:56:55.720 | So I'm seeing this other question about language acquisition.
00:56:58.720 | Chris, do you have some thoughts on this?
00:57:01.800 | Or maybe I can just say something.
00:57:07.680 | Yeah, so the question is, what do you think we can learn from baby language acquisition?
00:57:17.000 | Can we build a language model in a more interactive way, like reinforcement learning?
00:57:22.160 | Do you know any of these attempts?
00:57:26.680 | That's a big, huge question.
00:57:28.600 | And you know, I think the short non-helpful answer is that there are kind of no answers
00:57:34.680 | at the moment.
00:57:35.680 | I know people have certainly tried to do things at various scales, but you know, we just have
00:57:41.840 | no technology that is the least bit convincing for being able to replicate the language learning
00:57:50.760 | ability of a human child.
00:57:53.800 | But after that prologue, what I could say is, I mean, yeah, there are definitely ideas
00:58:00.440 | to have in your head.
00:58:02.840 | So you know, there are sort of clear results, which is that little kids don't learn by watching
00:58:10.200 | videos.
00:58:11.240 | So it seems like interaction is completely key.
00:58:17.680 | Little kids don't learn from language alone.
00:58:21.440 | They're in a very rich environment where people are sort of both learning stuff from the environment
00:58:28.040 | in general, and in particular, you know, they're learning a lot from what language acquisition
00:58:35.240 | researchers refer to as attention, which is different to what we mean by attention.
00:58:42.360 | But it means that the caregiver will be looking at the object that's the focus of interest
00:58:48.680 | and you know, commonly other things as well, like sort of, you know, picking it up and
00:58:52.440 | bringing it near the kid and all those kinds of things.
00:58:57.720 | And you know, babies and young kids get to experiment a lot, right?
00:59:03.440 | So regardless of whether it's learning what happens when you have some blocks that you
00:59:09.880 | stack up and play with them, or you're learning language, you sort of experiment by trying
00:59:17.280 | some things and see what kind of response you get.
00:59:20.920 | And again, that's essentially building on the interactivity of it that you're getting
00:59:27.240 | some kind of response to any offerings you make.
00:59:30.800 | And you know, this is something that's sort of been hotly debated in the language acquisition
00:59:35.680 | literature.
00:59:36.680 | So a traditional chompskin position is that, you know, human beings don't get effective
00:59:47.040 | feedback, you know, supervised labels when they talk.
00:59:52.880 | And you know, in some very narrow sense, well, that's true, right?
00:59:56.240 | It's just not the case that after a baby tries to say something that they get feedback of,
01:00:01.280 | you know, syntax error in English on word for, or they get given, here's the semantic
01:00:09.400 | form I took away from your utterance.
01:00:12.520 | But in a more indirect way, they clearly get enormous feedback, they can see what kind
01:00:18.040 | of response they get from their caregiver at every corner.
01:00:24.720 | And so like in your question, you were suggesting that, well, somehow we should be making use
01:00:32.920 | of reinforcement learning because we have something like a reward signal there.
01:00:37.960 | And you know, in a big picture way, I'd say, oh, yeah, I agree.
01:00:42.840 | In terms of a much more specific way as to, well, how can we possibly get that to work
01:00:48.160 | to learn something with the richness of human language?
01:00:52.800 | You know, I think we don't have much idea, but you know, there has started to be some
01:00:59.560 | work.
01:01:00.560 | So people have been sort of building virtual environments, which, you know, you have your
01:01:07.600 | avatar in and that can manipulate in the virtual environment and there's linguistic input,
01:01:14.720 | and it can succeed in getting rewards for sort of doing a command where the command
01:01:19.480 | can be something like, you know, pick up the orange block or something like that.
01:01:24.920 | And you know, to a small extent, people have been able to build things that work.
01:01:31.880 | I mean, as you might be picking up, I mean, I guess so far, at least I've just been kind
01:01:39.760 | of underwhelmed because it seems like the complexity of what people have achieved is
01:01:45.440 | sort of, you know, just so primitive compared to the full complexity of language, right?
01:01:51.680 | You know, the kind of languages that people have been able to get systems to learn are
01:01:57.080 | ones that can, yeah, do pick up commands where they can learn, you know, blue cube versus
01:02:02.880 | orange sphere.
01:02:04.720 | And that's sort of about how far people have gotten.
01:02:07.720 | And that's sort of such a teeny small corner of what's involved in learning a human language.
01:02:14.840 | One thing I'll just add to that is I think there are some principles of how kids learn
01:02:24.480 | that people have tried to apply to deep learning.
01:02:27.520 | And one example that comes to mind is curriculum learning, where there's like a lot of literature
01:02:33.200 | that shows that, you know, babies, they tend to pay attention to things that they just
01:02:39.080 | that is just slightly challenging for them.
01:02:41.520 | And they don't pay attention to things that are extremely challenging, and also don't
01:02:45.000 | pay attention to things that they know how to solve.
01:02:47.680 | And many researchers have really tried to get curriculum learning to work.
01:02:53.480 | And the verdict on that is that it seems to kind of work when you're in like reinforcement
01:02:58.280 | learning settings.
01:02:59.280 | But it's unclear if it's going to work on like supervised learning settings.
01:03:03.280 | But I still think that it's like under explored.
01:03:05.640 | And maybe, you know, there should be like more attempts to kind of see if we can like
01:03:11.920 | add in curriculum learning and if that improves anything.
01:03:16.360 | Yeah, I agree.
01:03:18.840 | Curriculum learning is an important idea, which we haven't really talked about.
01:03:23.480 | But it seems like it's certainly essential to human learning.
01:03:27.820 | And there's been some minor successes with it in the machine learning world.
01:03:31.880 | But it sort of seems like it's an idea you should be able to do a lot more with in the
01:03:36.560 | future as you move from models that are just doing one narrow task to trying to do a more
01:03:44.480 | general language acquisition process.
01:03:46.680 | Should I attempt the next question as well?
01:03:51.880 | Okay, the next question is, is the reason humans learn languages better just because
01:03:57.240 | we are pre trained over millions of years of physics simulation?
01:04:01.680 | Maybe we should pre train a model the same way.
01:04:05.680 | So I mean, I presume what you're saying is physics simulation, you're evoking evolution
01:04:11.840 | when you're talking about millions of years.
01:04:15.000 | So you know, this is a controversial, debated, big question.
01:04:24.180 | So you know, again, if I invoke Chomsky again, so Noam Chomsky is sort of the most famous
01:04:31.560 | linguist in the world.
01:04:36.480 | And you know, essentially, Noam Chomsky's career starting in the 1950s is built around
01:04:42.520 | the idea that little children get such dubious linguistic input because you know, they hear
01:04:52.280 | a random bunch of stuff, they don't get much feedback on what they say, etc.
01:04:57.720 | But language could not be learned empirically just from the data observed.
01:05:04.300 | And the only possible assumption to work from is significant parts of human language are
01:05:15.000 | innate or in the sort of human genome, babies are born with that.
01:05:19.400 | And that explains the miracle by which very little humans learn amazingly fast how human
01:05:26.960 | languages work.
01:05:28.560 | Now, to speak in credit for that idea, for those of you who have not been around little
01:05:36.560 | children, I mean, I think one does just have to acknowledge, you know, human language acquisition
01:05:43.740 | by live little kids.
01:05:46.680 | I mean, it does just seem to be miraculous, right?
01:05:49.880 | As you go through this sort of slow phase for a couple of years where, you know, the
01:05:57.000 | kids sort of goos and gahs some syllables, and then there's a fairly long period where
01:06:02.280 | they picked up a few words, and they can say "juice, juice" when they want to drink some
01:06:08.320 | juice and nothing else.
01:06:10.360 | And then it just sort of seems like there's this phase change, where the kids suddenly
01:06:15.720 | realize, wait, this is a productive generative sentence system, I can say whole sentences.
01:06:21.680 | And then in an incredibly short period, they sort of seem to transition from saying one
01:06:27.380 | and two word utterances to suddenly they can say, you know, "Daddy come home in garage,
01:06:36.800 | putting bike in garage."
01:06:39.160 | And you go, wow, how did they suddenly discover language?
01:06:42.960 | So, you know, so it is kind of amazing.
01:06:47.920 | But personally, for me, at least, you know, I've just never believed the strong versions
01:06:56.360 | of the hypothesis that human beings have much in the way of language specific knowledge
01:07:05.140 | or structure in their brains that comes from genetic inheritance.
01:07:09.920 | Like clearly, humans do have these very clever brains.
01:07:14.600 | And if we're at the level of saying, being able to think, or being able to interpret
01:07:21.160 | the visual world, that's things that have developed over tens of millions of years.
01:07:29.640 | And evolution can be a large part of the explanation.
01:07:35.000 | And humans are clearly born with lots of vision specific hardware in their brains, as are
01:07:42.680 | a lot of other creatures.
01:07:44.720 | But when you come to language, you know, no one knows when language was in a sort of a
01:07:53.160 | modern like form first became available, because, you know, there aren't any fossils of people
01:07:59.840 | saying, you know, the word spear or something like that.
01:08:04.400 | But, you know, to the extent that there are estimates based on sort of what you can see
01:08:09.920 | of the sort of spread of proto humans and their sort of apparent social structures from
01:08:19.280 | sort of what you can find in fossils, you know, most people guess that language is at
01:08:24.840 | most a million years old.
01:08:27.880 | And you know, that's just too short a time for any significant, for evolution to sort
01:08:35.040 | of build any significant structure inside human brains that's specific to language.
01:08:40.120 | So I kind of think that the working assumption has to be that sort of there's just about
01:08:47.880 | nothing specific to language in human brains.
01:08:52.160 | And you know, the most plausible hypothesis, not that I know very much about neuroscience
01:08:57.960 | when it comes down to it, is that humans were being able to repurpose hardware that was
01:09:04.680 | originally built for other purposes, like visual scene interpretation and memory, and
01:09:11.760 | that that gave a basis of sort of having all this clever hardware that you could then use
01:09:17.280 | for language.
01:09:18.280 | You know, it's kind of like GPUs were invented for playing computer games, and we were able
01:09:23.360 | to repurpose that hardware to do deep learning.
01:09:27.280 | Okay, we've got a lot of have come out at the end.
01:09:38.280 | Okay, so this one is answered live.
01:09:42.120 | Let's see.
01:09:43.120 | Yeah, if you could name, I guess this is for either of you, one main bottleneck as
01:09:48.560 | to if we could provide feedback efficiently to our systems like babies are given feedback,
01:09:56.240 | what's the bottleneck that remains in trying to have more human-like language acquisition?
01:10:09.120 | I mean, I sort of, I can apply it on this.
01:10:23.680 | Were you saying something, Shikhar?
01:10:24.680 | Yeah, I was just going to say that I think it's a bit of everything, right?
01:10:30.840 | Like, I think in terms of models, one thing I would say is that we know that there's more
01:10:37.040 | feedback connections and feed forward connections in the brain.
01:10:42.160 | And we haven't really figured out a way of kind of, so, you know, of course, we had RNNs,
01:10:49.360 | you know, which sort of implement like, you know, you can like look through an RNN that
01:10:52.880 | sort of implements a feedback loop, but we still haven't really figured out how to, you
01:10:58.080 | know, use that knowledge is that the brain has a lot of feedback connections and then
01:11:01.800 | apply that to practical systems, I think on the modeling end, like maybe that's one problem.
01:11:10.840 | There is like, yeah, I think curriculum learning is maybe one of them, but I think the one
01:11:16.360 | that's probably going to have most bang for buck is really figuring out how we can move
01:11:20.680 | beyond text.
01:11:21.680 | And I think there's just like so much more information that's available that we're just
01:11:27.320 | not using.
01:11:28.840 | And so I think that's where most of the progress might come from, like figuring out what's
01:11:32.960 | most practical of going beyond text.
01:11:37.320 | This is what I think.
01:11:39.320 | Okay.
01:11:48.320 | Let's see.
01:11:49.320 | What are some important NLP topics that we have not covered in this class?
01:11:55.320 | I do that.
01:12:01.320 | You know, well, sort of one answer is a lot of the topics that are covered in CS224U because,
01:12:07.320 | you know, we do make a bit of an effort to keep them destroyed, though not fully.
01:12:12.320 | Right.
01:12:13.320 | So there's sort of lots of topics in language understanding that we haven't covered.
01:12:20.320 | Right.
01:12:21.320 | So if you want to make a voice assistant like Alexa Siri or Google Assistant, well, you
01:12:31.320 | need to sort of be able to interface with systems, APIs that can do things like delete
01:12:37.320 | your mail or buy you concert tickets.
01:12:40.320 | And so you need to be able to convert from language into explicit semantic form that
01:12:45.320 | can interact with the systems of the world.
01:12:48.320 | We haven't talked about that at all.
01:12:51.320 | So there's lots of language understanding stuff.
01:12:54.320 | There's also lots of language generation things.
01:12:59.320 | So, you know, effectively for language generation, all we have done is neural language models.
01:13:06.320 | They are great.
01:13:07.320 | Run them and they will generate language.
01:13:10.320 | And, you know, in one sense, that's true.
01:13:13.320 | Right.
01:13:14.320 | It's awesome the kind of generation you can do with things like GPT-2 or 3.
01:13:22.320 | But, you know, where that's missing is that's really only giving you the ability to produce
01:13:32.320 | fluent text where it rabbits often produces fluent text that if you actually wanted to
01:13:39.320 | have a good natural language generation system, you also have to have higher level planning
01:13:46.320 | of what you're going to talk about and how you are going to express it.
01:13:52.320 | Right.
01:13:53.320 | So then in most situations in natural language, you think, OK, well, I want to explain to
01:14:00.320 | people something about why it's important to do math classes at college.
01:14:05.320 | Let me think how to organize this.
01:14:08.320 | Maybe I should talk about some of the different applications where math turns up and how it's
01:14:13.320 | a really good grounding.
01:14:15.320 | Whatever you kind of plan out, here's how I can present some ideas.
01:14:19.320 | Right.
01:14:20.320 | And that kind of natural language generation, we're not doing any we haven't done any of.
01:14:27.320 | Yeah.
01:14:28.320 | So that's sort of saying more understanding, more generation, which is most of NLP, you
01:14:38.320 | I mean, obviously, there are then sort of particular tasks that we can talk about that
01:14:43.320 | we either have or have not explicitly addressed.
01:14:48.320 | Is there has there been any work in putting language models into an environment in which
01:15:00.320 | they can communicate to achieve a task?
01:15:03.320 | And do you think this would help with unsupervised learning?
01:15:12.320 | So I guess there's been a lot of work on emergent communication and also self play, where you
01:15:19.320 | have these different models which are initialized as language models that attempt to communicate
01:15:27.320 | with each other to solve some task.
01:15:30.320 | And then you have a reward at the end, whether they were able to finish the task or not.
01:15:35.320 | And then based on that reward, you attempt to learn the communication strategy.
01:15:40.320 | And this started out as emergent communication and self play.
01:15:44.320 | And then there was recent work.
01:15:45.320 | I think it was last year or the year before that, where they showed that if you initialize
01:15:50.320 | these models with language model pre-training, you basically to prevent this problem of language
01:15:58.320 | drift, where the language or the communication protocol that your models end up learning
01:16:04.320 | has nothing to do with actual language.
01:16:08.320 | And so, yeah, I mean, from that sense, there has been some work.
01:16:12.320 | But it's very limited.
01:16:13.320 | I think there's some groups that try to study this, but not beyond that.
01:16:23.320 | OK, I mean, the last two questions are about gene.
01:16:31.320 | There's one question about whether genes make some correlations from social cues or a word
01:16:37.320 | based system.
01:16:38.320 | I don't know if either of you have opinions about this, but if you do.
01:16:45.320 | Yeah, I mean, I don't have anything very deep to say about this question.
01:16:50.320 | It's on the importance of social cues as opposed to pure award based systems.
01:16:56.320 | Well, I mean, in some sense, a social cue, you can also regard as a reward that people
01:17:04.320 | like to have other people put a smile on their face when you say something.
01:17:10.320 | But I do think generally, when people are saying, what have we not covered?
01:17:18.320 | Another thing that we've barely covered is the social side of language.
01:17:23.320 | So, you know, a huge, a huge interesting thing about language is it has this very dynamic,
01:17:31.320 | big dynamic range.
01:17:33.320 | So on the one hand, you can talk about very precise things in language.
01:17:38.320 | So you can sort of talk about math formulas and steps in a proof and things like that,
01:17:43.320 | so that there's a lot of precision and language.
01:17:46.320 | You know, on the other hand, you can just sort of emphatically mumble, mumble whatever
01:17:51.320 | words at all, and you're not really sort of communicating anything in the way of a propositional
01:17:57.320 | content.
01:17:58.320 | What you're really trying to communicate is, you know, I'm, oh, I'm thinking about you
01:18:03.320 | right now.
01:18:04.320 | And, oh, I'm concerned with how you're feeling or whatever it is in the circumstances, right?
01:18:10.320 | So that a huge part of language use is in forms of sort of social communication between
01:18:19.320 | human beings.
01:18:20.320 | And, you know, that's another big part of actually building successful natural language
01:18:29.320 | systems, right?
01:18:30.320 | So if you, you know, if you think negatively about something like the virtual assistants
01:18:35.320 | I've been falling back on a lot is, you know, that they have virtually no ability as social
01:18:43.320 | language users, right?
01:18:44.320 | So we're now training a generation of little kids that what you should do is sort of bark
01:18:52.320 | out commands as if you were, you know, serving in the German army in World War II or something,
01:18:59.320 | and that there's none of the kind of social part of how to, you know, use language to
01:19:08.320 | communicate satisfactorily with human beings and to maintain a social system.
01:19:15.320 | And that, you know, that's a huge part of human language use that kids have to learn
01:19:20.320 | and learn to use successfully, right?
01:19:23.320 | You know, a lot of being successful in the world is, you know, you know, when you want
01:19:29.320 | someone to do something for you, you know, that there are good ways to ask them for it.
01:19:35.320 | You know, some of its choice of how to present the arguments, but, you know, some of it is
01:19:41.320 | by building social rapport and asking nicely and reasonably and making it seem like you're
01:19:48.320 | a sweet person that other people should do something for.
01:19:51.320 | And, you know, human beings are very good at that.
01:19:54.320 | And being good at that is a really important skill for being able to navigate the world well.
01:19:59.320 | [ Silence ]