Stanford XCS224U: NLU I Contextual Word Representations, Part 10: Wrap-up I Spring 2023

00:00:00.000 | Welcome back everyone.

00:00:06.080 | This is the 10th and final screencast in

00:00:08.220 | our series on contextual representation.

00:00:10.520 | I'd like to just briefly wrap up.

00:00:12.460 | In doing that, I'd like to do three things.

00:00:14.520 | First, just take stock of what we did a little bit.

00:00:17.280 | Second, I'd like to make amends for

00:00:19.460 | really interesting architectures and

00:00:21.320 | innovations that I didn't have time

00:00:22.880 | to mention in the core series.

00:00:24.640 | Then finally, I'd like to look to the future,

00:00:26.960 | both for the course and also for the field.

00:00:30.600 | Let me start by trying to make amends a little bit for

00:00:34.000 | some noteworthy architectures that I didn't have time for.

00:00:37.500 | Transformer XL is an early and

00:00:40.280 | very innovative attempt to bring in long contexts.

00:00:43.680 | It does this by essentially caching

00:00:46.400 | earlier parts of a long sequence,

00:00:48.840 | and then recreating some recurrent connections

00:00:51.640 | across those cache states into

00:00:53.760 | the computation for the current set of states.

00:00:56.360 | Very innovative. The ideas for

00:00:58.680 | Transformer XL were carried forward into ExcelNet.

00:01:01.960 | The core of ExcelNet is the goal of having

00:01:04.760 | bidirectional context while nonetheless

00:01:07.280 | having an autoregressive language modeling loss.

00:01:10.600 | They do this in this really interesting way of

00:01:13.920 | sampling different sequence orders so that you process

00:01:17.400 | left to right while nonetheless sampling

00:01:19.820 | enough sequence orders that you essentially have

00:01:22.080 | the power of bidirectional context.

00:01:25.480 | Then de Berta is really interesting from

00:01:28.720 | the perspective of our discussion of positional encoding.

00:01:31.800 | In that screencast, I expressed a concern that

00:01:34.400 | the positional encoding representations were exerting,

00:01:37.840 | in some cases, too much influence

00:01:40.120 | on the representations of words.

00:01:42.600 | De Berta can be seen as an attempt to

00:01:45.000 | decouple word from position somewhat.

00:01:47.520 | It does that by decoupling those core representations and then

00:01:50.920 | having distinct attention mechanisms to those two parts.

00:01:55.200 | My guiding intuition here is that de Berta will allow words to have more of

00:01:59.440 | their wordhood separate from where they might

00:02:02.000 | have appeared in the input string.

00:02:04.840 | That seems very healthy to me.

00:02:07.320 | When I talked about BERT,

00:02:09.400 | I listed out some known limitations.

00:02:11.440 | There were four of them.

00:02:13.040 | I gave credit to Roberta for addressing the first one,

00:02:16.280 | which was around design decisions.

00:02:18.600 | I gave credit to Elektra for addressing items 2 and 3,

00:02:22.640 | where 2 was about the artificial nature of the mask token,

00:02:26.120 | and 3 was about the inefficiency of MLM training in the BERT context.

00:02:31.360 | I haven't yet touched on the fourth item.

00:02:33.760 | The fourth item is from Yang et al,

00:02:35.440 | which is the ExcelNet paper.

00:02:37.280 | ExcelNet indeed addresses this concern.

00:02:39.880 | The concern is just that BERT assumes the predicted tokens are independent of

00:02:44.200 | each other given the unmasked tokens,

00:02:46.760 | which is oversimplified as high-order,

00:02:49.200 | long-range dependency is prevalent in natural language.

00:02:52.880 | The guiding idea behind ExcelNet is that in having an autoregressive language modeling loss,

00:02:59.000 | we bring in some of the conditional probabilities that help us

00:03:02.760 | overcome this artificial statistical nature of the MLM objective.

00:03:07.280 | But remember, the interesting aspect of ExcelNet is that we still have

00:03:11.520 | bidirectional context and this comes from sampling

00:03:14.720 | all of those permutation orders of the input string.

00:03:18.240 | Really interesting to think about and also

00:03:21.160 | a lovely insight about the nature of BERT itself.

00:03:25.440 | I didn't get to discuss pre-training data really at all in this series,

00:03:30.920 | and I feel guilty about that because I think we can now see that pre-training data is

00:03:36.280 | an incredibly important ingredient in

00:03:38.880 | shaping the behaviors of these large language models.

00:03:42.000 | I have listed out here some core pre-training resources,

00:03:45.840 | OpenBook Corpus, the Pile,

00:03:48.440 | Big Science Data, Wikipedia, and Reddit.

00:03:51.840 | I have listed these here not really to

00:03:54.520 | encourage you to go off and train your own large language model,

00:03:57.680 | but rather to think about auditing these datasets as a way of more deeply

00:04:02.360 | understanding the artifacts that we do have and coming to

00:04:06.040 | an understanding of where they're likely to be successful

00:04:08.840 | and where they might be actually very problematic.

00:04:11.720 | A lot of that is going to trace to the nature of the input data.

00:04:16.480 | Then finally, let's look ahead to the future,

00:04:19.080 | some current trends to the best of my estimation.

00:04:22.920 | This is likely the situation we're in and what we're going to see going forward.

00:04:27.160 | First, it seems like autoregressive architectures have taken over.

00:04:31.720 | That's the rise of GPT.

00:04:34.000 | But this may be simply because the field is so focused on generation right now.

00:04:40.200 | I would still maintain that if you simply want to represent examples for the sake of

00:04:45.840 | having a sentence embedding or

00:04:47.800 | understanding how different representations compare to each other,

00:04:51.080 | it seems to me that bi-directional models like BERT

00:04:53.880 | might still have the edge over models like GPT.

00:04:57.800 | Sequence-to-sequence models are still

00:05:00.520 | a dominant choice for tasks that have that structure.

00:05:03.520 | It seems like they might have an edge in terms of

00:05:05.800 | an architectural bias that helps them understand the tasks themselves.

00:05:10.280 | Although item 1 here is important as we get these really large pure language models,

00:05:15.960 | we might find ourselves moving more toward autoregressive formulations

00:05:20.320 | even of tasks that have a sequence-to-sequence structure.

00:05:23.360 | We shall see. Then finally,

00:05:26.180 | and maybe this is the most interesting point of all,

00:05:28.960 | people are still obsessed with scaling up to ever larger language models.

00:05:33.540 | But happily, we are seeing a counter-movement towards smaller models.

00:05:38.120 | I've put smaller in quotes here because we're still talking about

00:05:41.400 | artifacts that have on the order of 10 billion parameters,

00:05:45.040 | but that is substantially smaller than these really massive language models.

00:05:50.240 | There are a lot of incentives that are going to

00:05:53.640 | encourage these smaller models to become very good.

00:05:56.760 | We can deploy them in more places,

00:05:58.880 | we can train them more efficiently,

00:06:00.760 | we can train more of them,

00:06:02.320 | and we might have more control of them in the end for the things that we want to do.

00:06:07.280 | All the incentives are there.

00:06:08.760 | This is a moment of intense innovation and a lot of change in this space.

00:06:13.260 | I have no idea what these small models are going to be able to do a year from now,

00:06:17.840 | but I would exhort all of you to think about how you might participate in

00:06:22.360 | this exciting moment and help us reach the point where relatively small and

00:06:26.680 | inefficient models are nonetheless incredibly performant and useful to us.

00:06:33.080 | [BLANK_AUDIO]

Stanford XCS224U: NLU I Contextual Word Representations, Part 10: Wrap-up I Spring 2023

Chapters