back to index

Stanford XCS224U: NLU I Contextual Word Representations, Part 10: Wrap-up I Spring 2023


Chapters

0:0
0:30 Other noteworthy architectures
2:7 BERT: Known limitations
3:25 Pretraining data
4:17 Current trends

Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome back everyone.
00:00:06.080 | This is the 10th and final screencast in
00:00:08.220 | our series on contextual representation.
00:00:10.520 | I'd like to just briefly wrap up.
00:00:12.460 | In doing that, I'd like to do three things.
00:00:14.520 | First, just take stock of what we did a little bit.
00:00:17.280 | Second, I'd like to make amends for
00:00:19.460 | really interesting architectures and
00:00:21.320 | innovations that I didn't have time
00:00:22.880 | to mention in the core series.
00:00:24.640 | Then finally, I'd like to look to the future,
00:00:26.960 | both for the course and also for the field.
00:00:30.600 | Let me start by trying to make amends a little bit for
00:00:34.000 | some noteworthy architectures that I didn't have time for.
00:00:37.500 | Transformer XL is an early and
00:00:40.280 | very innovative attempt to bring in long contexts.
00:00:43.680 | It does this by essentially caching
00:00:46.400 | earlier parts of a long sequence,
00:00:48.840 | and then recreating some recurrent connections
00:00:51.640 | across those cache states into
00:00:53.760 | the computation for the current set of states.
00:00:56.360 | Very innovative. The ideas for
00:00:58.680 | Transformer XL were carried forward into ExcelNet.
00:01:01.960 | The core of ExcelNet is the goal of having
00:01:04.760 | bidirectional context while nonetheless
00:01:07.280 | having an autoregressive language modeling loss.
00:01:10.600 | They do this in this really interesting way of
00:01:13.920 | sampling different sequence orders so that you process
00:01:17.400 | left to right while nonetheless sampling
00:01:19.820 | enough sequence orders that you essentially have
00:01:22.080 | the power of bidirectional context.
00:01:25.480 | Then de Berta is really interesting from
00:01:28.720 | the perspective of our discussion of positional encoding.
00:01:31.800 | In that screencast, I expressed a concern that
00:01:34.400 | the positional encoding representations were exerting,
00:01:37.840 | in some cases, too much influence
00:01:40.120 | on the representations of words.
00:01:42.600 | De Berta can be seen as an attempt to
00:01:45.000 | decouple word from position somewhat.
00:01:47.520 | It does that by decoupling those core representations and then
00:01:50.920 | having distinct attention mechanisms to those two parts.
00:01:55.200 | My guiding intuition here is that de Berta will allow words to have more of
00:01:59.440 | their wordhood separate from where they might
00:02:02.000 | have appeared in the input string.
00:02:04.840 | That seems very healthy to me.
00:02:07.320 | When I talked about BERT,
00:02:09.400 | I listed out some known limitations.
00:02:11.440 | There were four of them.
00:02:13.040 | I gave credit to Roberta for addressing the first one,
00:02:16.280 | which was around design decisions.
00:02:18.600 | I gave credit to Elektra for addressing items 2 and 3,
00:02:22.640 | where 2 was about the artificial nature of the mask token,
00:02:26.120 | and 3 was about the inefficiency of MLM training in the BERT context.
00:02:31.360 | I haven't yet touched on the fourth item.
00:02:33.760 | The fourth item is from Yang et al,
00:02:35.440 | which is the ExcelNet paper.
00:02:37.280 | ExcelNet indeed addresses this concern.
00:02:39.880 | The concern is just that BERT assumes the predicted tokens are independent of
00:02:44.200 | each other given the unmasked tokens,
00:02:46.760 | which is oversimplified as high-order,
00:02:49.200 | long-range dependency is prevalent in natural language.
00:02:52.880 | The guiding idea behind ExcelNet is that in having an autoregressive language modeling loss,
00:02:59.000 | we bring in some of the conditional probabilities that help us
00:03:02.760 | overcome this artificial statistical nature of the MLM objective.
00:03:07.280 | But remember, the interesting aspect of ExcelNet is that we still have
00:03:11.520 | bidirectional context and this comes from sampling
00:03:14.720 | all of those permutation orders of the input string.
00:03:18.240 | Really interesting to think about and also
00:03:21.160 | a lovely insight about the nature of BERT itself.
00:03:25.440 | I didn't get to discuss pre-training data really at all in this series,
00:03:30.920 | and I feel guilty about that because I think we can now see that pre-training data is
00:03:36.280 | an incredibly important ingredient in
00:03:38.880 | shaping the behaviors of these large language models.
00:03:42.000 | I have listed out here some core pre-training resources,
00:03:45.840 | OpenBook Corpus, the Pile,
00:03:48.440 | Big Science Data, Wikipedia, and Reddit.
00:03:51.840 | I have listed these here not really to
00:03:54.520 | encourage you to go off and train your own large language model,
00:03:57.680 | but rather to think about auditing these datasets as a way of more deeply
00:04:02.360 | understanding the artifacts that we do have and coming to
00:04:06.040 | an understanding of where they're likely to be successful
00:04:08.840 | and where they might be actually very problematic.
00:04:11.720 | A lot of that is going to trace to the nature of the input data.
00:04:16.480 | Then finally, let's look ahead to the future,
00:04:19.080 | some current trends to the best of my estimation.
00:04:22.920 | This is likely the situation we're in and what we're going to see going forward.
00:04:27.160 | First, it seems like autoregressive architectures have taken over.
00:04:31.720 | That's the rise of GPT.
00:04:34.000 | But this may be simply because the field is so focused on generation right now.
00:04:40.200 | I would still maintain that if you simply want to represent examples for the sake of
00:04:45.840 | having a sentence embedding or
00:04:47.800 | understanding how different representations compare to each other,
00:04:51.080 | it seems to me that bi-directional models like BERT
00:04:53.880 | might still have the edge over models like GPT.
00:04:57.800 | Sequence-to-sequence models are still
00:05:00.520 | a dominant choice for tasks that have that structure.
00:05:03.520 | It seems like they might have an edge in terms of
00:05:05.800 | an architectural bias that helps them understand the tasks themselves.
00:05:10.280 | Although item 1 here is important as we get these really large pure language models,
00:05:15.960 | we might find ourselves moving more toward autoregressive formulations
00:05:20.320 | even of tasks that have a sequence-to-sequence structure.
00:05:23.360 | We shall see. Then finally,
00:05:26.180 | and maybe this is the most interesting point of all,
00:05:28.960 | people are still obsessed with scaling up to ever larger language models.
00:05:33.540 | But happily, we are seeing a counter-movement towards smaller models.
00:05:38.120 | I've put smaller in quotes here because we're still talking about
00:05:41.400 | artifacts that have on the order of 10 billion parameters,
00:05:45.040 | but that is substantially smaller than these really massive language models.
00:05:50.240 | There are a lot of incentives that are going to
00:05:53.640 | encourage these smaller models to become very good.
00:05:56.760 | We can deploy them in more places,
00:05:58.880 | we can train them more efficiently,
00:06:00.760 | we can train more of them,
00:06:02.320 | and we might have more control of them in the end for the things that we want to do.
00:06:07.280 | All the incentives are there.
00:06:08.760 | This is a moment of intense innovation and a lot of change in this space.
00:06:13.260 | I have no idea what these small models are going to be able to do a year from now,
00:06:17.840 | but I would exhort all of you to think about how you might participate in
00:06:22.360 | this exciting moment and help us reach the point where relatively small and
00:06:26.680 | inefficient models are nonetheless incredibly performant and useful to us.
00:06:33.080 | [BLANK_AUDIO]