back to indexStanford XCS224U: NLU I Contextual Word Representations, Part 10: Wrap-up I Spring 2023
Chapters
0:0
0:30 Other noteworthy architectures
2:7 BERT: Known limitations
3:25 Pretraining data
4:17 Current trends
00:00:14.520 |
First, just take stock of what we did a little bit. 00:00:24.640 |
Then finally, I'd like to look to the future, 00:00:30.600 |
Let me start by trying to make amends a little bit for 00:00:34.000 |
some noteworthy architectures that I didn't have time for. 00:00:40.280 |
very innovative attempt to bring in long contexts. 00:00:48.840 |
and then recreating some recurrent connections 00:00:53.760 |
the computation for the current set of states. 00:00:58.680 |
Transformer XL were carried forward into ExcelNet. 00:01:07.280 |
having an autoregressive language modeling loss. 00:01:10.600 |
They do this in this really interesting way of 00:01:13.920 |
sampling different sequence orders so that you process 00:01:19.820 |
enough sequence orders that you essentially have 00:01:28.720 |
the perspective of our discussion of positional encoding. 00:01:31.800 |
In that screencast, I expressed a concern that 00:01:34.400 |
the positional encoding representations were exerting, 00:01:47.520 |
It does that by decoupling those core representations and then 00:01:50.920 |
having distinct attention mechanisms to those two parts. 00:01:55.200 |
My guiding intuition here is that de Berta will allow words to have more of 00:01:59.440 |
their wordhood separate from where they might 00:02:13.040 |
I gave credit to Roberta for addressing the first one, 00:02:18.600 |
I gave credit to Elektra for addressing items 2 and 3, 00:02:22.640 |
where 2 was about the artificial nature of the mask token, 00:02:26.120 |
and 3 was about the inefficiency of MLM training in the BERT context. 00:02:39.880 |
The concern is just that BERT assumes the predicted tokens are independent of 00:02:49.200 |
long-range dependency is prevalent in natural language. 00:02:52.880 |
The guiding idea behind ExcelNet is that in having an autoregressive language modeling loss, 00:02:59.000 |
we bring in some of the conditional probabilities that help us 00:03:02.760 |
overcome this artificial statistical nature of the MLM objective. 00:03:07.280 |
But remember, the interesting aspect of ExcelNet is that we still have 00:03:11.520 |
bidirectional context and this comes from sampling 00:03:14.720 |
all of those permutation orders of the input string. 00:03:21.160 |
a lovely insight about the nature of BERT itself. 00:03:25.440 |
I didn't get to discuss pre-training data really at all in this series, 00:03:30.920 |
and I feel guilty about that because I think we can now see that pre-training data is 00:03:38.880 |
shaping the behaviors of these large language models. 00:03:42.000 |
I have listed out here some core pre-training resources, 00:03:54.520 |
encourage you to go off and train your own large language model, 00:03:57.680 |
but rather to think about auditing these datasets as a way of more deeply 00:04:02.360 |
understanding the artifacts that we do have and coming to 00:04:06.040 |
an understanding of where they're likely to be successful 00:04:08.840 |
and where they might be actually very problematic. 00:04:11.720 |
A lot of that is going to trace to the nature of the input data. 00:04:16.480 |
Then finally, let's look ahead to the future, 00:04:19.080 |
some current trends to the best of my estimation. 00:04:22.920 |
This is likely the situation we're in and what we're going to see going forward. 00:04:27.160 |
First, it seems like autoregressive architectures have taken over. 00:04:34.000 |
But this may be simply because the field is so focused on generation right now. 00:04:40.200 |
I would still maintain that if you simply want to represent examples for the sake of 00:04:47.800 |
understanding how different representations compare to each other, 00:04:51.080 |
it seems to me that bi-directional models like BERT 00:04:53.880 |
might still have the edge over models like GPT. 00:05:00.520 |
a dominant choice for tasks that have that structure. 00:05:03.520 |
It seems like they might have an edge in terms of 00:05:05.800 |
an architectural bias that helps them understand the tasks themselves. 00:05:10.280 |
Although item 1 here is important as we get these really large pure language models, 00:05:15.960 |
we might find ourselves moving more toward autoregressive formulations 00:05:20.320 |
even of tasks that have a sequence-to-sequence structure. 00:05:26.180 |
and maybe this is the most interesting point of all, 00:05:28.960 |
people are still obsessed with scaling up to ever larger language models. 00:05:33.540 |
But happily, we are seeing a counter-movement towards smaller models. 00:05:38.120 |
I've put smaller in quotes here because we're still talking about 00:05:41.400 |
artifacts that have on the order of 10 billion parameters, 00:05:45.040 |
but that is substantially smaller than these really massive language models. 00:05:50.240 |
There are a lot of incentives that are going to 00:05:53.640 |
encourage these smaller models to become very good. 00:06:02.320 |
and we might have more control of them in the end for the things that we want to do. 00:06:08.760 |
This is a moment of intense innovation and a lot of change in this space. 00:06:13.260 |
I have no idea what these small models are going to be able to do a year from now, 00:06:17.840 |
but I would exhort all of you to think about how you might participate in 00:06:22.360 |
this exciting moment and help us reach the point where relatively small and 00:06:26.680 |
inefficient models are nonetheless incredibly performant and useful to us.