Back to Index

Stanford XCS224U: NLU I Contextual Word Representations, Part 10: Wrap-up I Spring 2023


Chapters

0:0
0:30 Other noteworthy architectures
2:7 BERT: Known limitations
3:25 Pretraining data
4:17 Current trends

Transcript

Welcome back everyone. This is the 10th and final screencast in our series on contextual representation. I'd like to just briefly wrap up. In doing that, I'd like to do three things. First, just take stock of what we did a little bit. Second, I'd like to make amends for really interesting architectures and innovations that I didn't have time to mention in the core series.

Then finally, I'd like to look to the future, both for the course and also for the field. Let me start by trying to make amends a little bit for some noteworthy architectures that I didn't have time for. Transformer XL is an early and very innovative attempt to bring in long contexts.

It does this by essentially caching earlier parts of a long sequence, and then recreating some recurrent connections across those cache states into the computation for the current set of states. Very innovative. The ideas for Transformer XL were carried forward into ExcelNet. The core of ExcelNet is the goal of having bidirectional context while nonetheless having an autoregressive language modeling loss.

They do this in this really interesting way of sampling different sequence orders so that you process left to right while nonetheless sampling enough sequence orders that you essentially have the power of bidirectional context. Then de Berta is really interesting from the perspective of our discussion of positional encoding. In that screencast, I expressed a concern that the positional encoding representations were exerting, in some cases, too much influence on the representations of words.

De Berta can be seen as an attempt to decouple word from position somewhat. It does that by decoupling those core representations and then having distinct attention mechanisms to those two parts. My guiding intuition here is that de Berta will allow words to have more of their wordhood separate from where they might have appeared in the input string.

That seems very healthy to me. When I talked about BERT, I listed out some known limitations. There were four of them. I gave credit to Roberta for addressing the first one, which was around design decisions. I gave credit to Elektra for addressing items 2 and 3, where 2 was about the artificial nature of the mask token, and 3 was about the inefficiency of MLM training in the BERT context.

I haven't yet touched on the fourth item. The fourth item is from Yang et al, which is the ExcelNet paper. ExcelNet indeed addresses this concern. The concern is just that BERT assumes the predicted tokens are independent of each other given the unmasked tokens, which is oversimplified as high-order, long-range dependency is prevalent in natural language.

The guiding idea behind ExcelNet is that in having an autoregressive language modeling loss, we bring in some of the conditional probabilities that help us overcome this artificial statistical nature of the MLM objective. But remember, the interesting aspect of ExcelNet is that we still have bidirectional context and this comes from sampling all of those permutation orders of the input string.

Really interesting to think about and also a lovely insight about the nature of BERT itself. I didn't get to discuss pre-training data really at all in this series, and I feel guilty about that because I think we can now see that pre-training data is an incredibly important ingredient in shaping the behaviors of these large language models.

I have listed out here some core pre-training resources, OpenBook Corpus, the Pile, Big Science Data, Wikipedia, and Reddit. I have listed these here not really to encourage you to go off and train your own large language model, but rather to think about auditing these datasets as a way of more deeply understanding the artifacts that we do have and coming to an understanding of where they're likely to be successful and where they might be actually very problematic.

A lot of that is going to trace to the nature of the input data. Then finally, let's look ahead to the future, some current trends to the best of my estimation. This is likely the situation we're in and what we're going to see going forward. First, it seems like autoregressive architectures have taken over.

That's the rise of GPT. But this may be simply because the field is so focused on generation right now. I would still maintain that if you simply want to represent examples for the sake of having a sentence embedding or understanding how different representations compare to each other, it seems to me that bi-directional models like BERT might still have the edge over models like GPT.

Sequence-to-sequence models are still a dominant choice for tasks that have that structure. It seems like they might have an edge in terms of an architectural bias that helps them understand the tasks themselves. Although item 1 here is important as we get these really large pure language models, we might find ourselves moving more toward autoregressive formulations even of tasks that have a sequence-to-sequence structure.

We shall see. Then finally, and maybe this is the most interesting point of all, people are still obsessed with scaling up to ever larger language models. But happily, we are seeing a counter-movement towards smaller models. I've put smaller in quotes here because we're still talking about artifacts that have on the order of 10 billion parameters, but that is substantially smaller than these really massive language models.

There are a lot of incentives that are going to encourage these smaller models to become very good. We can deploy them in more places, we can train them more efficiently, we can train more of them, and we might have more control of them in the end for the things that we want to do.

All the incentives are there. This is a moment of intense innovation and a lot of change in this space. I have no idea what these small models are going to be able to do a year from now, but I would exhort all of you to think about how you might participate in this exciting moment and help us reach the point where relatively small and inefficient models are nonetheless incredibly performant and useful to us.