Stanford XCS224U: NLU I Contextual Word Representations, Part 1: Guiding Ideas I Spring 2023

Hello everyone, I'm Chris Potts. Welcome to our unit on contextual word representations. I thought I'd kick this off with some high-level guiding ideas. The first thing I wanted to say is that in previous iterations of this course, we spent about two weeks focused on static vector representations of words and in some cases phrases and full sentences.

For this iteration of the course, we're going to go directly to contextual word representations which have proven so powerful for today's NLP research and technologies. But I did want to offer a brief overview of the history that leads to contextual representations. Let's rewind the clock back to what I've called feature-based, classical lexical representations and the hallmark of these representations is that they are sparse.

Typically, what we're thinking of is very long vectors where the column dimensions represent the output of usually hand-built feature functions that we've written that might capture things like the binary sentiment classification of a word, or which part of speech it tends to have most dominantly, or whether or not it ends in a suffix like ing for English, and so forth and so on.

We write a lot of these feature functions and as a result, we have vectors that are mostly zeros because most words don't have these properties, but the few ones carry important information. In the next phase, we have count-based methods. This is the introduction of the distributional hypothesis. I'm thinking of methods like point-wise mutual information and TF-IDF, term frequency inverse document frequency, a classic from information retrieval that we'll visit a bit later in the course.

For these methods, we begin from some count matrix. Typically, for PMI, it would be a word-by-word matrix where the cells in that matrix capture the number of time so that each word co-occurs with the other words in our corpus. For TF-IDF, it would probably be a word-by-document matrix, and now we're capturing which words appear and with what frequency in all the documents in our corpus.

The idea behind these methods is that we take PMI or TF-IDF and we massage those counts in a way that leads to better representations. That's coming purely from distributional information. We typically don't write hand-built feature functions in this mode, but these vectors tend to be pretty sparse. The next phase is what I've called classical dimensionality reduction.

This is the introduction of dense representations. I have in mind methods like PCA, principal components analysis, SVD, that singular value decomposition, LDA is latent Dirichlet allocation. There's a whole family of these. Typically, what we're doing at this phase is taking maybe the output of representations built in the mode of step 2 there and compressing them.

In compressing them, we typically get denser, more informative representations that can also capture higher-order notions of distributional similarity and co-occurrence and so forth. That proved incredibly powerful. Then the fourth phase, which might be the final phase for this static vector representation approach, we have what I've called learned dimensionality reduction approaches.

Here again, we have dense representations. These could be the output of an autoencoder or a classical method like word2vec or GloVe. What we do at this phase is really essentially combine the count-based methods from step 2 in this history with the dimensionality reduction that we see from methods like SVD and PCA.

That leads us to very powerful, typically learned representations that have a tremendous capacity to capture higher-order notions of co-occurrence. That's a very fast overview. If you would like to get hands-on with these methods and really deeply understand them then check out this page that I've linked to here from the course website.

It links to a lot of notebooks and some older videos that could help you with a refresher on all these methods or to just get up to speed where you're ready to think directly about contextual representations. The representations that we get from these contextual models that will be the focus for this unit and really for the entire course really resonate with me as a linguist.

I thought I would just pause here and run through some examples that I think lead from the linguistics angle to the conclusion that word representations are highly sensitive to the context in which they're used. Let's start with a simple case like the vase broke and our focus will be on that English verb break.

Here, the sense of break is something like shatter. For dawn broke though, we have a superficially similar looking sentence, subject verb. But now the sense of break is more like begin. That's presumably conditioned by what we know about the subject and how that operates on the sense of the verb break for a stereotypical reading.

The news broke again, superficially similar sentence. The news is the subject, but now it seems to be conditioning a reading that is more like was announced or appeared or was published. Very different sense yet again. For Sandy broke the record, we have our first transitive case. Now the sense of break is something like surpass the previous world record.

But for Sandy broke the law, a different sense yet again. This is something like a transgression, again conditioned in this case by the direct object. The burglar broke into the house, now we have break appearing with a particle into, and in this case it means like enter forcibly without permission.

But the newscaster broke into the movie broadcast means something more like interrupt, a related but interestingly distinct meaning that is coming from the same break plus particle construction. Then in we broke even, just for another surprise, this is an entirely new sense break plus even means something like we lost the same amount as we gained.

This is just a sample of the many, many senses that break can take on in English. What we see in these patterns is that the sense it does have in context is driven by the surrounding words transparently, but maybe also by things that are happening in the context. Similar things arise for adjectives like flat, as in flat beer, flat note, flat tire, flat surface.

That's the same adjective flat, but depending on what noun it modifies, you get very different senses. It can happen with a verb to throw a party, throw a fight, throw a ball, throw a fit. There might be some metaphorical sense in which all of these senses are related, but they are also clearly distinct and they're being driven by something about how this direct object interacts with the verb.

We should extend this beyond just the simple morphosyntactic context. Let's think about ambiguity resolution coming from the things we say to each other. For the sentence, a crane caught a fish, you probably infer that there the sense of crane is for a bird. It's not determined by that, but that's the most likely inference.

Whereas a crane picked up the steel beam, you probably infer in that case that a crane is a machine. Then correspondingly for I saw a crane, you might not absent more context know whether I mean a bird or a machine. But if we do embed this sentence in larger context, you'll begin to make inferences about whether it was a bird or a machine.

That shows you that even things that are happening ambiently in the context of utterance can impact what sense crane has in these sentences. Are there any typos? I didn't see any. The second sentence is clearly elliptical, and in this case, we probably infer that I didn't see any means I didn't see any typos.

Contrast that with are there any bookstores downtown? I didn't see any. It's an identical second sentence, but now the inference is that any means something like any bookstores. That is again showing that the senses that these words take on in context could be driven by everything that is happening in the surrounding sentence, and also in the surrounding discourse on out extending into things about world knowledge and so forth.

That for me really shows that the static word vector representation approach was never really going to work out because it insists that broke in all those examples in one correspond to a single vector flat into a has to be a single vector and so forth and so on. What we actually see in how language works is much more malleability for individual word meanings, and that's precisely what contextual word representations allow us to capture.

I thought I would pause here just to offer a brief history of where these ideas come from because it's a very recent history. Things are moving fast, but it's also interesting to track. I think a classic paper, maybe the starting point for this is Dai and Lei, 2015. They really showed the value of language model style pre-training for downstream tasks and the paper is fascinating to look at.

It's a pre-transformer paper. A lot of the things that they do look complicated in retrospect, and some of the simpler ideas that they offer we can now see in retrospect are incredibly powerful. We fast forward a little bit to August 2017, McCann et al. This is the Cove paper, and what they showed is that pre-trained bidirectional LSTMs for machine translation could offer us sequence representations that were a useful starting point for many other downstream tasks outside of MT.

That begins this move toward pre-training for contextual representations. It really takes off with Elmo in February 2018. That team was really the first to show that very large-scale pre-training of bidirectional LSTMs could lead to rich multipurpose representations that were easily adapted to lots of downstream tasks via fine-tuning. In June 2018, we get GPT.

Then in October 2018, the BERT era truly begins. Devlin et al. 2019 is when the paper was published, but the work appeared before that and had already had tremendous influence by the time it was actually officially published. That BERT model is really the cornerstone of so much that we'll discuss in this unit and beyond.

There's a parallel dimension to all this, a parallel journey that I thought I would talk a little bit about as another guiding idea. I've put this under the heading of model structure and linguistic structure, and this is related also to this idea of what kinds of structural biases we build into our model architectures.

In the upper left, I have a simple model that's reviewed in those background materials, where you can imagine that each word in the sentence, the rock rules is looked up in like a glove space or a word to vex space. Then what this model does is simply add those representations together to get a representation for the entire sentence.

This is a very high bias model, in particular because we have to decide ahead of time that the way those representations will combine will be via addition. You have to think, even if that's approximately correct, it's only going to be approximate, it would be a minor miracle if addition turned out to be actually the optimal way to combine those word meanings.

You could see that as giving rise to the models that are on the right here, this is a recurrent neural network. Again, you could imagine that we look up each one of those words in a static vector representation space like glove. But now we feed them into this RNN process that actually processes them with a bunch of new neural network parameters.

In that way, it could be trained, it could be taught to combine those word meanings in an optimal way. It could be the case that the optimal way to combine word meanings is with addition, and the models that we have over here on the right are certainly powerful enough to learn addition of those vectors if it's correct, but that is very unlikely to happen.

Probably, the model will learn a much more complicated function that might be much more nuanced. In that way, we've released some of the biases that we introduced over here, and we're going to allow ourselves to learn in a more free-form way from data about how to optimally combine words.

There's another dimension to this. If you go down on the left here, this is a tree-structured neural network. In this case, we decide ahead of time that we know the constituent structure, and then we might have a bunch of neural network parameters that combine the child nodes to create representations for the parents in a recursive fashion as we move up the tree.

That is very powerful in the sense that we could have lots of different functions that get learned for how to combine child nodes into their parents. But this model is high bias because it decides ahead of time about how the constituent structure should look. You have to believe or know a priori that the rock forms a constituent in a way that rock rules simply does not.

Whereas the models up here take a more free-form approach. Down in the right-hand corner, I have the models that we saw in the lead up to the transformer. This could be like a bidirectional RNN, where again, we could look up the words in a static vector representation space, but now we have information flowing left to right in these hidden representations, and we might have added a bunch of attention mechanisms on top that essentially connect every other hidden unit to every other hidden unit in this representation.

Whereas this had a presumption that we would process left to right, and this one had an assumption that we would process by constituent structure, this model down here says essentially anything goes. I think it's fair to say that a lesson of the transformer era is that anything goes given sufficient data, is the most powerful mode to be in.

That really is a kind of insight behind the transformer architecture. The attention mechanisms that I mentioned there are really important, and this is also part of the journey that leads us to the transformer and might harbor most of its power. Let me give you a simple example of how attention worked, especially in the lead up to transformers.

Here I have a model that you might think of as an RNN, maybe we're processing left to right for really not so good. We look up those words in some static vector representation space, and then we have our left to right process that leads to these hidden representations. Suppose now that I want to train a classifier on top of this final output state.

Well, I might worry in doing that, that there'll be a lot of information about the words that are laid in the sequence, but not enough information in this representation here about the words that were earlier in the sequence. Attention emerges as a way to remind ourselves in late states, what was in earlier ones.

We could do that with a scoring function, and here what I've depicted is a simple dot product scoring function, exactly the sort that we get from the transformer in essence. What it's doing is taking the dot product of our target representation with all of the previous hidden states. We softmax normalize those, and then we bring those into the representation that we are targeting, to get a context vector here.

We could take the average of all of them. Then finally, we get this attention combination H here, and that could be a neural network parameterized function that takes in this representation plus the attention representation that we created here, and feeds that through some parameters and a non-linearity. That finally gives us the representation that we feed into the classifier that we wanted to fit originally.

The idea here is that now our classification decision is based indeed on this representation at the end but now infused with a lot of information about how similar that representation is to the ones that preceded it. That is the essential idea behind dot product attention, which will be the beating heart, so to speak, of the transformer.

Another idea that has proved so powerful is a notion of sub-word modeling. I thought I would take you on a brief journey of how we arrived in the current phase for this sub-word modeling, beginning with ELMo, because what ELMo did is truly fascinating. The ELMo word representation space begins with character level representations, and then it has a bunch of filters on top of those, and then it has a bunch of different convolutional layers that we then do max pooling over to get a representation for the entire sequence.

We do that at different layers here, and so those get concatenated up into these max pooling representations at the top, and those form the basis for word representations, and the idea is that this gives us whole word vectors that nonetheless have lots of information about the sub-word parts all the way down to characters, but including all of these convolutions of different lengths that capture different notions of sub-word within that space.

Incredibly visionary, I would say. One thing I should note though is that the ELMo vocabulary has about 100,000 words in it, which is an enormous vocabulary, and even still, if you deal with real text, you will find that you are mostly encountering words that are not in that vocabulary, and even if you double it to 200,000 or 300,000, now you're getting a really large embedding space, you will still mostly encounter words that are unked out, that is unknown to your model, and that's an incredibly limiting factor for this whole word approach, but we see in here the essence of the idea that we should model sub-words.

A big change happened for the transformer when we got in parallel this notion of word piece tokenization. Here I'm going to give you a feel for that by looking at the BERT tokenizer. That gets loaded in cell 2 here, and then when we call the tokenizer on the sentence, this isn't too surprising, we get things that look mostly like whole words, especially if you're used to NLP where the suffix for isn't has been broken apart and so is the punctuation.

But when we call the tokenizer on the sequence encode me with an exclamation mark, notice that the word encode has been split apart into two words, n and then code with those markers there indicating that that is a word internal piece. If you tokenize a word like snuffleupagus, you get a sequence of 1, 2, 3, 4, 5, 6 parts to that single word that came in.

The effect here is that we can have a vanishingly small vocabulary. There are under 30,000 words in this BERT tokenization space, so a very small embedding space. But nonetheless, when we encounter words that are outside of that space, they don't get unked out, but rather we analyze them into sub-word pieces that we do have embedding representations for.

Incredibly powerful and in the context of a contextual model, we might have some hope that for cases like encode being split into two tokens, the model will learn internally that in some sense, those form a coherent piece, a word. But we don't need that directly reflected in the tokenizers vocabulary.

Incredible idea. A related idea for the transformer that is so foundational and that I think the field is still figuring out is positional encoding. When we talk about the transformer, you will see that it has almost no way of keeping track of the order of words in a sequence.

It's mostly a bunch of columns of different things that happen independently with some attention mechanisms bringing them together. To capture word order, we typically have something like a positional encoding. The most heavy-handed way to do that is to simply have, in addition to your word embedding space, a positional embedding space that simply records where individual words appear in the sequence.

Here, the is paired with one because it's at the start of the sequence. But if the was in position 4, it would be this fixed vector here combined with the embedding for position 4. Those get added together into what you might think of as the basis for contextual representation.

This has proved effective, but it has many limitations that we're going to talk about later in the unit, and we're going to explore ways to capture what's good about positional encoding while also overcoming some of the problems that it introduces. Then of course, one of the major guiding ideas behind all of this is simply massive scale pre-training.

This is an idea that was unlocked by the distributional hypothesis, which had the insight that we don't need to write hand-built feature functions but rather we can just rely on unlabeled corpora and keep track of which words are appearing with which other words. It really comes into its own in the neural era with models like Word2Vec and GloVe, as I mentioned.

Following that, we get the ELMo paper, and I mentioned before that that was the eye-opening moment when we saw that we could learn contextual representations at scale and have that transfer into tasks we wanted to fine-tune for. Of course, you get the GPT paper, and then BERT launches the BERT era.

Then you get, at the end of this little history here, the GPT-3 paper which applied this massive scale idea at a level that was previously unimagined and unimaginable, and that really did introduce a phase change in research as we started to deal with these truly massive models, and ask them to learn in context, that is just from prompts we offer them, how to perform tasks and so forth.

Then related to this, of course, is the idea of fine-tuning and its corresponding notion of pre-training. Here's a brief review of these ideas. In 2016-2018, the notion of pre-training that we had was essentially that we would feed static word representations into variants of RNNs, and then the model would be fine-tuned and it would learn a bunch of stuff.

When we fast-forward to the BERT era in 2018, we start to get fine-tuning of contextual models. Here is just a bit of code of a source that you will write in this course, where we read in BERT representations and fine-tune them essentially to be a classifier. That started in 2018, and I think it continues.

We might be headed into an era in which most of the fine-tuning that happens is on these massive language models that we mostly don't have access to. We can't write code as in the BERT code there, but rather we just call an API and some partially understood and partially known to us fine-tuning process, fine-tune some of the parameters for one of these large models to do what we want to do.

I'm hoping that we still see a lot more of this custom code being written. It's very powerful analytically and technologically, but this is surely part of your future as an NLP-er as well.

Stanford XCS224U: NLU I Contextual Word Representations, Part 1: Guiding Ideas I Spring 2023

Transcript