back to indexStanford XCS224U: NLU I Contextual Word Representations, Part 1: Guiding Ideas I Spring 2023
00:00:06.280 |
Welcome to our unit on contextual word representations. 00:00:09.360 |
I thought I'd kick this off with some high-level guiding ideas. 00:00:12.780 |
The first thing I wanted to say is that in previous iterations of this course, 00:00:16.080 |
we spent about two weeks focused on static vector representations 00:00:20.480 |
of words and in some cases phrases and full sentences. 00:00:25.460 |
we're going to go directly to contextual word representations which have 00:00:29.000 |
proven so powerful for today's NLP research and technologies. 00:00:35.720 |
the history that leads to contextual representations. 00:00:38.920 |
Let's rewind the clock back to what I've called feature-based, 00:00:42.400 |
classical lexical representations and the hallmark 00:00:45.520 |
of these representations is that they are sparse. 00:00:48.400 |
Typically, what we're thinking of is very long vectors where 00:00:52.120 |
the column dimensions represent the output of 00:00:54.680 |
usually hand-built feature functions that we've written that might 00:00:57.520 |
capture things like the binary sentiment classification of a word, 00:01:01.720 |
or which part of speech it tends to have most dominantly, 00:01:05.600 |
or whether or not it ends in a suffix like ing for English, 00:01:11.000 |
We write a lot of these feature functions and as a result, 00:01:13.560 |
we have vectors that are mostly zeros because most words don't have these properties, 00:01:18.000 |
but the few ones carry important information. 00:01:21.320 |
In the next phase, we have count-based methods. 00:01:24.400 |
This is the introduction of the distributional hypothesis. 00:01:27.640 |
I'm thinking of methods like point-wise mutual information and TF-IDF, 00:01:34.400 |
a classic from information retrieval that we'll visit a bit later in the course. 00:01:39.400 |
For these methods, we begin from some count matrix. 00:01:44.040 |
it would be a word-by-word matrix where the cells in that matrix capture 00:01:47.800 |
the number of time so that each word co-occurs with the other words in our corpus. 00:01:52.680 |
For TF-IDF, it would probably be a word-by-document matrix, 00:01:56.200 |
and now we're capturing which words appear and with 00:01:59.240 |
what frequency in all the documents in our corpus. 00:02:02.760 |
The idea behind these methods is that we take PMI or 00:02:06.360 |
TF-IDF and we massage those counts in a way that leads to better representations. 00:02:12.000 |
That's coming purely from distributional information. 00:02:14.560 |
We typically don't write hand-built feature functions in this mode, 00:02:21.400 |
The next phase is what I've called classical dimensionality reduction. 00:02:25.360 |
This is the introduction of dense representations. 00:02:39.160 |
Typically, what we're doing at this phase is taking maybe the output of 00:02:42.720 |
representations built in the mode of step 2 there and compressing them. 00:02:47.760 |
In compressing them, we typically get denser, 00:02:51.000 |
more informative representations that can also capture 00:02:54.240 |
higher-order notions of distributional similarity and co-occurrence and so forth. 00:03:01.960 |
Then the fourth phase, which might be the final phase for 00:03:07.440 |
we have what I've called learned dimensionality reduction approaches. 00:03:15.880 |
or a classical method like word2vec or GloVe. 00:03:19.240 |
What we do at this phase is really essentially combine 00:03:23.040 |
the count-based methods from step 2 in this history with 00:03:27.240 |
the dimensionality reduction that we see from methods like SVD and PCA. 00:03:37.920 |
a tremendous capacity to capture higher-order notions of co-occurrence. 00:03:44.760 |
If you would like to get hands-on with these methods and really 00:03:51.360 |
this page that I've linked to here from the course website. 00:03:54.240 |
It links to a lot of notebooks and some older videos that 00:03:57.320 |
could help you with a refresher on all these methods or to just get up to 00:04:01.240 |
speed where you're ready to think directly about contextual representations. 00:04:09.540 |
these contextual models that will be the focus for this unit and really for 00:04:12.860 |
the entire course really resonate with me as a linguist. 00:04:16.840 |
I thought I would just pause here and run through some examples that I think 00:04:21.040 |
lead from the linguistics angle to the conclusion that 00:04:24.200 |
word representations are highly sensitive to the context in which they're used. 00:04:29.040 |
Let's start with a simple case like the vase broke and 00:04:31.880 |
our focus will be on that English verb break. 00:04:35.000 |
Here, the sense of break is something like shatter. 00:04:40.600 |
a superficially similar looking sentence, subject verb. 00:04:44.180 |
But now the sense of break is more like begin. 00:04:47.960 |
That's presumably conditioned by what we know about the subject and how 00:04:52.440 |
that operates on the sense of the verb break for a stereotypical reading. 00:04:56.760 |
The news broke again, superficially similar sentence. 00:05:01.420 |
but now it seems to be conditioning a reading that is more 00:05:04.220 |
like was announced or appeared or was published. 00:05:13.640 |
Now the sense of break is something like surpass the previous world record. 00:05:23.540 |
again conditioned in this case by the direct object. 00:05:28.800 |
now we have break appearing with a particle into, 00:05:31.640 |
and in this case it means like enter forcibly without permission. 00:05:35.560 |
But the newscaster broke into the movie broadcast means something more like interrupt, 00:05:40.040 |
a related but interestingly distinct meaning that is coming 00:05:43.360 |
from the same break plus particle construction. 00:05:50.960 |
this is an entirely new sense break plus even means 00:05:53.640 |
something like we lost the same amount as we gained. 00:06:00.560 |
many senses that break can take on in English. 00:06:03.640 |
What we see in these patterns is that the sense it does have in context is 00:06:08.280 |
driven by the surrounding words transparently, 00:06:11.200 |
but maybe also by things that are happening in the context. 00:06:15.000 |
Similar things arise for adjectives like flat, 00:06:34.880 |
There might be some metaphorical sense in which all of these senses are related, 00:06:39.200 |
but they are also clearly distinct and they're being driven 00:06:42.800 |
by something about how this direct object interacts with the verb. 00:06:47.400 |
We should extend this beyond just the simple morphosyntactic context. 00:06:52.920 |
Let's think about ambiguity resolution coming from the things we say to each other. 00:07:00.120 |
you probably infer that there the sense of crane is for a bird. 00:07:11.160 |
you probably infer in that case that a crane is a machine. 00:07:18.080 |
you might not absent more context know whether I mean a bird or a machine. 00:07:23.120 |
But if we do embed this sentence in larger context, 00:07:26.120 |
you'll begin to make inferences about whether it was a bird or a machine. 00:07:30.680 |
That shows you that even things that are happening ambiently in the context of utterance 00:07:35.120 |
can impact what sense crane has in these sentences. 00:07:45.840 |
and in this case, we probably infer that I didn't see any means I didn't see any typos. 00:07:50.760 |
Contrast that with are there any bookstores downtown? 00:07:57.240 |
but now the inference is that any means something like any bookstores. 00:08:01.680 |
That is again showing that the senses that these words take on in context could be 00:08:06.520 |
driven by everything that is happening in the surrounding sentence, 00:08:12.900 |
out extending into things about world knowledge and so forth. 00:08:17.200 |
That for me really shows that the static word vector representation approach was 00:08:22.200 |
never really going to work out because it insists that broke in all those examples in 00:08:28.520 |
one correspond to a single vector flat into a has to be a single vector and so forth and so on. 00:08:36.400 |
What we actually see in how language works is 00:08:39.400 |
much more malleability for individual word meanings, 00:08:42.920 |
and that's precisely what contextual word representations allow us to capture. 00:08:52.480 |
a brief history of where these ideas come from because it's a very recent history. 00:08:56.760 |
Things are moving fast, but it's also interesting to track. 00:09:01.200 |
maybe the starting point for this is Dai and Lei, 2015. 00:09:05.040 |
They really showed the value of language model style pre-training for 00:09:08.920 |
downstream tasks and the paper is fascinating to look at. 00:09:14.160 |
A lot of the things that they do look complicated in retrospect, 00:09:17.540 |
and some of the simpler ideas that they offer we can now see in retrospect are incredibly powerful. 00:09:23.560 |
We fast forward a little bit to August 2017, McCann et al. 00:09:29.240 |
and what they showed is that pre-trained bidirectional LSTMs for 00:09:33.640 |
machine translation could offer us sequence representations that were 00:09:38.480 |
a useful starting point for many other downstream tasks outside of MT. 00:09:44.360 |
That begins this move toward pre-training for contextual representations. 00:09:49.440 |
It really takes off with Elmo in February 2018. 00:09:53.720 |
That team was really the first to show that very large-scale pre-training of 00:09:58.400 |
bidirectional LSTMs could lead to rich multipurpose representations that were 00:10:03.320 |
easily adapted to lots of downstream tasks via fine-tuning. 00:10:16.240 |
Devlin et al. 2019 is when the paper was published, 00:10:21.960 |
already had tremendous influence by the time it was actually officially published. 00:10:26.200 |
That BERT model is really the cornerstone of so much that we'll discuss in this unit and beyond. 00:10:36.640 |
a parallel journey that I thought I would talk a little bit about as another guiding idea. 00:10:41.520 |
I've put this under the heading of model structure and linguistic structure, 00:10:45.600 |
and this is related also to this idea of what kinds 00:10:49.600 |
of structural biases we build into our model architectures. 00:10:54.200 |
In the upper left, I have a simple model that's reviewed in those background materials, 00:10:58.800 |
where you can imagine that each word in the sentence, 00:11:01.880 |
the rock rules is looked up in like a glove space or a word to vex space. 00:11:07.240 |
Then what this model does is simply add those representations 00:11:10.400 |
together to get a representation for the entire sentence. 00:11:17.200 |
in particular because we have to decide ahead of time that 00:11:20.920 |
the way those representations will combine will be via addition. 00:11:25.520 |
You have to think, even if that's approximately correct, 00:11:30.360 |
it would be a minor miracle if addition turned out to 00:11:33.760 |
be actually the optimal way to combine those word meanings. 00:11:38.040 |
You could see that as giving rise to the models that are on the right here, 00:11:44.080 |
Again, you could imagine that we look up each one of those words in 00:11:47.600 |
a static vector representation space like glove. 00:11:51.400 |
But now we feed them into this RNN process that actually 00:11:55.240 |
processes them with a bunch of new neural network parameters. 00:12:03.240 |
it could be taught to combine those word meanings in an optimal way. 00:12:07.560 |
It could be the case that the optimal way to combine word meanings is with addition, 00:12:12.880 |
and the models that we have over here on the right are certainly powerful 00:12:16.520 |
enough to learn addition of those vectors if it's correct, 00:12:24.080 |
a much more complicated function that might be much more nuanced. 00:12:27.800 |
In that way, we've released some of the biases that we introduced over here, 00:12:33.080 |
and we're going to allow ourselves to learn in a more free-form way 00:12:36.320 |
from data about how to optimally combine words. 00:12:45.040 |
In this case, we decide ahead of time that we know the constituent structure, 00:12:49.120 |
and then we might have a bunch of neural network parameters that combine 00:12:52.840 |
the child nodes to create representations for 00:12:55.440 |
the parents in a recursive fashion as we move up the tree. 00:12:59.200 |
That is very powerful in the sense that we could have 00:13:04.880 |
learned for how to combine child nodes into their parents. 00:13:08.240 |
But this model is high bias because it decides ahead of 00:13:11.680 |
time about how the constituent structure should look. 00:13:18.120 |
the rock forms a constituent in a way that rock rules simply does not. 00:13:22.600 |
Whereas the models up here take a more free-form approach. 00:13:29.120 |
I have the models that we saw in the lead up to the transformer. 00:13:36.760 |
where again, we could look up the words in a static vector representation space, 00:13:40.840 |
but now we have information flowing left to right in these hidden representations, 00:13:45.200 |
and we might have added a bunch of attention mechanisms on top that 00:13:49.680 |
essentially connect every other hidden unit to 00:13:52.080 |
every other hidden unit in this representation. 00:13:55.240 |
Whereas this had a presumption that we would process left to right, 00:13:58.520 |
and this one had an assumption that we would process by constituent structure, 00:14:02.640 |
this model down here says essentially anything goes. 00:14:08.480 |
the transformer era is that anything goes given sufficient data, 00:14:15.480 |
That really is a kind of insight behind the transformer architecture. 00:14:21.040 |
The attention mechanisms that I mentioned there are really important, 00:14:25.320 |
and this is also part of the journey that leads us to 00:14:27.640 |
the transformer and might harbor most of its power. 00:14:31.240 |
Let me give you a simple example of how attention worked, 00:14:37.880 |
Here I have a model that you might think of as an RNN, 00:14:40.320 |
maybe we're processing left to right for really not so good. 00:14:44.080 |
We look up those words in some static vector representation space, 00:14:48.860 |
and then we have our left to right process that 00:14:53.280 |
Suppose now that I want to train a classifier on top of this final output state. 00:15:02.040 |
that there'll be a lot of information about the words that are laid in the sequence, 00:15:06.280 |
but not enough information in this representation 00:15:09.080 |
here about the words that were earlier in the sequence. 00:15:12.640 |
Attention emerges as a way to remind ourselves in late states, 00:15:22.120 |
and here what I've depicted is a simple dot product scoring function, 00:15:25.780 |
exactly the sort that we get from the transformer in essence. 00:15:31.560 |
our target representation with all of the previous hidden states. 00:15:39.080 |
and then we bring those into the representation that we are targeting, 00:15:48.220 |
Then finally, we get this attention combination H here, 00:15:52.360 |
and that could be a neural network parameterized function that takes in 00:15:56.600 |
this representation plus the attention representation that we created here, 00:16:01.760 |
and feeds that through some parameters and a non-linearity. 00:16:06.160 |
That finally gives us the representation that we feed 00:16:09.640 |
into the classifier that we wanted to fit originally. 00:16:13.400 |
The idea here is that now our classification decision is based indeed on this representation 00:16:19.680 |
at the end but now infused with a lot of information about how 00:16:23.880 |
similar that representation is to the ones that preceded it. 00:16:27.600 |
That is the essential idea behind dot product attention, 00:16:36.480 |
Another idea that has proved so powerful is a notion of sub-word modeling. 00:16:42.240 |
I thought I would take you on a brief journey of how we 00:16:44.520 |
arrived in the current phase for this sub-word modeling, 00:16:53.080 |
The ELMo word representation space begins with character level representations, 00:16:58.960 |
and then it has a bunch of filters on top of those, 00:17:04.240 |
different convolutional layers that we then do max pooling 00:17:08.120 |
over to get a representation for the entire sequence. 00:17:16.080 |
these max pooling representations at the top, 00:17:19.180 |
and those form the basis for word representations, 00:17:22.560 |
and the idea is that this gives us whole word vectors that nonetheless have 00:17:28.080 |
lots of information about the sub-word parts all the way down to characters, 00:17:35.440 |
different lengths that capture different notions of sub-word within that space. 00:17:42.640 |
One thing I should note though is that the ELMo vocabulary has about 100,000 words in it, 00:17:52.840 |
you will find that you are mostly encountering words that are not in that vocabulary, 00:17:57.980 |
and even if you double it to 200,000 or 300,000, 00:18:02.340 |
now you're getting a really large embedding space, 00:18:04.600 |
you will still mostly encounter words that are unked out, 00:18:11.320 |
and that's an incredibly limiting factor for this whole word approach, 00:18:15.600 |
but we see in here the essence of the idea that we should model sub-words. 00:18:20.600 |
A big change happened for the transformer when we 00:18:24.320 |
got in parallel this notion of word piece tokenization. 00:18:28.280 |
Here I'm going to give you a feel for that by looking at the BERT tokenizer. 00:18:35.400 |
and then when we call the tokenizer on the sentence, 00:18:39.520 |
we get things that look mostly like whole words, 00:18:42.420 |
especially if you're used to NLP where the suffix for 00:18:45.980 |
isn't has been broken apart and so is the punctuation. 00:18:49.500 |
But when we call the tokenizer on the sequence encode me with an exclamation mark, 00:18:54.860 |
notice that the word encode has been split apart into two words, 00:18:59.640 |
n and then code with those markers there indicating that that is a word internal piece. 00:19:16.900 |
The effect here is that we can have a vanishingly small vocabulary. 00:19:21.540 |
There are under 30,000 words in this BERT tokenization space, 00:19:27.780 |
But nonetheless, when we encounter words that are outside of that space, 00:19:33.560 |
but rather we analyze them into sub-word pieces that we 00:19:40.460 |
Incredibly powerful and in the context of a contextual model, 00:19:45.420 |
we might have some hope that for cases like encode being split into two tokens, 00:19:50.740 |
the model will learn internally that in some sense, 00:19:56.620 |
But we don't need that directly reflected in the tokenizers vocabulary. 00:20:06.320 |
the transformer that is so foundational and that I think 00:20:08.860 |
the field is still figuring out is positional encoding. 00:20:16.700 |
keeping track of the order of words in a sequence. 00:20:19.300 |
It's mostly a bunch of columns of different things that happen 00:20:22.860 |
independently with some attention mechanisms bringing them together. 00:20:29.240 |
we typically have something like a positional encoding. 00:20:32.420 |
The most heavy-handed way to do that is to simply have, 00:20:41.060 |
records where individual words appear in the sequence. 00:20:44.940 |
Here, the is paired with one because it's at the start of the sequence. 00:20:50.900 |
it would be this fixed vector here combined with the embedding for position 4. 00:20:56.560 |
Those get added together into what you might think of as 00:21:05.060 |
but it has many limitations that we're going to talk about later in the unit, 00:21:08.620 |
and we're going to explore ways to capture what's good about 00:21:17.060 |
Then of course, one of the major guiding ideas behind 00:21:21.620 |
all of this is simply massive scale pre-training. 00:21:25.500 |
This is an idea that was unlocked by the distributional hypothesis, 00:21:29.980 |
which had the insight that we don't need to write 00:21:32.100 |
hand-built feature functions but rather we can just rely on 00:21:38.140 |
which words are appearing with which other words. 00:21:40.980 |
It really comes into its own in the neural era with 00:21:44.520 |
models like Word2Vec and GloVe, as I mentioned. 00:21:50.660 |
and I mentioned before that that was the eye-opening moment when we 00:21:53.540 |
saw that we could learn contextual representations at 00:21:57.100 |
scale and have that transfer into tasks we wanted to fine-tune for. 00:22:08.460 |
Then you get, at the end of this little history here, 00:22:11.780 |
the GPT-3 paper which applied this massive scale idea at 00:22:16.300 |
a level that was previously unimagined and unimaginable, 00:22:20.180 |
and that really did introduce a phase change in research as we 00:22:23.780 |
started to deal with these truly massive models, 00:22:39.300 |
and its corresponding notion of pre-training. 00:22:44.340 |
In 2016-2018, the notion of pre-training that we had was 00:22:49.300 |
essentially that we would feed static word representations 00:22:56.520 |
fine-tuned and it would learn a bunch of stuff. 00:22:59.100 |
When we fast-forward to the BERT era in 2018, 00:23:02.800 |
we start to get fine-tuning of contextual models. 00:23:05.980 |
Here is just a bit of code of a source that you will write in this course, 00:23:13.140 |
fine-tune them essentially to be a classifier. 00:23:16.820 |
That started in 2018, and I think it continues. 00:23:19.940 |
We might be headed into an era in which most of 00:23:24.980 |
these massive language models that we mostly don't have access to. 00:23:28.460 |
We can't write code as in the BERT code there, 00:23:31.780 |
but rather we just call an API and some partially understood and 00:23:41.500 |
one of these large models to do what we want to do. 00:23:47.760 |
It's very powerful analytically and technologically, 00:23:51.060 |
but this is surely part of your future as an NLP-er as well.