back to indexStanford XCS224U: NLU I Contextual Word Representations, Part 3: Positional Encoding I Spring 2023
00:00:06.160 |
This is part 3 in our series on contextual representations. 00:00:12.040 |
architectures that we're going to talk about a bit later. 00:00:15.800 |
I thought it would be good to pause and just reflect 00:00:18.600 |
a little bit on this important notion of positional encoding. 00:00:25.320 |
I certainly took it for granted for too long. 00:00:27.440 |
I think we now see that this is a crucial factor 00:00:31.000 |
in shaping the performance of transformer-based models. 00:00:36.840 |
positional encoding in the context of the transformer. 00:00:41.760 |
the transformer itself has only a very limited capacity 00:00:47.380 |
The attention mechanisms are themselves not directional, 00:00:52.720 |
There are no other interactions between the columns. 00:00:56.560 |
We are in grave danger of losing track of the fact that 00:01:00.680 |
the input sequence ABC is different from the input sequence CBA. 00:01:06.120 |
Positional encodings will ensure that we retain 00:01:09.280 |
a difference between those two sequences no matter what we 00:01:12.040 |
do with the representations that come from the model. 00:01:18.400 |
positional encodings play which is hierarchical. 00:01:21.300 |
They've been used to keep track of things like 00:01:23.280 |
premise hypothesis in natural language inference. 00:01:26.380 |
That was an important feature of the BERT model 00:01:28.740 |
that we'll talk about a bit later in the series. 00:01:31.960 |
I think there are a lot of perspectives that you 00:01:36.580 |
To keep things simple, I thought I would center 00:01:47.860 |
The second is, does the positional encoding scheme 00:01:55.200 |
I think those are good questions to guide us. 00:01:57.960 |
One other rule that I wanted to introduce is the following. 00:02:02.080 |
Modern transformer architectures might impose 00:02:06.900 |
many reasons related to how they were designed and optimized. 00:02:10.520 |
I would like to set all of that aside and just ask whether 00:02:14.220 |
the positional encoding scheme itself is imposing 00:02:18.060 |
anything about length generalization separately 00:02:20.360 |
from all that other stuff that might be happening. 00:02:23.280 |
Let's start with absolute positional encoding. 00:02:26.300 |
This is the scheme that we have talked about so far. 00:02:29.520 |
On this scheme, we have word representations, 00:02:32.760 |
and we also have positional representations that we have 00:02:35.560 |
learned corresponding to some fixed number of dimensions. 00:02:39.100 |
To get our position-sensitive word representation, 00:02:43.220 |
we simply add together the word vector with the position vector. 00:02:47.780 |
How is this scheme doing for our two crucial questions? 00:03:08.440 |
we will not have a positional representation for that position. 00:03:22.900 |
Just consider the fact that the rock as a phrase, 00:03:31.240 |
the rock if it appears later in the sequence. 00:03:34.520 |
There will be some shared features across these two as a result of 00:03:38.440 |
the fact that we have two word vectors involved in both places. 00:03:42.520 |
But we add in those positional representations 00:03:48.000 |
and I think the result is very heavy-handed when it comes to 00:03:51.880 |
learning representations that are heavily position-dependent. 00:03:55.600 |
That could make it hard for the model to see that in some sense, 00:04:01.840 |
the start of the sequence or the middle or the end. 00:04:08.680 |
goes all the way back to the Transformers paper. 00:04:11.000 |
I've called this frequency-based positional encoding. 00:04:17.040 |
but the essential idea here is that we'll define 00:04:19.880 |
a mathematical function that given a position, 00:04:25.960 |
information about that position semantically in its structure. 00:04:31.840 |
they picked a scheme that's based in frequency oscillation. 00:04:35.400 |
Essentially based in sine and cosine frequencies for 00:04:38.920 |
these vectors where higher positions oscillate more frequently, 00:04:48.020 |
I think there are lots of other schemes that we could use. 00:04:50.480 |
The essential feature of this is this argument pause here. 00:05:07.160 |
information about the relative position of that input. 00:05:12.320 |
We have definitely overcome the first limitation, 00:05:15.600 |
the set of positions does not need to be decided ahead of time in 00:05:20.480 |
a new vector for any position that you give us. 00:05:23.800 |
But I think our second question remains pressing. 00:05:26.880 |
Just as before, this scheme can hinder generalization to 00:05:30.600 |
new positions even for familiar phenomena in virtue of 00:05:34.160 |
the fact that we are taking those word representations and adding 00:05:37.440 |
in these positional ones for different positions as equal partners, 00:05:42.200 |
as I said, and I think that makes it hard for models to 00:05:45.200 |
see that the same phrase could appear in multiple places. 00:05:51.840 |
promising of the three that we're going to discuss. 00:06:02.560 |
This is a picture of the attention layer of the transformer. 00:06:06.440 |
We have our three position sensitive inputs here, 00:06:11.840 |
Remember, it's crucial that they be position sensitive because of 00:06:16.080 |
how much symmetry there is in these dot product attention mechanisms. 00:06:31.360 |
What I've depicted at the bottom of the slide here is 00:06:41.760 |
that we're going to learn representations for. 00:06:55.640 |
this multiplied attention mechanism plus the thing we're attending to. 00:07:00.200 |
Those are the new crucial parameters that we're adding in here. 00:07:08.640 |
with all the position sensitivity that's going to be encoded in these vectors, 00:07:12.640 |
we don't need these green representations here anymore to have 00:07:16.520 |
positional information in them because that positional information is 00:07:20.100 |
now being introduced in the attention layer because we're going to have 00:07:24.320 |
potentially new vectors for every combination of 00:07:32.920 |
I think the really powerful thing about this method is 00:07:36.920 |
the notion of having a positional encoding window. 00:07:42.720 |
the core calculation at the top here as a reminder. 00:07:50.080 |
Here's the input sequence that we'll use as an example. 00:07:56.160 |
just integers corresponding to the positions. 00:07:58.760 |
Those aren't directly ingredients into the model, 00:08:01.600 |
but they will help us keep track of where we are in the calculations. 00:08:10.640 |
If we follow the letter of the definitions that I've offered so far for the key values here, 00:08:16.680 |
we're going to have a vector A_44 corresponding to us attending from position 4 to position 4. 00:08:24.280 |
As part of creating this more limited window-based version of the model, 00:08:29.320 |
we're actually going to map that into a single vector W_0 for the keys. 00:08:38.400 |
In this case, we would have a vector A_43 for the keys. 00:08:42.960 |
But what we're going to do is map that into a single vector W_-1, 00:08:55.960 |
but now we're going to map that to vector W_-2, 00:09:04.080 |
when we get all the way to that leftmost position, 00:09:17.160 |
Then a parallel thing happens when we travel to the right. 00:09:27.640 |
Then when we get to the third position from our starting point, 00:09:31.240 |
that again just flattens out to W_2 because of our window size. 00:09:48.080 |
as opposed to all the distinctions that are made with those alpha, 00:09:55.160 |
We're collapsing those down into a smaller number of vectors 00:10:11.520 |
k, which is the same vector that we have up here in that 4, 4 position. 00:10:16.680 |
A similar collapsing is going to happen down here. 00:10:21.960 |
which is the same vector as we had up here just to the right. 00:10:29.240 |
minus 2 corresponding to the same vector that we had above. 00:10:33.560 |
That would continue and we have a parallel calculation for 00:10:37.360 |
the value parameters that you see in purple up here, 00:10:40.440 |
the same notions of relative position and window size. 00:10:44.360 |
We actually learn a relatively small number of position vectors. 00:10:52.200 |
a small window relative notion of position that's going to 00:10:56.240 |
slide around and give us a lot of ability to generalize to 00:11:00.040 |
new positions based on combinations that we've seen before, 00:11:07.280 |
A final thing I'll say is that this is actually 00:11:10.800 |
embedded in that full theory of attention that might have 00:11:13.400 |
a lot of learned parameters and might even be multi-headed. 00:11:18.200 |
the full calculation just to really give you all the details. 00:11:21.760 |
But again, the cognitive shortcut is that it's 00:11:34.800 |
the attention layer, not in the embedding layer. 00:11:41.080 |
First, we don't need to decide the set of positions ahead of time, 00:11:46.480 |
Then for a potentially extremely long string, 00:11:54.520 |
positional vectors to keep track of relative position. 00:11:58.280 |
I think we have also largely overcome the concern that 00:12:02.200 |
positional embeddings might hinder generalization to new positions. 00:12:06.420 |
After all, if you consider a phrase like the rock, 00:12:10.220 |
the core position vectors that are involved there are 0, 00:12:21.240 |
there will be other positional things that are 00:12:27.400 |
But we do have this sense of constancy that will allow the model to 00:12:33.720 |
essentially wherever it appears in the string. 00:12:36.920 |
My hypothesis is that because we have overcome these two crucial limitations, 00:12:42.440 |
relative positional encoding is a very good bet for how to 00:12:46.160 |
do positional encoding in general in the transformer. 00:12:51.920 |
by results across the field for the transformer.