back to indexStanford XCS224U: NLU I Contextual Word Representations, Part 8: Seq2seq Architectures I Spring 2023
00:00:06.000 |
This is part eight in our series on contextual representations. 00:00:09.440 |
We're going to talk briefly about sequence to sequence architectures. 00:00:13.680 |
To kick it off, I thought I would begin with tasks. 00:00:17.040 |
These are going to be tasks that have natural sequence to sequence structure, 00:00:21.280 |
and I'm trying to leave open for now whether we would actually 00:00:24.140 |
model them with sequence to sequence architectures. 00:00:28.460 |
Seq2Seq tasks include machine translation, of course. 00:00:31.800 |
This is a classic one where a text in one language comes in, 00:00:35.000 |
and we would like to produce text in another language as the output. 00:00:38.660 |
Summarization, also a classic Seq2Seq problem. 00:00:42.300 |
A long text comes in and a presumably shorter one comes out summarizing the input. 00:00:47.620 |
Free-form question answering where we're trying to generate answers. 00:00:53.460 |
a question maybe with some contextual information comes in, 00:00:56.560 |
and the task in decoding is to generate an answer. 00:01:00.300 |
Dialogue, of course, classic Seq2Seq problem, 00:01:05.400 |
Semantic parsing could also be thought of as a Seq2Seq task here, 00:01:11.400 |
and we try to map them to their logical forms capturing aspects of their meaning. 00:01:20.260 |
and we try to produce a program that the sentence is describing. 00:01:24.200 |
That is just a small sample of the many things that we could call Seq2Seq tasks. 00:01:30.200 |
Even these are just special cases of the more general class of things that we might call 00:01:37.760 |
agnostic about whether the encoding and decoding involve sequences, 00:01:41.440 |
they could involve images, video, speech, and so forth. 00:01:46.160 |
I've been offering historical notes throughout this series of lectures, 00:01:50.960 |
and I think this is a nice point to emphasize that the RNN era really 00:01:55.240 |
primed us to think about Seq2Seq problems in the context of transformers. 00:02:00.020 |
On the left here, I have a classic RNN formulation of 00:02:03.720 |
a Seq2Seq problem where we have the input sequence A, B, 00:02:06.560 |
C, D, and then we begin decoding with this special symbol, 00:02:10.420 |
decode X, Y, Z, and then we produce our end token, 00:02:16.280 |
The historical note here is that those tasks evolved from 00:02:19.960 |
standard RNNs into RNNs with lots of attention mechanisms on the top here, 00:02:25.400 |
designed specifically to help the decoding steps remember what was in 00:02:32.760 |
these attention mechanisms back into that encoding phase. 00:02:36.640 |
What we see again in the transformer paper is a full embrace of 00:02:41.400 |
attention as the primary mechanism and a dropping 00:02:44.200 |
away of all of these recurrent mechanisms here. 00:02:51.560 |
we have a variety of ways that we could think about Seq2Seq problems, 00:03:05.400 |
On the left, you have encoder-decoder, as I said, 00:03:07.960 |
where we fully encode the input in the encoder side with some set of parameters, 00:03:12.800 |
and then possibly different parameters do decoding, 00:03:17.480 |
we attend fully back to all the steps from the encoder. 00:03:21.320 |
But we needn't have this encoder-decoder structure. 00:03:27.080 |
would be to simply process these sequences with a standard language model. 00:03:31.440 |
In the middle here, you have a transformer-based language model, 00:03:35.480 |
and you can see that characteristic attention mask where we don't get to look into 00:03:39.180 |
the future but rather can only attend to the past, 00:03:42.600 |
even for the part that we're thinking of as the encoded part. 00:03:46.760 |
An obvious variation of that would be to take our language model, 00:03:52.440 |
do a full attention connection set across all the things that we're doing encoding. 00:03:57.800 |
That's what you can see reflected here where when we're doing encoding, 00:04:03.240 |
we can have every element attend to every other element. 00:04:09.400 |
that's where the mask can only look into the past and not the future. 00:04:14.320 |
That's a nice framework for thinking about this. 00:04:17.080 |
The middle and right-hand options have become increasingly prominent as people have 00:04:22.640 |
explored ever larger variants of the GPT architecture, 00:04:28.800 |
But I'm going to focus now on two encoder-decoder releases that I think are very powerful, 00:04:34.600 |
beginning with T5, which was the source of that nice previous framework there. 00:04:39.440 |
T5 is an encoder-decoder variant that had extensive, multitask, 00:04:44.920 |
supervised, and unsupervised training across lots of different tasks. 00:04:49.880 |
Then one very innovative thing that they did in the T5 paper, 00:04:53.480 |
which really gives us a glimpse of what was about to happen within context learning, 00:04:57.960 |
is that they offered task prefixes like translate English to German, 00:05:05.560 |
That instruction on the left is telling the model what we want to do in decoding, 00:05:11.000 |
and it guides the model in this case to do translation, 00:05:13.840 |
but the same part after the colon could be performing 00:05:17.720 |
a sentiment task given a different description of the task before the colon. 00:05:22.920 |
Wonderfully insightful thing where we express all these tasks as natural language, 00:05:31.360 |
essentially as though those task instructions were 00:05:34.280 |
themselves structured information as inputs to the model. 00:05:38.320 |
For T5, we have lots of model releases as well, 00:05:43.600 |
This is a sample of the models that are available on Hugging Face, 00:05:46.920 |
and you can see that they range from very manageable 60 million parameter models, 00:05:51.320 |
on up to really large 11 billion parameter releases. 00:05:55.760 |
Relatedly, the Flan T5 models are variants of 00:06:00.080 |
the T5 architecture that were specifically instruction tuned, 00:06:04.240 |
and that's a set of methods that we'll talk about in the next unit of the course. 00:06:10.800 |
That's T5. The other architecture that I thought I would highlight here is BART. 00:06:16.360 |
BART has some similarities and some real differences with T5. 00:06:20.400 |
The essence of BART is that on the encoding side, 00:06:23.720 |
we're going to have a standard BERT-like architecture, 00:06:28.680 |
we're going to have a standard GPT-like architecture. 00:06:34.000 |
What's interesting about BART is the way pre-training happens. 00:06:41.680 |
corrupted sequences as input and figuring out how to uncorrupt them. 00:06:45.920 |
What they did on the corrupting side is, for example, 00:06:49.320 |
text infilling where whole parts of the input are masked out or removed, 00:06:54.080 |
sentence shuffling where we reorganize parts of the input, 00:06:58.080 |
token masking, token deletion, and document rotation. 00:07:03.360 |
What they found is that the most effective pre-training regime was 00:07:07.600 |
a combination of that text infilling step and the sentence shuffling step. 00:07:12.000 |
Remember, the idea here is that in pre-training, 00:07:15.240 |
we're feeding in these corrupted sequences with these two techniques by and large, 00:07:19.200 |
and having the model learn to uncorrupt those sequences. 00:07:22.320 |
The idea there, which is similar to the insight that we had from 00:07:28.160 |
the model to understand what good sequences look like. 00:07:33.720 |
If you download parameters from Hugging Face, 00:07:36.120 |
they're likely to be pre-trained in this uncorrupting fashion. 00:07:41.120 |
For fine-tuning, the protocol is a little bit different. 00:07:46.940 |
we feed uncorrupted copies of the input into the encoder and the decoder, 00:07:51.680 |
and then maybe we fine-tune the final decoder state as 00:07:55.360 |
we would with GPT against our classification task. 00:08:01.760 |
we simply feed in the input and the output to 00:08:04.880 |
the model and then fine-tune it on that basis with no corruption. 00:08:08.560 |
The corruption is by and large confined to the pre-training phase. 00:08:12.800 |
The evidence that is offered in the paper is that that objective puts models in 00:08:18.760 |
a good pre-trained state where they're really good at 00:08:21.100 |
these fine-tuning tasks across a lot of different tasks. 00:08:28.120 |
the wide range of different seek-to-seek architectures that are out there. 00:08:31.560 |
But I think they're both very powerful as pre-trained artifacts that you can make 00:08:35.480 |
use of and also highlight some of the innovation that is 00:08:38.760 |
happening with transformers in the seek-to-seek space.