back to index

Stanford XCS224U: NLU I Contextual Word Representations, Part 8: Seq2seq Architectures I Spring 2023


Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome back everyone.
00:00:06.000 | This is part eight in our series on contextual representations.
00:00:09.440 | We're going to talk briefly about sequence to sequence architectures.
00:00:13.680 | To kick it off, I thought I would begin with tasks.
00:00:17.040 | These are going to be tasks that have natural sequence to sequence structure,
00:00:21.280 | and I'm trying to leave open for now whether we would actually
00:00:24.140 | model them with sequence to sequence architectures.
00:00:26.740 | That's a separate question.
00:00:28.460 | Seq2Seq tasks include machine translation, of course.
00:00:31.800 | This is a classic one where a text in one language comes in,
00:00:35.000 | and we would like to produce text in another language as the output.
00:00:38.660 | Summarization, also a classic Seq2Seq problem.
00:00:42.300 | A long text comes in and a presumably shorter one comes out summarizing the input.
00:00:47.620 | Free-form question answering where we're trying to generate answers.
00:00:51.280 | This could also be a Seq2Seq problem where
00:00:53.460 | a question maybe with some contextual information comes in,
00:00:56.560 | and the task in decoding is to generate an answer.
00:01:00.300 | Dialogue, of course, classic Seq2Seq problem,
00:01:03.360 | utterances to utterances.
00:01:05.400 | Semantic parsing could also be thought of as a Seq2Seq task here,
00:01:09.280 | natural language sentences come in,
00:01:11.400 | and we try to map them to their logical forms capturing aspects of their meaning.
00:01:16.120 | Related task would be code generation here,
00:01:18.440 | a natural language sentence comes in,
00:01:20.260 | and we try to produce a program that the sentence is describing.
00:01:24.200 | That is just a small sample of the many things that we could call Seq2Seq tasks.
00:01:30.200 | Even these are just special cases of the more general class of things that we might call
00:01:35.160 | encoder-decoder problems which would be
00:01:37.760 | agnostic about whether the encoding and decoding involve sequences,
00:01:41.440 | they could involve images, video, speech, and so forth.
00:01:46.160 | I've been offering historical notes throughout this series of lectures,
00:01:50.960 | and I think this is a nice point to emphasize that the RNN era really
00:01:55.240 | primed us to think about Seq2Seq problems in the context of transformers.
00:02:00.020 | On the left here, I have a classic RNN formulation of
00:02:03.720 | a Seq2Seq problem where we have the input sequence A, B,
00:02:06.560 | C, D, and then we begin decoding with this special symbol,
00:02:10.420 | decode X, Y, Z, and then we produce our end token,
00:02:13.840 | and that is the job of decoding.
00:02:16.280 | The historical note here is that those tasks evolved from
00:02:19.960 | standard RNNs into RNNs with lots of attention mechanisms on the top here,
00:02:25.400 | designed specifically to help the decoding steps remember what was in
00:02:30.360 | the encoding part by offering all of
00:02:32.760 | these attention mechanisms back into that encoding phase.
00:02:36.640 | What we see again in the transformer paper is a full embrace of
00:02:41.400 | attention as the primary mechanism and a dropping
00:02:44.200 | away of all of these recurrent mechanisms here.
00:02:48.280 | In the context of transformers,
00:02:51.560 | we have a variety of ways that we could think about Seq2Seq problems,
00:02:56.520 | one of them being encoder-decoder,
00:02:59.160 | but other options present themselves.
00:03:01.200 | This is a nice figure from the T5 paper,
00:03:03.440 | which we'll talk about in a second.
00:03:05.400 | On the left, you have encoder-decoder, as I said,
00:03:07.960 | where we fully encode the input in the encoder side with some set of parameters,
00:03:12.800 | and then possibly different parameters do decoding,
00:03:15.980 | where in the decoding steps,
00:03:17.480 | we attend fully back to all the steps from the encoder.
00:03:21.320 | But we needn't have this encoder-decoder structure.
00:03:25.380 | Another option, for example,
00:03:27.080 | would be to simply process these sequences with a standard language model.
00:03:31.440 | In the middle here, you have a transformer-based language model,
00:03:35.480 | and you can see that characteristic attention mask where we don't get to look into
00:03:39.180 | the future but rather can only attend to the past,
00:03:42.600 | even for the part that we're thinking of as the encoded part.
00:03:46.760 | An obvious variation of that would be to take our language model,
00:03:50.700 | and when we do encoding,
00:03:52.440 | do a full attention connection set across all the things that we're doing encoding.
00:03:57.800 | That's what you can see reflected here where when we're doing encoding,
00:04:01.320 | just as in the encoder-decoder structure,
00:04:03.240 | we can have every element attend to every other element.
00:04:06.980 | Then here, when we start to do decoding,
00:04:09.400 | that's where the mask can only look into the past and not the future.
00:04:14.320 | That's a nice framework for thinking about this.
00:04:17.080 | The middle and right-hand options have become increasingly prominent as people have
00:04:22.640 | explored ever larger variants of the GPT architecture,
00:04:26.200 | which is a standard language model.
00:04:28.800 | But I'm going to focus now on two encoder-decoder releases that I think are very powerful,
00:04:34.600 | beginning with T5, which was the source of that nice previous framework there.
00:04:39.440 | T5 is an encoder-decoder variant that had extensive, multitask,
00:04:44.920 | supervised, and unsupervised training across lots of different tasks.
00:04:49.880 | Then one very innovative thing that they did in the T5 paper,
00:04:53.480 | which really gives us a glimpse of what was about to happen within context learning,
00:04:57.960 | is that they offered task prefixes like translate English to German,
00:05:03.000 | colon, and then you got the true input.
00:05:05.560 | That instruction on the left is telling the model what we want to do in decoding,
00:05:11.000 | and it guides the model in this case to do translation,
00:05:13.840 | but the same part after the colon could be performing
00:05:17.720 | a sentiment task given a different description of the task before the colon.
00:05:22.920 | Wonderfully insightful thing where we express all these tasks as natural language,
00:05:27.200 | which we simply encode,
00:05:28.720 | and that guides the model's behavior,
00:05:31.360 | essentially as though those task instructions were
00:05:34.280 | themselves structured information as inputs to the model.
00:05:38.320 | For T5, we have lots of model releases as well,
00:05:41.920 | which has been tremendously empowering.
00:05:43.600 | This is a sample of the models that are available on Hugging Face,
00:05:46.920 | and you can see that they range from very manageable 60 million parameter models,
00:05:51.320 | on up to really large 11 billion parameter releases.
00:05:55.760 | Relatedly, the Flan T5 models are variants of
00:06:00.080 | the T5 architecture that were specifically instruction tuned,
00:06:04.240 | and that's a set of methods that we'll talk about in the next unit of the course.
00:06:08.600 | Those are also very powerful.
00:06:10.800 | That's T5. The other architecture that I thought I would highlight here is BART.
00:06:16.360 | BART has some similarities and some real differences with T5.
00:06:20.400 | The essence of BART is that on the encoding side,
00:06:23.720 | we're going to have a standard BERT-like architecture,
00:06:26.760 | and on the decoding side,
00:06:28.680 | we're going to have a standard GPT-like architecture.
00:06:31.840 | That's fairly straightforward.
00:06:34.000 | What's interesting about BART is the way pre-training happens.
00:06:37.560 | This is essentially oriented around taking
00:06:41.680 | corrupted sequences as input and figuring out how to uncorrupt them.
00:06:45.920 | What they did on the corrupting side is, for example,
00:06:49.320 | text infilling where whole parts of the input are masked out or removed,
00:06:54.080 | sentence shuffling where we reorganize parts of the input,
00:06:58.080 | token masking, token deletion, and document rotation.
00:07:03.360 | What they found is that the most effective pre-training regime was
00:07:07.600 | a combination of that text infilling step and the sentence shuffling step.
00:07:12.000 | Remember, the idea here is that in pre-training,
00:07:15.240 | we're feeding in these corrupted sequences with these two techniques by and large,
00:07:19.200 | and having the model learn to uncorrupt those sequences.
00:07:22.320 | The idea there, which is similar to the insight that we had from
00:07:25.400 | Electra is that that kind of task can lead
00:07:28.160 | the model to understand what good sequences look like.
00:07:32.040 | That's the pre-training phase.
00:07:33.720 | If you download parameters from Hugging Face,
00:07:36.120 | they're likely to be pre-trained in this uncorrupting fashion.
00:07:41.120 | For fine-tuning, the protocol is a little bit different.
00:07:44.640 | If we're doing classification tasks,
00:07:46.940 | we feed uncorrupted copies of the input into the encoder and the decoder,
00:07:51.680 | and then maybe we fine-tune the final decoder state as
00:07:55.360 | we would with GPT against our classification task.
00:07:59.400 | For standard seek-to-seek problems,
00:08:01.760 | we simply feed in the input and the output to
00:08:04.880 | the model and then fine-tune it on that basis with no corruption.
00:08:08.560 | The corruption is by and large confined to the pre-training phase.
00:08:12.800 | The evidence that is offered in the paper is that that objective puts models in
00:08:18.760 | a good pre-trained state where they're really good at
00:08:21.100 | these fine-tuning tasks across a lot of different tasks.
00:08:24.760 | That's T5 and BART.
00:08:26.400 | That's just two samples from
00:08:28.120 | the wide range of different seek-to-seek architectures that are out there.
00:08:31.560 | But I think they're both very powerful as pre-trained artifacts that you can make
00:08:35.480 | use of and also highlight some of the innovation that is
00:08:38.760 | happening with transformers in the seek-to-seek space.
00:08:43.480 | [BLANK_AUDIO]