Stanford XCS224U: NLU I Contextual Word Representations, Part 8: Seq2seq Architectures I Spring 2023

Welcome back everyone. This is part eight in our series on contextual representations. We're going to talk briefly about sequence to sequence architectures. To kick it off, I thought I would begin with tasks. These are going to be tasks that have natural sequence to sequence structure, and I'm trying to leave open for now whether we would actually model them with sequence to sequence architectures.

That's a separate question. Seq2Seq tasks include machine translation, of course. This is a classic one where a text in one language comes in, and we would like to produce text in another language as the output. Summarization, also a classic Seq2Seq problem. A long text comes in and a presumably shorter one comes out summarizing the input.

Free-form question answering where we're trying to generate answers. This could also be a Seq2Seq problem where a question maybe with some contextual information comes in, and the task in decoding is to generate an answer. Dialogue, of course, classic Seq2Seq problem, utterances to utterances. Semantic parsing could also be thought of as a Seq2Seq task here, natural language sentences come in, and we try to map them to their logical forms capturing aspects of their meaning.

Related task would be code generation here, a natural language sentence comes in, and we try to produce a program that the sentence is describing. That is just a small sample of the many things that we could call Seq2Seq tasks. Even these are just special cases of the more general class of things that we might call encoder-decoder problems which would be agnostic about whether the encoding and decoding involve sequences, they could involve images, video, speech, and so forth.

I've been offering historical notes throughout this series of lectures, and I think this is a nice point to emphasize that the RNN era really primed us to think about Seq2Seq problems in the context of transformers. On the left here, I have a classic RNN formulation of a Seq2Seq problem where we have the input sequence A, B, C, D, and then we begin decoding with this special symbol, decode X, Y, Z, and then we produce our end token, and that is the job of decoding.

The historical note here is that those tasks evolved from standard RNNs into RNNs with lots of attention mechanisms on the top here, designed specifically to help the decoding steps remember what was in the encoding part by offering all of these attention mechanisms back into that encoding phase. What we see again in the transformer paper is a full embrace of attention as the primary mechanism and a dropping away of all of these recurrent mechanisms here.

In the context of transformers, we have a variety of ways that we could think about Seq2Seq problems, one of them being encoder-decoder, but other options present themselves. This is a nice figure from the T5 paper, which we'll talk about in a second. On the left, you have encoder-decoder, as I said, where we fully encode the input in the encoder side with some set of parameters, and then possibly different parameters do decoding, where in the decoding steps, we attend fully back to all the steps from the encoder.

But we needn't have this encoder-decoder structure. Another option, for example, would be to simply process these sequences with a standard language model. In the middle here, you have a transformer-based language model, and you can see that characteristic attention mask where we don't get to look into the future but rather can only attend to the past, even for the part that we're thinking of as the encoded part.

An obvious variation of that would be to take our language model, and when we do encoding, do a full attention connection set across all the things that we're doing encoding. That's what you can see reflected here where when we're doing encoding, just as in the encoder-decoder structure, we can have every element attend to every other element.

Then here, when we start to do decoding, that's where the mask can only look into the past and not the future. That's a nice framework for thinking about this. The middle and right-hand options have become increasingly prominent as people have explored ever larger variants of the GPT architecture, which is a standard language model.

But I'm going to focus now on two encoder-decoder releases that I think are very powerful, beginning with T5, which was the source of that nice previous framework there. T5 is an encoder-decoder variant that had extensive, multitask, supervised, and unsupervised training across lots of different tasks. Then one very innovative thing that they did in the T5 paper, which really gives us a glimpse of what was about to happen within context learning, is that they offered task prefixes like translate English to German, colon, and then you got the true input.

That instruction on the left is telling the model what we want to do in decoding, and it guides the model in this case to do translation, but the same part after the colon could be performing a sentiment task given a different description of the task before the colon. Wonderfully insightful thing where we express all these tasks as natural language, which we simply encode, and that guides the model's behavior, essentially as though those task instructions were themselves structured information as inputs to the model.

For T5, we have lots of model releases as well, which has been tremendously empowering. This is a sample of the models that are available on Hugging Face, and you can see that they range from very manageable 60 million parameter models, on up to really large 11 billion parameter releases.

Relatedly, the Flan T5 models are variants of the T5 architecture that were specifically instruction tuned, and that's a set of methods that we'll talk about in the next unit of the course. Those are also very powerful. That's T5. The other architecture that I thought I would highlight here is BART.

BART has some similarities and some real differences with T5. The essence of BART is that on the encoding side, we're going to have a standard BERT-like architecture, and on the decoding side, we're going to have a standard GPT-like architecture. That's fairly straightforward. What's interesting about BART is the way pre-training happens.

This is essentially oriented around taking corrupted sequences as input and figuring out how to uncorrupt them. What they did on the corrupting side is, for example, text infilling where whole parts of the input are masked out or removed, sentence shuffling where we reorganize parts of the input, token masking, token deletion, and document rotation.

What they found is that the most effective pre-training regime was a combination of that text infilling step and the sentence shuffling step. Remember, the idea here is that in pre-training, we're feeding in these corrupted sequences with these two techniques by and large, and having the model learn to uncorrupt those sequences.

The idea there, which is similar to the insight that we had from Electra is that that kind of task can lead the model to understand what good sequences look like. That's the pre-training phase. If you download parameters from Hugging Face, they're likely to be pre-trained in this uncorrupting fashion.

For fine-tuning, the protocol is a little bit different. If we're doing classification tasks, we feed uncorrupted copies of the input into the encoder and the decoder, and then maybe we fine-tune the final decoder state as we would with GPT against our classification task. For standard seek-to-seek problems, we simply feed in the input and the output to the model and then fine-tune it on that basis with no corruption.

The corruption is by and large confined to the pre-training phase. The evidence that is offered in the paper is that that objective puts models in a good pre-trained state where they're really good at these fine-tuning tasks across a lot of different tasks. That's T5 and BART. That's just two samples from the wide range of different seek-to-seek architectures that are out there.

But I think they're both very powerful as pre-trained artifacts that you can make use of and also highlight some of the innovation that is happening with transformers in the seek-to-seek space.

Stanford XCS224U: NLU I Contextual Word Representations, Part 8: Seq2seq Architectures I Spring 2023

Transcript