Back to Index

Stanford CS224N NLP with Deep Learning | 2023 | Lecture 8 - Self-Attention and Transformers


Transcript

Hi, everyone. Welcome to CS224N. We're about two minutes in, so let's get started. So today, we've got what I think is quite an exciting lecture topic. We're going to talk about self-attention and transformers. So these are some ideas that are sort of the foundation of most of the modern advances in natural language processing.

And actually, AI systems in a broad range of fields. So it's a very, very fun topic. Before we get into that-- OK, before we get into that, we're going to have a couple of reminders. So there are brand new lecture notes. Woo! Nice, thank you. Yeah. I'm very excited about them.

They go into-- they pretty much follow along with what I'll be talking about today, but go into considerably more detail. Assignment four is due a week from today. Yeah, so the issues with Azure continue. Thankfully-- woo! Thankfully, our TAs especially has tested that this works on Colab, and the amount of training is such that a Colab session will allow you to train your machine translation system.

So if you don't have a GPU, use Colab. We're continuing to work on getting access to more GPUs for assignment five in the final project. We'll continue to update you as we're able to. But the usual systems this year are no longer holding because companies are changing their minds about things.

OK, so our final project proposal, you have a proposal of what you want to work on for your final project. We will give you feedback on whether we think it's a feasible idea or how to change it. So this is very important because we want you to work on something that we think has a good chance of success for the rest of the quarter.

That's going to be out tonight. We'll have an ad announcement when it is out. And we want to get you feedback on that pretty quickly because you'll be working on this after assignment five is done. Really, the major core component of the course after that is the final project.

OK, any questions? Cool. OK. So let's take a look back into what we've done so far in this course and see what we were doing in natural language processing. What was our strategy? If you had a natural language processing problem and you wanted to take your best effort attempt at it without doing anything too fancy, you would have said, OK, I'm going to have a bidirectional LSTM instead of a simple RNN.

I'm going to use an LSTM to encode my sentences, I get bidirectional context. And if I have an output that I'm trying to generate, I'll have a unidirectional LSTM that I was going to generate one by one. So you have a translation or a parse or whatever. And so maybe I've encoded in a bidirectional LSTM the source sentence and I'm sort of one by one decoding out the target with my unidirectional LSTM.

And then also, I was going to use something like attention to give flexible access to memory if I felt like I needed to do this sort of look back and see where I want to translate from. And this was just working exceptionally well. And we motivated attention through wanting to do machine translation.

And you have this bottleneck where you don't want to have to encode the whole source sentence in a single vector. And in this lecture, we have the same goal. So we're going to be looking at a lot of the same problems that we did previously. But we're going to use different building blocks.

We're going to say, if 2014 to 2017-ish I was using recurrence through lots of trial and error, years later, it had these brand new building blocks that we can plug in, direct replacement for LSTMs. And they're going to allow for just a huge range of much more successful applications.

And so what are the issues with the recurrent neural networks we used to use? And what are the new systems that we're going to use from this point moving forward? So one of the issues with a recurrent neural network is what we're going to call linear interaction distance. So as we know, RNNs are unrolled left to right or right to left, depending on the language and the direction.

But it encodes the notion of linear locality, which is useful. Because if two words occur right next to each other, sometimes they're actually quite related. So tasty pizza. They're nearby. And in the recurrent neural network, you encode tasty. And then you walk one step, and you encode pizza. So nearby words do often affect each other's meanings.

But you have this problem where very long distance dependencies can take a very long time to interact. So if I have the sentence, the chef-- so those are nearby. Those interact with each other. And then who, and then a bunch of stuff. Like the chef who went to the stores and picked up the ingredients and loves garlic.

And then was. Like I actually have an RNN step, this sort of application of the recurrent weight matrix and some element-wise nonlinearities once, twice, three times. As many times as there is potentially the length of the sequence between chef and was. And it's the chef who was. So this is a long distance dependency.

Should feel kind of related to the stuff that we did in dependency syntax. But it's quite difficult to learn potentially that these words should be related. So if you have a lot of steps between words, it can be difficult to learn the dependencies between them. We talked about all these gradient problems.

LSTMs do a lot better at modeling the gradients across long distances than simple recurrent neural networks. But it's not perfect. And we already know that this linear order isn't sort of the right way to think about sentences. So if I wanted to learn that it's the chef who was, then I might have a hard time doing it because the gradients have to propagate from was to chef.

And really, I'd like more direct connection between words that might be related in the sentence. Or in a document even, if these are going to get much longer. So this is this linear interaction distance problem. We would like words that might be related to be able to interact with each other in the neural networks computation graph more easily than being linearly far away so that we can learn these long distance dependencies better.

And there's a related problem too that again comes back to the recurrent neural networks dependence on the index. On the index into the sequence, often called a dependence on time. So in a recurrent neural network, the forward and backward passes have O of sequence length many. So that means just roughly sequence, in this case, just sequence length many unparallelizable operations.

So we know GPUs are great. They can do a lot of operations at once, as long as there's no dependency between the operations in terms of time. You have to compute one and then compute the other. But in a recurrent neural network, you can't actually compute the RNN hidden state for time step 5 before you compute the RNN hidden state for time step 4 or time step 3.

And so you get this graph that looks very similar, where if I want to compute this hidden state, so I've got some word, I have zero operations I need to do before I can compute this state. I have one operation I can do before I can compute this state.

And as my sequence length grows, I've got-- OK, here I've got three operations I need to do before I can compute the state with the number 3, because I need to compute this and this and that. So there's three unparallelizable operations that I'm glomming all the matrix multiplies and stuff into a single one.

So 1, 2, 3. And of course, this grows with the sequence length as well. So down over here, as the sequence length grows, I can't parallelize-- I can't just have a big GPU just kachanka with the matrix multiply to compute this state, because I need to compute all the previous states beforehand.

OK, any questions about that? So these are these two related problems, both with the dependence on time. Yeah. Yeah, so I have a question on the linear interaction issues. I thought that was the whole point of the attention network, and then how maybe you want, during the training, of the actual cells that depend more on each other.

Can't we do something like the attention and then work our way around that? So the question is, with the linear interaction distance, wasn't this the point of attention that gets around that? Can't we use something with attention to help, or does that just help? So it won't solve the parallelizability problem.

And in fact, everything we do in the rest of this lecture will be attention-based. But we'll get rid of the recurrence and just do attention, more or less. So well, yeah, it's a great intuition. Any other questions? OK, cool. So if not recurrence, what about attention? See, I'm just a slide back.

And so we're going to get deep into attention today. But just for the second, attention treats each word's representation as a query to access and incorporate information from a set of values. So previously, we were in a decoder. We were decoding out a translation of a sentence. And we attended to the encoder so that we didn't have to store the entire representation of the source sentence into a single vector.

And here, today, we'll think about attention within a single sentence. So I've got this sentence written out here with a word 1 through word t, in this case. And right on these integers in the boxes, I'm writing out the number of unparallelizable operations that you need to do before you can compute these.

So for each word, you can independently compute its embedding without doing anything else previously, because the embedding just depends on the word identity. And then with attention, if I wanted to build an attention representation of this word by looking at all the other words in the sequence, that's one big operation.

And I can do them in parallel for all the words. So the attention for this word, I can do for the attention for this word. I don't need to walk left to right like I did for an RNN. Again, we'll get much deeper into this. But you should have the intuition that it solves the linear interaction problem and the non-parallelizability problem.

Because now, no matter how far away words are from each other, I am potentially interacting. I might just attend to you, even if you're very, very far away, sort of independent of how far away you are. And I also don't need to sort of walk along the sequence linearly long.

So I'm treating the whole sequence at once. All right. So the intuition is that attention allows you to look very far away at once. And it doesn't have this dependence on the sequence index that keeps us from parallelizing operations. And so now, the rest of the lecture will talk in great depth about attention.

So maybe let's just move on. OK. So let's think more deeply about attention. One thing that you might think of with attention is that it's sort of performing kind of a fuzzy lookup in a key value store. So you have a bunch of keys, a bunch of values, and it's going to help you sort of access that.

So in an actual lookup table, just like a dictionary in Python, for example, very simple. You have a table of keys that each key maps to a value. And then you give it a query. And the query matches one of the keys. And then you return the value. So I've got a bunch of keys here.

And my query matches the key. So I return the value. Simple, fair, easy. OK. Good. And in attention, so just like we saw before, the query matches all keys softly. There's no exact match. You sort of compute some sort of similarity between the key and all of the-- sorry, the query and all of the keys.

And then you sort of weight the results. So you've got a query again. You've got a bunch of keys. The query, to different extents, is similar to each of the keys. And you will sort of measure that similarity between 0 and 1 through a softmax. And then you get the values out.

So you average them via the weights of the similarity between the key and the query and the keys. You do a weighted sum with those weights. And you get an output. So it really is quite a bit like a lookup table, but in this sort of soft vector space, mushy sort of sense.

So I'm really doing some kind of accessing into this information that's stored in the key value store. But I'm sort of softly looking at all of the results. OK, any questions there? Cool. So what might this look like? So if I was trying to represent this sentence, I went to Stanford CS224n and learned.

So I'm trying to build a representation of learned. I have a key for each word. So this is this self-attention thing that we'll get into. I have a key for each word, a value for each word. I've got the query for learned. And I've got these sort of tealish bars up top, which sort of might say how much you're going to try to access each of the word.

Like, oh, maybe 224n is not that important. CS, maybe that determines what I learned. You know, Stanford. And then learned, maybe that's important to representing itself. So you sort of look across at the whole sentence and build up this sort of soft accessing of information across the sentence in order to represent learned in context.

So this is just a toy diagram. So let's get into the math. So we're going to look at a sequence of words. So that's w1 to n, a sequence of words in a vocabulary. So this is like, you know, Zuko made his uncle tea. That's a good sequence. And for each word, we're going to embed it with this embedding matrix, just like we've been doing in this class.

So I have this embedding matrix that goes from the vocabulary size to the dimensionality d. So each word has a non-contextual, only dependent on itself, word embedding. And now I'm going to transform each word with one of three different weight matrices. So this is often called key query value self-attention.

So I have a matrix Q, which is an rd to d. So this maps xi, which is a vector of dimensionality d, to another vector of dimensionality d. And that's going to be a query vector. So it takes an xi and it sort of rotates it, shuffles it around, stretches it, squishes it.

Makes it different. And now it's a query. And now for a different learnable parameter, k-- so that's another matrix. I'm going to come up with my keys. And with a different learnable parameter, v, I'm going to come up with my values. So I'm taking each of the non-contextual word embeddings, each of these xi's, and I'm transforming each of them to come up with my query for that word, my key for that word, and my value for that word.

So every word is doing each of these roles. Next, I'm going to compute all pairs of similarities between the keys and queries. So in the toy example we saw, I was computing the similarity between a single query for the word learned and all of the keys for the entire sentence.

In this context, I'm computing all pairs of similarities between all keys and all values because I want to represent all of these sums. So I've got this sort of dot-- I'm just going to take the dot product between these two vectors. So I've got qi. So this is saying the query for word i dotted with the key for word j.

And I get this score, which is a real value. Might be very large negative, might be zero, might be very large and positive. And so that's like, how much should I look at j in this lookup table? And then I do the softmax. So I softmax. So I say that the actual weight that I'm going to look at j from i is softmax of this over all of the possible indices.

So it's like the affinity between i and j normalized by the affinity between i and all of the possible j prime in the sequence. And then my output is just the weighted sum of values. So I've got this output for word i. So maybe i is like 1 for Zuko.

And I'm representing it as the sum of these weights for all j. So Zuko and maid and his and uncle and t. And the value vector for that word j. I'm looking from i to j as much as alpha ij. What's the dimension of Wi? Oh, Wi, you can either think of it as a symbol in vocab v.

So that's like, you could think of it as a one-hot vector. And yeah, in this case, we are, I guess, thinking of it as-- so one-hot vector in dimensionality size of vocab. So in the matrix E, you see that it's r d by bars around v. That's size of the vocabulary.

So when I do E multiplied by Wi, that's taking E, which is d by v, multiplying it by w, which is v, and returning a vector that's dimensionality d. So w in that first line, like w1n, that's a matrix where it has maybe like a column for every word in that sentence.

And each column is a length v. Yeah, usually, I guess we think of it as having a-- I mean, if I'm putting the sequence length index first, you might think of it as having a row for each word. But similarly, yeah, it's n, which is the sequence length. And then the second dimension would be v, which is the vocabulary size.

And then that gets mapped to this thing, which is sequence length by d. Why do we learn two different matrices, q and k, when q transpose-- qi transpose kj is really just one matrix in the middle? That's a great question. It ends up being because this will end up being a low-rank approximation to that matrix.

So it is for computational efficiency reasons. Although it also, I think, feels kind of nice in the presentation. But yeah, what we'll end up doing is having a very low-rank approximation to qk transpose. And so you actually do do it like this. It's a good question. Is vii, so the query with any specific?

Sorry, could you repeat that for me? This eii, so the query of the word dotted with the key by itself, does it look like an identity, or does it look like anything in particular? That's a good question. OK, let me remember to repeat questions. So does eii, for j equal to i, so looking at itself, look like anything in particular?

Does it look like the identity? Is that the question? OK, so right, it's unclear, actually. This question of should you look at yourself for representing yourself, well, it's going to be encoded by the matrices q and k. If I didn't have q and k in there, if those were the identity matrices, if q is identity, k is identity, then this would be sort of dot product with yourself, which is going to be high on average, like you're pointing in the same direction as yourself.

But it could be that qxi and kxi might be sort of arbitrarily different from each other, because q could be the identity, and k could map you to the negative of yourself, for example, so that you don't look at yourself. So this is all learned in practice. So you end up-- it can sort of decide by learning whether you should be looking at yourself or not.

And that's some of the flexibility that parametrizing it as q and k gives you that wouldn't be there if I just used xis everywhere in this equation. I'm going to try to move on, I'm afraid, because there's a lot to get on. But we'll keep talking about self-attention. And so as more questions come up, I can also potentially return back.

OK, so this is our basic building block. But there are a bunch of barriers to using it as a replacement for LSTMs. And so what we're going to do for this portion of the lecture is talk about the minimal components that we need in order to use self-attention as sort of this very fundamental building block.

So we can't use it as it stands as I've presented it, because there are a couple of things that we need to sort of solve or fix. One of them is that there's no notion of sequence order in self-attention. So what does this mean? If I have a sentence like-- I'm going to move over here to the whiteboard briefly, and hopefully I'll write quite large.

If I have a sentence like, Zuko made his uncle. And let's say, his uncle made Zuko. If I were to embed each of these words using its embedding matrix, the embedding matrix isn't dependent on the index of the word. So this is the word index 1, 2, 3, 4, versus now his is over here, and uncle.

And so when I compute the self-attention-- and there's a lot more on this in the lecture notes that goes through a full example-- the actual self-attention operation will give you exactly the same representations for this sequence, Zuko made his uncle, as for this sequence, his uncle made Zuko. And that's bad, because they're sentences that mean different things.

And so it's this idea that self-attention is an operation on sets. You have a set of vectors that you're going to perform self-attention on, and nowhere does the exact position of the words come into play directly. So we're going to encode the position of words through the keys, queries, and values that we have.

So consider now representing each sequence index-- our sequences are going from 1 to n-- as a vector. So don't worry so far about how it's being made, but you can imagine representing the number 1, the position 1, the position 2, the position 3, as a vector in the dimensionality d, just like we're representing our keys, queries, and values.

And so these are position vectors. If you were to want to incorporate the information represented by these positions into our self-attention, you could just add these vectors, these p i vectors, to the inputs. So if I have this xi embedding of a word, which is the word at position i, but really just represents, oh, the word zuko is here, now I can say, oh, it's the word zuko, and it's at position 5, because this vector represents position 5.

So how do we do this? And we might only have to do this once. So we can do it once at the very input to the network, and then that is sufficient. We don't have to do it at every layer, because it knows from the input. So one way in which people have done this is look at these sinusoidal position representations.

So this looks a little bit like this, where you have-- so this is a vector p i, which is in dimensionality d. And each one of the dimensions, you take the value i, you modify it by some constant, and you pass it to the sine or cosine function, and you get these sort of values that vary according to the period, differing periods depending on the dimensionality's d.

So I've got this sort of a representation of a matrix, where d is the vertical dimension, and then n is the horizontal. And you can see that there's sort of like, oh, as I walk along, you see the period of the sine function going up and down, and each of the dimensions d has a different period.

And so together, you can represent a bunch of different sort of position indices. And it gives this intuition that, oh, maybe sort of the absolute position of a word isn't as important. You've got the sort of periodicity of the sines and cosines. And maybe that allows you to extrapolate to longer sequences.

But in practice, that doesn't work. But this is sort of like an early notion that is still sometimes used for how to represent position in transformers and self-attention networks in general. So that's one idea. You might think it's a little bit complicated, a little bit unintuitive. Here's something that feels a little bit more deep learning.

So we're just going to say, oh, I've got a maximum sequence length of n. And I'm just going to learn a matrix that's dimensionality d by n. And that's going to represent my positions. And I'm going to learn it as a parameter, just like I learn every other parameter.

And what do they mean? Oh, I have no idea. But it represents position. So you just sort of add this matrix to the xi's, your input embeddings. And it learns to fit to data. So whatever representation of position that's linear, sort of index-based that you want, you can learn.

And the cons are that, well, you definitely now can't represent anything that's longer than n words long, right? No sequence longer than n you can handle because, well, you only learned a matrix of this many positions. And so in practice, you'll get a model error if you pass a self-attention model, something longer than length n.

It will just sort of crash and say, I can't do this. And so this is sort of what most systems nowadays use. There are more flexible representations of position, including a couple in the lecture notes. You might want to look at the relative linear position, or words before or after each other, but not their absolute position.

There's also some sort of representations that harken back to our dependency syntax. Because, oh, maybe words that are close in the dependency parse tree should be the things that are sort of close in the self-attention operation. OK, questions? In practice, do we typically just make n large enough that we don't run into the issue of having something that can be input longer than n?

So the question is, in practice, do we just make n long enough that we don't run into the problem where we're going to look at a text longer than n? No, in practice, it's actually quite a problem. Even today, even in the largest, biggest language models, and can I fit this prompt into chat GPT or whatever?

It's a thing that you might see on Twitter. These continue to be issues. And part of it is because the self-attention operation-- and we'll get into this later in the lecture-- it's quadratic complexity in the sequence length. So you're going to spend n squared memory budget in order to make sequence lengths longer.

So in practice, this might be on a large model, say, 4,000 or so. n is 4,000, so you can fit 4,000 words, which feels like a lot, but it's not going to fit a novel. It's not going to fit a Wikipedia page. And there are models that do longer sequences, for sure.

And again, we'll talk a bit about it, but no, this actually is an issue. How do you know that the p you learned is the position, which is not any other? I don't. It's yours. Yeah. So how do you know that the p that you've learned, this matrix that you've learned, is representing position as opposed to anything else?

And the reason is the only thing it correlates is position. So when I see these vectors, I'm adding this p matrix to my x matrix, the word embeddings. I'm adding them together. And the words that show up at each index will vary depending on what word actually showed up there in the example.

But the p matrix never differs. It's always exactly the same at every index. And so it's the only thing in the data that it correlates with. So you're learning it implicitly. This vector at index 1 is always at index 1 for every example, for every gradient update. And nothing else co-occurs like that.

Yeah. So what do you end up learning? I don't know. It's unclear. But it definitely allows you to know, oh, this word is with this index. Yeah. OK. Yeah. Just quickly, when you say quadratic constant in space, is a sequence right now defined as a sequence? Is there a sequence of words?

Or I'm trying to figure out what unit is using it. OK. So the question is, when this is quadratic in the sequence, is that a sequence of words? Yeah. Think of it as a sequence of words. Sometimes there'll be pieces that are smaller than words, which we'll go into in the next lecture.

But yeah. Think of this as a sequence of words, but not necessarily just for a sentence, maybe for an entire paragraph, or an entire document, or something like that. OK. But the attention is where it is. Yeah, the attention is based words to words. OK. Cool. I'm going to move on.

OK. Right. So we have another problem. Another is that, based on the presentation of self-attention that we've done, there's really no nonlinearities for deep learning magic. We're just computing weighted averages of stuff. So if I apply self-attention, and then apply self-attention again, and then again, and again, and again, you should look at the next lecture notes if you're interested in this.

It's actually quite cool. But what you end up doing is you're just re-averaging value vectors together. So you're computing averages of value vectors, and it ends up looking like one big self-attention. But there's an easy fix to this if you want the traditional deep learning magic. And you can just add a feed-forward network to post-process each output vector.

So I've got a word here. That's the output of self-attention. And I'm going to pass it through-- in this case, I'm calling it a multilayer perceptron MLP. So this is a vector in Rd that's going to be-- and it's taking in as input a vector in Rd. And you do the usual multilayer perceptron thing, where you have the output, and you multiply it by a matrix, pass it through a nonlinearity, multiply it by another matrix.

And so what this looks like in self-attention is that I've got this sentence, the chef who-- da, da, da, da, da-- food. And I've got my embeddings for it. I pass it through this whole big self-attention block, which looks at the whole sequence and incorporates context and all that.

And then I pass each one individually through a feed-forward layer. So this embedding, that's the output of the self-attention for the word "the," is passed independently through a multilayer perceptron here. And you can think of it as combining together or processing the result of attention. So there's a number of reasons why we do this.

One of them also is that you can actually stack a ton of computation into these feed-forward networks very, very efficiently, very parallelizable, very good for GPUs. But this is what's done in practice. So you do self-attention, and then you can pass it through this position-wise feed-forward layer. Every word is processed independently by this feed-forward network to process the result.

So that's adding our classical deep learning nonlinearities for self-attention. And that's an easy fix for this no nonlinearities problem in self-attention. And then we have a last issue before we have our final minimal self-attention building block with which we can replace RNNs. And that's that-- well, when I've been writing out all of these examples of self-attention, you can look at the entire sequence.

And in practice, for some tasks, such as machine translation or language modeling, whenever you want to define a probability distribution over a sequence, you can't cheat and look at the future. So at every time step, I could define the set of keys and queries and values to only include past words.

But this is inefficient. Bear with me. It's inefficient because you can't parallelize it so well. So instead, we compute the entire n by n matrix, just like I showed in the slide discussing self-attention. And then I mask out words in the future. So for this score, eij-- and I computed eij for all n by n pairs of words-- is equal to whatever it was before if the word that you're looking at, index j, is an index that is less than or equal to where you are, index i.

And it's equal to negative infinity-ish otherwise, if it's in the future. And when you softmax the eij, negative infinity gets mapped to 0. So now my attention is weighted 0. My weighted average is 0 on the future. So I can't look at it. What does this look like? So in order to encode these words, the chef who-- maybe the start symbol there-- I can look at these words.

That's all pairs of words. And then I just gray out-- I negative infinity out the words I can't look at. So when encoding the start symbol, I can just look at the start symbol. When encoding the, I can look at the start symbol and the. When encoding chef, I can look at start the chef.

But I can't look at who. And so with this representation of chef that is only looking at start the chef, I can define a probability distribution using this vector that allows me to predict who without having cheated by already looking ahead and seeing that, well, who is the next word.

Questions? So it says for using it in decoders. Do we do this for both the encoding layer and the decoding layer? Or for the encoding layer, are we allowing ourselves to look for-- The question is, it says here that we're using this in a decoder. Do we also use it in the encoder?

So this is the distinction between a bidirectional LSTM and a unidirectional LSTM. So wherever you don't need this constraint, you probably don't use it. So if you're using an encoder on the source sentence of your machine translation problem, you probably don't do this masking because it's probably good to let everything look at each other.

And then whenever you do need to use it because you have this autoregressive probability of word one, probability of two given one, three given two and one, then you would use this. So traditionally, yes, in decoders, you will use it. In encoders, you will not. Yes. My question is a little bit philosophical.

How humans actually generate sentences by having some notion of the probability of future words before they say the words that-- or before they choose the words that they are currently speaking or writing, generating? Good question. So the question is, isn't looking ahead a little bit and predicting or getting an idea of the words that you might say in the future sort of how humans generate language instead of the strict constraint of not seeing it into the future?

Is that what you're-- OK. So right. Trying to plan ahead to see what I should do is definitely an interesting idea. But when I am training the network, I can't-- if I'm teaching it to try to predict the next word, and if I give it the answer, it's not going to learn anything useful.

So in practice, when I'm generating text, maybe it would be a good idea to make some guesses far into the future or have a high-level plan or something. But in training the network, I can't encode that intuition about how humans build-- like, generate sequences of language by just giving it the answer of the future directly, at least, because then it's just too easy.

There's nothing to learn. Yeah. But there might be interesting ideas about maybe giving the network a hint as to what kind of thing could come next, for example. But that's out of scope for this. Yeah. Yeah, question over here. So I understand why we want to mask the future for stuff like language models, but how does it apply to machine translation?

Like, why would we use it there? Yeah. So in machine translation-- I'm going to come over to this board and hopefully get a better marker. Nice. In machine translation, I have a sentence like, "I like pizza." And I want to be able to translate it-- "Je me pizza." Nice.

And so when I'm looking at "I like pizza," I get this as the input. And so I want self-attention without masking, because I want "I" to look at "like" and "I" to look at "pizza" and "like" to look at "pizza." And then when I'm generating this, if my tokens are like "Je m la pizza," I want to, in encoding this word, I want to be able to look only at myself.

And we'll talk about encoder-decoder architectures in this later in the lecture. But I want to be able to look at myself, none of the future, and all of this. And so what I'm talking about right now in this masking case is masking out with negative infinity all of these words.

So that attention score from "Je" to everything else should be negative infinity. Yeah. Does that answer your question? Great. OK, let's move ahead. OK, so that was our last big building block issue with self-attention. So this is what I would call-- and this is my personal opinion-- a minimal self-attention building block.

You have self-attention, the basis of the method. So that's sort of here in the red. And maybe we had the inputs to the sequence here. And then you embed it with that embedding matrix E. And then you add position embeddings. And then these three arrows represent using the key, the value, and the query that's sort of stylized there.

This is often how you see these diagrams. And so you pass it to self-attention with the position representation. So that specifies the sequence order, because otherwise you'd have no idea what order the words showed up in. You have the nonlinearities in sort of the TLFeedForward network there to sort of provide that sort of squashing and sort of deep learning expressivity.

And then you have masking in order to have parallelizable operations that don't look at the future. So this is sort of our minimal architecture. And then up at the top above here, so you have this thing-- maybe you repeat this sort of self-attention and feedforward many times. So self-attention, feedforward, self-attention, feedforward, self-attention, feedforward.

That's what I'm calling this block. And then maybe at the end of it, you predict something. I don't know. We haven't really talked about that. But you have these representations. And then you predict the next word, or you predict the sentiment, or you predict whatever. So this is like a self-attention architecture.

OK, we're going to move on to the transformer next. So if there are any questions-- yeah? Other way around. We will use masking for decoders, where I want to decode out a sequence where I have an informational constraint, where to represent this word properly, I cannot have the information of the future.

Yeah, OK. OK, great. So now let's talk about the transformer. So what I've pitched to you is what I call a minimal self-attention architecture. And I quite like pitching it that way. But really, no one uses the architecture that was just up on the slide, the previous slide. It doesn't work quite as well as it could.

And there's a bunch of important details that we'll talk about now that goes into the transformer. What I would hope, though, to have you take away from that is that the transformer architecture, as I'll present it now, is not necessarily the end point of our search for better and better ways of representing language, even though it's now ubiquitous and has been for a couple of years.

So think about these sort of ideas of the problems of using self-attention and maybe ways of fixing some of the issues with transformers. OK, so a transformer decoder is how we'll build systems like language models. And so we've discussed this. It's like our decoder with our self-attention-only sort of minimal architecture.

It's got a couple of extra components, some of which I've grayed out here, that we'll go over one by one. The first that's actually different is that we'll replace our self-attention with masking with masked multi-head self-attention. This ends up being crucial. It's probably the most important distinction between the transformer and this sort of minimal architecture that I've presented.

So let's come back to our toy example of attention, where we've been trying to represent the word learned in the context of the sequence, I went to Stanford CS224N and learned. And I was sort of giving these teal bars to say, oh, maybe intuitively you look at various things to build up your representation of learned.

But really, there are varying ways in which I want to look back at the sequence to see varying sort of aspects of information that I want to incorporate into my representation. So maybe in this way, I sort of want to look at Stanford CS224N, because, oh, it's like entities.

You learn different stuff at Stanford CS224N than you do at other courses or other universities or whatever. And so maybe I want to look here for this reason. And maybe in another sense, I actually want to look at the word learned. And I want to look at I. I went and learned.

And I want to see maybe syntactically relevant words. It's very different reasons for which I might want to look at different things in the sequence. And so trying to average it all out with a single operation of self-attention ends up being maybe somewhat too difficult in a way that will make precise in assignment 5.

Nice, we'll do a little bit more math. OK, so any questions about this intuition? Yeah, so it should be an application of attention just as I've presented it. So one independent define the keys, define the queries, define the values. I'll define it more precisely here. But think of it as I do attention once, and then I do it again with different parameters, being able to look at different things, et cetera.

How do we ensure that they look at different things? We do not-- OK, so the question is, if we have two separate sets of weights trying to learn, say, to do this and to do that, how do we ensure that they learn different things? We do not ensure that they learn different things.

And in practice, they do, although not perfectly. So it ends up being the case that you have some redundancy, and you can cut out some of these. But that's out of scope for this. But we hope, just like we hope that different dimensions in our feedforward layers will learn different things because of lack of symmetry and whatever, that we hope that the heads will start to specialize.

And that will mean they'll specialize even more. And yeah. OK. All right, so in order to discuss multi-head self attention well, we really need to talk about the matrices, how we're going to implement this in GPUs efficiently. We're going to talk about the sequence-stacked form of attention. So we've been talking about each word sort of individually as a vector in dimensionality D.

But really, we're going to be working on these as big matrices that are stacked. So I take all of my word embeddings, x1 to xn, and I stack them together. And now I have a big matrix that is in dimensionality Rn by D. OK, and now with my matrices K, Q, and V, I can just multiply them on this side of x.

So x is Rn by D. K is Rd by D. So n by D times d by D gives you n by D again. So I can just compute a big matrix multiply on my whole sequence to multiply each one of the words of my key query and value matrices very efficiently.

So this is sort of this vectorization idea. I don't want to for loop over the sequence. I represent the sequence as a big matrix, and I just do one big matrix multiply. Then the output is defined as this sort of inscrutable bit of math, which I'm going to go over visually.

So first, we're going to take the key query dot products in one matrix. So we've got xq, which is Rn by D. And I've got xk transpose, which is Rd by n. So n by D, d by n. This is computing all of the eij's, these scores for self-attention.

So this is all pairs of attention scores computed in one big matrix multiply. OK? So this is this big matrix here. Next, I use the softmax. So I softmax this over the second dimension, the second n dimension. And I get my sort of normalized scores, and then I multiply with xv.

So this is an n by n matrix multiplied by an n by D matrix. And what do I get? Well, this is just doing the weighted average. So this is one big weighted average. Big weighted average contribution on the whole matrix, giving me my whole self-attention output in Rn by D.

So I've just restated identically the self-attention operations, but computed in terms of matrices so that you could do this efficiently on a GPU. OK. So multi-headed attention. This is going to give us-- and it's going to be important to compute this in terms of the matrices, which we'll see.

This is going to give us the ability to look in multiple places at once for different reasons. So for self-attention looks where this dot product here is high. This xi, the Q matrix, the key matrix. But maybe we want to look in different places for different reasons. So we actually define multiple query, key, and value matrices.

So I'm going to have a bunch of heads. I'm going to have h self-attention heads. And for each head, I'm going to define an independent query, key, and value matrix. And I'm going to say that its shape is going to map from the model dimensionality to the model dimensionality over h.

So each one of these is doing projection down to a lower dimensional space. This is going to be for computational efficiency. And I'll just apply self-attention independently for each output. So this equation here is identical to the one we saw for single-headed self-attention, except I've got these sort of l indices everywhere.

So I've got this lower dimensional thing. I'm mapping to a lower dimensional space. And then I do have my lower dimensional value vector there. So my output is an rd by h. But really, you're doing exactly the same kind of operation. I'm just doing it h different times. And then you combine the outputs.

So I've done sort of look in different places with the different key, query, and value matrices. And then I get each of their outputs. And then I concatenate them together. So each one is dimensionality d by h. And I concatenate them together and then sort of mix them together with the final linear transformation.

And so each head gets to look at different things and construct their value vectors differently. And then I sort of combine the result all together at once. Let's go through this visually, because it's at least helpful for me. It's actually not more costly to do this, really, than it is to compute a single head of self-attention.

And we'll see through the pictures. So in single-headed self-attention, we computed xq. And in multi-headed self-attention, we'll also compute xq the same way. So xq is rn by d. And then we can reshape it into rn, that's sequence length, times the number of heads, times the model dimensionality over the number of heads.

So I've just reshaped it to say, now I've got a big three-axis tensor. The first axis is the sequence length. The second one is the number of heads. The third is this reduced model dimensionality. And that costs nothing. And do the same thing for x and v. And then I transpose so that I've got the head axis as the first axis.

And now I can compute all my other operations with the head axis, kind of like a batch. So what does this look like in practice? Instead of having one big xq matrix that's model dimensionality d, I've got, in this case, three xq matrices of model dimensionality d by 3, d by 3, d by 3.

Same thing with the key matrix here. So everything looks almost identical. It's just the reshaping of the tensors. And now, at the output of this, I've got three sets of attention scores just by doing this reshape. And the cost is that, well, each of my attention heads has only a d by h vector to work with instead of a d-dimensional vector to work with.

So I get the output. I get these three sets of pairs of scores. I compute the softmax independently for each of the three. And then I have three value matrices there as well, each of them lower dimensional. And then finally, I get my three different output vectors. And I have a final linear transformation to mush them together.

And I get an output. And in summary, what this allows you to do is exactly what I gave in the toy example, which was I can have each of these heads look at different parts of a sequence for different reasons. So this is at a given block, right? All of these attention heads are for a given transformer block.

A next block could also have three attention heads. The question is, are all of these for a given block? And we'll talk about a block again. But this block was this sort of pair of self-attention and feed-forward network. So you do self-attention, feed-forward. That's one block. Another block is another self-attention, another feed-forward.

And the question is, are the parameters shared between the blocks or not? Generally, they are not shared. You'll have independent parameters at every block, although there are some exceptions. Voting on that, is it typically the case that you have the same number of heads at each block? Or do you vary the number of heads across blocks?

You have this-- you definitely could vary it. People haven't found reason to vary-- so the question is, do you have different numbers of heads across the different blocks? Or do you have the same number of heads across all blocks? The simplest thing is to just have it be the same everywhere, which is what people have done.

I haven't yet found a good reason to vary it, but it could be interesting. It's definitely the case that after training these networks, you can actually just totally zero out, remove some of the attention heads. And I'd be curious to know if you could remove more or less, depending on the layer index, which might then say, oh, we should just have fewer.

But again, it's not actually more expensive to have a bunch. So people tend to instead set the number of heads to be roughly so that you have a reasonable number of dimensions per head, given the total model dimensionality d that you want. So for example, I might want at least 64 dimensions per head, which if d is 128, that tells me how many heads I'm going to have, roughly.

So people tend to scale the number of heads up with the model dimensionality. Yeah, with that xq, by slicing it into different columns, you're reducing the rank of the final matrix, right? Yeah. But that doesn't really have any effect on the results. So the question is, by having these reduced xq and xk matrices, this is a very low rank approximation.

This little sliver and this little sliver defining this whole big matrix, it's very low rank. Is that not bad? In practice, no. I mean, again, it's the reason why we limit the number of heads depending on the model dimensionality, because you want intuitively at least some number of dimensions.

So 64 is sometimes done, 128, something like that. But if you're not giving each head too much to do, and it's got sort of a simple job, you've got a lot of heads, it ends up sort of being OK. All we really know is that empirically, it's way better to have more heads than one.

Yes. I'm wondering, have there been studies to see if information in one of the sets of the attention scores, like information that one of them learns is consistent and related to each other, or how are they related? So the question is, have there been studies to see if there's consistent information encoded by the attention heads?

And yes. Actually, there's been quite a lot of study and interpretability and analysis of these models to try to figure out what roles, what sort of mechanistic roles each of these heads takes on. And there's quite a bit of exciting results there around some attention heads learning to pick out the syntactic dependencies, or maybe doing a global averaging of context.

The question is quite nuanced, though, because in a deep network, it's unclear-- and we should talk about this more offline-- it's unclear if you look at a word 10 layers deep in a network what you're really looking at, because it's already incorporated context from everyone else, and it's a little bit unclear.

Active area of research. But I think I should move on now to keep discussing transformers. But yeah, if you want to talk more about it, I'm happy to. OK. So another sort of hack that I'm going to toss in here-- I mean, maybe they wouldn't call it hack, but it's a nice little method to improve things.

It's called scaled dot product attention. So one of the issues with this sort of key query value self-attention is that when the model dimensionality becomes large, the dot products between vectors, even random vectors, tend to become large. And when that happens, the inputs to the softmax function can be very large, making the gradient small.

So intuitively, if you have two random vectors and model dimensionality d, and you just dot product them together, as d grows, their dot product grows in expectation to be very large. And so you sort of want to start out with everyone's attention being very uniform, very flat, sort of look everywhere.

But if some dot products are very large, then learning will be inhibited. And so what you end up doing is you just sort of-- for each of your heads, you just sort of divide all the scores by this constant that's determined by the model dimensionality. So as the vectors grow very large, their dot products don't, at least at initialization time.

So this is sort of like a nice little important, but maybe not-- yeah, it's important to know. And so that's called scaled dot product attention. From here on out, we'll just assume that we do this. It's quite easy to implement. You just do a little division in all of your computations.

OK, so now in the transformer decoder, we've got a couple of other things that I have unfaded out here. We have two big optimization tricks, or optimization methods, I should say, really, because these are quite important, that end up being very important. We've got residual connections and layer normalization.

And in transformer diagrams that you see sort of around the web, they're often written together as this add and norm box. And in practice, in the transformer decoder, I'm going to apply mask multi-head attention and then do this sort of optimization add a norm. Then I'll do a feed forward application and then add a norm.

So this is quite important. So let's go over these two individual components. The first is residual connections. I mean, I think we've talked about residual connections before, right? So it's worth doing it again. But it's really a good trick to help models train better. So just to recap, we're going to take-- instead of having this sort of-- you have a layer, layer i minus 1, and you pass it through a thing.

Maybe it's self-attention. Maybe it's a feed forward network. Now you've got layer i. I'm going to add the result of layer i to its input here. So now I'm saying I'm just going to compute the layer, and I'm going to add in the input to the layer so that I only have to learn the residual from the previous layer.

So I've got this sort of connection here. It's often written as this. It's sort of like, boop, connection. It goes around. And you should think that the gradient is just really great through the residual connection. Like, ah, if I've got vanishing or exploding gradient-- vanishing gradients through this layer, well, I can at least learn everything behind it because I've got this residual connection where the gradient is 1 because it's the identity.

This is really nice. And it also maybe is like a-- at least at initialization, everything looks a little bit like the identity function now, right? Because if the contribution of the layer is somewhat small because all of your weights are small, and I have the addition from the input, maybe the whole thing looks a little bit like the identity, which might be a good sort of place to start.

And there are really nice visualizations. I just love this visualization. So this is your lost landscape. So you're gradient descent, and you're trying to traverse the mountains of the lost landscape. This is like the parameter space. And down is better in your loss function. And it's really hard. So you get stuck in some local optima, and you can't sort of find your way to get out.

And then this is with residual connections. I mean, come on. You just sort of walk down. I mean, that's not actually, I guess, really how it works all the time. But I really love this. It's great. OK. So yeah, we've seen residual connections. We should move on to layer normalization.

So layer norm is another thing to help your model train faster. And the intuitions around layer normalization and sort of the empiricism of it working very well maybe aren't perfectly, let's say, connected. But you should imagine, I suppose, that we want to say this variation within each layer. Things can get very big.

Things can get very small. That's not actually informative because of variations between maybe the gradients. Or I've got sort of weird things going on in my layers that I can't totally control. I haven't been able to sort of make everything behave sort of nicely where everything stays roughly the same norm.

Maybe some things explode. Maybe some things shrink. And I want to cut down on sort of uninformative variation between layers. So I'm going to let x and rd be an individual word vector in the model. So this is like I have a single index, one vector. And what I'm going to try to do is just normalize it.

Normalize it in the sense of it's got a bunch of variation. And I'm going to cut out on everything. I'm going to normalize it to unit mean and standard deviation. So I'm going to estimate the mean here across-- so for all of the dimensions in the vector, so j equals 1 to the model dimensionality, I'm going to sum up the value.

So I've got this one big word vector. And I sum up all the values. Division by d here, that's the mean. I'm going to have my estimate of the standard deviation. Again, these should say estimates. This is my simple estimate of the standard deviation or the values within this one vector.

And I'm just going to-- and then possibly, I guess I can have learned parameters to try to scale back out in terms of multiplicatively and additively here. That's optional. We're going to compute this standardization. I'm going to take my vector x, subtract out the mean, divide by the standard deviation, plus this epsilon constant.

If there's not a lot of variation, I don't want things to explode. So I'm going to have this epsilon there that's close to 0. So this part here, x minus mu over square root sigma plus epsilon, is saying take all the variation and normalize it to unit mean and standard deviation.

And then maybe I want to scale it, stretch it back out, and then maybe add an offset beta that I've learned. Although in practice, actually, this part-- and discuss this in the lecture notes-- in practice, this part maybe isn't actually that important. But so layer normalization, yeah, you're sort of-- you can think of this as when I get the output of layer normalization, it's going to be-- sort of look nice and look similar to the next layer independent of what's gone on because it's going to be unit mean and standard deviation.

So maybe that makes for a better thing to learn off of for the next layer. OK, any questions for residual or layer norm? Yes. What would it mean to subtract the scalar mu from the vector x? Yeah, it's a good question. When I subtract the scalar mu from the vector x, I broadcast mu to dimensionality d and remove mu from all d.

Yeah, good point. Thank you. That was unclear. Yeah. In the fourth bullet, maybe I'm confused. Is it divided? Should it be divided by d or from mean? Sorry, can you repeat that? In the fourth bullet point when you're calculating the mean, is it divided by d or is it-- or maybe I'm just confused.

I think it is divided by d. Yeah. Oh. These are-- so this is the average deviation from the mean of all of the-- yeah. Yes. So if you have five words in a sentence by their norm, do you normalize based on the statistics of these five words or do you want one word by one?

So the question is, if I have five words in the sequence, do I normalize by aggregating the statistics to estimate mu and sigma across all the five words, share their statistics, or do it independently for each word? This is a great question, which I think in all the papers that discuss transformers is under specified.

You do not share across the five words, which is somewhat confusing to me. So each of the five words is done completely independently. You could have shared across the five words and said that my estimate of the statistics are just based on all five, but you do not. I can't pretend I understand totally why.

For example, per batch or per output of the same position? So similar question. The question is, if you have a batch of sequences, so just like we were doing batch-based training, do you for a single word-- now, we don't share across the sequence index for sharing the statistics, but do you share across the batch?

And the answer is no. You also do not share across the batch. In fact, layer normalization was sort of invented as a replacement for batch normalization, which did just that. And the issue with batch normalization is that now your forward pass sort of depends in a way that you don't like on examples that should be not related to your example.

And so, yeah, you don't share statistics across the batch. OK. Cool. OK, so now we have our full transformer decoder, and we have our blocks. So in this sort of slightly grayed out thing here that says repeat for a number of decoder blocks, each block consists of-- I pass it through self-attention, and then my add and norm.

So I've got this residual connection here that goes around, add. I've got the layer normalization there, and then a feed-forward layer, and then another add and norm. And so that sort of set of four operations, I apply for some number of times, number of blocks. So that whole thing is called a single block.

And that's it. That's the transformer decoder as it is. Cool, so that's a whole architecture right there. We've solved things like needing to represent position. We've solved things like not being able to look into the future. We've solved a lot of different optimization problems. You had a question? Yes.

Yes. Yes, masked multi-head attention, yeah. With the dot product scaling with the square root d over h as well, yeah. So the question is, how do these models handle variable length inputs? Yeah, so if you have-- so the input to the GPU forward pass is going to be a constant length.

So you're going to maybe pad to a constant length. And in order to not look at the future, the stuff that's happening in the future, you can mask out the pad tokens, just like the masking that we showed for not looking at the future in general. You can just say, set all of the attention weights to 0, or the scores to negative infinity for all of the pad tokens.

Yeah, exactly. So you can set everything to this maximum length. Now, in practice-- so the question was, do you set this length that you have everything be that maximum length? I mean, yes, often, although you can save computation by setting it to something smaller. And everything-- the math all still works out.

You just have to code it properly so it can handle-- you set everything instead of to n. You set it all to 5 if everything is shorter than length 5, and you save a lot of computation. All of the self-attention operations just work. So yeah. How many layers are in the feedforward normally?

There's one hidden layer in the feedforward usually. Oh, just one? Yeah. OK, I should move on. We've got a couple more things and not very much time. OK. But I'll be here after the class as well. So in the encoder-- so the transformer encoder is almost identical. But again, we want bidirectional context.

And so we just don't do the masking. So I've got in my multi-head attention here, I've got no masking. And so it's that easy to make the model bidirectional. So that's easy. So that's called the transformer encoder. It's almost identical but no masking. And then finally, we've got the transformer encoder decoder, which is actually how the transformer was originally presented in this paper, "Attention is All You Need." And this is when we want to have a bidirectional network.

Here's the encoder. It takes in, say, my source sentence for machine translation. Its multi-headed attention is not masked. And I have a decoder to decode out my sentence. Now, but you'll see that this is slightly more complicated. I have my masked multi-head self-attention, just like I had before in my decoder.

But now I have an extra operation, which is called cross-attention, where I am going to use my decoder vectors as my queries. Then I'll take the output of the encoder as my keys and values. So now for every word in the decoder, I'm looking at all the possible words in the output of all of the blocks of the encoder.

Yes? How do we get a key and value separated from the output? Because didn't we collapse those into the single output? So we-- well, how-- sorry. How will we get the keys and values out? Like, how do we-- because when we have the output, didn't we collapse the keys and values into a single output?

So the output-- Yeah, the question is, how do you get the keys and values and queries out of this single collapsed output? Now, remember, the output for each word is just this weighted average of the value vectors for the previous words. And then from that output for the next layer, we apply a new key, query, and value transformation to each of them for the next layer of self-attention.

So it's not actually that you're-- Yeah, you apply the key matrix, the query matrix, to the output of whatever came before it. Yeah. And so just in a little bit of math, we have these vectors, h1 through hn, I'm going to call them the output of the encoder. And then I've got vectors that are the output of the decoder.

So I've got these z's I'm calling the output of the decoder. And then I simply define my keys and my values from the encoder vectors, these h's. So I take the h's, I apply a key matrix and a value matrix, and then I define the queries from my decoder.

So my queries here-- so this is why two of the arrows come from the encoder, and one of the arrows comes from the decoder. I've got my z's here, my queries, my keys and values from the encoder. So that is it. I've got a couple of minutes. I want to discuss some of the results of transformers, and I'm happy to answer more questions about transformers after class.

So really, the original results of transformers, they had this big pitch for, oh, look, you can do way more computation because of parallelization. They got great results in machine translation. So you had-- you had transformers doing quite well, although not astoundingly better than existing machine translation systems. But they were significantly more efficient to train.

Because you don't have this parallelization problem, you could compute on much more data much faster, and you could make use of faster GPUs much more. After that, there were things like document generation, where you had the old standard of sequence-to-sequence models to the LSTMs. And eventually, everything became transformers all the way down.

Transformers also enabled this revolution into pre-training, which we'll go over in next year, next class. And the efficiency, the parallelizability allows you to compute on tons and tons of data. And so after a certain point, on standard large benchmarks, everything became transformer-based. This ability to make use of lots and lots of data, lots and lots of compute, just put transformers head and shoulders above LSTMs in, let's say, almost every modern advancement in natural language processing.

There are many drawbacks and variants to transformers. The clearest one that people have tried to work on quite a bit is this quadratic compute problem. So this all pairs of interactions means that our total computation for each block grows quadratically with the sequence length. And in a student's question, we heard that, well, as the sequence length becomes long, if I want to process a whole Wikipedia article, a whole novel, that becomes quite unfeasible.

And actually, that's a step backwards in some sense, because for recurrent neural networks, it only grew linearly with the sequence length. Other things people have tried to work on are better position representations, because the absolute index of a word is not really the best way maybe to represent its position in a sequence.

And just to give you an intuition of quadratic sequence length, remember that we had this big matrix multiply here that resulted in this matrix of n by n. And computing this is a big cost. It costs a lot of memory. And so there's been work-- oh, yeah. And so if you think of the model dimensionality as like 1,000, although today it gets much larger, then for a short sequence of n is roughly 30, maybe if you're computing n squared times d, 30 isn't so bad.

But if you had something like 50,000, then n squared becomes huge and sort of totally infeasible. So people have tried to sort of map things down to a lower dimensional space to get rid of the sort of quadratic computation. But in practice, I mean, as people have gone to things like GPT-3, Chat-GPT, most of the computation doesn't show up in the self-attention.

So people are wondering sort of is it even necessary to get rid of the self-attention operations quadratic constraint? It's an open form of research whether this is sort of necessary. And then finally, there have been a ton of modifications to the transformer over the last five, four-ish years. And it turns out that the original transformer plus maybe a couple of modifications is pretty much the best thing there is still.

There have been a couple of things that end up being important. Changing out the nonlinearities in the feedforward network ends up being important. But it's had lasting power so far. But I think it's ripe for people to come through and think about how to sort of improve it in various ways.

So pre-training is on Tuesday. Good luck on assignment four. And then we'll have the project proposal documents out tonight for you to talk about.