Stanford CS25: V3 I How I Learned to Stop Worrying and Love the Transformer

Do folks still have, what is this? There's a Stanford, that's a Stanford location. You know which one? Well, first, what is this? What was going on here? >> It's the first Dartmouth in Alabama. >> That's right, yeah, and then what does the association to Stanford get? >> I believe this is the McCarthy, yeah, who started at sale, if I understand correctly, is that right, he started at sale?

Yeah, I think he did, but anyways, so what's interesting is, so it's amusing to actually look at what they wrote in their, I don't know, is it brochure or what they wrote in their goals, right? So, this font is a bit small. Okay, so the study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it, right, fantastic.

So single machine, you wanna simulate all of human intelligence, okay. And carefully selected group of scientists, and we think that we can make, actually, the paragraph right before the second set of red underline, is we think that a significant advance can be made in one or two of these problems.

If a carefully selected group of scientists work on together for a summer, okay. I don't think they knew of AI winters then actually, they didn't know of it then. But, and the third thing is amusing is, but the major obstacle is not lack of machine capacity, but our inability to write programs taking full advantage of what we have.

>> So, while the goals are noble, it's surprising how wrong you can be with some of the smartest people in the room, right? So Selfridge, a neural network OG that were the original pandemoniums, I think he got everything basically set for path problems in black box optimization. Then Minsky, of course, Shannon, Solomonoff, I think it was Solomonoff, MDL.

In many ways, you can argue that's the underpinning of self-supervised learning today. But it's really amusing to see the first, I mean, at least I don't know if we'll be able to characterize or write down all the rules for intelligence. So you can imagine that the approaches they were taking were all these rule-based systems, right?

And they couldn't be more wrong on machine capacity. Today's transformers, they don't, they're data centers, right? And I guess they needed a really, really long summer to solve this one. But yeah, so it's 1955, so yeah, like about 60 years. No, not even, I'm getting close to 70, 70 years, right?

So, and we're basically talking about the same problems again, except maybe some things work, some things don't work. And this talk is about some of the, one of the pieces that has made this larger enterprise work, and we're getting closer to the original goals of the Dartmouth conference, yeah.

Again, okay, so this is like the big gaps. I mean, so what eventually happened in the field was that their goal of having a single system that explained most of, that was able to mimic our cognitive abilities, which would definitely mean like image processing or image understanding and language processing as well, right?

That, I mean, the field got, I mean, a single model or a single approach to do all these things was shattered by like thousands of like different research products. So, I mean, there was no consolidation, but here's another, here's another, here's another, this is going to be a harder one.

Can you tell what is a, this is, this is 2009, and this is a, and this is not a single system. This is a complicated machine translation system. So when I started my PhD, our machine translation systems used to be a bit more complicated, complicated than this, actually. Thousands of pipeline systems, you had to first extract, you had to first do word alignments that actually looked like attention as like art.

You think about it as hard attention, then based on that, we extracted like all larger phrases aligned with other phrases. Then you had to figure out how they then had to, then you had to teach, there was some machine learning there, you had to teach the model how to score them connecting with each other.

So can you tell, can you, does anybody know where a neural network is in this? All right. So CS, so this is the, this is a machine translation system from 2009, and CSM is a continuous state language model. That's used for rescoring, right? So the world was so discreet then that he had to call these models like continuous state language models.

And, I mean, it was a lot, it was largely inspired by the neural probabilistic language model by, oh, it doesn't appear. Huh. Sorry. Ah, there. The neural probabilistic language model by Benjy, I think it was in 2003. And so we were, even in 2013, when I published a paper on neural network language models, these models were still being put into the fee for neural network language models was still, you know, rescoring.

And now it's incredible if you think about it. So just in terms of consolidation, how all of these complicated systems that have now been replaced by just neurons that talk to each other, and you just learn the rules from, you just learn the rules from data automatically. So it's fun to, it's interesting to see.

And so since then, you know, like, so this is what the EMLB 2013 conference was like. You see these different, like these, you can call it verticalized NLP, these different areas like morphology, dialogue, and discourse. I mean, I don't even know if people talk about dialogue and discourse. There's talk to models now.

There's, I don't know if there's a research track anymore. Then there is like a machine translation. So there's opinion mining and sentiment analysis. Now models make you angry or upset. So it's, and so you could see that just in 2013, the field, even research was divided into these smaller tracks and everybody had their own specific, they were bringing their own specific domain information.

And they had to specialize in a domain in order to solve some tasks. And we solved tasks to some degree. Machine translation, because, I mean, probably because of a lot of government funding as well, we had made a lot of progress and we were making practical translation systems that were being deployed in the wild.

Google Translate was a great example of that, right? And so since then, you're like, you have this, you know, we started to, through first, first we all agree, we need distributed, we need distributed word representations. And you saw this, like, people probably don't know this funky, this funky embedding algebra of king minus man plus woman equals, equals queen from word2vec.

And we had a, we had a, and there was a, there was a, there was a big industry of models that actually, that just, that learned word representations and the word representations are actually useful in downstream tasks. And then, then came like, you know, another step in this process.

Another step in this process where now we started saying, okay, these representations are like in there, but they're, they're, they're only helpful if they're learned in context, right? So the king should change based on context, like the, the, the, the king of Persia, or the king has no clothes, or the emperor has no clothes, right?

And so, so, so that, so we saw these, we saw approaches like sequence to sequence, sequence learning where we started to like formulate, we started to create these general formulations of how to solve any task in NLP, right? So sequence to sequence formulation, if you can, you can, you can, you can formulate many tasks in language of sequence to sequence, question answering, machine translation, dialogue.

So, and then, and then of course we had, then we, then we developed attention, right? Which was a, which was a very effective content based way to summarize information. If you were a, typically you have these encoder decoder architectures, everybody has probably, I'm guessing, familiar with encoder decoder architectures, right?

So yeah, encoder decoder architecture and a, and a, and a position on the decoder side would summarize based on its content, all the information on the, on the source sentence, right? And this was really effective content-based way of summarizing information. And what, what started happening was we started these, these general, these general paradigms started coming up.

Sequence to sequence learning can solve, it can install most language problems because most language problems have to deal with learning representations of variable length. The goal is to learn representations of variable length sequences. And if you do that successfully, you can then potentially solve that problem. And then attention was an excellent way, a content-based way to actually summarize information from some neighborhood.

And and, and, and, and so, so, so, so, and the, and the major workhorse until then were these recurrent models or LSTMs, right? Where basically the, the, the, the method was typically the same. You had a, you had a sentence and you crushed the sentence into a, into a set of, into a set of vectors, set of representations, one typically, typically one for each position, right?

And the way LSTMs did it was where they walked along the sentence, they ate up a word, and then they summarized, they summarized the entire history into one fixed bottleneck. And that bottleneck was then transmitted, was updated based on the next word. So, so, so now, and, and, and, and if we, if you were successfully able to learn representations, then we could solve these tasks, translation, summarization, dialogue.

So it's an important movement and, and, and, and like the 20, 20, I, I, 20, I guess when was the sequence to sequence learning papers, 2015, NeurIPS, then we saw, then we saw the attention paper in around 2015, 2016, and the machine translation community was kind of the first to respond and say, hey, yeah, you know, machine translation is a classic sequence to sequence learning problem.

Like why don't we now first start rescoring? And then can we still build native, greeny, rethink machine translation with the sequence to sequence models, right? And these are fantastic models. I don't know if you guys have ever done these exercises on LSTMs can, can count, like if you, if you, for example, if you, if you train an encoder decoder on, if you, like on A, to model A to the N, B to the N.

So you feed in NAs and you ask the decoder to predict N, NBs, and you actually, just a single cell LSTM, if you know the structure of an LSTM, there's a cell that basically keeps, so it's a, it's a notion of state, and just a single cell is able to actually just do trivial counting.

It counts how many A's you consumed, and then it decrements it, and then when you consume all the, exactly the same number of B's as the number of A's, something lights up and says, I'm done, I've recognized this language, so you can train trivial A to the N, B to the N.

And here, you have a, I'm sorry, this is not clear, but you have somewhat of a, you have a grammar here, and you can see that these are different cells, there's about eight cells here, and each one of these cells actually increments its counter once it feeds a particular symbol, and it's able to actually track how deep you are in this, how deep you are in this hierarchy, in this grammar.

And Google, of course, the crowning achievement, perhaps, of sequence-to-sequence models, which I was actually, right, I was fortunate to be in the same cuticle as this work was being done, was the Google neural machine translation system, where they took LSTMs, I mean, they added many advancements, of course, a lot of systems improvements, a lot of data that Google had, and they produced what you might, at that time, the state-of-the-art neural machine translation system, sequence-to-sequence models.

So now, this big consolidated, this big complicated system, which looked much, which looked much more complicated, and now become a homogenous, just as a single homogenous neural network, right? So at the time, the biggest frustration we had was, this was, I mean, these, the LSTMs were the primary workforce, and the biggest, the biggest frustration we had was, I mean, not only were we producing, not only were we, did we produce the output order aggressively, we were sequentially decoding the output, left to right, but also, we were reading the input sequentially.

So you had to kind of, in order to produce that, in order to produce that representation for the 10th word, you had to eat up, you had to, the first word, the second word, the third word. So that was, that was really slow, and, and, and, and, and, and, and, and not the whole, and another big problem with LSTMs were that you have this bottleneck that basically, that contains all the information about your past.

So you have to now, you have to now crush, you have to, you have to pack both long distance interactions that you might have, and local interactions through the single, single fixed vector that you need to transmit, right? And, and sequentiality, it doesn't, inhibits parallelism, which means that you couldn't even read, like the encoder couldn't even read the sentence in parallel, and of course decoding was autoregressive, so you couldn't even write in parallel, right?

And convolutions, they were starting to emerge as a solution largely. I mean, they had been very successful in-- they had been very successful in computer vision. They had also figured out how to optimize them well, how to make them really fast on GPUs, because they're just basically matrix multiplications.

And matrix multiplication is largely-- it's parallelizable. So convolutions were a solution to this problem of not being able to read in parallel, because you could-- in parallel, every word could basically produce its representation by looking at its neighbors, its local neighbors. And there were some breakthrough papers, such as Bitenet for machine translation, the convolutional sequence-to-sequence model that was contemporaneous to the transformer, actually, probably predated by a few months, where they used convolutions both in the encoder and decoder to get good scores on machine translation that were better than the Google neural machine translation system.

And of course, probably the most successful was Bitenet, which was a text-to-speech system that was state-of-the-art at the time. And again, so convolutions still have this problem that, one, I guess they were parallelizable, but the issue was that you still-- you couldn't directly capture long-distance interactions between-- you couldn't directly capture long-distance interactions between words.

So if you're basically a receptive field, if it's like a 3 by 3, if it's a 1 by 3, then it basically grows linearly, either with the factor of-- it grows linearly with the number of layers each time it expands by 3. So you still needed a linear number of layers to capture these long-distance relationships.

But attention, on the other hand, was this really effective mechanism that we knew was-- that could actually get-- in one, it could actually capture all the interactions between one word and every other word using content-based addressing. Because convolutions basically match-- convolutions match weights with parameters. Attention was actually able to use content with content.

So based on how similar I am to my neighborhood, based on how similar I am to my neighbors, I'm going to absorb that information. And this motif actually appears everywhere, even in computer vision. So maybe, actually, I can go there. So here's a-- in vision, there is this approach-- do people here know non-local means?

So in computer vision, there's an approach called non-local means that's basically-- it was originally developed for image denoising. So if you want to denoise an image patch, you look at all your neighbors, and you see which patch is very similar to you. And based on the similarity, you actually pull in that information.

And this largely works in images because images are very self-similar. This starts sounding like, hey, based on content, I want to pull in information. And again, there were similar-- there were approaches like texture synthesis by EFROS, where if you wanted to-- if you wanted to do painting, or if you wanted to generate an image, then you would look at a patch that's similar to this rectangle in some other-- in your dictionary, or in a database that you have of patches.

And then based on what's closest to it, you actually bring it. So you'll bring that patch, and then you'll paste it there. So these approaches that looked like attention were already prevalent. It's a very natural formulation. And the Baden-Auwe paper had shown that this actually works really well for language as well.

So the question then was, OK, why can't we then learn representations? Instead of being this source target, why can't we actually learn representations by the sentence attending onto itself? So now you basically use-- instead of attending a source sentence, attending to a target sentence, can it just attend to itself?

And the original goal of actually when we wanted to actually do parallel decoding-- so attention by construction is parallelizable, because each token can basically construct its representations from its neighbors in parallel, right? And it directly captures token-to-token interactions, because now, of course, we'll run into complexities of length, but we can-- and we'll discuss how to solve some of these things later, how to overcome them.

But you can direct-- instead of having this sort of linear growth in receptive field, you can directly capture these interactions. Because convolutions, if you have a very, very large receptive field, it gets computationally very expensive. And it also had these explicit gating and multiplicative interactions, which we've often seen, like, in gated-pixel CNN or GeLUs.

These explicit gated-multiplicative interactions have typically helped training and have led to better accuracies. And as I mentioned, the original motivation of why we actually wanted to do this was, we said, hey, OK, so the LSTMs are-- we have good translation systems, but the problem is that actually, both reading and writing sequentially, can we actually do both in parallel?

So we wanted to read-- we wanted to read the German sentence in parallel and then translate it in-- and then also write in parallel by that, instead of actually decoding it sort of autoregressively, can you decode it-- instead of decoding in time, can you decode it in height? So, like, you first spit out one word, or you spit out all the words, and you iteratively define them, right?

And this was-- this turned out to be very, very challenging and hasn't been solved successfully until today. Because the biggest challenge, essentially, is when you-- whenever you're decoding, right, essentially, as you predict a word, you kind of bend the probability distribution that then nails down-- narrows down what you're going to predict later on.

And the ordering that allows you, basically-- the ordering that allows you to nail these modes was very hard to learn. So imposing a left-to-right ordering is much easier than actually not having one and having to learn it as you're decoding. So the original approaches didn't work, but then we still had-- we still had-- we still had our salvation in being able to read it parallelly.

So we said, all right, let's take this back to the encoder-decoder models. And unlike-- at that time, there were a few formulations, right? So we had this sort of-- the original formulation of attention from graves, then we had the additive attention formulation, and we took the-- and we took the dot product attention formulation, largely because it allowed us to do-- it-- because it allowed us to actually do attention as a matrix multiplication.

And oftentimes, some of the biggest constraints that actually-- physics is such a big constraint in neural networks that if you can-- if you can make your-- if you can make your architecture amenable to modern accelerators, you have a much better chance of-- you have a much better chance of succeeding.

And dot product attention could be expressed as a matrix multiplication, and it's-- and there are already sub-- there are already kernels for being able to do matrix multiplication very effectively on the GPA. So we had-- so the formulation was, all right, so now we have-- similar to the dot product attention, we had a scaling factor, simply because if the dot product actually becomes too big and you can solve it under certain assumptions of mean and variance in the representations, you can-- it hasn't updated, actually.

Yeah, so our formulation is basically you have-- you now have your queries, which-- what you end up doing is if you have a-- if you have a position, you first project it into queries, and then the same-- the same token-- the same-- the representation of the same token gets projected into-- to also keys and values.

And the first-- the query determines how much-- how much you're actually going to pull from all these keys. So you first do a dot product of the query with every key, and then based on that, you combine or you pool the content of-- in all these positions based on-- based on what the-- based on what the score was after-- after normalizing and using a softmax.

So in some sense, you can think of self-attention also as kind of a content-based pooling mechanism, right? And the scaling factor basically avoids you-- avoids you-- like, it saved us from these logits actually blowing up and training becoming unstable. And on the decoder side, you could trivially-- you can trivially implement causality by just adding an-- adding an attention-- adding an attention mask.

And what this-- where this-- where this brings us is that-- all right, so-- so now we've-- we've solved-- now there's-- it's-- so a caveat on the flops. We'll actually cover this later. But now what we have is a mechanism that-- that's parallelizable. It gives you direct-- it gives you direct content-- it gives you direct token interactions that will-- and that-- that we-- that we assume-- that we believe is going to help you actually learn-- model these relationships between the words better.

And it's-- and it's-- and the complexity of self-attention is faster than convolutions, right? Because it was-- because convolutions are quadratic in the number-- they're quadratic in the number of channels and the number of-- in the hidden dimension, but a self-attention is quadratic in the length. So if your length is not much more than a hidden dimension, you've actually saved on flops.

Now this is a-- not-- not quite a complete picture because not all flops are equal, and we'll talk about this later on. And-- and-- and-- and now when you put it-- when you put everything together, what we-- what-- basically, we-- we kind of took the-- the-- the-- the-- the basis-- this has a very strong similarity to the-- to the ResNet architecture, actually.

So if we look at ResNets, right? So in ResNets, you have contraction, you have spatial mixing with convolutions, and then you have the expansion again, right? If you just-- the transformer, if you just adjust, if you just move it one-- one step down, it's very-- it's analogous. You have-- you have attention, then you have expansion and contraction, but it is a-- and-- and the difference in where the residual connections are, but it's a-- it's a very similar-- it's a very similar sort of basic building block with, say, the residual-- with the residual connections, and you have these contractions and expansions.

And in the transformer, those were-- there was multi-head attention with expansion and contraction, which was in the feed-forward layers. And with-- and-- and then one-- one-- one challenge with the tension, we loo-- LSTMs can count, they can impact, they can-- they can-- they can-- they can count-- they can learn interesting temporal patterns, but attention is permutation-invariant, so we had to actually add position-- we had to add position information so that we could-- we could learn ordering.

So we add position information at the input, which trans-- gets transmitted to the other layers through-- through the-- through the residual connections. And the-- the original paper, we had those-- we had post-layer norm, but later on, we realized that as we actually make the model deeper, post-layer norm is-- doesn't-- doesn't allow you to train effectively, so we have to-- then we did-- then we used a pre-layer norm formulation, which was also observed in the original ResNet papers.

And so the model is basically, all right, you've got your input, well, you have spatial mixing-- spatial mixing through attention, three, four layers, and this sort of repeats. And the-- the difference in-- on the decoder side is that you also now have encoder-decoder attention and encoder-decoder attention at every-- at every-- at every layer.

If there's any questions, yeah. Yes, what was your question behind the post-layer norm? Oh, so-- so it ended up-- so if you do post-layer norm, then-- then-- actually, Liz-- Liz, do I have that slide? Let me check. Probably I've deleted it. But if you do post-layer norm, then you are basically squashing both the residual and the additive parts.

So when you-- so your activations from the lower layers keep getting-- keep going through layer norms. But in pre-layer norm, you're only-- you're only-- a residual path has a layer norm, which means your-- your activations all the way from the bottom of the model are free. They're untouched, and they can pass through the-- yeah.

Yes, OK. OK, so now-- so now-- I mean, so until this point, we haven't discussed why did we-- you know, we haven't discussed multi-head attention, which ended up being very important. So one of the problems with attention is that imagine if you wanted to-- I mean, so oftentimes language is about understanding who did what to whom.

So in this case, the cat licked the owner's hand. So licked-- who licked what? Like, the cat licked the owner, right? So now if you actually want to combine information from these two slots, these positions, these vectors, then the best you could do with attention is 0.5, 0.5 to the single layer, right?

Half probability, half probability. But then they get mushed together, right? But now imagine the-- imagine the strength that a convolution has. It can actually have-- that actually should have-- well, OK, well, I think the point will still come across. So now what a convolution can do is because it has-- it basically applies-- essentially, a convolution, in this case, it's a 5 by 1.

All it really does is it just applies a different linear transformation at each position, right? So it can take any-- and because these linear transformations are different, it can-- the first linear transformation can learn, I'm going to take a little bit of information from here. I'm going to take a little bit of information from here.

And I'm going to put them together, right? And the attention, the best way that you could actually just do this is best by averaging. That would mush all these things. But having different linear transformations allows you to take a part of the embedding here, a part of the embedding here, mix it up, and then maybe put it together without actually then interfering with each other.

And multi-head attention, which is a bit like basically a multi-tape, multi-head of a multi-head Turing machine with different read-write heads, essentially allows you-- starts getting you that property back, where now what you do is you essentially-- you now-- you bring back the ability to select different parts of the input.

So you chop up the hidden dimension into independent pieces. And then each one of them is now able to do attention. So now you can have probability 1 in this place and probability 1 in this other subspace, instead of having 0.5, 0.5. So now you don't have to-- you don't have to get these averaging effects.

You can actually be selective, right? And also, for computational reasons, instead of actually having eight attention layers of like-- or six attention heads of d dimensions, we had-- or eight attention heads of d dimensions, we had eight attention heads of d by 8 dimensions, right? So we wouldn't incur any more-- we wouldn't incur any more flops, for the same amount of flops.

But that's only half the story, because the attention heads themselves turn out to be quite expensive, which then later on had to be-- they were doing improvements that needed to be made, right? And the most important part-- probably the most important result was that, with the transformer, we were able to outperform previous ensembled models as well.

And that was very, very exciting, that, hey, this single model actually is able to outperform previous ensembled models. And not only that-- and this was machine translation in WMT 2014, English, German, and English, French machine translation tasks. And not only were we able to do it in less flops, but also these-- it was very clear that this was a very general model, as we immediately applied it to parsing, and we were able to get-- we were able to get, with a small model, excellent results.

So in some sense, this was very exciting, because this meant that, all right, now this consolidation that we're trying to go for in machine learning, we probably have a model that's more general than what we had before, and we can now throw it at different-- maybe we can now throw it at different problems, right?

And ultimately, why? Because it would be helpful to have a single model that's able to combine representations from speech, images, and language. And if you had a general substrate that worked well in all tasks, then potentially you could get to the single multimodal model. Sometimes interpretability is like tea leaves.

It's like reading tea leaves, so one should be careful. But it was nice that the attention by itself can give you some interpretability, and we were able to kind of see how some of these attention heads, or some of these attention mechanisms were actually able to learn long-distance relationships.

Some actually learned to be kind of early on in the transformer. We saw this generally invariant pattern, where some of the attention heads basically turned out to just look like convolutions. They were just putting in local information. There's, of course, now being much more advanced work with some of the mechanistic interpretability stuff with grokking and the stuff that's happening in entropic, which is where they're learning now that actually learning how to interpret these induction heads.

So it's interesting. But we were able to see some anecdotal evidence of these heads actually performing very, very distinct and clear actions. OK, so if there's any more questions, then I'll pause for a second. Do you, by the research, find that it's the induction heads that are causing the in-context learning?

Yeah, it's hard to tell. So from what I haven't looked at the most recent work, but they have solved this issue of superposition. Is that right? So now, with having solved that, they're able to-- does that roughly mean that now they'll be able to assign distinguishing features to each one of these heads and be able to explain it, from what I understand?

Or the in-context learning part is that-- is it that they have to show it, or is it that they're saying that in-context learning happens because of induction heads? Yeah, the latter. Yeah, it's the latter. Yeah, it's not clear, because-- yeah, I think there's probably many, many kinds of-- in-context learning is shown to work in so many different tasks that-- and actually, I haven't followed this quite well.

I don't know specifically-- what are the induction heads typically-- what kinds of properties do they have? Do you know what kinds of mechanisms they have? OK, so yeah, so then-- so since both of us don't know this really, really well, we won't be able to go very far here.

But I'm not sure if they've gotten to the point where they're able to explain most of the in-context learning because of induction heads, from what I understand. They might have, yeah. Does anybody know about the induction heads? OK, so now, over the years, so there have been a few-- there have been many papers, but there have been a few changes that have been important.

There have been a few changes that have stuck, and the new transformers typically have these improvements, right? And we'll go from bottom to top with some of them and see which ones have actually stuck, right? So we started with the first-- one of the biggest problems with self-attention was that it was-- that self-attention itself is permutation invariant, right?

You need to dope position information in order for it to learn some kind of temporal structure. And in the original transformer, we used these sinusoids, and we had hoped that it would actually learn relative position encodings because you could decompose the position encoding of another-- you could decompose the position embedding of another position as some linear function of the previous one.

And we had-- and some-- and another factor, which depends on the relative distance between the two. But that didn't happen. Learned position encodings in the original paper did as well, and so we were not quite able to get-- we were not quite able to get these model relative distances using the sinusoids.

So then a couple of important-- and this is a very biased sample, but I think it generally covers a large category of these-- it covers a large set of papers. There's roughly sort of three categories, right? So there's-- and all of them are kind of now explicitly learning relative-- explicitly learning relative embeddings.

So there's-- so in the relative position transformer, we had an embedding for every pair of relative positions. And using that, we basically then dot-- we did a dot product of that embedding with a query that produced a logit that modulated according to the relative distance. And we found this to be extremely-- we found this to be extremely useful for translation, but I'll show also in music.

Another sort of-- maybe a simplification, this is the alibi paper where-- this is non-parametric. These are not learned, where instead of an embedding for every pair of positions, you actually have a single bias, right? So you just add a single bias to the logit, and you can either learn it, or you can use a heuristic, which Alibi did.

And one other advantage about relative position encodings is that they could potentially allow you to extrapolate to new to longer sequence lengths, which you couldn't do with absolute position encodings. I'm curious about the room-- about what the room thinks here, but I believe that the latest in partition relative position encodings where this is-- I believe it's called the row former, where they basically just rotate the embedding with every pair of dimensions a little bit.

And the angle of rotation depends on your actual absolute distance. But what ends up happening is, when you do the attention operation, you end up getting relative-- you end up basically getting an effect where you're modulating the logit based on relative distance. So now what's remarkable about this approach, what's-- it combines the best of both worlds, right?

It actually-- it's absolute position encodings-- relative position encodings had a couple of challenges in that you have to maintain an extra logit for-- or an embedding for every pair. So there was a lot of-- so it ended up increasing your memory. Here, these are actually absolute position encodings, but they gave you-- they ended up giving you the relative modulation in the attention operation that you needed.

And I believe the consensus is that this is the most successful-- this is the most successful position encoding. Is that correct, or are there-- is that-- are there others that are-- that people-- is that the consensus? OK. So it looks like-- so I would say that the the-- these relative rotations are from-- or the approach that's in the reformer is likely-- is basically an actual new genuine improvement that is now going to stay with the transformer.

And it has all the-- it has all the great properties of what you would want. It has-- it's an absolute position encoding that gives you relative effects, which is what we originally wanted. And one-- and to emphasize that we needed relative-- like that being-- emphasize two things. One, that modeling, like, interesting temporal relationships, which is-- which are really important in music, requires a good position representation.

We actually found significant improvements in the music transformer. Is it-- is it possible to play this? OK. So here is a-- like, here's a priming sequence. This is-- this is work by-- work by Anna Huang, by the way. So this is a in-context learning in music, because you actually see this prompt and you ask the model to complete it.

OK. So now this is the vanilla transformer. And you can already-- so you can see that these were using-- I mean, we tried both learned and sinusoids. And you can see that it starts off peppy and happy, but then just sort of languishes into something really sad and confused, right?

So it's not able to capture these-- because music has these interesting motifs where-- well, there's motifs at different levels, because there's some repetition locally, but there's a repetition across the entire piece as well. So now here, this is with the relative transformer. And this is with the first approach where we had relative embeddings.

And we had to-- we had to-- we had to develop a compute-efficient approach to actually with-- by using some matrix calisthenics to actually put the logits in the right place. So you can read the papers here. It's fun. So here's the same prime sequence. And let's see the completion here.

So Anna, who is the first author of this paper, and also a musician, tells me this actually captures a lot of structure in music. It sounds nicer than the previous one, but maybe-- depends on what people's tastes are. Like maybe some avant-garde jazz fan would like the second-- would like the first piece.

But the point here was that the difference is pretty clear between not working and working. And I think people-- it'd be fun to try this out with the new rotary position encodings. All right. OK. So walking up, now that we have a good mechanism-- a better mechanism than we originally had for modeling relative distances.

And there's advancements on top of the rotary position encodings where, by adjusting the base frequencies, you can-- when you encounter longer sequences, you can just adjust the base frequencies. And then the model's not going to-- the model's not going to degrade. So that has good properties. Probably there's been several, several important contributions to the attention piece itself, which is the primary workhorse here.

It's the one that you can think of it as-- it's either-- there's induction heads that are learning how to copy. Or maybe all it's really doing is just routing information so that the giant feed-forward layers can actually learn the important features. But there's broadly two classes of problems. There are two classes of issues with the attention mechanism.

One that was brought up today that's very evident is long context itself. So the complexity, as we remember, was quadratic in the length of the sequence. And once your sequences get very, very long-- once your sequences get very, very long, then not only-- I mean, there's one problem that's going to-- it's going to become very-- it's going to become computationally expensive.

But it's also the logics that are going to become infeasible, right? So there's just generally a few groups of papers. One is restricting attention windows. And we did this for images where they had local 1D and 2D attention for images. And in the first one, we actually just rasterized the image.

And we had local 1D attention, which is very similar to the sliding window attention in the recent Mistral paper. And then in the 2D case, we have a spatial 2D attention. Then there was these sparse versions where you actually-- you had these specific patterns that over many layers-- I mean, you can think about it as, if you have these sparse matrices, how many of them do you have to multiply with each other until you get a really dense matrix, right?

So roughly, this kind of turns out to be-- so here, you can get connectivity-- is that for me? No, OK. You can get connectivity between distant pixels or distant notes in a musical tune or words pretty quickly. And then there's a second one, which there hasn't been enough work.

And there's some challenges there. But it's these unstructured sparse attention approaches. And they're typically-- they're essentially-- at a higher level, what they're really trying to do is imagine that I walked up to you and I told you that, hey, these are the bunches of tokens that just have very high inter-similarity.

Like, they're likely to tend to each other. How quickly can I approximate it without actually having to do the whole computation, right? Two approaches. And in routing attention, you use vector quantization. And in the LSH or the-- I forget what-- I think I forget the name of the paper.

But in this paper, they used LSH. And in the routing transformer, most layers were actually local. The final layers, which typically are the ones that end up do modeling, that end up modeling these long-distance relationships, were the ones that actually used this kind of content-based unstructured sparse attention. And the results were generally better.

And it's also interesting that maybe we can build models on very long sequences, where most layers are fairly local. And you have only a few layers that are actually doing these long-distance attentions. Now, one of the bigger challenges there, actually, even though it ended up being-- even though you end up nullifying a lot of the flops that you would do if you did full attention, the problem always ends up being memory movement.

Always ends up being memory movement. And there's still more innovation to be done here, also, with memory bandwidth improving. Maybe some of these approaches become more feasible today than they were when we wrote these papers. But this is an interesting approach, where you're essentially trying to approximate the original attention matrix.

Sorry. This is kind of a silly thing, but a clarification. How is this unstructured sparse attention scheme very different from just convolutions that are sparse, in the sense that you're losing a lot of the long-distance or unrelated context from any arbitrary comparison elements? Right. So I would say that this is similar to the convolution there.

If you did this perfectly, then what you didn't attend to would have very little attention in itself. So you're essentially trying to guess, as best as you can, what would have attended to each other. And so it uses content based unstructured sparsity. And there's probably more interesting work to be done there.

Maybe instead of actually just doing a token at a time, you end up doing a lot of memory movement. You end up deciding which chunks want to self-attend to which chunks. So then you just move entire chunks at a time. Right. So I think there's some interesting directions here.

And frankly, the ones that ended up sticking are the simplest ones. And because structure sparsity is easy, you're able to optimize easily in modern accelerators. So again, you should make physics your friend. And so typically, local attention or sliding into attention, we're still seeing it often appear and do well.

These other sort of really wild but very expressive unstructured sparse attention approaches typically haven't quite succeeded. There's, of course, linear attention variance that I don't think today are in any of the architectures. There were other approaches that, hey, instead of actually doing n squared, you do n squared d, where you learn new k embeddings, where you do nkd and then you do ndk.

So you basically factor it, right? Just like an analog matrix factorization. Something that's-- one other approach that's interesting that I would like myself to actually investigate is we are seeing, in general, using retrieval as a tool. So why don't you just pretend that your memories, your memories themselves were documents and use retrieval as a tool there.

So the memorizing transformer, basically, it essentially does a mix of local and it then retrieves from very, very long memories. And they find that you don't need to train the model from scratch. All you need to do is adapt with this approach on some small amount of data. And you're able to learn a good retrieval mechanism.

I think it's quite interesting. So it still comes in this content-based decision of what I should attend to. But I like the fact that it just makes retrieval a tool that you can use either on your own memories or you could use it on documents. It's a nice general view of looking at things.

OK, so now the second piece, which you basically run into-- you run into the issue that not all flops are equal, right? So if you look at the memory hierarchy, a lot of your activations that are stored in the GPU-HPU, which today in the H100 is about 80 gigabytes.

But the H100 is 80 gigabytes, and the A100 is 40 gigabytes, right? So it's a limited amount of high-bandwidth memory. And so you have to first go from high-bandwidth memory to the SRAM. And then you have to go to the compute elements and then back, right? So every single time-- and this is-- I mean, it probably-- whenever-- if interested, you look at roofline analysis.

The roofline analysis actually gives you a nice picture to characterize for any device where you would need-- where your workload or operation needs to be so that you can actually effectively utilize the compute as much. You want to be compute-bound, because ultimately, if you don't calculate representations, if you don't calculate, you're not going to get any output.

But if you spend a lot of time moving things around and spend less relative time calculating, then you're actually-- you're kind of wasting effort, right? So one of the-- so if you look at standard attention mechanism, right, one of the issues is that-- OK, so imagine you have your queries, keys, and values all in your memory.

But then you need to then-- your standard approach would be you move it from HBM. You do the calculations. You compute the attention. You compute the logits. You move logits back into HBM. And then you compute softmax, right, the softmax back into HBM. And then you basically load the probabilities and the values then to then finally compute the outputs, right?

So the arithmetic intensity or the arithmetic intensity or operational intensity, which is the amount of flops that you do per byte on attention, even though it's less flops than, say, a one-by-one convolution, it has more-- it is lower, because it typically has more memory movement. Whereas one-by-one convolutions have less memory movement.

You just move the weights, move the activations, you do the calculations, and you bring them back, right? And same goes for convolutions, too. And convolutions have a very high arithmetic intensity. It's not that you just want the highest arithmetic intensity or operational intensity operations, because you still want to have useful parameters, right?

So it's a trade-off. So a lot of-- so there's been a bunch of improvements that will stick. I mean, they're almost certain likely to stay, that try to combat this issue both in training time, because your logits can get really big, but also inference time or your KB. When you're doing inference, then you have a single query.

But your KB cache, right, you have to maintain your keys and values that can grow quite a bit. So you have to move that around. And so the first step of the day is simple. Let's just decrease the activation memory. So the multi-query approach, where it's basically in a multiple-- so you reduce-- you have multiple queries, but just you reduce the number of read heads to just one.

So you have just one key and one value. That does reduce your expressivity. So grouped query, which is now a simple balance, that basically says, hey, let's not take the extreme of having all this temporary activation memory. Let's actually group it to a different query. So a bunch of queries will attend to the same keys and values.

And then what ends up happening is-- another point to note here is that all of this is relative, because most of the work in these very, very-- oh, but a third approach, actually, that I should say of not worrying about your attention is just to make it more of a debate.

But then you just get about your three-fold computations and your attention computations just like a small slice of that. So you don't worry about it, right? So typically, these larger models, even though grouped query attention has more activation memory than multi-query, when with these large models, it's still not a much larger-- it's not a much larger-- it's still a smaller proportion of what you're doing in the feedforce or your certified, right?

So I guess three things, like ignore, make it really big. Second is, I guess, you-- but even with prolonged context, you can do some of these approaches that we talked about. But then you also have these system optimizations, which are pretty cool. So the softmax has an interesting property that you can compute it in an online fashion.

You can compute it incrementally. So if you've got a bunch of logits, so you're kind of streaming them, if you've got a partial softmax and a new logit comes in, you can update it in an online fashion, right? So what does that mean? That means that now you never needed to write logits or the p's into the HBM.

So you save a lot, right? If there's an extremely long sequence, you end up writing a lot. So you save on that. And both these approaches end up-- in one case, the first paper was on TPUs that introduced this property or took advantage of this property, the property to be able to compute the softmax in an online fashion.

And the second paper, which is now flash attention today, they've had many advancements. They actually had some systems-level optimization where now you can actually have very, very long sequences on GPUs, the optimizations for GPUs, by basically not moving the logits back into HBM, using this online-- using this property and also writing the right columns that use the SRAM and everything-- use the GPU.

With any questions? What's the time? So we are basically 20 minutes. I'll finish in 10. So I just covered these two. There's many, many-- there's, I guess, there's other important improvements. I'd say this to the-- we talked about the pre- and post-versus post-layer norm. There's been some changes of the feed-forward layers themselves.

You can stare at the feed-forward layers. I mean, you can stare at anything long enough, everything becomes attention. But it's true in the feed-forward case that if you look at it, you can think about them as-- it looks like attention. And there was a paper that sort of turned that into a bit of a-- turned those into memories.

It was originally by Facebook. I actually forget what it was. But it didn't-- and the feed-forward layers just stayed-- I mean, we typically haven't seen a lot of improvements on them. There have been some efforts on higher-order attention right now. Attention, if you think about it, is a third-order interaction.

You have queries, keys, and values. But-- and right now-- but you could imagine actually having four-order interactions where you're actually computing logits of pairs of things against all pairs of things, right? So these are now higher-order interactions where now you can have complicated geometries that you actually include in your attention computation.

And maybe it's important for, say, biology or some biology, but it's not been explored much. What has actually worked and is likely to now stay is some approaches on password decoding. Not quite the original, less or non-order-regressive aspirations that we had, but these more speculative decoding where-- the heuristic there is pretty simple.

You score-- if you want-- instead of generating from a heavy model, generate from a really light model that captures the diversity and then score with a heavy model. So then you re-rank the list. And that ends up working quite well. And most production deployments likely use speculative decoding. OK.

So now switching gears, I guess we started this-- or we started by coding the Dartmouth conference where they wanted to build a single machine. And the question now is, with large language models that are now eating up most of the internet, are we quite getting there? And we are seeing some remarkable-- we're finally seeing self-supervised learning work at a scale that-- work at an unprecedented scale where now by digesting carefully curated and colossal amounts of text with very, very large models, you're able to-- they're able to perform, presumably, or it's still waiting to be confirmed, tasks that are-- or they're able to actually perform at least a large-- a broad variety of tasks by just specifying them in the prompt.

And it's now-- it's almost like now you have-- now you have a new computer. And for people who are really excited about the future of agents, now they can program thousands of agents with the same computer. Oh, maybe you-- now they have-- now they have agents that they can-- several agents that they can program with the same computer that then coordinate to solve problems.

So we're getting much closer to the single model, not quite being able to specify all the rules of intelligence, but at least learning all the rules from data. We're very close to-- we're much closer than we were before. Now, this doesn't include all the important thing-- all the important specialization that has to happen after, like, RLHF or the alignment that you have to do to make a model more steerable.

But it's-- and as it stands today, the scaling laws that the transformer exhibits are better than any other existing model, right? And there's an interesting question of, you know, which-- can we build a better model? And there are efforts-- there's, I guess, from the Stanford, from Chris Rea's lab, there have been a couple of efforts.

There's been some revival of RNNs. But I think the only-- the only-- the only thing I'll say that is that the attention operation itself, this operation of actually moving information around or routing information based on content, is very, very useful. And it's maybe not a surprise that this general sort of spatial mixing of sampling, downsampling architecture has kind of stayed both in cognition, computer vision, and language, now with the transformer.

So there are some invariants that are likely to stay, but I do think that maybe that it-- and there is certainly much more room there to improve, I mean, not just in the architecture, but on data itself. Like, there's probably 2x improvements on data. But I wouldn't say that there's-- there aren't architectures in the future that will get better scaling loss.

They might, but there are properties about the transformer, such as self-attention and its general structuring, that is likely-- that we're likely to see in future architectures to come. Also, it's hard to really think of a modern-- like, if somebody really, really wanted to study large-scale modern transformers, you'd have to study, like, all-reduces, InfiniBand, Rocky, and what are-- like, well, but they get congestion, and they have very, very large clusters.

So the computer is no-- the computer-- the transformer is now, in some sense, a data center, because it's not split up. These large transformers are with tens of-- potentially tens of thousands of GPUs. So-- and so if you-- so now you actually have to really focus on several parts, the infrastructures and the model itself.

But what's really interesting, I think, is-- you know, I was just thinking of the smallest model that has exhibited emergent phenomena. Well, so we certainly know that GPT-4, which is likely-- I don't know if you're allowed to say it's some big-- like, trillion parameters. Yeah, I think you're allowed to say it, yeah.

So it's a trillion-parameter size model. That's what everybody says. Size model. And then you have Brocking, which is a two-layer transformer that has this weird emergent behavior that, when you just keep training it on just-- on some amount of data, suddenly it just exhibits a space shift, right? So we're lucky.

There are these, like, really-- there's strange-- there's weirdness everywhere. There's weirdness in small models and large models. And maybe we can learn something about large models by studying these small models, one would hope. But it's funny. There's still unexplained phenomena in very, very large models and very, very small models.

But large transformers are no more just, you know, like a cola. There's just-- I mean, it could still be, but it's-- you have to-- there's so many-- there's so much that you have to keep in your stack in order to really optimize this entire-- this model. Of course, some of the very exciting directions are LLMs using tools.

Yeah, that's-- so now the benefits of-- now language models or transformers are actually starting to use external entities. So they're connecting with the rest of the world. And I guess that's a good-- that's a good pitch for-- it makes a lot of sense to actually build products today because it's through interactions with-- like, if you want to get to the next tranche of capabilities, where will they come from?

And likely, with a lot of usage, you will learn much more about how to guide these models and how to train them without-- than in vacuum. Now, you can definitely do very, very important work still in-- by even with a smaller model or even without building a product, without building a product because there's so many important unsolved problems.

And maybe you shouldn't even work on the transformer because it's like Burning Man right now. Everybody's going to the same party. But I think that you will be able to build new capabilities once these-- with this human-machine collaboration. Of course, teaching models or models being able to express what they don't know, how do you learn new skills in infants' time, important for-- there's some interesting work, I think, on Minecraft that showed some evidence of this is also important for agents.

And another-- a great property that some of these diffusion models have is the more compute you spend, the potentially better the quality of the image gets. But we don't exactly quite have that for language. And what does that mean? So today, the best-- the models that can reason-- that have the most proficient reasoning and planning are also the largest ones.

Can we separate it out? Can we have smaller models that do some adaptive thinking and are able to match the capabilities of potentially larger models and reasoning and planning? And maybe the answer is going to come by connecting to external planners and planners or maybe with better representations of data, you can actually reason better on it.

Also, this is, again, a more systems piece, but it's fascinating how low you can actually get on your-- how low you can-- how few bits you can actually use and still get something useful out. We already went from-- the original transformer was trained on 32-bit precision. Then we went to BFLOAT16.

And now there's good signs that INT8 and FP8 would also work. And I think there's useful work to be done there. Again, going back to the same-- this argument about if you're actually-- if you're vector-- if you're using fewer bits to represent a number, you're actually transmitting fewer bits to the-- from HPM.

So actually, you can get faster. You can utilize your matrix multipliers much more effectively. That was it. So there's many topics, but hopefully, we covered something fun. Thank you. Yeah. Could you talk about what you're working on now and what you're working on? Yeah. So I'm a co-founder of a startup with my transformer co-author, Nikki.

And we're working on building models that will ultimately automate workflows. And we're starting with data. So it's very puzzling what happens in a company. Companies are just basically just masses of dark knowledge, right? And there's very few people that have both the technical privilege and the understanding to ask questions, like typically analysts.

But the less you understand, the less effective your company can be. So how can you eventually help anyone become an effective analyst, in some sense, right? So help them ask the right question, help them figure out, eventually, the whys, which then requires some kind of counterfactual reasoning that's very complicated.

But start with data, since it's so important, and companies are essentially drowning in it. And then be spread out from there, and then try to automate other workflows and be impressed. But we believe that some of the early signs that we're seeing and our position is that I believe that this is going to require a full stack approach.

So not just building the model, because you can then control what feedback you get. And so if you have a gap in the model, you ask for that. You start to get that feedback, so then you can improve the model. That's what we're doing. Please talk to us after we've done it.

Yes? I'm surprised to hear that you're fairly bullish about tools in the end, like in our transparency control and third-party things. We talked about in the beginning that your motivation was transformers that enabled us to get rid of pipelines. But I feel like the rule was against pipelines again.

So I'm surprised at this. Can you talk about that and where do you think that's going to go? Right. So until we get to the point where it's like, you know, we're turtles all the way down, it's like transformers all the way down. No, I think that tools just allows you to-- so it's kind of like, how do you interface with a machine that can think, right?

You have to build some kind of interface. And if you build a useful functionality, you want the machine to be able to take your functionality and do generally useful things with it, right? And I think that using tools is just a way of leveraging things that people have built and software out there.

Certain tools will probably get absorbed in the model, right? Some others won't. And that still gives us the ability to-- yeah, it still gives us the ability to-- and certain things that transformers shouldn't even do, sorry. I mean, like you don't want to spend a billion flops per position to calculate two numbers, right?

You don't want to spend more flops to do an operation that required like 1 billion flops, right? So there's certain things that the model should not do. It should use external tools. And there's certain things that the-- certain kind of thinking that the model should do. So even from a capability perspective, there's an important question of what all the capability should be in this neural network, right?

But then also being able to utilize the work that others have done, software that other people have built. Yeah. It talks more about why like the original approach of decoding parallely and then integratively refining it. Yeah. Why that didn't work and what-- Yeah, so sometimes if you know exactly why things work, maybe you can make it work.

But it ended up being-- so you're able to do silly things like randomly sort, which means that if somebody walks up to you with a sequence and you can-- I mean, you can break two modes. Like you can say ascending or descending. So how do I say this? So typically, when you decode, right, imagine that when you give a prompt, you have many possible computations, right?

And each time you make a choice, you narrow that space. And each time another choice, you narrow that space, right? And you have a very-- and you've learned to narrow the set of all possible, in some sense, paths in a way. The model doesn't have to decide what's the order in which you have to go.

When you're doing this less or non-autoregressive generation, you have to do both, right? And doing learning both simultaneously is hard. I mean, but eventually, I think that if for a particular-- I think this is probably true, right? If an oracle walked up to me and said, this is the order in which all these sentences should be generated.

First, you should generate these three words. Then you should generate these other two. Then these other two. If somebody walked up to you and gave you this oracle ordering for all of human language, I think you would have a much better chance. And you could actually get this less non-autoregressive generation.

So one thing was basically the ordering itself. And I think it kind of has to do that, because the ordering helps you then lock down the modes. It narrows down what you're going to generate next. So ultimately, I think it does boil down to what's the right non-autoregressive ordering.

And that could be either you're still generating one word at a time, but not autoregressively, or you're generating a few. And then based on that, you're generating the other few. So the words that you can generate all at once should be conditionally independent of each other, right? What you've generated so far should have completely explained them.

And then what you generate after should again be-- they should be conditionally independent, right? So how do you learn these conditional independences? Yeah. And if somebody walked up to me and gave them to me, I think they'd probably learn them. Yeah. Christian? Yeah. Yeah. Yeah. I think more of his thinking is that only scaling small and small doesn't help them to actually learn how the real world actually works.

And we have a good idea of truth and real world now. And do you agree with him? Do you think that So yeah, I think it's interesting. You can't learn a word model with just language, right? So I mean, some of these models are not exactly being learned that way.

You're doing RLHFs. You're getting some feedback, which means there's some-- you're applying some-- they are modifying themselves to some preference, right? So it's not just a pure language model. But it's interesting. So you've seen some of the work where robotics is now potentially starting to flourish because they're able to use these large models as planners, right?

And so I think that it's surprising how much of the world-- how much information about the world that they carry. And if I understand, is that right that the SACAN were basically used a language model now as a planner, right? And then they left the rest of it to just the standard perception and the classical tasks of even solving the robotics.

So that's a-- I mean, that's-- no, while Jan is probably still right, but the usefulness of it is evident in something that needs world knowledge, right? So I think you can do a lot with what you have. I mean, they're probably-- yeah, I mean, we still haven't quite extracted all the usefulness out of these models as well.

And you might be right in some things. But there's still a lot more to be gained. Yeah. So I'm similar to the previous question, and you're also talking about immersion, right? I'm just curious to know what your thoughts are more on generalizability and immersion, especially in the-- I know there was a paper from DeepMind about the science-- yeah, I think-- yeah.

Yeah, yeah, yeah. Like, they can't really generalize outside of what they've been trained on as-- especially because these large models now that they're just trained on everything. Is there truly anything left that's out of distribution that you could really sort of benchmark it on? So I have been caught saying that if I had all my test data in my training, I'd make a billion dollars.

Yeah. So I don't have a problem with it. But I still think-- so OK, so correct me if I'm wrong, but the general argument is that these models have learned such a vast set of distributions and phenomena that typically, when you interrogate them, they're often very cleverly blending or bringing information from what they've learned, right?

It might, yes. And then they have these algorithmic tasks where the models fail to generalize, right? So I'll focus on the former. I think that that's an incredibly useful property. It might be that-- so I think maybe the feeling is that we actually don't quite understand how much we could-- how much is even represented in text.

And second, how much-- how far we could go if we were able to blend information from different-- like, certainly being able to write about the Stanford-- this lecture in the rhyme meter and words of Chaucer, not possible because nobody did it, right? But I think that you could do it, right?

Now, is that blending information from what you already have? If so, that's-- that means you can-- that's an incredible skill, right? Yeah, I haven't read it. It's very recent, but I believe the work. I think you can show that in these but I think there's a surprising amount of new-- seemingly new things you could do by just blending information from what you've already learned.

And yeah, it largely probably has to do with-- there's so much of it. Yeah, yeah. So you have a question? Yeah, I have two questions to go. I think I had an ordering in mind and then I came back to you. Sorry. But Give me a second. I was wondering if you might have insights into connecting different agents, transformers, or whatnot.

Neurons is a great-- I feel like transformers essentially like a great connection of neurons in a specific way and it's awesome, right? So you figured out the best way to connect them so far. The agents? No, the neurons. Oh, the neurons. You're talking about-- do I know somehow to do this in the brain?

No, the neurons in the transformer, right? The transformer is the way you connect different pieces together. And then when you connect them together, it works. Yeah. I was wondering if you have some insights in the building system that can actually go perform the best together. Yeah. I like to make this joke that the best agents are actually just the neurons because they can communicate with each other.

They can update themselves really, really well by what the other agents are doing. What is the fundamental problem by making-- what is the fundamental issue in making a bunch of-- I'm trying to understand what are the fundamental problems in trying to make a bunch of systems work together, if that's what you're asking.

One is goal decomposition, right? And one is the second big one is coordination, and third one is verification. If you solved a successful decomposition of the goals based on what your estimate of the skills of these agents are, if you're able to do what they've done, and if you're able to coordinate, then I think you could make a lot of progress, right?

So while I didn't answer your question, I don't know in general how much progress you've made in all these three areas. But does somebody have any input here? and you have something that's and you want to verify this and make sure and verify everything and make it big enough.

You can see how this is almost everything and making it efficient over time. And it's a lot of time how do you break the stuff, how do you Yeah, right. But I think these are probably maybe to some degree What's on mind? It's actually one question. So now we have a but the human brain is very modular.

So it's modularity like the emergence phenomena, you need some spatial space to make that happen. Yeah, and by modularity here you mean that is it modularity in that they have this vision, has this responsibility, or even is the composition different, the construction different? What do you mean by that?

Because you could have both, right? You could argue that there's no-- the responsibility is diffused across the model. And it's just that experts try to go in the opposite direction, which I should probably mention. That's another really exciting direction, which certainly has happened in a few folders, and it's going to stick.

I totally missed it. That tries to get the specialization, right? So maybe that is some kind of modularity, right? Learn modularity. The rest of responsibility for performing the task is likely distributed. But if now you're going to these subsystems themselves of different composition, then you get back to-- and I know that this was a goal with the Pathways Project at Google, where you wanted to have these really modular systems communicate with each other.

And I think there's-- it's just taken so long to get gradient descent. In fact, sometimes I think that rigid building architectures deserve gradient descent. And I feel like if you can learn with gradient descent, it's very useful. Maybe it's actually possible to make these modular systems work. We have some of the three experts, and I imagine some of these problems that we discussed before.

Does that make sense? Sorry, circling back to whatever seven questions ago, you mentioned that the problem with decoding all at once was one of the things that code generating all at once has this assumption that the outputs are conditionally independent. But aren't they, in a sense that if you have a latent space-- if you're given the latent space as your prior, then your posterior outputs should be conditionally independent of each other, right?

So great point. And where do you get the latent space from? Well, from the encoder, or whatever, in the beginning. Right, but there might be quite a few ways to translate something, right? Yeah, there's a multiple-- so if there's only one mode, then yeah, it's probably right? But if there's multiple ways of-- well, actually, there's two things.

How much does the latent space actually carry? That was an important thing to ask, right? How much does it actually carry? Because it's not just the one latent vector that you're transmitting every-- you're doing attention again and again. But we took this approach, where we did precisely this. We autoregressively generated tokens in a new vocabulary using vector quantization.

So the conditional dependence was modeled in a latent space, where we discretized using vector quantization. And then, based on that, we generated everything conditionally independent. And that didn't work. But again, so that didn't work in translations. The issue-- there were some funky issues there, where the latent-- the latent sequence of latent vectors were only effective-- were not effective if you can learn directly on the original data.

You have to do something like distillation. Because distillation itself throws away, potentially, some of the modes. So generally, lower entropy data was-- we had to train on it. The second piece was, for practical systems, you have to make the whole thing really, really fast. But this was a good research exercise.

But ultimately, it didn't have the right practical impact. Because speculative decoding, practically, with what we have right now didn't work well. Yeah, exactly. Yeah. But you're right. I think if you can generate a sufficiently-- if you can generate a good, sufficient latency, then yes, you're right. We can assume-- that makes everything conditionally independent.

Yeah. And we managed to do that a bit. But it wasn't quite good, I think. I guess this is the last question, now? Or are we already done now? Oh, wow. That's too personal. And I have friends there. They're all really great. They're doing terrible things. I think that there's-- we'll be surprised how much there is to do.

And if-- so first, the motivation, right? That is an entire new-- there's an entire new bucket of-- or like a new tranche of capabilities that you will get with human-computer interaction. So you can make a product. People use it. They give you feedback. Models get smarter. And this closed-loop system can really bring-- can really advance models.

And then bring value, right? That's one. And I think it's helpful to have some deep learning benefit. It's so much from a diversity of ideas and people pursuing important directions. And I would say the same about building company products, as well, or building companies that are building new kinds of products with these models, right?

So I would say that we have-- there's so much surface area that we could build something incredible. So that's the second piece. Third, yeah. Maybe that's the more personal direction I want to bring on, right? Yeah.

Stanford CS25: V3 I How I Learned to Stop Worrying and Love the Transformer

Transcript