back to indexStanford CS25: V3 I How I Learned to Stop Worrying and Love the Transformer

00:00:10.860 | 
There's a Stanford, that's a Stanford location. 00:00:23.740 | 
>> That's right, yeah, and then what does the association to Stanford get? 00:00:30.300 | 
>> I believe this is the McCarthy, yeah, who started at sale, if I understand 00:00:35.580 | 
correctly, is that right, he started at sale? 00:00:38.900 | 
Yeah, I think he did, but anyways, so what's interesting is, 00:00:43.900 | 
so it's amusing to actually look at what they wrote in their, 00:00:48.780 | 
I don't know, is it brochure or what they wrote in their goals, right? 00:00:57.700 | 
Okay, so the study is to proceed on the basis of the conjecture that every aspect 00:01:02.460 | 
of learning or any other feature of intelligence can in principle be so 00:01:06.660 | 
precisely described that a machine can be made to simulate it, right, fantastic. 00:01:11.540 | 
So single machine, you wanna simulate all of human intelligence, okay. 00:01:16.300 | 
And carefully selected group of scientists, and we think that we can make, 00:01:20.960 | 
actually, the paragraph right before the second set of red underline, 00:01:26.040 | 
is we think that a significant advance can be made in one or two of these problems. 00:01:31.840 | 
If a carefully selected group of scientists work on together for a summer, okay. 00:01:38.080 | 
I don't think they knew of AI winters then actually, they didn't know of it then. 00:01:46.440 | 
the major obstacle is not lack of machine capacity, but 00:01:50.360 | 
our inability to write programs taking full advantage of what we have. 00:01:53.880 | 
>> So, while the goals are noble, it's surprising how wrong you can be with some 00:02:00.120 | 
So Selfridge, a neural network OG that were the original pandemoniums, 00:02:05.200 | 
I think he got everything basically set for path problems in black box optimization. 00:02:09.320 | 
Then Minsky, of course, Shannon, Solomonoff, I think it was Solomonoff, MDL. 00:02:16.720 | 
In many ways, you can argue that's the underpinning of 00:02:24.320 | 
But it's really amusing to see the first, I mean, at least I don't know if we'll 00:02:30.160 | 
be able to characterize or write down all the rules for intelligence. 00:02:35.680 | 
So you can imagine that the approaches they were taking were all these rule-based 00:02:39.960 | 
And they couldn't be more wrong on machine capacity. 00:02:44.240 | 
Today's transformers, they don't, they're data centers, right? 00:02:48.440 | 
And I guess they needed a really, really long summer to solve this one. 00:02:55.440 | 
But yeah, so it's 1955, so yeah, like about 60 years. 00:03:03.960 | 
No, not even, I'm getting close to 70, 70 years, right? 00:03:06.600 | 
So, and we're basically talking about the same problems again, 00:03:11.920 | 
except maybe some things work, some things don't work. 00:03:16.840 | 
And this talk is about some of the, one of the pieces that has made this larger 00:03:22.920 | 
enterprise work, and we're getting closer to the original goals of the 00:03:32.760 | 
I mean, so what eventually happened in the field was that their goal of having a 00:03:36.720 | 
single system that explained most of, that was able to mimic our cognitive 00:03:42.960 | 
abilities, which would definitely mean like image processing or image 00:03:46.480 | 
understanding and language processing as well, right? 00:03:48.800 | 
That, I mean, the field got, I mean, a single model or a single approach to do 00:03:54.560 | 
all these things was shattered by like thousands of like different research 00:03:59.040 | 
So, I mean, there was no consolidation, but here's another, here's another, 00:04:02.760 | 
here's another, this is going to be a harder one. 00:04:06.240 | 
Can you tell what is a, this is, this is 2009, and this is a, and this is not a 00:04:11.440 | 
This is a complicated machine translation system. 00:04:13.760 | 
So when I started my PhD, our machine translation systems used to be a bit more 00:04:17.040 | 
complicated, complicated than this, actually. 00:04:19.640 | 
Thousands of pipeline systems, you had to first extract, you had to first do word 00:04:23.880 | 
alignments that actually looked like attention as like art. 00:04:27.000 | 
You think about it as hard attention, then based on that, we extracted like all 00:04:32.320 | 
Then you had to figure out how they then had to, then you had to teach, there was 00:04:36.240 | 
some machine learning there, you had to teach the model how to score them 00:04:40.040 | 
So can you tell, can you, does anybody know where a neural network is in this? 00:04:51.880 | 
So CS, so this is the, this is a machine translation system from 2009, and CSM is a 00:05:03.480 | 
So the world was so discreet then that he had to call these models like continuous 00:05:08.240 | 
And, I mean, it was a lot, it was largely inspired by the neural probabilistic 00:05:20.880 | 
The neural probabilistic language model by Benjy, I think it was in 2003. 00:05:26.040 | 
And so we were, even in 2013, when I published a paper on neural network language 00:05:33.880 | 
models, these models were still being put into the fee for neural network language 00:05:42.240 | 
And now it's incredible if you think about it. 00:05:45.360 | 
So just in terms of consolidation, how all of these complicated systems that have now 00:05:50.360 | 
been replaced by just neurons that talk to each other, and you just learn the rules 00:05:53.840 | 
from, you just learn the rules from data automatically. 00:06:02.400 | 
And so since then, you know, like, so this is what the EMLB 2013 conference was like. 00:06:10.200 | 
You see these different, like these, you can call it verticalized NLP, these 00:06:15.360 | 
different areas like morphology, dialogue, and discourse. 00:06:19.080 | 
I mean, I don't even know if people talk about dialogue and discourse. 00:06:23.680 | 
There's, I don't know if there's a research track anymore. 00:06:27.360 | 
So there's opinion mining and sentiment analysis. 00:06:32.960 | 
So it's, and so you could see that just in 2013, the field, even research was divided 00:06:39.440 | 
into these smaller tracks and everybody had their own specific, they were bringing their 00:06:46.880 | 
And they had to specialize in a domain in order to solve some tasks. 00:06:51.800 | 
Machine translation, because, I mean, probably because of a lot of government funding as 00:06:56.280 | 
well, we had made a lot of progress and we were making practical translation systems 00:07:01.840 | 
Google Translate was a great example of that, right? 00:07:05.600 | 
And so since then, you're like, you have this, you know, we started to, through first, first 00:07:12.040 | 
we all agree, we need distributed, we need distributed word representations. 00:07:15.400 | 
And you saw this, like, people probably don't know this funky, this funky embedding algebra 00:07:20.800 | 
of king minus man plus woman equals, equals queen from word2vec. 00:07:27.000 | 
And we had a, we had a, and there was a, there was a, there was a big industry of models 00:07:34.000 | 
that actually, that just, that learned word representations and the word representations 00:07:41.040 | 
And then, then came like, you know, another step in this process. 00:07:45.380 | 
Another step in this process where now we started saying, okay, these representations 00:07:49.600 | 
are like in there, but they're, they're, they're only helpful if they're learned in context, 00:07:55.120 | 
So the king should change based on context, like the, the, the, the king of Persia, or 00:08:00.000 | 
the king has no clothes, or the emperor has no clothes, right? 00:08:03.840 | 
And so, so, so that, so we saw these, we saw approaches like sequence to sequence, sequence 00:08:09.120 | 
learning where we started to like formulate, we started to create these general formulations 00:08:18.360 | 
So sequence to sequence formulation, if you can, you can, you can, you can formulate many 00:08:22.040 | 
tasks in language of sequence to sequence, question answering, machine translation, dialogue. 00:08:27.400 | 
So, and then, and then of course we had, then we, then we developed attention, right? 00:08:31.680 | 
Which was a, which was a very effective content based way to summarize information. 00:08:37.000 | 
If you were a, typically you have these encoder decoder architectures, everybody has probably, 00:08:41.560 | 
I'm guessing, familiar with encoder decoder architectures, right? 00:08:44.300 | 
So yeah, encoder decoder architecture and a, and a, and a position on the decoder side 00:08:48.540 | 
would summarize based on its content, all the information on the, on the source sentence, 00:08:53.740 | 
And this was really effective content-based way of summarizing information. 00:08:56.580 | 
And what, what started happening was we started these, these general, these general paradigms 00:09:03.860 | 
Sequence to sequence learning can solve, it can install most language problems because 00:09:08.220 | 
most language problems have to deal with learning representations of variable length. 00:09:12.100 | 
The goal is to learn representations of variable length sequences. 00:09:14.740 | 
And if you do that successfully, you can then potentially solve that problem. 00:09:18.280 | 
And then attention was an excellent way, a content-based way to actually summarize information 00:09:25.060 | 
And and, and, and, and so, so, so, so, and the, and the major workhorse until then were 00:09:33.780 | 
Where basically the, the, the, the method was typically the same. 00:09:37.820 | 
You had a, you had a sentence and you crushed the sentence into a, into a set of, into a 00:09:41.980 | 
set of vectors, set of representations, one typically, typically one for each position, 00:09:47.860 | 
And the way LSTMs did it was where they walked along the sentence, they ate up a word, and 00:09:52.260 | 
then they summarized, they summarized the entire history into one fixed bottleneck. 00:09:57.500 | 
And that bottleneck was then transmitted, was updated based on the next word. 00:10:01.020 | 
So, so, so now, and, and, and, and if we, if you were successfully able to learn representations, 00:10:07.360 | 
then we could solve these tasks, translation, summarization, dialogue. 00:10:09.620 | 
So it's an important movement and, and, and, and like the 20, 20, I, I, 20, I guess when 00:10:14.580 | 
was the sequence to sequence learning papers, 2015, NeurIPS, then we saw, then we saw the 00:10:19.140 | 
attention paper in around 2015, 2016, and the machine translation community was kind 00:10:23.440 | 
of the first to respond and say, hey, yeah, you know, machine translation is a classic 00:10:31.860 | 
And then can we still build native, greeny, rethink machine translation with the sequence 00:10:39.780 | 
I don't know if you guys have ever done these exercises on LSTMs can, can count, like if 00:10:45.260 | 
you, if you, for example, if you, if you train an encoder decoder on, if you, like on A, 00:10:50.980 | 
to model A to the N, B to the N. So you feed in NAs and you ask the decoder to predict 00:10:55.020 | 
N, NBs, and you actually, just a single cell LSTM, if you know the structure of an LSTM, 00:11:01.380 | 
there's a cell that basically keeps, so it's a, it's a notion of state, and just a single 00:11:05.700 | 
cell is able to actually just do trivial counting. 00:11:08.980 | 
It counts how many A's you consumed, and then it decrements it, and then when you consume 00:11:13.820 | 
all the, exactly the same number of B's as the number of A's, something lights up and 00:11:17.620 | 
says, I'm done, I've recognized this language, so you can train trivial A to the N, B to 00:11:21.860 | 
And here, you have a, I'm sorry, this is not clear, but you have somewhat of a, you have 00:11:26.140 | 
a grammar here, and you can see that these are different cells, there's about eight cells 00:11:29.740 | 
here, and each one of these cells actually increments its counter once it feeds a particular 00:11:34.060 | 
symbol, and it's able to actually track how deep you are in this, how deep you are in 00:11:40.020 | 
And Google, of course, the crowning achievement, perhaps, of sequence-to-sequence models, which 00:11:47.740 | 
I was actually, right, I was fortunate to be in the same cuticle as this work was being 00:11:55.340 | 
done, was the Google neural machine translation system, where they took LSTMs, I mean, they 00:11:59.940 | 
added many advancements, of course, a lot of systems improvements, a lot of data that 00:12:04.700 | 
Google had, and they produced what you might, at that time, the state-of-the-art neural 00:12:09.540 | 
machine translation system, sequence-to-sequence models. 00:12:12.020 | 
So now, this big consolidated, this big complicated system, which looked much, which looked much 00:12:17.380 | 
more complicated, and now become a homogenous, just as a single homogenous neural network, 00:12:23.820 | 
So at the time, the biggest frustration we had was, this was, I mean, these, the LSTMs 00:12:29.620 | 
were the primary workforce, and the biggest, the biggest frustration we had was, I mean, 00:12:36.100 | 
not only were we producing, not only were we, did we produce the output order aggressively, 00:12:41.420 | 
we were sequentially decoding the output, left to right, but also, we were reading the 00:12:46.500 | 
So you had to kind of, in order to produce that, in order to produce that representation 00:12:49.980 | 
for the 10th word, you had to eat up, you had to, the first word, the second word, the 00:12:54.820 | 
So that was, that was really slow, and, and, and, and, and, and, and, and not the whole, 00:12:59.580 | 
and another big problem with LSTMs were that you have this bottleneck that basically, that 00:13:05.740 | 
contains all the information about your past. 00:13:08.340 | 
So you have to now, you have to now crush, you have to, you have to pack both long distance 00:13:12.820 | 
interactions that you might have, and local interactions through the single, single fixed 00:13:19.200 | 
And, and sequentiality, it doesn't, inhibits parallelism, which means that you couldn't 00:13:23.940 | 
even read, like the encoder couldn't even read the sentence in parallel, and of course 00:13:28.220 | 
decoding was autoregressive, so you couldn't even write in parallel, right? 00:13:32.620 | 
And convolutions, they were starting to emerge as a solution largely. I mean, they had been 00:13:39.100 | 
very successful in-- they had been very successful in computer vision. They had also figured 00:13:43.220 | 
out how to optimize them well, how to make them really fast on GPUs, because they're 00:13:50.340 | 
just basically matrix multiplications. And matrix multiplication is largely-- it's parallelizable. 00:13:55.980 | 
So convolutions were a solution to this problem of not being able to read in parallel, because 00:14:03.300 | 
you could-- in parallel, every word could basically produce its representation by looking 00:14:08.340 | 
at its neighbors, its local neighbors. And there were some breakthrough papers, such 00:14:16.380 | 
as Bitenet for machine translation, the convolutional sequence-to-sequence model that was contemporaneous 00:14:22.020 | 
to the transformer, actually, probably predated by a few months, where they used convolutions 00:14:26.260 | 
both in the encoder and decoder to get good scores on machine translation that were better 00:14:31.180 | 
than the Google neural machine translation system. And of course, probably the most successful 00:14:39.780 | 
was Bitenet, which was a text-to-speech system that was state-of-the-art at the time. 00:14:45.260 | 
And again, so convolutions still have this problem that, one, I guess they were parallelizable, 00:14:53.940 | 
but the issue was that you still-- you couldn't directly capture long-distance interactions 00:14:59.880 | 
between-- you couldn't directly capture long-distance interactions between words. So if you're 00:15:04.860 | 
basically a receptive field, if it's like a 3 by 3, if it's a 1 by 3, then it basically 00:15:10.620 | 
grows linearly, either with the factor of-- it grows linearly with the number of layers 00:15:15.860 | 
each time it expands by 3. So you still needed a linear number of layers to capture these 00:15:22.020 | 
long-distance relationships. But attention, on the other hand, was this really effective 00:15:26.500 | 
mechanism that we knew was-- that could actually get-- in one, it could actually capture all 00:15:34.660 | 
the interactions between one word and every other word using content-based addressing. 00:15:40.780 | 
Because convolutions basically match-- convolutions match weights with parameters. Attention was 00:15:45.540 | 
actually able to use content with content. So based on how similar I am to my neighborhood, 00:15:49.820 | 
based on how similar I am to my neighbors, I'm going to absorb that information. And 00:15:54.420 | 
this motif actually appears everywhere, even in computer vision. 00:15:58.340 | 
So maybe, actually, I can go there. So here's a-- in vision, there is this approach-- do 00:16:05.460 | 
people here know non-local means? So in computer vision, there's an approach called non-local 00:16:11.940 | 
means that's basically-- it was originally developed for image denoising. So if you want 00:16:18.380 | 
to denoise an image patch, you look at all your neighbors, and you see which patch is 00:16:23.460 | 
very similar to you. And based on the similarity, you actually pull in that information. And 00:16:27.980 | 
this largely works in images because images are very self-similar. This starts sounding 00:16:31.980 | 
like, hey, based on content, I want to pull in information. And again, there were similar-- 00:16:36.740 | 
there were approaches like texture synthesis by EFROS, where if you wanted to-- if you 00:16:41.960 | 
wanted to do painting, or if you wanted to generate an image, then you would look at 00:16:45.640 | 
a patch that's similar to this rectangle in some other-- in your dictionary, or in a database 00:16:51.740 | 
that you have of patches. And then based on what's closest to it, you actually bring it. 00:16:55.940 | 
So you'll bring that patch, and then you'll paste it there. So these approaches that looked 00:16:59.580 | 
like attention were already prevalent. It's a very natural formulation. And the Baden-Auwe 00:17:08.460 | 
paper had shown that this actually works really well for language as well. 00:17:12.180 | 
So the question then was, OK, why can't we then learn representations? Instead of being 00:17:17.220 | 
this source target, why can't we actually learn representations by the sentence attending 00:17:21.740 | 
onto itself? So now you basically use-- instead of attending a source sentence, attending 00:17:26.900 | 
to a target sentence, can it just attend to itself? 00:17:29.260 | 
And the original goal of actually when we wanted to actually do parallel decoding-- 00:17:36.420 | 
so attention by construction is parallelizable, because each token can basically construct 00:17:44.340 | 
its representations from its neighbors in parallel, right? And it directly captures 00:17:49.900 | 
token-to-token interactions, because now, of course, we'll run into complexities of 00:17:54.420 | 
length, but we can-- and we'll discuss how to solve some of these things later, how to 00:17:59.580 | 
overcome them. But you can direct-- instead of having this sort of linear growth in receptive 00:18:03.220 | 
field, you can directly capture these interactions. 00:18:05.620 | 
Because convolutions, if you have a very, very large receptive field, it gets computationally 00:18:08.660 | 
very expensive. And it also had these explicit gating and multiplicative interactions, which 00:18:12.740 | 
we've often seen, like, in gated-pixel CNN or GeLUs. These explicit gated-multiplicative 00:18:20.220 | 
interactions have typically helped training and have led to better accuracies. 00:18:26.420 | 
And as I mentioned, the original motivation of why we actually wanted to do this was, 00:18:31.300 | 
we said, hey, OK, so the LSTMs are-- we have good translation systems, but the problem 00:18:37.340 | 
is that actually, both reading and writing sequentially, can we actually do both in parallel? 00:18:42.820 | 
So we wanted to read-- we wanted to read the German sentence in parallel and then translate 00:18:47.540 | 
it in-- and then also write in parallel by that, instead of actually decoding it sort 00:18:52.180 | 
of autoregressively, can you decode it-- instead of decoding in time, can you decode it in 00:18:57.500 | 
So, like, you first spit out one word, or you spit out all the words, and you iteratively 00:19:01.000 | 
define them, right? And this was-- this turned out to be very, very challenging and hasn't 00:19:07.260 | 
been solved successfully until today. Because the biggest challenge, essentially, is when 00:19:11.580 | 
you-- whenever you're decoding, right, essentially, as you predict a word, you kind of bend the 00:19:15.100 | 
probability distribution that then nails down-- narrows down what you're going to predict 00:19:19.540 | 
later on. And the ordering that allows you, basically-- the ordering that allows you to 00:19:25.340 | 
nail these modes was very hard to learn. So imposing a left-to-right ordering is much 00:19:30.500 | 
easier than actually not having one and having to learn it as you're decoding. 00:19:34.740 | 
So the original approaches didn't work, but then we still had-- we still had-- we still 00:19:40.140 | 
had our salvation in being able to read it parallelly. So we said, all right, let's take 00:19:43.700 | 
this back to the encoder-decoder models. And unlike-- at that time, there were a few formulations, 00:19:50.700 | 
right? So we had this sort of-- the original formulation of attention from graves, then 00:19:55.780 | 
we had the additive attention formulation, and we took the-- and we took the dot product 00:20:00.020 | 
attention formulation, largely because it allowed us to do-- it-- because it allowed 00:20:05.940 | 
us to actually do attention as a matrix multiplication. And oftentimes, some of the biggest constraints 00:20:11.340 | 
that actually-- physics is such a big constraint in neural networks that if you can-- if you 00:20:17.220 | 
can make your-- if you can make your architecture amenable to modern accelerators, you have 00:20:22.660 | 
a much better chance of-- you have a much better chance of succeeding. And dot product 00:20:26.820 | 
attention could be expressed as a matrix multiplication, and it's-- and there are already sub-- there 00:20:31.260 | 
are already kernels for being able to do matrix multiplication very effectively on the GPA. 00:20:36.300 | 
So we had-- so the formulation was, all right, so now we have-- similar to the dot product 00:20:40.500 | 
attention, we had a scaling factor, simply because if the dot product actually becomes 00:20:45.140 | 
too big and you can solve it under certain assumptions of mean and variance in the representations, 00:20:50.860 | 
you can-- it hasn't updated, actually. Yeah, so our formulation is basically you have-- 00:20:58.420 | 
you now have your queries, which-- what you end up doing is if you have a-- if you have 00:21:04.740 | 
a position, you first project it into queries, and then the same-- the same token-- the same-- 00:21:10.500 | 
the representation of the same token gets projected into-- to also keys and values. 00:21:14.740 | 
And the first-- the query determines how much-- how much you're actually going to pull from 00:21:20.780 | 
all these keys. So you first do a dot product of the query with every key, and then based 00:21:25.380 | 
on that, you combine or you pool the content of-- in all these positions based on-- based 00:21:30.820 | 
on what the-- based on what the score was after-- after normalizing and using a softmax. 00:21:34.980 | 
So in some sense, you can think of self-attention also as kind of a content-based pooling mechanism, 00:21:41.860 | 
right? And the scaling factor basically avoids you-- avoids you-- like, it saved us from 00:21:47.260 | 
these logits actually blowing up and training becoming unstable. And on the decoder side, 00:21:52.820 | 
you could trivially-- you can trivially implement causality by just adding an-- adding an attention-- 00:21:58.260 | 
adding an attention mask. And what this-- where this-- where this brings us is that-- 00:22:03.620 | 
all right, so-- so now we've-- we've solved-- now there's-- it's-- so a caveat on the flops. 00:22:09.260 | 
We'll actually cover this later. But now what we have is a mechanism that-- that's parallelizable. 00:22:14.340 | 
It gives you direct-- it gives you direct content-- it gives you direct token interactions 00:22:19.500 | 
that will-- and that-- that we-- that we assume-- that we believe is going to help you actually 00:22:22.940 | 
learn-- model these relationships between the words better. And it's-- and it's-- and 00:22:26.500 | 
the complexity of self-attention is faster than convolutions, right? Because it was-- 00:22:30.100 | 
because convolutions are quadratic in the number-- they're quadratic in the number of 00:22:34.420 | 
channels and the number of-- in the hidden dimension, but a self-attention is quadratic 00:22:37.660 | 
in the length. So if your length is not much more than a hidden dimension, you've actually 00:22:40.500 | 
saved on flops. Now this is a-- not-- not quite a complete picture because not all flops 00:22:45.860 | 
are equal, and we'll talk about this later on. And-- and-- and-- and now when you put 00:22:52.060 | 
it-- when you put everything together, what we-- what-- basically, we-- we kind of took 00:22:56.060 | 
the-- the-- the-- the-- the basis-- this has a very strong similarity to the-- to the ResNet 00:23:01.260 | 
architecture, actually. So if we look at ResNets, right? So in ResNets, you have contraction, 00:23:06.260 | 
you have spatial mixing with convolutions, and then you have the expansion again, right? 00:23:10.380 | 
If you just-- the transformer, if you just adjust, if you just move it one-- one step 00:23:14.580 | 
down, it's very-- it's analogous. You have-- you have attention, then you have expansion 00:23:17.980 | 
and contraction, but it is a-- and-- and the difference in where the residual connections 00:23:21.940 | 
are, but it's a-- it's a very similar-- it's a very similar sort of basic building block 00:23:25.940 | 
with, say, the residual-- with the residual connections, and you have these contractions 00:23:29.540 | 
and expansions. And in the transformer, those were-- there was multi-head attention with 00:23:34.660 | 
expansion and contraction, which was in the feed-forward layers. And with-- and-- and 00:23:39.020 | 
then one-- one-- one challenge with the tension, we loo-- LSTMs can count, they can impact, 00:23:44.220 | 
they can-- they can-- they can-- they can count-- they can learn interesting temporal 00:23:49.100 | 
patterns, but attention is permutation-invariant, so we had to actually add position-- we had 00:23:55.940 | 
to add position information so that we could-- we could learn ordering. So we add position 00:23:59.860 | 
information at the input, which trans-- gets transmitted to the other layers through-- 00:24:04.580 | 
through the-- through the residual connections. And the-- the original paper, we had those-- 00:24:09.500 | 
we had post-layer norm, but later on, we realized that as we actually make the model deeper, 00:24:14.660 | 
post-layer norm is-- doesn't-- doesn't allow you to train effectively, so we have to-- 00:24:18.620 | 
then we did-- then we used a pre-layer norm formulation, which was also observed in the 00:24:24.780 | 
And so the model is basically, all right, you've got your input, well, you have spatial 00:24:29.940 | 
mixing-- spatial mixing through attention, three, four layers, and this sort of repeats. 00:24:34.900 | 
And the-- the difference in-- on the decoder side is that you also now have encoder-decoder 00:24:39.780 | 
attention and encoder-decoder attention at every-- at every-- at every layer. If there's 00:24:47.380 | 
Yes, what was your question behind the [INAUDIBLE] post-layer norm? 00:24:51.500 | 
Oh, so-- so it ended up-- so if you do post-layer norm, then-- then-- 00:24:57.860 | 
actually, Liz-- Liz, do I have that slide? Let me check. Probably I've deleted it. But 00:25:02.380 | 
if you do post-layer norm, then you are basically squashing both the residual and the additive 00:25:09.060 | 
parts. So when you-- so your activations from the lower layers keep getting-- keep going 00:25:13.340 | 
through layer norms. But in pre-layer norm, you're only-- you're only-- a residual path 00:25:17.660 | 
has a layer norm, which means your-- your activations all the way from the bottom of 00:25:21.340 | 
the model are free. They're untouched, and they can pass through the-- yeah. 00:25:27.820 | 
Yes, OK. OK, so now-- so now-- I mean, so until this point, we haven't discussed why 00:25:34.180 | 
did we-- you know, we haven't discussed multi-head attention, which ended up being very important. 00:25:40.220 | 
So one of the problems with attention is that imagine if you wanted to-- I mean, so oftentimes 00:25:48.900 | 
language is about understanding who did what to whom. So in this case, the cat licked the 00:25:54.340 | 
owner's hand. So licked-- who licked what? Like, the cat licked the owner, right? So 00:25:58.620 | 
now if you actually want to combine information from these two slots, these positions, these 00:26:03.940 | 
vectors, then the best you could do with attention is 0.5, 0.5 to the single layer, right? Half 00:26:08.700 | 
probability, half probability. But then they get mushed together, right? But now imagine 00:26:12.140 | 
the-- imagine the strength that a convolution has. It can actually have-- that actually 00:26:18.940 | 
should have-- well, OK, well, I think the point will still come across. So now what 00:26:25.300 | 
a convolution can do is because it has-- it basically applies-- essentially, a convolution, 00:26:30.460 | 
in this case, it's a 5 by 1. All it really does is it just applies a different linear 00:26:36.540 | 
transformation at each position, right? So it can take any-- and because these linear 00:26:41.740 | 
transformations are different, it can-- the first linear transformation can learn, I'm 00:26:46.100 | 
going to take a little bit of information from here. I'm going to take a little bit 00:26:49.140 | 
of information from here. And I'm going to put them together, right? And the attention, 00:26:53.460 | 
the best way that you could actually just do this is best by averaging. That would mush 00:26:57.300 | 
But having different linear transformations allows you to take a part of the embedding 00:27:00.660 | 
here, a part of the embedding here, mix it up, and then maybe put it together without 00:27:04.300 | 
actually then interfering with each other. And multi-head attention, which is a bit like 00:27:08.500 | 
basically a multi-tape, multi-head of a multi-head Turing machine with different read-write 00:27:16.300 | 
heads, essentially allows you-- starts getting you that property back, where now what you 00:27:22.600 | 
do is you essentially-- you now-- you bring back the ability to select different parts 00:27:28.580 | 
of the input. So you chop up the hidden dimension into independent pieces. And then each one 00:27:33.580 | 
of them is now able to do attention. So now you can have probability 1 in this place and 00:27:37.980 | 
probability 1 in this other subspace, instead of having 0.5, 0.5. So now you don't have 00:27:42.460 | 
to-- you don't have to get these averaging effects. You can actually be selective, right? 00:27:48.340 | 
And also, for computational reasons, instead of actually having eight attention layers 00:27:53.420 | 
of like-- or six attention heads of d dimensions, we had-- or eight attention heads of d dimensions, 00:27:58.780 | 
we had eight attention heads of d by 8 dimensions, right? 00:28:03.260 | 
So we wouldn't incur any more-- we wouldn't incur any more flops, for the same amount 00:28:10.540 | 
of flops. But that's only half the story, because the attention heads themselves turn 00:28:14.180 | 
out to be quite expensive, which then later on had to be-- they were doing improvements 00:28:17.620 | 
that needed to be made, right? And the most important part-- probably the most important 00:28:26.260 | 
result was that, with the transformer, we were able to outperform previous ensembled 00:28:33.860 | 
models as well. And that was very, very exciting, that, hey, this single model actually is able 00:28:38.500 | 
to outperform previous ensembled models. And not only that-- and this was machine translation 00:28:44.580 | 
in WMT 2014, English, German, and English, French machine translation tasks. And not 00:28:51.780 | 
only were we able to do it in less flops, but also these-- it was very clear that this 00:29:01.660 | 
was a very general model, as we immediately applied it to parsing, and we were able to 00:29:06.220 | 
get-- we were able to get, with a small model, excellent results. 00:29:10.980 | 
So in some sense, this was very exciting, because this meant that, all right, now this 00:29:18.900 | 
consolidation that we're trying to go for in machine learning, we probably have a model 00:29:22.460 | 
that's more general than what we had before, and we can now throw it at different-- maybe 00:29:27.420 | 
we can now throw it at different problems, right? And ultimately, why? Because it would 00:29:32.580 | 
be helpful to have a single model that's able to combine representations from speech, images, 00:29:40.420 | 
and language. And if you had a general substrate that worked well in all tasks, then potentially 00:29:45.940 | 
you could get to the single multimodal model. 00:29:51.180 | 
Sometimes interpretability is like tea leaves. It's like reading tea leaves, so one should 00:29:57.660 | 
be careful. But it was nice that the attention by itself can give you some interpretability, 00:30:03.180 | 
and we were able to kind of see how some of these attention heads, or some of these attention 00:30:09.140 | 
mechanisms were actually able to learn long-distance relationships. Some actually learned to be 00:30:14.100 | 
kind of early on in the transformer. We saw this generally invariant pattern, where some 00:30:20.340 | 
of the attention heads basically turned out to just look like convolutions. They were 00:30:23.740 | 
just putting in local information. There's, of course, now being much more advanced work 00:30:27.860 | 
with some of the mechanistic interpretability stuff with grokking and the stuff that's happening 00:30:34.060 | 
in entropic, which is where they're learning now that actually learning how to interpret 00:30:40.300 | 
these induction heads. So it's interesting. But we were able to see some anecdotal evidence 00:30:46.900 | 
of these heads actually performing very, very distinct and clear actions. 00:30:51.060 | 
OK, so if there's any more questions, then I'll pause for a second. 00:30:57.740 | 
Do you, by the research, find that it's the induction heads that are causing the in-context 00:31:04.180 | 
Yeah, it's hard to tell. So from what I haven't looked at the most recent work, but they have 00:31:10.220 | 
solved this issue of superposition. Is that right? So now, with having solved that, they're 00:31:14.180 | 
able to-- does that roughly mean that now they'll be able to assign distinguishing features 00:31:19.260 | 
to each one of these heads and be able to explain it, from what I understand? Or the 00:31:25.140 | 
in-context learning part is that-- is it that they have to show it, or is it that they're 00:31:30.660 | 
saying that in-context learning happens because of induction heads? 00:31:34.340 | 
Yeah, it's the latter. Yeah, it's not clear, because-- yeah, I think there's probably many, 00:31:42.860 | 
many kinds of-- in-context learning is shown to work in so many different tasks that-- 00:31:49.140 | 
and actually, I haven't followed this quite well. I don't know specifically-- what are 00:31:53.300 | 
the induction heads typically-- what kinds of properties do they have? Do you know what 00:31:59.660 | 
OK, so yeah, so then-- so since both of us don't know this really, really well, we won't 00:32:05.420 | 
be able to go very far here. But I'm not sure if they've gotten to the point where they're 00:32:10.580 | 
able to explain most of the in-context learning because of induction heads, from what I understand. 00:32:14.420 | 
They might have, yeah. Does anybody know about the induction heads? OK, so now, over the 00:32:25.780 | 
years, so there have been a few-- there have been many papers, but there have been a few 00:32:33.340 | 
changes that have been important. There have been a few changes that have stuck, and the 00:32:39.060 | 
new transformers typically have these improvements, right? And we'll go from bottom to top with 00:32:47.500 | 
some of them and see which ones have actually stuck, right? 00:32:51.260 | 
So we started with the first-- one of the biggest problems with self-attention was that 00:32:55.140 | 
it was-- that self-attention itself is permutation invariant, right? You need to dope position 00:33:04.260 | 
information in order for it to learn some kind of temporal structure. And in the original 00:33:09.260 | 
transformer, we used these sinusoids, and we had hoped that it would actually learn 00:33:13.540 | 
relative position encodings because you could decompose the position encoding of another-- 00:33:19.660 | 
you could decompose the position embedding of another position as some linear function 00:33:24.060 | 
of the previous one. And we had-- and some-- and another factor, which depends on the relative 00:33:30.500 | 
distance between the two. But that didn't happen. Learned position encodings in the 00:33:34.900 | 
original paper did as well, and so we were not quite able to get-- we were not quite 00:33:43.620 | 
able to get these model relative distances using the sinusoids. 00:33:46.980 | 
So then a couple of important-- and this is a very biased sample, but I think it generally 00:33:52.820 | 
covers a large category of these-- it covers a large set of papers. There's roughly sort 00:33:59.820 | 
of three categories, right? So there's-- and all of them are kind of now explicitly learning 00:34:05.700 | 
relative-- explicitly learning relative embeddings. So there's-- so in the relative position transformer, 00:34:13.340 | 
we had an embedding for every pair of relative positions. And using that, we basically then 00:34:18.300 | 
dot-- we did a dot product of that embedding with a query that produced a logit that modulated 00:34:23.060 | 
according to the relative distance. And we found this to be extremely-- we found this 00:34:27.900 | 
to be extremely useful for translation, but I'll show also in music. 00:34:34.180 | 
Another sort of-- maybe a simplification, this is the alibi paper where-- this is non-parametric. 00:34:40.100 | 
These are not learned, where instead of an embedding for every pair of positions, you 00:34:44.740 | 
actually have a single bias, right? So you just add a single bias to the logit, and you 00:34:50.220 | 
can either learn it, or you can use a heuristic, which Alibi did. And one other advantage about 00:34:58.300 | 
relative position encodings is that they could potentially allow you to extrapolate to new 00:35:02.500 | 
to longer sequence lengths, which you couldn't do with absolute position encodings. 00:35:08.340 | 
I'm curious about the room-- about what the room thinks here, but I believe that the latest 00:35:15.040 | 
in partition relative position encodings where this is-- I believe it's called the row former, 00:35:20.740 | 
where they basically just rotate the embedding with every pair of dimensions a little bit. 00:35:28.600 | 
And the angle of rotation depends on your actual absolute distance. But what ends up 00:35:33.060 | 
happening is, when you do the attention operation, you end up getting relative-- you end up basically 00:35:38.940 | 
getting an effect where you're modulating the logit based on relative distance. 00:35:43.140 | 
So now what's remarkable about this approach, what's-- it combines the best of both worlds, 00:35:48.180 | 
right? It actually-- it's absolute position encodings-- relative position encodings had 00:35:52.420 | 
a couple of challenges in that you have to maintain an extra logit for-- or an embedding 00:35:57.700 | 
for every pair. So there was a lot of-- so it ended up increasing your memory. Here, 00:36:02.860 | 
these are actually absolute position encodings, but they gave you-- they ended up giving you 00:36:07.060 | 
the relative modulation in the attention operation that you needed. 00:36:10.900 | 
And I believe the consensus is that this is the most successful-- this is the most successful 00:36:15.300 | 
position encoding. Is that correct, or are there-- is that-- are there others that are-- 00:36:19.780 | 
that people-- is that the consensus? OK. So it looks like-- so I would say that the 00:36:27.620 | 
the-- these relative rotations are from-- or the approach that's in the reformer is 00:36:33.940 | 
likely-- is basically an actual new genuine improvement that is now going to stay with 00:36:39.260 | 
the transformer. And it has all the-- it has all the great properties of what you would 00:36:42.980 | 
want. It has-- it's an absolute position encoding that gives you relative effects, which is 00:36:46.940 | 
what we originally wanted. And one-- and to emphasize that we needed relative-- like that 00:36:56.580 | 
being-- emphasize two things. One, that modeling, like, interesting temporal relationships, 00:37:04.100 | 
which is-- which are really important in music, requires a good position representation. We 00:37:09.660 | 
actually found significant improvements in the music transformer. Is it-- is it possible 00:37:13.820 | 
to play this? OK. So here is a-- like, here's a priming sequence. This is-- this is work 00:37:21.740 | 
by-- work by Anna Huang, by the way. So this is a in-context learning in music, because 00:37:36.380 | 
you actually see this prompt and you ask the model to complete it. OK. So now this is the 00:37:41.180 | 
vanilla transformer. And you can already-- so you can see that these were using-- I mean, 00:37:53.140 | 
we tried both learned and sinusoids. And you can see that it starts off peppy and happy, 00:37:57.660 | 
but then just sort of languishes into something really sad and confused, right? So it's not 00:38:02.260 | 
able to capture these-- because music has these interesting motifs where-- well, there's 00:38:07.620 | 
motifs at different levels, because there's some repetition locally, but there's a repetition 00:38:12.860 | 
across the entire piece as well. So now here, this is with the relative transformer. And 00:38:20.980 | 
this is with the first approach where we had relative embeddings. And we had to-- we had 00:38:26.060 | 
to-- we had to develop a compute-efficient approach to actually with-- by using some 00:38:32.260 | 
matrix calisthenics to actually put the logits in the right place. So you can read the papers 00:38:37.600 | 
here. It's fun. So here's the same prime sequence. And let's see the completion here. 00:38:52.140 | 
So Anna, who is the first author of this paper, and also a musician, tells me this actually 00:39:07.720 | 
captures a lot of structure in music. It sounds nicer than the previous one, but maybe-- depends 00:39:12.880 | 
on what people's tastes are. Like maybe some avant-garde jazz fan would like the second-- 00:39:18.120 | 
would like the first piece. But the point here was that the difference is pretty clear 00:39:23.840 | 
between not working and working. And I think people-- it'd be fun to try this out with 00:39:28.680 | 
the new rotary position encodings. All right. OK. So walking up, now that we have a good 00:39:37.920 | 
mechanism-- a better mechanism than we originally had for modeling relative distances. And there's 00:39:45.520 | 
advancements on top of the rotary position encodings where, by adjusting the base frequencies, 00:39:49.840 | 
you can-- when you encounter longer sequences, you can just adjust the base frequencies. 00:39:54.720 | 
And then the model's not going to-- the model's not going to degrade. So that has good properties. 00:40:01.480 | 
Probably there's been several, several important contributions to the attention piece itself, 00:40:09.400 | 
which is the primary workhorse here. It's the one that you can think of it as-- it's 00:40:13.920 | 
either-- there's induction heads that are learning how to copy. Or maybe all it's really 00:40:19.640 | 
doing is just routing information so that the giant feed-forward layers can actually 00:40:23.600 | 
learn the important features. But there's broadly two classes of problems. There are 00:40:28.440 | 
two classes of issues with the attention mechanism. One that was brought up today that's very 00:40:32.600 | 
evident is long context itself. So the complexity, as we remember, was quadratic in the length 00:40:40.560 | 
of the sequence. And once your sequences get very, very long-- once your sequences get 00:40:44.560 | 
very, very long, then not only-- I mean, there's one problem that's going to-- it's going to 00:40:51.040 | 
become very-- it's going to become computationally expensive. But it's also the logics that are 00:40:56.760 | 
So there's just generally a few groups of papers. One is restricting attention windows. 00:41:02.160 | 
And we did this for images where they had local 1D and 2D attention for images. And 00:41:09.440 | 
in the first one, we actually just rasterized the image. And we had local 1D attention, 00:41:13.760 | 
which is very similar to the sliding window attention in the recent Mistral paper. And 00:41:20.000 | 
then in the 2D case, we have a spatial 2D attention. Then there was these sparse versions 00:41:28.960 | 
where you actually-- you had these specific patterns that over many layers-- I mean, you 00:41:34.200 | 
can think about it as, if you have these sparse matrices, how many of them do you have to 00:41:39.800 | 
multiply with each other until you get a really dense matrix, right? So roughly, this kind 00:41:43.640 | 
of turns out to be-- so here, you can get connectivity-- is that for me? No, OK. You 00:41:54.040 | 
can get connectivity between distant pixels or distant notes in a musical tune or words 00:42:02.840 | 
pretty quickly. And then there's a second one, which there hasn't been enough work. 00:42:08.360 | 
And there's some challenges there. But it's these unstructured sparse attention approaches. 00:42:13.400 | 
And they're typically-- they're essentially-- at a higher level, what they're really trying 00:42:18.120 | 
to do is imagine that I walked up to you and I told you that, hey, these are the bunches 00:42:25.600 | 
of tokens that just have very high inter-similarity. Like, they're likely to tend to each other. 00:42:33.080 | 
How quickly can I approximate it without actually having to do the whole computation, right? 00:42:37.480 | 
Two approaches. And in routing attention, you use vector quantization. And in the LSH 00:42:42.320 | 
or the-- I forget what-- I think I forget the name of the paper. But in this paper, 00:42:48.720 | 
they used LSH. And in the routing transformer, most layers were actually local. The final 00:42:58.960 | 
layers, which typically are the ones that end up do modeling, that end up modeling these 00:43:03.640 | 
long-distance relationships, were the ones that actually used this kind of content-based 00:43:07.680 | 
unstructured sparse attention. And the results were generally better. And it's also interesting 00:43:13.240 | 
that maybe we can build models on very long sequences, where most layers are fairly local. 00:43:20.120 | 
And you have only a few layers that are actually doing these long-distance attentions. 00:43:23.400 | 
Now, one of the bigger challenges there, actually, even though it ended up being-- even though 00:43:28.720 | 
you end up nullifying a lot of the flops that you would do if you did full attention, the 00:43:34.320 | 
problem always ends up being memory movement. Always ends up being memory movement. And 00:43:39.720 | 
there's still more innovation to be done here, also, with memory bandwidth improving. Maybe 00:43:43.960 | 
some of these approaches become more feasible today than they were when we wrote these papers. 00:43:50.000 | 
But this is an interesting approach, where you're essentially trying to approximate the 00:43:53.820 | 
Sorry. This is kind of a silly thing, but a clarification. How is this unstructured 00:43:57.920 | 
sparse attention scheme very different from just convolutions that are sparse, in the 00:44:02.440 | 
sense that you're losing a lot of the long-distance or unrelated context from any arbitrary comparison 00:44:09.160 | 
Right. So I would say that this is similar to the convolution there. If you did this 00:44:15.120 | 
perfectly, then what you didn't attend to would have very little attention in itself. 00:44:22.720 | 
So you're essentially trying to guess, as best as you can, what would have attended 00:44:27.840 | 
to each other. And so it uses content based unstructured sparsity. 00:44:35.200 | 
And there's probably more interesting work to be done there. Maybe instead of actually 00:44:39.320 | 
just doing a token at a time, you end up doing a lot of memory movement. You end up deciding 00:44:43.800 | 
which chunks want to self-attend to which chunks. So then you just move entire chunks 00:44:48.160 | 
Right. So I think there's some interesting directions here. And frankly, the ones that 00:44:56.400 | 
ended up sticking are the simplest ones. And because structure sparsity is easy, you're 00:45:03.700 | 
able to optimize easily in modern accelerators. So again, you should make physics your friend. 00:45:12.400 | 
And so typically, local attention or sliding into attention, we're still seeing it often 00:45:17.320 | 
appear and do well. These other sort of really wild but very expressive unstructured sparse 00:45:24.320 | 
attention approaches typically haven't quite succeeded. 00:45:27.160 | 
There's, of course, linear attention variance that I don't think today are in any of the 00:45:32.360 | 
[INAUDIBLE] architectures. There were other approaches that, hey, instead of actually 00:45:35.560 | 
doing n squared, you do n squared d, where you learn new k embeddings, where you do nkd 00:45:45.080 | 
and then you do ndk. So you basically factor it, right? Just like an analog matrix factorization. 00:45:51.400 | 
Something that's-- one other approach that's interesting that I would like myself to actually 00:45:55.660 | 
investigate is we are seeing, in general, using retrieval as a tool. So why don't you 00:46:00.640 | 
just pretend that your memories, your memories themselves were documents and use retrieval 00:46:05.320 | 
as a tool there. So the memorizing transformer, basically, it essentially does a mix of local 00:46:12.000 | 
and it then retrieves from very, very long memories. And they find that you don't need 00:46:16.120 | 
to train the model from scratch. All you need to do is adapt with this approach on some 00:46:22.400 | 
small amount of data. And you're able to learn a good retrieval mechanism. I think it's quite 00:46:26.880 | 
So it still comes in this content-based decision of what I should attend to. But I like the 00:46:34.200 | 
fact that it just makes retrieval a tool that you can use either on your own memories or 00:46:38.960 | 
you could use it on documents. It's a nice general view of looking at things. 00:46:44.600 | 
OK, so now the second piece, which you basically run into-- you run into the issue that not 00:46:52.080 | 
all flops are equal, right? So if you look at the memory hierarchy, a lot of your activations 00:47:00.120 | 
that are stored in the GPU-HPU, which today in the H100 is about 80 gigabytes. But the 00:47:08.980 | 
H100 is 80 gigabytes, and the A100 is 40 gigabytes, right? So it's a limited amount of high-bandwidth 00:47:16.120 | 
memory. And so you have to first go from high-bandwidth memory to the SRAM. And then you have to go 00:47:21.160 | 
to the compute elements and then back, right? 00:47:23.240 | 
So every single time-- and this is-- I mean, it probably-- whenever-- if interested, you 00:47:32.400 | 
look at roofline analysis. The roofline analysis actually gives you a nice picture to characterize 00:47:39.920 | 
for any device where you would need-- where your workload or operation needs to be so 00:47:48.320 | 
that you can actually effectively utilize the compute as much. You want to be compute-bound, 00:47:52.460 | 
because ultimately, if you don't calculate representations, if you don't calculate, you're 00:47:55.920 | 
not going to get any output. But if you spend a lot of time moving things around and spend 00:47:59.640 | 
less relative time calculating, then you're actually-- you're kind of wasting effort, 00:48:06.680 | 
So one of the-- so if you look at standard attention mechanism, right, one of the issues 00:48:10.520 | 
is that-- OK, so imagine you have your queries, keys, and values all in your memory. But then 00:48:15.600 | 
you need to then-- your standard approach would be you move it from HBM. You do the 00:48:21.120 | 
calculations. You compute the attention. You compute the logits. You move logits back into 00:48:25.640 | 
HBM. And then you compute softmax, right, the softmax back into HBM. And then you basically 00:48:31.840 | 
load the probabilities and the values then to then finally compute the outputs, right? 00:48:38.320 | 
So the arithmetic intensity or the arithmetic intensity or operational intensity, which 00:48:43.880 | 
is the amount of flops that you do per byte on attention, even though it's less flops 00:48:49.360 | 
than, say, a one-by-one convolution, it has more-- it is lower, because it typically has 00:48:53.800 | 
more memory movement. Whereas one-by-one convolutions have less memory movement. You just move the 00:48:58.240 | 
weights, move the activations, you do the calculations, and you bring them back, right? 00:49:01.440 | 
And same goes for convolutions, too. And convolutions have a very high arithmetic intensity. It's 00:49:04.960 | 
not that you just want the highest arithmetic intensity or operational intensity operations, 00:49:08.800 | 
because you still want to have useful parameters, right? So it's a trade-off. 00:49:13.640 | 
So a lot of-- so there's been a bunch of improvements that will stick. I mean, they're almost certain 00:49:19.680 | 
likely to stay, that try to combat this issue both in training time, because your logits 00:49:24.640 | 
can get really big, but also inference time or your KB. When you're doing inference, then 00:49:29.560 | 
you have a single query. But your KB cache, right, you have to maintain your keys and 00:49:35.000 | 
values that can grow quite a bit. So you have to move that around. 00:49:37.800 | 
And so the first step of the day is simple. Let's just decrease the activation memory. 00:49:42.840 | 
So the multi-query approach, where it's basically in a multiple-- so you reduce-- you have multiple 00:49:50.160 | 
queries, but just you reduce the number of read heads to just one. So you have just one 00:49:55.000 | 
key and one value. That does reduce your expressivity. 00:49:57.920 | 
So grouped query, which is now a simple balance, that basically says, hey, let's not take the 00:50:02.680 | 
extreme of having all this temporary activation memory. Let's actually group it to a different 00:50:07.800 | 
query. So a bunch of queries will attend to the same keys and values. 00:50:12.480 | 
And then what ends up happening is-- another point to note here is that all of this is 00:50:18.200 | 
relative, because most of the work in these very, very-- oh, but a third approach, actually, 00:50:22.920 | 
that I should say of not worrying about your attention is just to make it more of a debate. 00:50:29.160 | 
But then you just get about your three-fold computations and your attention computations 00:50:33.000 | 
just like a small slice of that. So you don't worry about it, right? 00:50:36.000 | 
So typically, these larger models, even though grouped query attention has more activation 00:50:42.320 | 
memory than multi-query, when with these large models, it's still not a much larger-- it's 00:50:46.680 | 
not a much larger-- it's still a smaller proportion of what you're doing in the feedforce or your 00:50:50.480 | 
certified, right? So I guess three things, like ignore, make it really big. Second is, 00:50:56.800 | 
I guess, you-- but even with prolonged context, you can do some of these approaches that we 00:51:04.720 | 
talked about. But then you also have these system optimizations, which are pretty cool. 00:51:10.600 | 
So the softmax has an interesting property that you can compute it in an online fashion. 00:51:15.400 | 
You can compute it incrementally. So if you've got a bunch of logits, so you're kind of streaming 00:51:20.000 | 
them, if you've got a partial softmax and a new logit comes in, you can update it in 00:51:24.920 | 
an online fashion, right? So what does that mean? That means that now you never needed 00:51:31.320 | 
to write logits or the p's into the HBM. So you save a lot, right? If there's an extremely 00:51:36.280 | 
long sequence, you end up writing a lot. So you save on that. And both these approaches 00:51:41.600 | 
end up-- in one case, the first paper was on TPUs that introduced this property or took 00:51:48.640 | 
advantage of this property, the property to be able to compute the softmax in an online 00:51:53.160 | 
fashion. And the second paper, which is now flash attention today, they've had many advancements. 00:52:00.000 | 
They actually had some systems-level optimization where now you can actually have very, very 00:52:04.600 | 
long sequences on GPUs, the optimizations for GPUs, by basically not moving the logits 00:52:12.280 | 
back into HBM, using this online-- using this property and also writing the right columns 00:52:16.280 | 
that use the SRAM and everything-- use the GPU. With any questions? What's the time? 00:52:27.600 | 
So we are basically 20 minutes. I'll finish in 10. So I just covered these two. There's 00:52:33.680 | 
many, many-- there's, I guess, there's other important improvements. I'd say this to the-- 00:52:40.520 | 
we talked about the pre- and post-versus post-layer norm. There's been some changes of the feed-forward 00:52:46.440 | 
layers themselves. You can stare at the feed-forward layers. I mean, you can stare at anything 00:52:51.360 | 
long enough, everything becomes attention. But it's true in the feed-forward case that 00:52:54.640 | 
if you look at it, you can think about them as-- it looks like attention. And there was 00:52:58.880 | 
a paper that sort of turned that into a bit of a-- turned those into memories. It was 00:53:06.000 | 
originally by Facebook. I actually forget what it was. But it didn't-- and the feed-forward 00:53:10.520 | 
layers just stayed-- I mean, we typically haven't seen a lot of improvements on them. 00:53:16.280 | 
There have been some efforts on higher-order attention right now. Attention, if you think 00:53:21.720 | 
about it, is a third-order interaction. You have queries, keys, and values. But-- and 00:53:25.800 | 
right now-- but you could imagine actually having four-order interactions where you're 00:53:29.800 | 
actually computing logits of pairs of things against all pairs of things, right? So these 00:53:34.240 | 
are now higher-order interactions where now you can have complicated geometries that you 00:53:39.040 | 
actually include in your attention computation. And maybe it's important for, say, biology 00:53:44.320 | 
or some biology, but it's not been explored much. 00:53:47.840 | 
What has actually worked and is likely to now stay is some approaches on password decoding. 00:53:53.040 | 
Not quite the original, less or non-order-regressive aspirations that we had, but these more speculative 00:53:58.960 | 
decoding where-- the heuristic there is pretty simple. You score-- if you want-- instead 00:54:04.440 | 
of generating from a heavy model, generate from a really light model that captures the 00:54:09.040 | 
diversity and then score with a heavy model. So then you re-rank the list. And that ends 00:54:12.800 | 
up working quite well. And most production deployments likely use speculative decoding. 00:54:19.240 | 
OK. So now switching gears, I guess we started this-- or we started by coding the Dartmouth 00:54:31.760 | 
conference where they wanted to build a single machine. And the question now is, with large 00:54:35.240 | 
language models that are now eating up most of the internet, are we quite getting there? 00:54:41.440 | 
And we are seeing some remarkable-- we're finally seeing self-supervised learning work 00:54:45.360 | 
at a scale that-- work at an unprecedented scale where now by digesting carefully curated 00:54:54.480 | 
and colossal amounts of text with very, very large models, you're able to-- they're able 00:54:58.720 | 
to perform, presumably, or it's still waiting to be confirmed, tasks that are-- or they're 00:55:06.560 | 
able to actually perform at least a large-- a broad variety of tasks by just specifying 00:55:12.200 | 
them in the prompt. And it's now-- it's almost like now you have-- now you have a new computer. 00:55:18.120 | 
And for people who are really excited about the future of agents, now they can program 00:55:21.600 | 
thousands of agents with the same computer. Oh, maybe you-- now they have-- now they have 00:55:27.040 | 
agents that they can-- several agents that they can program with the same computer that 00:55:31.400 | 
then coordinate to solve problems. So we're getting much closer to the single model, not 00:55:37.200 | 
quite being able to specify all the rules of intelligence, but at least learning all 00:55:40.960 | 
the rules from data. We're very close to-- we're much closer than we were before. Now, 00:55:46.600 | 
this doesn't include all the important thing-- all the important specialization that has 00:55:51.720 | 
to happen after, like, RLHF or the alignment that you have to do to make a model more steerable. 00:55:59.440 | 
But it's-- and as it stands today, the scaling laws that the transformer exhibits are better 00:56:08.000 | 
than any other existing model, right? And there's an interesting question of, you know, 00:56:14.000 | 
which-- can we build a better model? And there are efforts-- there's, I guess, from the Stanford, 00:56:18.640 | 
from Chris Rea's lab, there have been a couple of efforts. There's been some revival of RNNs. 00:56:23.840 | 
But I think the only-- the only-- the only thing I'll say that is that the attention 00:56:28.640 | 
operation itself, this operation of actually moving information around or routing information 00:56:33.040 | 
based on content, is very, very useful. And it's maybe not a surprise that this general 00:56:39.400 | 
sort of spatial mixing of sampling, downsampling architecture has kind of stayed both in cognition, 00:56:44.720 | 
computer vision, and language, now with the transformer. So there are some invariants 00:56:48.320 | 
that are likely to stay, but I do think that maybe that it-- and there is certainly much 00:56:52.880 | 
more room there to improve, I mean, not just in the architecture, but on data itself. Like, 00:56:58.680 | 
there's probably 2x improvements on data. But I wouldn't say that there's-- there aren't 00:57:04.160 | 
architectures in the future that will get better scaling loss. They might, but there 00:57:08.600 | 
are properties about the transformer, such as self-attention and its general structuring, 00:57:12.880 | 
that is likely-- that we're likely to see in future architectures to come. Also, it's 00:57:19.320 | 
hard to really think of a modern-- like, if somebody really, really wanted to study large-scale 00:57:25.200 | 
modern transformers, you'd have to study, like, all-reduces, InfiniBand, Rocky, and 00:57:31.720 | 
what are-- like, well, but they get congestion, and they have very, very large clusters. So 00:57:37.640 | 
the computer is no-- the computer-- the transformer is now, in some sense, a data center, because 00:57:42.320 | 
it's not split up. These large transformers are with tens of-- potentially tens of thousands 00:57:46.360 | 
of GPUs. So-- and so if you-- so now you actually have to really focus on several parts, the 00:57:56.720 | 
infrastructures and the model itself. But what's really interesting, I think, is-- you 00:58:01.560 | 
know, I was just thinking of the smallest model that has exhibited emergent phenomena. 00:58:05.560 | 
Well, so we certainly know that GPT-4, which is likely-- I don't know if you're allowed 00:58:10.480 | 
to say it's some big-- like, trillion parameters. Yeah, I think you're allowed to say it, yeah. 00:58:16.400 | 
So it's a trillion-parameter size model. That's what everybody says. Size model. And then 00:58:19.600 | 
you have Brocking, which is a two-layer transformer that has this weird emergent behavior that, 00:58:26.280 | 
when you just keep training it on just-- on some amount of data, suddenly it just exhibits 00:58:30.400 | 
a space shift, right? So we're lucky. There are these, like, really-- there's strange-- 00:58:35.840 | 
there's weirdness everywhere. There's weirdness in small models and large models. And maybe 00:58:40.240 | 
we can learn something about large models by studying these small models, one would 00:58:44.720 | 
hope. But it's funny. There's still unexplained phenomena in very, very large models and very, 00:58:51.160 | 
very small models. But large transformers are no more just, you know, like a cola. There's 00:58:57.480 | 
just-- I mean, it could still be, but it's-- you have to-- there's so many-- there's so 00:59:02.360 | 
much that you have to keep in your stack in order to really optimize this entire-- this 00:59:07.640 | 
model. Of course, some of the very exciting directions are LLMs using tools. Yeah, that's-- 00:59:14.640 | 
so now the benefits of-- now language models or transformers are actually starting to use 00:59:19.800 | 
external entities. So they're connecting with the rest of the world. And I guess that's 00:59:24.000 | 
a good-- that's a good pitch for-- it makes a lot of sense to actually build products 00:59:28.820 | 
today because it's through interactions with-- like, if you want to get to the next tranche 00:59:33.520 | 
of capabilities, where will they come from? And likely, with a lot of usage, you will 00:59:38.840 | 
learn much more about how to guide these models and how to train them without-- than in vacuum. 00:59:43.320 | 
Now, you can definitely do very, very important work still in-- by even with a smaller model 00:59:49.200 | 
or even without building a product, without building a product because there's so many 00:59:52.360 | 
important unsolved problems. And maybe you shouldn't even work on the transformer because 00:59:56.880 | 
it's like Burning Man right now. Everybody's going to the same party. But I think that 01:00:03.320 | 
you will be able to build new capabilities once these-- with this human-machine collaboration. 01:00:09.800 | 
Of course, teaching models or models being able to express what they don't know, how 01:00:15.120 | 
do you learn new skills in infants' time, important for-- there's some interesting work, 01:00:18.640 | 
I think, on Minecraft that showed some evidence of this is also important for agents. And 01:00:23.760 | 
another-- a great property that some of these diffusion models have is the more compute 01:00:28.720 | 
you spend, the potentially better the quality of the image gets. But we don't exactly quite 01:00:32.720 | 
have that for language. And what does that mean? So today, the best-- the models that 01:00:37.240 | 
can reason-- that have the most proficient reasoning and planning are also the largest 01:00:42.600 | 
ones. Can we separate it out? Can we have smaller models that do some adaptive thinking 01:00:47.600 | 
and are able to match the capabilities of potentially larger models and reasoning and 01:00:52.080 | 
planning? And maybe the answer is going to come by connecting to external planners and 01:00:56.280 | 
planners or maybe with better representations of data, you can actually reason better on 01:01:03.000 | 
Also, this is, again, a more systems piece, but it's fascinating how low you can actually 01:01:09.040 | 
get on your-- how low you can-- how few bits you can actually use and still get something 01:01:15.320 | 
useful out. We already went from-- the original transformer was trained on 32-bit precision. 01:01:19.600 | 
Then we went to BFLOAT16. And now there's good signs that INT8 and FP8 would also work. 01:01:25.040 | 
And I think there's useful work to be done there. Again, going back to the same-- this 01:01:29.360 | 
argument about if you're actually-- if you're vector-- if you're using fewer bits to represent 01:01:36.000 | 
a number, you're actually transmitting fewer bits to the-- from HPM. So actually, you can 01:01:40.400 | 
get faster. You can utilize your matrix multipliers much more effectively. 01:01:45.240 | 
That was it. So there's many topics, but hopefully, we covered something fun. Thank you. 01:01:52.880 | 
Could you talk about what you're working on now and what you're working on? 01:02:03.880 | 
Yeah. So I'm a co-founder of a startup with my transformer co-author, Nikki. And we're 01:02:12.680 | 
working on building models that will ultimately automate workflows. And we're starting with 01:02:21.080 | 
data. So it's very puzzling what happens in a company. Companies are just basically just 01:02:26.320 | 
masses of dark knowledge, right? And there's very few people that have both the technical 01:02:31.200 | 
privilege and the understanding to ask questions, like typically analysts. But the less you 01:02:36.400 | 
understand, the less effective your company can be. So how can you eventually help anyone 01:02:41.320 | 
become an effective analyst, in some sense, right? So help them ask the right question, 01:02:46.120 | 
help them figure out, eventually, the whys, which then requires some kind of counterfactual 01:02:50.760 | 
reasoning that's very complicated. But start with data, since it's so important, and companies 01:02:55.280 | 
are essentially drowning in it. And then be spread out from there, and then try to automate 01:03:00.320 | 
other workflows and be impressed. But we believe that some of the early signs that we're seeing 01:03:06.440 | 
and our position is that I believe that this is going to require a full stack approach. 01:03:12.600 | 
So not just building the model, because you can then control what feedback you get. And 01:03:18.920 | 
so if you have a gap in the model, you ask for that. You start to get that feedback, 01:03:23.040 | 
so then you can improve the model. That's what we're doing. 01:03:30.880 | 
I'm surprised to hear that you're fairly bullish about tools in the end, like in our 01:03:34.680 | 
transparency control and third-party things. We talked about in the beginning that your 01:03:37.840 | 
motivation was transformers that enabled us to get rid of pipelines. But I feel like the 01:03:41.040 | 
rule was against pipelines again. So I'm surprised at this. Can you talk about that and where 01:03:47.680 | 
Right. So until we get to the point where it's like, you know, we're turtles all the 01:03:52.880 | 
way down, it's like transformers all the way down. No, I think that tools just allows you 01:03:57.480 | 
to-- so it's kind of like, how do you interface with a machine that can think, right? You 01:04:05.120 | 
have to build some kind of interface. And if you build a useful functionality, you want 01:04:08.240 | 
the machine to be able to take your functionality and do generally useful things with it, right? 01:04:12.840 | 
And I think that using tools is just a way of leveraging things that people have built 01:04:18.160 | 
and software out there. Certain tools will probably get absorbed in the model, right? 01:04:23.520 | 
Some others won't. And that still gives us the ability to-- yeah, it still gives us the 01:04:29.120 | 
ability to-- and certain things that transformers shouldn't even do, sorry. I mean, like you 01:04:35.640 | 
don't want to spend a billion flops per position to calculate two numbers, right? You don't 01:04:40.800 | 
want to spend more flops to do an operation that required like 1 billion flops, right? 01:04:45.240 | 
So there's certain things that the model should not do. It should use external tools. And 01:04:50.960 | 
there's certain things that the-- certain kind of thinking that the model should do. 01:04:55.880 | 
So even from a capability perspective, there's an important question of what all the capability 01:05:00.560 | 
should be in this neural network, right? But then also being able to utilize the work that 01:05:04.680 | 
others have done, software that other people have built. Yeah. 01:05:11.120 | 
It talks more about why like the original approach of decoding parallely and then integratively 01:05:16.440 | 
Why that didn't work and what-- Yeah, so sometimes if you know exactly why 01:05:20.520 | 
things work, maybe you can make it work. But it ended up being-- so you're able to do silly 01:05:25.360 | 
things like randomly sort, which means that if somebody walks up to you with a sequence 01:05:31.160 | 
and you can-- I mean, you can break two modes. Like you can say ascending or descending. 01:05:36.080 | 
So how do I say this? So typically, when you decode, right, imagine that when you give 01:05:42.580 | 
a prompt, you have many possible computations, right? And each time you make a choice, you 01:05:49.600 | 
narrow that space. And each time another choice, you narrow that space, right? And you have 01:05:54.720 | 
a very-- and you've learned to narrow the set of all possible, in some sense, paths 01:06:00.840 | 
in a way. The model doesn't have to decide what's the order in which you have to go. 01:06:05.520 | 
When you're doing this less or non-autoregressive generation, you have to do both, right? And 01:06:11.880 | 
doing learning both simultaneously is hard. I mean, but eventually, I think that if for 01:06:17.880 | 
a particular-- I think this is probably true, right? If an oracle walked up to me and said, 01:06:25.840 | 
this is the order in which all these sentences should be generated. First, you should generate 01:06:30.120 | 
these three words. Then you should generate these other two. Then these other two. If 01:06:33.200 | 
somebody walked up to you and gave you this oracle ordering for all of human language, 01:06:36.880 | 
I think you would have a much better chance. And you could actually get this less non-autoregressive 01:06:40.440 | 
generation. So one thing was basically the ordering itself. And I think it kind of has 01:06:50.000 | 
to do that, because the ordering helps you then lock down the modes. It narrows down 01:06:53.880 | 
what you're going to generate next. So ultimately, I think it does boil down to what's the right 01:06:58.640 | 
non-autoregressive ordering. And that could be either you're still generating one word 01:07:02.640 | 
at a time, but not autoregressively, or you're generating a few. And then based on that, 01:07:06.280 | 
you're generating the other few. So the words that you can generate all at once should be 01:07:11.200 | 
conditionally independent of each other, right? What you've generated so far should have completely 01:07:15.000 | 
explained them. And then what you generate after should again be-- they should be conditionally 01:07:20.720 | 
independent, right? So how do you learn these conditional independences? Yeah. And if somebody 01:07:24.400 | 
walked up to me and gave them to me, I think they'd probably learn them. Yeah. 01:07:35.880 | 
I think more of his thinking is that only scaling small and small doesn't help them 01:07:46.880 | 
to actually learn how the real world actually works. And we have a good idea of truth and 01:07:57.380 | 
real world now. And do you agree with him? Do you think that [INAUDIBLE] 01:08:08.760 | 
So yeah, I think it's interesting. You can't learn a word model with just language, right? 01:08:16.680 | 
So I mean, some of these models are not exactly being learned that way. You're doing RLHFs. 01:08:20.400 | 
You're getting some feedback, which means there's some-- you're applying some-- they 01:08:26.680 | 
are modifying themselves to some preference, right? So it's not just a pure language model. 01:08:33.240 | 
But it's interesting. So you've seen some of the work where robotics is now potentially 01:08:37.840 | 
starting to flourish because they're able to use these large models as planners, right? 01:08:43.040 | 
And so I think that it's surprising how much of the world-- how much information about 01:08:47.400 | 
the world that they carry. And if I understand, is that right that the SACAN were basically 01:08:51.320 | 
used a language model now as a planner, right? And then they left the rest of it to just 01:08:55.960 | 
the standard perception and the classical tasks of even solving the robotics. So that's 01:09:01.240 | 
a-- I mean, that's-- no, while Jan is probably still right, but the usefulness of it is evident 01:09:07.920 | 
in something that needs world knowledge, right? 01:09:11.240 | 
So I think you can do a lot with what you have. I mean, they're probably-- yeah, I mean, 01:09:21.500 | 
we still haven't quite extracted all the usefulness out of these models as well. And you might 01:09:27.560 | 
be right in some things. But there's still a lot more to be gained. Yeah. 01:09:35.520 | 
So I'm similar to the previous question, and you're also talking about immersion, right? 01:09:41.480 | 
I'm just curious to know what your thoughts are more on generalizability and immersion, 01:09:47.920 | 
especially in the-- I know there was a paper from DeepMind about the science-- yeah, I 01:09:56.040 | 
Like, they can't really generalize outside of what they've been trained on as-- especially 01:09:59.600 | 
because these large models now that they're just trained on everything. Is there truly 01:10:03.560 | 
anything left that's out of distribution that you could really sort of benchmark it on? 01:10:08.600 | 
So I have been caught saying that if I had all my test data in my training, I'd make 01:10:12.520 | 
a billion dollars. Yeah. So I don't have a problem with it. But I still think-- so OK, 01:10:19.320 | 
so correct me if I'm wrong, but the general argument is that these models have learned 01:10:25.400 | 
such a vast set of distributions and phenomena that typically, when you interrogate them, 01:10:32.360 | 
they're often very cleverly blending or bringing information from what they've learned, right? 01:10:40.720 | 
It might, yes. And then they have these algorithmic tasks where the models fail to generalize, 01:10:47.640 | 
So I'll focus on the former. I think that that's an incredibly useful property. It might 01:10:54.680 | 
be that-- so I think maybe the feeling is that we actually don't quite understand how 01:10:59.720 | 
much we could-- how much is even represented in text. And second, how much-- how far we 01:11:05.240 | 
could go if we were able to blend information from different-- like, certainly being able 01:11:11.560 | 
to write about the Stanford-- this lecture in the rhyme meter and words of Chaucer, not 01:11:17.840 | 
possible because nobody did it, right? But I think that you could do it, right? 01:11:22.160 | 
Now, is that blending information from what you already have? If so, that's-- that means 01:11:27.160 | 
you can-- that's an incredible skill, right? Yeah, I haven't read it. It's very recent, 01:11:35.560 | 
but I believe the work. I think you can show that in these [INAUDIBLE] but I think there's 01:11:40.080 | 
a surprising amount of new-- seemingly new things you could do by just blending information 01:11:47.120 | 
from what you've already learned. And yeah, it largely probably has to do with-- there's 01:11:52.080 | 
so much of it. Yeah, yeah. So you have a question? Yeah, I have two questions to go. [INAUDIBLE] 01:12:00.960 | 
I think I had an ordering in mind and then I came back to you. Sorry. But [INAUDIBLE] 01:12:08.800 | 
Give me a second. [INAUDIBLE] I was wondering if you might have insights into connecting 01:12:15.120 | 
different agents, transformers, or whatnot. Neurons is a great-- I feel like transformers 01:12:21.920 | 
essentially like a great connection of neurons in a specific way and it's awesome, right? 01:12:27.120 | 
So you figured out the best way to connect them so far. 01:12:30.040 | 
The agents? No, the neurons. Oh, the neurons. You're talking about-- do 01:12:33.520 | 
I know somehow to do this in the brain? No, the neurons in the transformer, right? 01:12:38.120 | 
The transformer is the way you connect different pieces together. And then when you connect 01:12:43.880 | 
them together, it works. Yeah. [INAUDIBLE] I was wondering if you have some insights 01:12:48.240 | 
in the building system that can actually go perform the best together. 01:12:52.480 | 
Yeah. [INAUDIBLE] I like to make this joke that the best agents are actually just the 01:12:58.840 | 
neurons because they can communicate with each other. They can update themselves really, 01:13:02.960 | 
really well by what the other agents are doing. What is the fundamental problem by making-- 01:13:09.840 | 
what is the fundamental issue in making a bunch of-- I'm trying to understand what are 01:13:16.560 | 
the fundamental problems in trying to make a bunch of systems work together, if that's 01:13:19.760 | 
what you're asking. One is goal decomposition, right? And one is the second big one is coordination, 01:13:27.200 | 
and third one is verification. If you solved a successful decomposition of the goals based 01:13:32.160 | 
on what your estimate of the skills of these agents are, if you're able to do what they've 01:13:36.920 | 
done, and if you're able to coordinate, then I think you could make a lot of progress, 01:13:40.840 | 
right? So while I didn't answer your question, I don't know in general how much progress 01:13:45.320 | 
you've made in all these three areas. But does somebody have any input here? 01:13:50.360 | 
[INAUDIBLE] and you have something that's [INAUDIBLE] and you want to verify this [INAUDIBLE] 01:14:00.120 | 
and make sure [INAUDIBLE] and verify everything and make it big enough. You can see how this 01:14:12.480 | 
is almost [INAUDIBLE] everything and making it efficient over time. And it's a lot of 01:14:18.800 | 
time [INAUDIBLE] how do you break the stuff, how do you [INAUDIBLE] 01:14:25.800 | 
Yeah, right. But I think these [INAUDIBLE] are probably maybe to some degree [INAUDIBLE] 01:14:34.720 | 
It's actually one question. So [INAUDIBLE] now we have a [INAUDIBLE] but the human brain 01:14:44.400 | 
is very modular. So it's modularity like the emergence phenomena, you need some spatial 01:14:53.280 | 
space to make that happen. Yeah, and by modularity here you mean that 01:14:58.440 | 
is it modularity in that they have this vision, has this responsibility, or even is the composition 01:15:06.000 | 
different, the construction different? What do you mean by that? Because you could have 01:15:11.440 | 
both, right? You could argue that there's no-- the responsibility is diffused across 01:15:15.360 | 
the model. And it's just that experts try to go in the opposite direction, which I should 01:15:20.040 | 
probably mention. That's another really exciting direction, which certainly has happened in 01:15:23.800 | 
a few folders, and it's going to stick. I totally missed it. That tries to get the specialization, 01:15:29.680 | 
right? So maybe that is some kind of modularity, right? Learn modularity. The rest of responsibility 01:15:36.440 | 
for performing the task is likely distributed. But if now you're going to these subsystems 01:15:41.520 | 
themselves of different composition, then you get back to-- and I know that this was 01:15:47.200 | 
a goal with the Pathways Project at Google, where you wanted to have these really modular 01:15:52.360 | 
systems communicate with each other. And I think there's-- it's just taken so long to 01:15:57.640 | 
get gradient descent. In fact, sometimes I think that rigid building architectures deserve 01:16:02.360 | 
gradient descent. And I feel like if you can learn with gradient descent, it's very useful. 01:16:11.040 | 
Maybe it's actually possible to make these modular systems work. We have some of the 01:16:14.080 | 
three experts, and I imagine some of these problems that we discussed before. Does that 01:16:22.880 | 
Sorry, circling back to whatever seven questions ago, you mentioned that the problem with decoding 01:16:29.400 | 
all at once was one of the things that code generating all at once has this assumption 01:16:33.960 | 
that the outputs are conditionally independent. But aren't they, in a sense that if you have 01:16:38.280 | 
a latent space-- if you're given the latent space as your prior, then your posterior 01:16:42.720 | 
outputs should be conditionally independent of each other, right? 01:16:45.520 | 
So great point. And where do you get the latent space from? 01:16:48.920 | 
Well, from the encoder, or whatever, in the beginning. 01:16:51.840 | 
Right, but there might be quite a few ways to translate something, right? Yeah, there's 01:16:57.280 | 
a multiple-- so if there's only one mode, then yeah, it's probably [INAUDIBLE] right? 01:17:03.160 | 
But if there's multiple ways of-- well, actually, there's two things. How much does the latent 01:17:07.760 | 
space actually carry? That was an important thing to ask, right? How much does it actually 01:17:11.880 | 
carry? Because it's not just the one latent vector that you're transmitting every-- you're 01:17:16.720 | 
doing attention again and again. But we took this approach, where we did precisely this. 01:17:25.000 | 
We autoregressively generated tokens in a new vocabulary using vector quantization. 01:17:33.920 | 
So the conditional dependence was modeled in a latent space, where we discretized using 01:17:40.480 | 
vector quantization. And then, based on that, we generated everything conditionally independent. 01:17:45.800 | 
And that didn't work. But again, so that didn't work in translations. The issue-- there were 01:17:52.600 | 
some funky issues there, where the latent-- the latent sequence of latent vectors were 01:18:00.280 | 
only effective-- were not effective if you can learn directly on the original data. You 01:18:03.960 | 
have to do something like distillation. Because distillation itself throws away, potentially, 01:18:08.360 | 
some of the modes. So generally, lower entropy data was-- we had to train on it. 01:18:12.640 | 
The second piece was, for practical systems, you have to make the whole thing really, really 01:18:17.080 | 
fast. But this was a good research exercise. But ultimately, it didn't have the right practical 01:18:23.040 | 
impact. Because speculative decoding, practically, with what we have right now didn't work well. 01:18:28.800 | 
Yeah, exactly. Yeah. But you're right. I think if you can generate a sufficiently-- if you 01:18:33.520 | 
can generate a good, sufficient latency, then yes, you're right. We can assume-- that makes 01:18:38.560 | 
everything conditionally independent. Yeah. And we managed to do that a bit. But it wasn't 01:18:48.160 | 
I guess this is the last question, now? Or are we already done now? 01:19:05.280 | 
And I have friends there. They're all really great. They're doing terrible things. I think 01:19:13.080 | 
that there's-- we'll be surprised how much there is to do. And if-- so first, the motivation, 01:19:20.560 | 
right? That is an entire new-- there's an entire new bucket of-- or like a new tranche 01:19:28.960 | 
of capabilities that you will get with human-computer interaction. So you can make a product. People 01:19:33.160 | 
use it. They give you feedback. Models get smarter. And this closed-loop system can really 01:19:38.680 | 
bring-- can really advance models. And then bring value, right? That's one. 01:19:45.160 | 
And I think it's helpful to have some deep learning benefit. It's so much from a diversity 01:19:52.720 | 
of ideas and people pursuing important directions. And I would say the same about building company 01:20:02.800 | 
products, as well, or building companies that are building new kinds of products with these 01:20:10.160 | 
models, right? So I would say that we have-- there's so much surface area that we could 01:20:16.160 | 
build something incredible. So that's the second piece. Third, yeah. Maybe that's the 01:20:23.360 | 
more personal direction I want to bring on, right? Yeah.