back to indexLanguage Diffusion Survey

00:00:05.080 |
So just general, yeah, intro yourself. Don't forget. We want to know about you and your personal 00:00:13.480 |
Yeah, for sure. Yeah, so I'm currently a master student at Georgia Tech. I know a handful of 00:00:20.040 |
folks in this community have also taken some classes there. But yeah, I quit my job about a 00:00:27.100 |
year ago to go full time just to speed up the process and take some harder classes and kind 00:00:31.120 |
of dig a little deeper. So I've got about a year left trying to focus on machine learning stuff 00:00:39.280 |
and a bunch of systems courses. I didn't have like a CS undergrad, so I'm really trying to fill in 00:00:44.440 |
some gaps there. But did some web dev stuff, some data engineering, some information retrieval stuff 00:00:50.080 |
at a legal company, and then at Rose set a stone before that. 00:00:54.200 |
Let's see. So I guess sort of the motivation for this is definitely like the Gemini diffusion 00:01:02.700 |
announcement. I know that Mercury got introduced by Inception Labs, I think back in January, 00:01:10.680 |
which they're not a big outfit. But one of the people on some of these papers is one of the founders 00:01:19.020 |
there. And one of their big claims is just like they can achieve similar levels of accuracy at much 00:01:25.520 |
faster output speed. So I think that's just an interesting dimension. I guess sort of for the 00:01:34.360 |
structure of this task, I think about this as like, how do we teach in general, spend a little bit of time 00:01:42.260 |
as teaching at a coding boot camp. And I'm trying to keep you guys in that flow state between boredom and 00:01:48.060 |
anxiety. So, you know, if it's going too fast, feel free to ask questions. If you're bored, I'm sorry. 00:01:52.580 |
You know, take a look at some of the papers that are referenced. This is definitely optimized for 00:01:58.580 |
exploration as opposed to exploitation. So I'm going over a bunch of different papers, but not very deep, 00:02:03.620 |
but happy to kind of pause and redirect as needed. Don't want to keep this like super formal. 00:02:27.620 |
Flow is still going. Keep keep it going. Who's seen Gemini diffusion? Thousands of tokens per second. 00:02:35.060 |
I don't think thousands. I think I think a thousand, right? 00:02:38.420 |
We shall see. Let's ask Gemini. How many tokens per second is Gemini diffusion? 00:02:52.660 |
I think that there's quite a few ways to still optimize it. Wow. Gemini is so smart. Gemini says that 00:03:00.100 |
it says it is 1479 tokens per second. But there's a brief overhead delay of 0.84 seconds. 00:03:09.300 |
That's rough. You got to wait a second for your thousand tokens per second. 00:03:15.380 |
There's a question here from IO about are diffusion models auto-regressive? Typically not. But you can 00:03:21.860 |
have an auto-regressive body and a diffusion head. 00:03:27.540 |
Okay. Sorry about that. I think Zoom crashed on me. That's the last of it. Cool. So I'll try to run 00:03:39.540 |
through this. So really just kind of motivating this with some of the foundations. What is a generative 00:03:45.860 |
model in the abstract? We're trying to model kind of an underlying probability distribution of data 00:03:53.140 |
that's unknown. And then once we learned the distribution, then we're able to do stuff with 00:03:58.340 |
it. So this slide is based on a talk from Yang Song, who did some foundational stuff starting kind of 00:04:06.020 |
in 2019 and has been a big contributor to diffusion models in general since then. But I thought his graphic 00:04:13.380 |
was pretty cool. So we've got our probability of a data point X given the -- based on the probability 00:04:32.420 |
of the data. And then we're trying to find a model that's able to learn that distribution based on a 00:04:39.700 |
laminal parameters theta. And during training, the p theta X is treated as a likelihood function and 00:04:46.580 |
measuring it that deserves observed data under the current parameters theta. We can think about this 00:04:52.020 |
as a KL divergence between the two probability distributions and sort of the optimization function 00:04:57.460 |
we're trying to do is minimize the difference between these two distributions. Another equivalent 00:05:03.940 |
framing is to say that we're trying to find the parameters theta that maximize the likelihood of 00:05:12.100 |
estimating expected data. And then we can do this equivalence relation. The problem is that the 00:05:18.020 |
for most data distributions that we care about, the shape of the distribution is really complex and 00:05:28.340 |
hard to model. And then it's also hard to find a set of weights that are able to appropriately model it. 00:05:36.180 |
So in that naive case, we can say we're using a Gaussian distribution, which isn't very expressive and 00:05:46.020 |
isn't able to model complex spaces. Or we can use kind of a neural net, which is based on universal approximation 00:05:56.420 |
theorem is able to model our data. And then we can sample from it. Sort of the challenge becomes, you know, 00:06:02.420 |
how do we make sure that it's appropriately expressive, but isn't too expressive. So another view of this is 00:06:12.820 |
just like a narrow slice of a two-dimensional mapping of sort of the feature space that Claude 3 learns. 00:06:25.140 |
So just another kind of perspective into what the general modeling task is. We're trying to learn this 00:06:30.980 |
distribution or a distribution that's able to model the true space of the data that we care about. 00:06:40.100 |
Cool. So jumping into a bunch of different types of general models. Again, this is sort of more 00:06:48.980 |
background information. This is inspired by an old talk by Ian Goodfellow, but breaking down the 00:06:58.500 |
general models into explicit densities of implicit density. I'm starting kind of on the left branch here. 00:07:04.740 |
We're saying that a generative model aims to learn the underlying probability distribution data. Cool. 00:07:10.100 |
So that's what we just discussed. And then the explicit density 00:07:13.780 |
is we're defining p theta x. Again, using the maximum likelihood. These are split, again, 00:07:25.380 |
into tractable density models and approximate density models. The tractable density models include 00:07:32.660 |
auto-aggressive models and normalizing flows. So sort of the classic formulation for auto-aggressive models. 00:07:40.260 |
RNNs, LSTMs, GPT, et cetera, kind of follows this joint probability distribution of a single token, 00:07:55.300 |
given the previous tokens. And that just kind of continues auto-aggressively. 00:07:59.140 |
For explicit density functions, they're modifying the density itself, but it's intractable to compute 00:08:16.340 |
that directly. So we use an approximation on that. And so these both use lower bounds on what the density 00:08:27.060 |
function we're trying to learn is. So we talked about VAE or VQVAE recently. I think Ted did a talk on that. 00:08:36.900 |
And then today we're getting into diffusion models, but this is sort of just like framing it with other 00:08:43.300 |
types. And then, you know, going back up to our tree, we've got implicit density functions, which includes 00:08:50.740 |
GANs, energy-based models, which aren't used a ton, but sort of inspire or motivate a lot of the -- or some of the 00:08:59.780 |
So with GANs, you've got a generator discriminator, and you're trying to 00:09:09.380 |
generate a sample that's able to fool the discriminator. 00:09:14.500 |
GANs, but with each of all of these, we can view it as forming some sort of lean representation of 00:09:24.180 |
what our model -- our data distribution is, compressing it into a different form and then trying to learn 00:09:34.900 |
how to reassemble it from that later representation. There's a lot of kind of back and forth between 00:09:44.500 |
the different generative modeling approaches. The space is pretty rich and people pollinate ideas 00:09:54.020 |
between all of this stuff. Again, this is sort of hopefully not too much in the boredom side of 00:09:59.860 |
things. But for folks that are newer to the space, I think it's helpful to kind of have the preliminaries. 00:10:06.260 |
So jumping into diffusion models foundations. Actually, let me pause for questions real quick and check chat. 00:10:18.420 |
Just let me know if anybody's got something I can pause for. 00:10:22.020 |
There's a question about elbow. I vaguely remember elbow. I forget what elbows are. 00:10:27.300 |
Yeah, I think it's the expression lower bounds. So it comes up in VE's and saw it in some of the diffusion stuff. 00:10:40.100 |
But it's sort of a trick for how to work around learning, like the true probability distribution, 00:10:48.340 |
because that's intractable. So we're saying we can get as close as this approximation 00:11:02.980 |
On my side, I didn't super get implicit versus explicit. I guess like we're used to explicit. 00:11:16.180 |
This was an obsidian canvas. And I just took screenshots of it. 00:11:30.100 |
Like, everyone is explicit, right? Based on the diagram that you... 00:11:39.780 |
Because they do like the weird generator discriminator min/max game to like try to optimize. 00:11:48.260 |
I don't think it's weird at all. I think we're about to see that with multi-agent. That's basically 00:12:00.100 |
It's a little spoiler that we recorded an episode with Gnome Brown. 00:12:08.100 |
So, trying to go back as far as I could. The first reference I found to kind of iteratively 00:12:20.260 |
applying noise was in extracting this paper, extracting and composing robust features with denoising autoencoders. 00:12:26.820 |
So, this is before like diffusion became a thing. 00:12:31.380 |
And then Pascal Vincent also shows up a lot of kind of earlier work out of University of Montreal. 00:12:37.860 |
So, a quote from the paper is destruction. For each input X, a fixed number of VD of components are 00:12:46.500 |
chosen at random and their value is forced to zero while the others are left untouched. All information 00:12:51.780 |
about the chosen components is thus removed from the particular input pattern. The autoencoder will be 00:12:56.820 |
trained to fill in those particular introduced blanks. 00:13:02.020 |
So, they're just randomly picking elements and masking them out and then trying to learn a representation 00:13:12.180 |
based on the mask stuff. They found that it improves sort of the performance. 00:13:16.660 |
Benjio picked that up a couple of years later. So, that was 2008. He had a handful of papers between this. 00:13:25.700 |
And then, in 2013, this one I thought was kind of interesting where definitely in the same vein, 00:13:35.140 |
but they've taken a bit further. We have proven that training a model to denoise is a way to 00:13:39.380 |
implicitly estimate the underlying data generating process and that a simple Markov chain that alternates 00:13:45.380 |
sampling from the denoising model and from the corrupting process converges to the estimator. 00:13:50.420 |
This provides a mean for generating data from any denoising autoencoder. So, they've got, you know, 00:13:57.620 |
they had a figure from MNIST where they're iteratively applying noise on one side and then walking it back 00:14:07.060 |
on the other side. So, can we take a noisy image and reconstruct it? Which definitely leaves the 00:14:13.380 |
foundation for some of the stuff to come. This was like the first paper that people associate with 00:14:19.300 |
Diffusion process. So, kind of like the past ones, a lot of this comes from physics. Like a lot of 00:14:30.740 |
old school ML, like early 2000s and before is like statistics and statistical mechanics. So, one of the 00:14:40.500 |
examples I've heard to describe this is like how do you like a stochastic process is like dropping a drop 00:14:47.700 |
of ink in water and seeing it kind of like spread out. And then the reverse process is like how would you 00:14:52.660 |
like imagine where the drop came from? Shit. Okay. Okay. So, similar to the past ones, they iteratively add 00:15:09.220 |
noise. They talk about stochastic differential equations, which the math is deeper than I fully 00:15:17.860 |
understand or want to try to get into. But it's the underpinnings are just kind of on these stochastic 00:15:23.540 |
processes, which were heavily inspired by physics and some of the stuff into like how do particles move? 00:15:32.820 |
How does heat and entropy associated with heat kind of perform? 00:15:37.860 |
And some of the examples from that paper, they've got like on the left, figure A, we've got C410 holdout 00:15:50.420 |
images. They corrupt them in figure B with like Gaussian noise. And then they denoise the images 00:16:00.420 |
and try to reconstruct them. And we can see that the reconstructions are pretty decent. And I think 00:16:06.020 |
this was with a two layer network. So, not a ton of parameters, but they're still able to get some 00:16:12.900 |
interesting results. And again, yeah, we're gradually applying noise in the forward diffusion and the 00:16:19.540 |
reverse diffusion, recovering the distribution. Kind of speed summary of 2015 to 2019. 00:16:26.900 |
2017. GANs got released. I think Bingio was also on the GAN paper with Ian Goodfellow, I think around 2014. 00:16:35.460 |
They blew up, they had kind of leading results for the Fisher inception, just to whatever the FID score 00:16:47.620 |
for a while. Their GANs kind of struggle with a mode collapse. So, I did a project on GANs a few months ago. 00:16:59.300 |
And I was trying to reconstruct the fashion in this. And like half of my results came out as boots. 00:17:08.180 |
when there should be shirts and hats and all kinds of stuff. VQVAE was released, I think, 2018. We talked 00:17:19.060 |
about that recently. But it compresses images into a discrete one-dimensional space. And then tries to 00:17:24.740 |
predict it from the compressed space. And these kind of like went back and forth on what the state-of-the-art 00:17:34.100 |
results were. Then in 2020, Song and Ehrman took another stab at diffusion models. They framed it in 00:17:47.700 |
terms of this thing called score matching. Where their approach to this was very math heavy. But they're 00:17:59.540 |
trying to understand how a diffusion process is able to learn a function to score 00:18:04.820 |
the data in a high-dimensional space. And then they also introduced some processes for scaling that. 00:18:13.940 |
And then they introduced a slightly larger network. And based on their principled approach, we're able to 00:18:24.340 |
increase the sample size and quality of the images that they could generate. So just a snapshot from this. 00:18:30.740 |
So still kind of look bad, but better. And then their FID score was 10. 00:18:39.940 |
The big breakout paper for diffusion was this one, which is usually called DDPMs, but denoising diffusion 00:18:51.140 |
of probabilistic models. Similar to previous work. This one uses a forward and refuse deversion process. 00:18:59.860 |
One of the key breakouts here was that they restricted this to only predicting the mean of the Gaussian 00:19:09.860 |
distribution from the noise was added. And then we're able to show with a bunch of math that the mean, 00:19:19.460 |
predicting the mean is equivalent to predicting the sample X or the image sample that they're trying 00:19:24.980 |
to find. So they effectively, they're trying to predict the sample X, but they do that by predicting 00:19:37.460 |
the noise applied to X. And then they also introduced a much larger network. So this is a unit that they pulled 00:19:45.380 |
from image segmentation tasks. And they achieved a state of the art for FID. Again, this is 2020. 00:19:54.100 |
I'm not going to go through all this. Song came back pretty shortly afterwards. And it iterated on his last 00:20:10.580 |
framework. More stuff with stochastic differential equations. Some of these papers or all these papers 00:20:17.220 |
that I've talked about sort of form the underpinnings for current work. They're really kind of like the 00:20:22.420 |
landmark ones. And people keep referencing particularly this, the DDPM paper. 00:20:33.220 |
And again, the results look even better. You can see here, this is the score function in the middle. 00:20:50.340 |
Okay. So jumping ahead to actually pause there. Any questions? A lot of the stuff is kind of 00:20:59.060 |
background, so trying to go through quickly. But that doesn't mean it's not interesting or important. 00:21:10.020 |
And then also, I think the other thing I was waiting for you to cover, but it looks like you 00:21:16.740 |
stopped it like 2020, right? Did the paper be covered? 00:21:23.620 |
2021. You didn't cover latent diffusion? Was that latent diffusion? 00:21:29.620 |
Well, latent diffusion is very important. And then consistency models and flow matching. 00:21:34.900 |
Those are the three things that I think the last three years of diffusion have kind of thought taught us. 00:21:45.940 |
We've covered those things in previous paperclips. 00:21:51.620 |
Yeah, I think that the background was super helpful. Thank you, Tyler. 00:22:10.980 |
So after the DDPM paper got state of the art on FID scores for ImageNet, it got a lot of attention. 00:22:22.180 |
The research kind of spiked afterwards. A lot of the focus was still on applications with image generation. 00:22:32.180 |
But increasingly, it started to become with text modeling and other modalities. So audio, video. 00:22:38.180 |
This was one of the first ones I could find with reference to 00:22:44.740 |
text modeling. And this is the Austin et al. 2021 structure denoising models in discrete spaces. 00:23:00.020 |
One of the question paper is we develop a structured corruption process for appropriate 00:23:07.060 |
processes appropriate for text using similarity between tokens to enable gradual corruption and 00:23:12.580 |
denoising. Expanding further, we also explore a corruption process that insert mask tokens, 00:23:17.620 |
allow us to draw parallels to autoregressive and mass-based generative models. 00:23:28.980 |
So the process they introduce is they think of, you know, a string of text as a sequence of discrete 00:23:41.300 |
tokens. We can do that as a reshape it into a matrix. And then if we're randomly masking a token, 00:23:52.020 |
we can do that by sampling Gaussian noise. But each one of these samples is discrete versus 00:23:59.460 |
some of the previous approaches we talked about, or assume that the distribution is continuous, 00:24:04.740 |
which is more appropriate for like color values or from, you know, an audio signal, 00:24:11.860 |
we're text. We're treating this as just one of the values from a row cab. And we can iteratively apply 00:24:24.500 |
this is the four process Q, XT condition on XS. And they use like a transition matrix, which I think is from 00:24:39.700 |
Markov, Markov, like a Markov process. But I'm feel free to jump in if someone has a better understanding 00:24:50.020 |
of that. But each time step, they determine, you know, randomly, which based on a probability from 00:24:58.580 |
from this transition basics, should we mask a token at each time step T. 00:25:08.020 |
And then the reverse process is similar to that, where if we know what the true distribution is of 00:25:15.300 |
the data XO, we know like what the unmasked value should be. And so we're able to sample from like 00:25:24.340 |
the reverse of the distribution to unmask it. But of course, you know, in actual gender modeling, 00:25:33.460 |
we don't know what the true distribution of XO is. So in our, you know, our reverse process, 00:25:41.380 |
probability of X given theta, we can said, approximate this using our neural network 00:25:57.140 |
to determine what the probability should be. So this is, this is what the network learns. 00:26:14.100 |
what the model is, this is a figure from the paper. 00:26:16.260 |
And based on the LM1B task to generate new sentences. And then the bottom is the bottom 00:26:26.820 |
portion on the right is reconstruction samples. And so what this is trying to show is like on the, 00:26:33.780 |
on the top as T goes up, more stuff is masked. And then on the bottom as T moves forward, more stuff is 00:26:40.340 |
unmasked. And then if we learned a good representation of the, of our data, then what gets reconstructed is, is like 00:26:55.540 |
So we can see, you know, it masks. The original is Caterpillar is eager to expand in Asia. And then the, 00:27:03.940 |
the reconstructed version of Caterpillar is eager to expand in China. 00:27:06.820 |
Some of the claims in the paper, this is a big block quote that I'm not going to read, 00:27:18.260 |
but I encourage people to check out because I think it's 00:27:25.140 |
But they, they claim that that BERT viewed through this, this lens that they've, they've established 00:27:33.540 |
And the auto aggressive models are discrete diffusion models. 00:27:39.060 |
And the generative masculine drink models. So the type of model they've proposed, 00:27:47.700 |
or, or, or mass language models in general. So like BERT, um, or diffusion models. 00:27:53.780 |
So interesting parallel with kind of like previous NLP research, um, 00:27:59.460 |
and trying to like establish a background for, for their approach. 00:28:04.420 |
This plays out in some of the, um, the future papers. So I wasn't sure how much depth to go in. 00:28:11.780 |
So I cut out a bunch of it stuff. Um, so this one was 2021. Um, there's a ton of. 00:28:17.620 |
Sorry Tyler. Oh yeah. I was going to actually ask a question on this. I think this feels actually 00:28:23.060 |
kind of important. Um, so I want to talk through my thinking of it. And I think you, you will know 00:28:30.580 |
this a lot more than me. I think BERT is a one-step diffusion model because we add noise, right? We corrupt 00:28:36.660 |
the tokens. That's why it's a one-step diffusion model. 00:28:40.980 |
I'm a little bit stuck on how autoregressive models are discrete diffusion models. Could 00:28:45.460 |
you talk a little bit more about that piece? Yeah, totally. Um, so 00:28:52.260 |
I think there, um, I'm not as, as firm on this as, as I, as I could be. Um, but 00:29:06.980 |
you know way more than me at this point. I don't know. I went really broad and skimmed a ton of 00:29:12.580 |
papers. Um, and I can, I can share a bit of a high level about what they're staying here. So what's the 00:29:17.940 |
intuition? Yes, please. Discrete diffusion in this sense is basically them. Uh, so at each step there, 00:29:27.220 |
there's a deterministic answer and they're just masking the next token, right? So when you train the 00:29:32.660 |
other aggressive model, what you're basically doing is you're predicting one token at a time 00:29:38.020 |
and in a sense, right, that's, that's just, you have deterministic outputs of what the token should be. 00:29:43.940 |
You're just now masking one token at a time. So it's deterministic in the sense of, you know, 00:29:49.940 |
you know exactly what it is. It's equivalent to like a single diffusion step, but instead of stochastic 00:29:55.620 |
diffusion, it's deterministic because you know what the token should be. But yeah, I like read a little 00:30:02.420 |
bit more about this and that's, that's kind of all they're saying. They're just saying you're masking token 00:30:06.580 |
by token and that's all it is. Right. So it is discrete as opposed to probabilistic or continuous 00:30:14.980 |
like images because images, image pixels are continuous, right? Right. In this case, it's text 00:30:19.540 |
tokens. That's why it's discrete. Is that it? Images, images are continuous in the sense of you're predicting 00:30:26.180 |
how much noise was applied and what the noise is. In this case, you have exactly one token of measurable, 00:30:32.180 |
predictable, discrete change. Right. So that's, that's how they're framing the idea, which, which 00:30:38.420 |
kind of in some sense makes sense, right? Like how much noise did I apply over an image? There's a 00:30:43.060 |
spectrum of answers, right? And then your loss is measured differently. In this case, it's a very 00:30:47.540 |
discrete. Yeah. I also wonder in this case, like maybe the word big, it could be large or huge. I think 00:30:54.420 |
all of them, all of them are actually, the word big itself may not be the, you have many synonyms that 00:31:01.380 |
could also be correct answers. So in that sense, framing as discrete loss or discrete diffusion. 00:31:07.860 |
Never mind. I don't know enough of this to comment more. Sorry, please go ahead. 00:31:12.020 |
So in that case, why are mass language models? Oh, RJ, go. Yeah, sorry. One interesting thing that I 00:31:21.220 |
learned when I did the consistency paper was that the diffusion process is actually the model from the, from the 00:31:30.020 |
beginning, given the predictions up until now. Right. So like, it's not, it's not just like given 00:31:37.860 |
the last prediction, what's the next step. It's given, what's the next step, given all the predictions up 00:31:45.700 |
until now. And this is the same for our language model. And so that it, in, from that perspective, 00:31:50.580 |
it looks very much like an auto regressive model. Right. That makes sense. Yeah. And, and, and the other thing that I 00:31:58.180 |
I just like, maybe, maybe there's a statement, the obvious, but BERT is it like with the mass language 00:32:04.660 |
modeling is just sort of like a generalization of masking only the next token. Right. So it's sort of 00:32:10.900 |
like, you're just, you're just, you just happen to have all masked the next token every time for the, 00:32:16.980 |
um, for the auto aggressive model. Whereas with a mass language model, you're doing anywhere in, 00:32:24.580 |
or like a chunk inside of the, and like tokens inside of the, inside of the text, instead of at the end. 00:32:32.740 |
Right. That makes sense. I kind of think like, but it's like, you're masking tokens in the middle, 00:32:39.380 |
like 15% of the time. Yeah. Whereas for auto regressive, you're masking everything ahead. 00:32:45.460 |
Um, everything's a mask. Okay. And then they denoise one position at a time, if you think of it that way. 00:32:50.100 |
But, but, but maybe more. Yeah. So maybe more just like you mask whatever the next token is. 00:32:55.540 |
Or, or maybe it, maybe it's better. Maybe what you said is better way to look at it in the sense 00:33:01.460 |
it, because it parallels what I said about diffusion models in the, um, like for whatever the stable 00:33:09.220 |
diffusion or whatever. No, that's totally right. Yeah. And then going to get into it in a minute. Um, 00:33:14.420 |
at least a little bit is that, um, some of the state of the art stuff for language diffusion, 00:33:20.180 |
um, of course they, they lean heavier on big transformer models. Um, and they're able to remove 00:33:25.780 |
kind of the, the causal, causal mask, um, that's applied for attention. Um, so you think of like the, 00:33:32.500 |
you know, the bottom triangle, um, that we usually see for attention, or if you have like group query 00:33:36.980 |
attention, it's like the little squares, um, down your, your triangular matrix, um, for, for 00:33:43.380 |
diffusion language models, they're able to look at, you know, the whole, um, uh, KV, um, for, 00:33:53.700 |
for the attention, um, operator. So, so if I'm jumping ahead here, I'm sorry, 00:34:00.020 |
sorry, please finish, uh, yeah, no, that's it. So if I'm jumping ahead here, let's say I have a code base 00:34:06.260 |
and I say, I want to refactor this specific object that's being used. 00:34:13.060 |
Discrete model, uh, a generative model just fills in the blanks. 00:34:17.060 |
And that's why you don't actually have to go left to right, like autoregressive models, 00:34:21.620 |
which you have to go left to right, even when they're just filling in the blanks. 00:34:24.340 |
Actually, I don't know if there's, uh, better techniques for filling in the blanks with an 00:34:29.060 |
autoregressive model for code models, but discrete model can just say, okay, 00:34:32.180 |
someone will just draft out, okay, these are all the definitions I need, all the functions I need, 00:34:36.500 |
some autoregressive model draft that out and then discrete model just fill in the blanks and 00:34:41.060 |
it can do that very fast. I'm, I'm just kind of trying to match it to what we saw at Google I/O 00:34:46.980 |
when they were demoing this, um, sorry, I mean a diffusion model for coding. 00:34:52.660 |
Totally. Yeah. I mean, I think one of the challenges is too, is that the context for 00:34:57.220 |
things that are ahead. Um, so if you're, you know, in that example, and I, I don't know the specific 00:35:05.940 |
process that are applied for, for code models and assuming it's, you know, mom with a, with a similar 00:35:10.660 |
loss function, um, and that they're just sampling from, um, saying, here's what it is, here's what we 00:35:18.580 |
want. And it's able to interpret that there should be a larger output space, um, or the output space 00:35:25.060 |
sequence should be longer. Um, that's true. Expanding that dynamically, right? 00:35:29.860 |
Yeah, totally. Yeah. But if you're, if you have like, you know, your, your hundred line file, 00:35:35.460 |
if you're doing it auto aggressively, you, you have to feed in the, like, and you're making edit at a 00:35:43.460 |
character 50 or something, you have to feed in the first 50 lines in context and say, we want the 00:35:49.700 |
edit at line 50, and then also feed in the remaining 50 lines, but it, you know, so, and then the, 00:35:58.740 |
your auto aggressive model that has to attend to the stuff that's ahead of it, um, with reference 00:36:04.740 |
to the stuff that's behind it versus if we don't have the, the causal mask, it can attend to all of it 00:36:10.100 |
at once, um, or it's able to attend to it at once in a different way. Um, but hopefully I'm not. 00:36:17.060 |
Makes sense. Thank you, Tyler. This was helpful for my intuition. Thank you. 00:36:22.660 |
cool. So yeah, there's a bunch of papers that happened from 2021. I'm jumping ahead to 2024. 00:36:31.220 |
Um, the state of the art kind of like kept ticking up a little bit, but still significantly worse than 00:36:38.340 |
auto aggressive models. Um, even for comparable model sizes, um, there was stuff that like the 00:36:46.500 |
back and forth between discrete state spaces and then continuous state spaces where they have like 00:36:51.460 |
a continuous representation and then they have like a separate process. So the continuous representation is 00:36:56.180 |
like an embedding. Um, and then they have a separate process to sample from the embedding 00:37:00.740 |
to determine what the token should be. Um, which hasn't really panned out in terms of like what the, 00:37:09.700 |
what is now state of the art. So that's part of the reason I chose to skip over it. Um, but, um, 00:37:15.140 |
encourage folks that are interested to look into it. There's a bunch of papers that came out, 00:37:20.180 |
um, kind of in this time span that I'm glossing over. Um, one of the next ones that I thought was, 00:37:30.740 |
was pretty cool. Um, and I think this is kind of looking backwards. It's sort of what this year, 00:37:35.380 |
the art is and what's referenced from that. So looking at, um, um, um, 00:37:41.380 |
the study in October of last year, um, where they started scaling up, um, mass diffusion models. Um, 00:37:54.100 |
and they were able to achieve results, um, that are competitive with, um, auto aggressive model, 00:38:01.780 |
um, language models, um, of similar sizes, um, and with relatively similar, um, levels of, of training 00:38:11.220 |
compute. Um, so they trained up to a 1.1 billion parameter model. Um, and then for, depending on the 00:38:20.980 |
different benchmark was, was competitive with GPT two, um, the 1.5 B version and then llama two, 00:38:27.460 |
um, the seven B version. Um, so those weren't quite state of the art when this was released. Um, 00:38:32.820 |
but still kind of a, um, a step forward. Um, and then it's, I don't know, I kind of like this scaling, 00:38:39.460 |
um, scaling work just to see like the number goes down as compute goes up. Um, there was pretty graphs. 00:38:49.620 |
Um, so, you know, the classic Kaplan, um, chinchilla isoflop curves, um, based on our, 00:38:57.700 |
our training budget, we make stuff go down. Um, 00:39:03.140 |
and here they have, like, they're saying they followed a similar, um, scaling law to auto aggressive 00:39:14.740 |
models with their mass diffusion models. Um, but with some constant multiplier. Um, 00:39:20.900 |
but the, the general curve, you know, on our, our log log plot was, was pretty similar for, for their 00:39:28.580 |
approach. Um, digging into this a little bit more, um, the, to achieve a similar validation loss, um, 00:39:40.900 |
they had to have 16 X, um, more compute, um, than the auto aggressive model and on their, 00:39:48.020 |
their approach that they model here. So that's the, the left plot. And then on the right plot, 00:39:53.300 |
they were able to achieve similar, um, what is this? 00:40:03.460 |
I think better performance with fewer parameters is the right plot. 00:40:06.900 |
And then, yeah, here's their, the results, um, which are competitive with the, the models we listed. 00:40:18.020 |
Um, this one we, we talked about last, last week. Um, so I'm not going to dig into it too much. Um, 00:40:24.420 |
so I just want to hit some things that I thought were cool that we didn't quite touch on. Um, 00:40:28.580 |
so this is the same, uh, most of the same authors as the scaling loss paper, um, where they continue 00:40:36.820 |
the trend and scale up to, um, 7 billion parameters. Um, and then the results got even better. Um, it's now 00:40:45.460 |
as good or better than llama two, seven B or llama three, eight B, um, and handful of different benchmarks. 00:40:50.420 |
Um, one of the things, you know, we were just talking about is that the bi-directional reasoning. 00:40:56.900 |
Um, so one of the tasks they, um, they looked at was, um, reversing a poem. Um, so if you have like 00:41:10.740 |
the last couple of lines in a poem, can you predict the lines that came before it? 00:41:14.820 |
Um, and for this specific task, um, this model significantly outperformed 00:41:21.140 |
most of the other models, they, they tested against including, um, 4.0. Um, and they, they attribute 00:41:28.340 |
that to kind of the, um, the lack of the casual mask, um, for tension that, you know, just talking about. 00:41:34.260 |
Um, so again, a figure from the paper where they're, they're masking stuff. Um, they also introduced this 00:41:42.180 |
step where, um, they're able to score the probability of a token, um, that's predicted by the, 00:41:49.620 |
the mass predictor. Um, and things that are low probability, they can then potentially remask and 00:41:54.820 |
then try to resample again to get a better prediction. Um, so that's, um, the right or part C of this 00:42:02.500 |
figure, um, where they're remasking, um, the figure and then re-predicting it again, which is kind of neat. 00:42:07.940 |
Um, and then more scaling loss stuff, um, for different tasks, um, their model was able to, um, 00:42:20.020 |
achieve better performance at, at lower, um, training compute. Um, and sometimes it's worse. 00:42:26.820 |
So for the middle bottom plot GSM, GSM 8K, um, which I think is a math focused task, um, 00:42:34.740 |
they outperformed, um, their autoregressive baseline and then to the plot to the left of it, 00:42:39.380 |
the bottom left plot there, um, we can see all the orange stars are sort of below, um, the, the blue dots. 00:42:48.420 |
Um, and so lower is a worse accuracy and this is your zero shot task, um, for the same level of compute. 00:42:58.260 |
This next one, uh, block diffusion. Um, so this paper, this is 20, um, January, this one came out in, um, 00:43:08.660 |
March of this year. Um, this uses, um, a hybrid architecture where it, um, has, um, 00:43:18.340 |
has blocks, um, that in each block is generated auto aggressively, but with any in the block, 00:43:24.020 |
it's, um, generated using diffusion. Um, part of the reason they did that is to, um, 00:43:30.740 |
I think part of it was to make their, um, the sampling task easier. Um, but they were also able 00:43:39.220 |
to take advantage of KB caching, um, versus this, this paper, um, despite the results, um, 00:43:48.420 |
doesn't, hasn't been optimized to fulfill a lot of the tricks, like KB caching. Um, 00:43:53.540 |
I didn't dig into the specifics of why KB caching doesn't work for this previous paper. Um, kind of 00:44:00.660 |
thinking about it, um, my intuition is that, um, if you have like a, a dialogue, um, you know, between 00:44:09.860 |
the system and a user, um, that it's, it has to generate like the next response starting from zero, 00:44:16.500 |
um, as opposed to finding a way to, to cache like the previous, um, tokens like you can, um, 00:44:23.300 |
if you're generating auto aggressively, um, and they're able to, so block diffusion was able to 00:44:28.180 |
get around that by having, um, like these, these chunks of a fixed length that they then generate. 00:44:34.100 |
So we, again, we talked about this last week that, um, the large language diffusion paper, um, applied 00:44:46.100 |
pretty similar pre-training. Um, I think they use like a 2.3 trillion tokens. Um, and we saw 10 to the 23rd, 00:44:53.860 |
um, flops on H 100, um, but they didn't do any post training. Um, and then some of the, the new work 00:45:05.380 |
that's coming out is improving kind of the, the post training, um, uh, especially with our reasoning slant. 00:45:11.860 |
Um, so this paper, um, so this paper, uh, uses this, this is a base model, the, the LADA that they introduce, 00:45:20.660 |
um, and then applies, um, this, their custom GRPO, um, post training. Um, and then they use like the S1, uh, 00:45:34.580 |
reasoning dataset to do supervised foreign tuning, um, with a slant towards, um, like math and reasoning 00:45:42.020 |
and code tasks. Um, and based on that, they're able to, um, you know, dramatically improve the 00:45:47.620 |
performance on those specific tasks over the base model. Um, so that's the, the bottom table here, 00:45:57.940 |
they're in green. Um, the top row is the, the base model. Um, so they, they bumped up the numbers a 00:46:05.380 |
good amount. Um, let's see, and that's, that's all. So thanks all. 00:46:16.020 |
This is really good. Um, Tyler, I, there's a question that Eric posed in the channel 00:46:27.540 |
that I'm also curious about. That's a compute for the output window scale as O log. Is it N squared 00:46:35.140 |
for diffusion models or is it linear? Oh, would you know? 00:46:45.940 |
Some of the big ones that we hear about like stable diffusion, dolly image gen, uh, diffusion transformer, 00:46:57.220 |
video diffusion transformers. Those are all transformer based. So there is, um, there is 00:47:03.780 |
that, um, quadratic scaling issue, but some of them are not. So some basic denoising 00:47:10.740 |
diffusion probabilistic models, they're more CNN based. And then there's no longer that transformer 00:47:16.820 |
complexity issue. So depending on what you're doing, like some early work trying to do diffusion 00:47:23.460 |
for completion. So like short completion for cogen stuff is not transformer based. So you're 00:47:30.100 |
no longer complexity bound. But I found it weird. Cause like, you know, that's completion. It's like 00:47:35.940 |
not long, long context. So I don't know. It's just what they did though. But, um, some of the latent 00:47:42.980 |
diffusion models, like from stability, those I believe are also not transformer based. So they're, 00:47:50.260 |
you know, that's more popular. You've probably heard of some latent diffusion stuff. Um, 00:47:54.740 |
they don't use, um, they don't use transformers for the diffusion itself. 00:48:01.380 |
I think they just use it for the, um, text encoding. So, you know, it, it kind of depends on where you see 00:48:20.660 |
that's cool. Um, that's cool. Yeah, I guess we have a few follow open questions now. 00:48:25.620 |
Yeah. I know RJ had a question. I don't know if you want to come on camera and just ask it. 00:48:30.500 |
Oh, no, I was, I was come, I was commenting that I think that most, uh, or stable diffusion 00:48:41.300 |
anyway, has a, has attention blocks in inside of the diffusion block. So I think it, but I don't know 00:48:48.580 |
that would be, uh, so that would be across all the, um, sort of latent space tokens. So that's a fixed 00:48:59.620 |
size and wouldn't be impacted. So I don't really, I'm not sure that this, that impacts like a, a text 00:49:05.380 |
diffusion model, except because you would, yeah. Um, if you look at like BERT being a diffusion model, 00:49:13.220 |
right, then it is obviously, it has a transformer block. And so therefore would, uh, so I think that 00:49:18.500 |
the question about that is it's a little bit orthogonal, right? Because I, I view diffusion as 00:49:23.780 |
like a alternative process to, uh, autoregressive modeling kind of, and not so much transformer versus 00:49:35.700 |
Yeah, there was, the people have incorporated transformers into diffusion models in a handful 00:49:47.300 |
of different ways. Um, so let me see if we can pull this up. Um, uh, I think that's my old desktop. 00:50:01.780 |
Well, so this is the, the unit image that I pulled from one of the slides. Um, this is from 00:50:07.380 |
Prince's understanding deep learning, which is, which is pretty solid. Um, so starting with DDPM, 00:50:13.780 |
this is sort of the, the base model architecture that they use to predict, um, the noise at each time 00:50:19.780 |
step. Um, so they have, um, you know, the previous time step and then it's fed into this. And then the output is the, um, the image 00:50:31.540 |
minus the noise. Um, or maybe it's just, sorry, I think it's actually just the noise itself, um, 00:50:37.140 |
as a difference from what the image of you, but regardless, um, they've got like a bunch of, um, 00:50:44.340 |
convolutional blocks, um, scaling it down and then back up. Um, and they, they don't make it very clear. 00:50:51.460 |
Um, this color's not very good, but even within the original DDPM model from 2020, um, there's attention 00:50:58.980 |
operators kind of at the 16 by 16, um, chunks here. So it's able to attend to what came in. Um, and then as 00:51:09.300 |
models got bigger, um, and more complicated, um, people started tossing in more complicated model 00:51:15.780 |
architectures and a lot of more attention at different parts of this. Um, a lot of them still 00:51:20.020 |
kind of retained this unit shape, um, but stuff kind of got fancier from there. So that's one approach. 00:51:26.900 |
And then I don't know if I can find the right. And then, can I just add to that? Yeah, please. 00:51:31.460 |
Yeah. So, so as Vibu was saying, uh, the attention is primarily used in the, some of these earlier 00:51:37.540 |
image diffusion models to, uh, to the time emitting space, as well as the prompt that would guide the 00:51:47.060 |
image generation. Uh, sometimes these prompts are text images or text prompts, but sometimes these are 00:51:54.180 |
image prompts like depth maps or contours or outlines of different things. So the 00:52:01.060 |
attention is basically used as a grounding mechanism, uh, to flow forward, but the actual 00:52:06.980 |
process itself is diffusion. So we give, there is a separation that we can make. The attention is used 00:52:12.980 |
for helping the model understand the semantics of the image as well as for the generation. 00:52:19.700 |
But the actual diffusion process itself is orthogonal to that. 00:52:23.380 |
Yeah. The term for that is conditioning. So they call it conditioning sometimes. 00:52:29.300 |
Yeah. And there's even class conditioning. So you can have like class based labels and 00:52:34.660 |
guide, um, generation towards that. So like stylistic labels, right? I want anime and that's 00:52:40.500 |
separate than tax, text conditioning of like ultra realistic. And you know, you're basically using text 00:52:47.220 |
embeddings, but the, the attention there is it's typically done, um, with something like cross 00:52:54.180 |
attention over a text embedding dimension. And that's, that's separate than diffusion scaling, uh, 00:53:05.860 |
And then things get much more complex. They have something called as control nets, 00:53:09.460 |
which is slightly different concept than conditioning. Yeah. 00:53:13.860 |
It's interesting where like, you know, we basically had our like GPT three moment where we used to have 00:53:22.820 |
like all these temporal nets to fix consistency when you scale stuff up. And then with Sora, you know, 00:53:28.340 |
it turns out that video generation is just scaled up diffusion and you just scale it up a lot and 00:53:33.700 |
you solve a lot of these little like nuances. And it just kind of works. 00:53:38.260 |
So I guess I don't want to be argumentative, but, uh, I think that, um, unless I'm misinterpreting this 00:53:47.460 |
image, uh, that I put in the chat, I'm pretty sure it's saying that there is actually, um, uh, 00:53:56.900 |
attention blocks in the backbone of the diffusion. 00:54:08.180 |
So I don't know if I can pull it up and, or someone can share it if you want. 00:54:11.540 |
Yeah. I mean, it's been a while since the last week did it, but, uh, that conditioning is primarily 00:54:16.980 |
for, uh, that attention is primarily for the conditioning, whether it's text or, or image prompts. 00:54:25.540 |
Okay. Uh, okay. Uh, well, okay. We, I guess we can argue about it offline. I think, 00:54:36.820 |
I think that, uh, it's far as I can tell it as, as, uh, Tyler was noting, I think it's actually part of 00:54:44.500 |
the unit backbone or whatever, uh, the backbone is made of. And there's like in the stable diffusion, 00:54:54.820 |
I'll always help pull for more background information. Um, thanks again, Tyler. Really, really fun one this 00:55:04.740 |
time. Um, I think Cirque said to take a call. So just wrapping things up here next week, we have the 00:55:12.820 |
AI engineer world's fair. So if any of you guys are around, you know, we'll share something on 00:55:17.940 |
discord. We'll, we'll do like a little meetup. I think we have time for like an in-person paper club 00:55:23.140 |
workshop thing. Uh, we're supposed to be announcing our test of time paper club V2. So we'll share it 00:55:30.500 |
remote too, but, um, TBD on what the paper is, but we'll have some sort of session in person and also 00:55:39.140 |
remote next week. So if you're around at the conference, come by, otherwise, you know, same 00:55:44.020 |
zoom thing and then it'll just be a different one. But yeah, thanks everyone for coming. Thanks, Tyler, 00:55:49.300 |
for sharing. Thank you. Uh, Tyler, someone asked about slides. Oh yeah. I'll post those in the discord. 00:55:59.300 |
Yeah. Thanks everyone. Perfect. Yeah. Thanks. Thanks.