All right. So jumping into it. So just general, yeah, intro yourself. Don't forget. We want to know about you and your personal why. Okay. Yeah, there you go. Yeah, for sure. Yeah, so I'm currently a master student at Georgia Tech. I know a handful of folks in this community have also taken some classes there.
But yeah, I quit my job about a year ago to go full time just to speed up the process and take some harder classes and kind of dig a little deeper. So I've got about a year left trying to focus on machine learning stuff and a bunch of systems courses.
I didn't have like a CS undergrad, so I'm really trying to fill in some gaps there. But did some web dev stuff, some data engineering, some information retrieval stuff at a legal company, and then at Rose set a stone before that. Let's see. So I guess sort of the motivation for this is definitely like the Gemini diffusion announcement.
I know that Mercury got introduced by Inception Labs, I think back in January, which they're not a big outfit. But one of the people on some of these papers is one of the founders there. And one of their big claims is just like they can achieve similar levels of accuracy at much faster output speed.
So I think that's just an interesting dimension. I guess sort of for the structure of this task, I think about this as like, how do we teach in general, spend a little bit of time as teaching at a coding boot camp. And I'm trying to keep you guys in that flow state between boredom and anxiety.
So, you know, if it's going too fast, feel free to ask questions. If you're bored, I'm sorry. You know, take a look at some of the papers that are referenced. This is definitely optimized for exploration as opposed to exploitation. So I'm going over a bunch of different papers, but not very deep, but happy to kind of pause and redirect as needed.
Don't want to keep this like super formal. nice. Did he freeze? Did in fact freeze. Oh, no. Yeah. Okay. It's okay. We broke out a flow. Flow is still going. Keep keep it going. Who's seen Gemini diffusion? Thousands of tokens per second. So smart. I don't think thousands. I think I think a thousand, right?
We shall see. Let's ask Gemini. How many tokens per second is Gemini diffusion? Yeah, let's let's also ping him on discord. I think that there's quite a few ways to still optimize it. Wow. Gemini is so smart. Gemini says that it says it is 1479 tokens per second. But there's a brief overhead delay of 0.84 seconds.
That's rough. You got to wait a second for your thousand tokens per second. There's a question here from IO about are diffusion models auto-regressive? Typically not. But you can have an auto-regressive body and a diffusion head. Okay. Sorry about that. I think Zoom crashed on me. That's the last of it.
Cool. So I'll try to run through this. So really just kind of motivating this with some of the foundations. What is a generative model in the abstract? We're trying to model kind of an underlying probability distribution of data that's unknown. And then once we learned the distribution, then we're able to do stuff with it.
So this slide is based on a talk from Yang Song, who did some foundational stuff starting kind of in 2019 and has been a big contributor to diffusion models in general since then. But I thought his graphic was pretty cool. So we've got our probability of a data point X given the -- based on the probability of the data.
And then we're trying to find a model that's able to learn that distribution based on a laminal parameters theta. And during training, the p theta X is treated as a likelihood function and measuring it that deserves observed data under the current parameters theta. We can think about this as a KL divergence between the two probability distributions and sort of the optimization function we're trying to do is minimize the difference between these two distributions.
Another equivalent framing is to say that we're trying to find the parameters theta that maximize the likelihood of estimating expected data. And then we can do this equivalence relation. The problem is that the for most data distributions that we care about, the shape of the distribution is really complex and hard to model.
And then it's also hard to find a set of weights that are able to appropriately model it. So in that naive case, we can say we're using a Gaussian distribution, which isn't very expressive and isn't able to model complex spaces. Or we can use kind of a neural net, which is based on universal approximation theorem is able to model our data.
And then we can sample from it. Sort of the challenge becomes, you know, how do we make sure that it's appropriately expressive, but isn't too expressive. So another view of this is just like a narrow slice of a two-dimensional mapping of sort of the feature space that Claude 3 learns.
So just another kind of perspective into what the general modeling task is. We're trying to learn this distribution or a distribution that's able to model the true space of the data that we care about. Cool. So jumping into a bunch of different types of general models. Again, this is sort of more background information.
This is inspired by an old talk by Ian Goodfellow, but breaking down the general models into explicit densities of implicit density. I'm starting kind of on the left branch here. We're saying that a generative model aims to learn the underlying probability distribution data. Cool. So that's what we just discussed.
And then the explicit density is we're defining p theta x. Again, using the maximum likelihood. These are split, again, into tractable density models and approximate density models. The tractable density models include auto-aggressive models and normalizing flows. So sort of the classic formulation for auto-aggressive models. RNNs, LSTMs, GPT, et cetera, kind of follows this joint probability distribution of a single token, given the previous tokens.
And that just kind of continues auto-aggressively. For explicit density functions, they're modifying the density itself, but it's intractable to compute that directly. So we use an approximation on that. And so these both use lower bounds on what the density function we're trying to learn is. So we talked about VAE or VQVAE recently.
I think Ted did a talk on that. And then today we're getting into diffusion models, but this is sort of just like framing it with other types. And then, you know, going back up to our tree, we've got implicit density functions, which includes GANs, energy-based models, which aren't used a ton, but sort of inspire or motivate a lot of the -- or some of the approaches that are used.
So with GANs, you've got a generator discriminator, and you're trying to generate a sample that's able to fool the discriminator. GANs, but with each of all of these, we can view it as forming some sort of lean representation of what our model -- our data distribution is, compressing it into a different form and then trying to learn how to reassemble it from that later representation.
There's a lot of kind of back and forth between the different generative modeling approaches. The space is pretty rich and people pollinate ideas between all of this stuff. Again, this is sort of hopefully not too much in the boredom side of things. But for folks that are newer to the space, I think it's helpful to kind of have the preliminaries.
So jumping into diffusion models foundations. Actually, let me pause for questions real quick and check chat. Just let me know if anybody's got something I can pause for. There's a question about elbow. I vaguely remember elbow. I forget what elbows are. Yeah, I think it's the expression lower bounds.
So it comes up in VE's and saw it in some of the diffusion stuff. But it's sort of a trick for how to work around learning, like the true probability distribution, because that's intractable. So we're saying we can get as close as this approximation of what the distribution should be.
On my side, I didn't super get implicit versus explicit. I guess like we're used to explicit. Yeah, for sure. Also, how did you make these? This was an obsidian canvas. And I just took screenshots of it. God. Yeah. It looks great. Thanks. Like, everyone is explicit, right? Based on the diagram that you...
Everyone except for GANs. Yeah. Okay. Because they do like the weird generator discriminator min/max game to like try to optimize. I don't think it's weird at all. I think we're about to see that with multi-agent. That's basically what OpenAI is working on. Oh, really? That's cool. Well, I'll bite my tongue.
It's a little spoiler that we recorded an episode with Gnome Brown. Damn. That's cool. Anyway. So, trying to go back as far as I could. The first reference I found to kind of iteratively applying noise was in extracting this paper, extracting and composing robust features with denoising autoencoders. So, this is before like diffusion became a thing.
Yashio Bengio's one here. And then Pascal Vincent also shows up a lot of kind of earlier work out of University of Montreal. So, a quote from the paper is destruction. For each input X, a fixed number of VD of components are chosen at random and their value is forced to zero while the others are left untouched.
All information about the chosen components is thus removed from the particular input pattern. The autoencoder will be trained to fill in those particular introduced blanks. So, they're just randomly picking elements and masking them out and then trying to learn a representation based on the mask stuff. They found that it improves sort of the performance.
Benjio picked that up a couple of years later. So, that was 2008. He had a handful of papers between this. And then, in 2013, this one I thought was kind of interesting where definitely in the same vein, but they've taken a bit further. We have proven that training a model to denoise is a way to implicitly estimate the underlying data generating process and that a simple Markov chain that alternates sampling from the denoising model and from the corrupting process converges to the estimator.
This provides a mean for generating data from any denoising autoencoder. So, they've got, you know, they had a figure from MNIST where they're iteratively applying noise on one side and then walking it back on the other side. So, can we take a noisy image and reconstruct it? Which definitely leaves the foundation for some of the stuff to come.
This was like the first paper that people associate with Diffusion process. So, kind of like the past ones, a lot of this comes from physics. Like a lot of old school ML, like early 2000s and before is like statistics and statistical mechanics. So, one of the examples I've heard to describe this is like how do you like a stochastic process is like dropping a drop of ink in water and seeing it kind of like spread out.
And then the reverse process is like how would you like imagine where the drop came from? Shit. Okay. Okay. So, similar to the past ones, they iteratively add noise. They talk about stochastic differential equations, which the math is deeper than I fully understand or want to try to get into.
But it's the underpinnings are just kind of on these stochastic processes, which were heavily inspired by physics and some of the stuff into like how do particles move? How does heat and entropy associated with heat kind of perform? And some of the examples from that paper, they've got like on the left, figure A, we've got C410 holdout images.
They corrupt them in figure B with like Gaussian noise. And then they denoise the images and try to reconstruct them. And we can see that the reconstructions are pretty decent. And I think this was with a two layer network. So, not a ton of parameters, but they're still able to get some interesting results.
And again, yeah, we're gradually applying noise in the forward diffusion and the reverse diffusion, recovering the distribution. Kind of speed summary of 2015 to 2019. 2017. GANs got released. I think Bingio was also on the GAN paper with Ian Goodfellow, I think around 2014. They blew up, they had kind of leading results for the Fisher inception, just to whatever the FID score for a while.
Their GANs kind of struggle with a mode collapse. So, I did a project on GANs a few months ago. And I was trying to reconstruct the fashion in this. And like half of my results came out as boots. when there should be shirts and hats and all kinds of stuff.
VQVAE was released, I think, 2018. We talked about that recently. But it compresses images into a discrete one-dimensional space. And then tries to predict it from the compressed space. And these kind of like went back and forth on what the state-of-the-art results were. Then in 2020, Song and Ehrman took another stab at diffusion models.
They framed it in terms of this thing called score matching. Where their approach to this was very math heavy. But they're trying to understand how a diffusion process is able to learn a function to score the data in a high-dimensional space. And then they also introduced some processes for scaling that.
And then they introduced a slightly larger network. And based on their principled approach, we're able to increase the sample size and quality of the images that they could generate. So just a snapshot from this. So still kind of look bad, but better. And then their FID score was 10.
The big breakout paper for diffusion was this one, which is usually called DDPMs, but denoising diffusion of probabilistic models. Similar to previous work. This one uses a forward and refuse deversion process. One of the key breakouts here was that they restricted this to only predicting the mean of the Gaussian distribution from the noise was added.
And then we're able to show with a bunch of math that the mean, predicting the mean is equivalent to predicting the sample X or the image sample that they're trying to find. So they effectively, they're trying to predict the sample X, but they do that by predicting the noise applied to X.
And then they also introduced a much larger network. So this is a unit that they pulled from image segmentation tasks. And they achieved a state of the art for FID. Again, this is 2020. I'm not going to go through all this. Song came back pretty shortly afterwards. And it iterated on his last framework.
More stuff with stochastic differential equations. Some of these papers or all these papers that I've talked about sort of form the underpinnings for current work. They're really kind of like the landmark ones. And people keep referencing particularly this, the DDPM paper. And again, the results look even better. You can see here, this is the score function in the middle.
From the song paper. Okay. So jumping ahead to actually pause there. Any questions? A lot of the stuff is kind of background, so trying to go through quickly. But that doesn't mean it's not interesting or important. I had a side tangent on video diffusion. And then also, I think the other thing I was waiting for you to cover, but it looks like you stopped it like 2020, right?
Did the paper be covered? Yeah, I think so. 2021. You didn't cover latent diffusion? Was that latent diffusion? No, I didn't. Well, latent diffusion is very important. And then consistency models and flow matching. Those are the three things that I think the last three years of diffusion have kind of thought taught us.
For sure. Yeah. We've covered those things in previous paperclips. Yeah, I think that the background was super helpful. Thank you, Tyler. It was a good catch up for me. Okay, cool. Cool. So jumping into this one is 2021. So after the DDPM paper got state of the art on FID scores for ImageNet, it got a lot of attention.
The research kind of spiked afterwards. A lot of the focus was still on applications with image generation. But increasingly, it started to become with text modeling and other modalities. So audio, video. This was one of the first ones I could find with reference to text modeling. And this is the Austin et al.
2021 structure denoising models in discrete spaces. So let's see. One of the question paper is we develop a structured corruption process for appropriate processes appropriate for text using similarity between tokens to enable gradual corruption and denoising. Expanding further, we also explore a corruption process that insert mask tokens, allow us to draw parallels to autoregressive and mass-based generative models.
I'm actually going to skip the slide. So the process they introduce is they think of, you know, a string of text as a sequence of discrete tokens. We can do that as a reshape it into a matrix. And then if we're randomly masking a token, we can do that by sampling Gaussian noise.
But each one of these samples is discrete versus some of the previous approaches we talked about, or assume that the distribution is continuous, which is more appropriate for like color values or from, you know, an audio signal, we're text. We're treating this as just one of the values from a row cab.
And we can iteratively apply noise using these masks. So again, this is the four process Q, XT condition on XS. And they use like a transition matrix, which I think is from Markov, Markov, like a Markov process. But I'm feel free to jump in if someone has a better understanding of that.
But each time step, they determine, you know, randomly, which based on a probability from from this transition basics, should we mask a token at each time step T. And then the reverse process is similar to that, where if we know what the true distribution is of the data XO, we know like what the unmasked value should be.
And so we're able to sample from like the reverse of the distribution to unmask it. But of course, you know, in actual gender modeling, we don't know what the true distribution of XO is. So in our, you know, our reverse process, probability of X given theta, we can said, approximate this using our neural network to determine what the probability should be.
So this is, this is what the network learns. And here's kind of a sample of like what the what the model is, this is a figure from the paper. And based on the LM1B task to generate new sentences. And then the bottom is the bottom portion on the right is reconstruction samples.
And so what this is trying to show is like on the, on the top as T goes up, more stuff is masked. And then on the bottom as T moves forward, more stuff is unmasked. And then if we learned a good representation of the, of our data, then what gets reconstructed is, is like pretty close to what the data should be.
So we can see, you know, it masks. The original is Caterpillar is eager to expand in Asia. And then the, the reconstructed version of Caterpillar is eager to expand in China. Some of the claims in the paper, this is a big block quote that I'm not going to read, but I encourage people to check out because I think it's pretty, pretty interesting.
But they, they claim that that BERT viewed through this, this lens that they've, they've established is a one-step diffusion model. And the auto aggressive models are discrete diffusion models. And the generative masculine drink models. So the type of model they've proposed, or, or, or mass language models in general.
So like BERT, um, or diffusion models. So interesting parallel with kind of like previous NLP research, um, and trying to like establish a background for, for their approach. This plays out in some of the, um, the future papers. So I wasn't sure how much depth to go in. So I cut out a bunch of it stuff.
Um, so this one was 2021. Um, there's a ton of. Sorry Tyler. Oh yeah. I was going to actually ask a question on this. I think this feels actually kind of important. Um, so I want to talk through my thinking of it. And I think you, you will know this a lot more than me.
I think BERT is a one-step diffusion model because we add noise, right? We corrupt the tokens. That's why it's a one-step diffusion model. I'm a little bit stuck on how autoregressive models are discrete diffusion models. Could you talk a little bit more about that piece? Yeah, totally. Um, so I think there, um, I'm not as, as firm on this as, as I, as I could be.
Um, but you know way more than me at this point. I don't know. I went really broad and skimmed a ton of papers. Um, and I can, I can share a bit of a high level about what they're staying here. So what's the intuition? Yes, please. Discrete diffusion in this sense is basically them.
Uh, so at each step there, there's a deterministic answer and they're just masking the next token, right? So when you train the other aggressive model, what you're basically doing is you're predicting one token at a time and in a sense, right, that's, that's just, you have deterministic outputs of what the token should be.
You're just now masking one token at a time. So it's deterministic in the sense of, you know, you know exactly what it is. It's equivalent to like a single diffusion step, but instead of stochastic diffusion, it's deterministic because you know what the token should be. But yeah, I like read a little bit more about this and that's, that's kind of all they're saying.
They're just saying you're masking token by token and that's all it is. Right. So it is discrete as opposed to probabilistic or continuous like images because images, image pixels are continuous, right? Right. In this case, it's text tokens. That's why it's discrete. Is that it? Images, images are continuous in the sense of you're predicting how much noise was applied and what the noise is.
In this case, you have exactly one token of measurable, predictable, discrete change. Right. So that's, that's how they're framing the idea, which, which kind of in some sense makes sense, right? Like how much noise did I apply over an image? There's a spectrum of answers, right? And then your loss is measured differently.
In this case, it's a very discrete. Yeah. I also wonder in this case, like maybe the word big, it could be large or huge. I think all of them, all of them are actually, the word big itself may not be the, you have many synonyms that could also be correct answers.
So in that sense, framing as discrete loss or discrete diffusion. Never mind. I don't know enough of this to comment more. Sorry, please go ahead. So in that case, why are mass language models? Oh, RJ, go. Yeah, sorry. One interesting thing that I learned when I did the consistency paper was that the diffusion process is actually the model from the, from the beginning, given the predictions up until now.
Right. So like, it's not, it's not just like given the last prediction, what's the next step. It's given, what's the next step, given all the predictions up until now. And this is the same for our language model. And so that it, in, from that perspective, it looks very much like an auto regressive model.
Right. That makes sense. Yeah. And, and, and the other thing that I I just like, maybe, maybe there's a statement, the obvious, but BERT is it like with the mass language modeling is just sort of like a generalization of masking only the next token. Right. So it's sort of like, you're just, you're just, you just happen to have all masked the next token every time for the, um, for the auto aggressive model.
Whereas with a mass language model, you're doing anywhere in, or like a chunk inside of the, and like tokens inside of the, inside of the text, instead of at the end. Right. That makes sense. I kind of think like, but it's like, you're masking tokens in the middle, like 15% of the time.
Yeah. Whereas for auto regressive, you're masking everything ahead. Yeah, exactly. Um, everything's a mask. Okay. And then they denoise one position at a time, if you think of it that way. But, but, but maybe more. Yeah. So maybe more just like you mask whatever the next token is. Or, or maybe it, maybe it's better.
Maybe what you said is better way to look at it in the sense it, because it parallels what I said about diffusion models in the, um, like for whatever the stable diffusion or whatever. No, that's totally right. Yeah. And then going to get into it in a minute. Um, at least a little bit is that, um, some of the state of the art stuff for language diffusion, um, of course they, they lean heavier on big transformer models.
Um, and they're able to remove kind of the, the causal, causal mask, um, that's applied for attention. Um, so you think of like the, you know, the bottom triangle, um, that we usually see for attention, or if you have like group query attention, it's like the little squares, um, down your, your triangular matrix, um, for, for diffusion language models, they're able to look at, you know, the whole, um, uh, KV, um, for, for the attention, um, operator.
So, so if I'm jumping ahead here, I'm sorry, sorry, please finish, uh, yeah, no, that's it. So if I'm jumping ahead here, let's say I have a code base and I say, I want to refactor this specific object that's being used. Discrete model, uh, a generative model just fills in the blanks.
And that's why you don't actually have to go left to right, like autoregressive models, which you have to go left to right, even when they're just filling in the blanks. Actually, I don't know if there's, uh, better techniques for filling in the blanks with an autoregressive model for code models, but discrete model can just say, okay, someone will just draft out, okay, these are all the definitions I need, all the functions I need, some autoregressive model draft that out and then discrete model just fill in the blanks and it can do that very fast.
I'm, I'm just kind of trying to match it to what we saw at Google I/O when they were demoing this, um, sorry, I mean a diffusion model for coding. Totally. Yeah. I mean, I think one of the challenges is too, is that the context for things that are ahead.
Um, so if you're, you know, in that example, and I, I don't know the specific process that are applied for, for code models and assuming it's, you know, mom with a, with a similar loss function, um, and that they're just sampling from, um, saying, here's what it is, here's what we want.
And it's able to interpret that there should be a larger output space, um, or the output space sequence should be longer. Um, that's true. Expanding that dynamically, right? Yeah, totally. Yeah. But if you're, if you have like, you know, your, your hundred line file, if you're doing it auto aggressively, you, you have to feed in the, like, and you're making edit at a character 50 or something, you have to feed in the first 50 lines in context and say, we want the edit at line 50, and then also feed in the remaining 50 lines, but it, you know, so, and then the, your auto aggressive model that has to attend to the stuff that's ahead of it, um, with reference to the stuff that's behind it versus if we don't have the, the causal mask, it can attend to all of it at once, um, or it's able to attend to it at once in a different way.
Um, but hopefully I'm not. Makes sense. Thank you, Tyler. This was helpful for my intuition. Thank you. For sure. Um, cool. So yeah, there's a bunch of papers that happened from 2021. I'm jumping ahead to 2024. Um, the state of the art kind of like kept ticking up a little bit, but still significantly worse than auto aggressive models.
Um, even for comparable model sizes, um, there was stuff that like the back and forth between discrete state spaces and then continuous state spaces where they have like a continuous representation and then they have like a separate process. So the continuous representation is like an embedding. Um, and then they have a separate process to sample from the embedding to determine what the token should be.
Um, which hasn't really panned out in terms of like what the, what is now state of the art. So that's part of the reason I chose to skip over it. Um, but, um, encourage folks that are interested to look into it. There's a bunch of papers that came out, um, kind of in this time span that I'm glossing over.
Um, one of the next ones that I thought was, was pretty cool. Um, and I think this is kind of looking backwards. It's sort of what this year, the art is and what's referenced from that. So looking at, um, um, um, the study in October of last year, um, where they started scaling up, um, mass diffusion models.
Um, and they were able to achieve results, um, that are competitive with, um, auto aggressive model, um, language models, um, of similar sizes, um, and with relatively similar, um, levels of, of training compute. Um, so they trained up to a 1.1 billion parameter model. Um, and then for, depending on the different benchmark was, was competitive with GPT two, um, the 1.5 B version and then llama two, um, the seven B version.
Um, so those weren't quite state of the art when this was released. Um, but still kind of a, um, a step forward. Um, and then it's, I don't know, I kind of like this scaling, um, scaling work just to see like the number goes down as compute goes up.
Um, there was pretty graphs. Um, so, you know, the classic Kaplan, um, chinchilla isoflop curves, um, based on our, our training budget, we make stuff go down. Um, and here they have, like, they're saying they followed a similar, um, scaling law to auto aggressive models with their mass diffusion models.
Um, but with some constant multiplier. Um, but the, the general curve, you know, on our, our log log plot was, was pretty similar for, for their approach. Um, digging into this a little bit more, um, the, to achieve a similar validation loss, um, they had to have 16 X, um, more compute, um, than the auto aggressive model and on their, their approach that they model here.
So that's the, the left plot. And then on the right plot, um, they were able to achieve similar, um, what is this? I think better performance with fewer parameters is the right plot. And then, yeah, here's their, the results, um, which are competitive with the, the models we listed.
Um, this one we, we talked about last, last week. Um, so I'm not going to dig into it too much. Um, so I just want to hit some things that I thought were cool that we didn't quite touch on. Um, so this is the same, uh, most of the same authors as the scaling loss paper, um, where they continue the trend and scale up to, um, 7 billion parameters.
Um, and then the results got even better. Um, it's now as good or better than llama two, seven B or llama three, eight B, um, and handful of different benchmarks. Um, one of the things, you know, we were just talking about is that the bi-directional reasoning. Um, so one of the tasks they, um, they looked at was, um, reversing a poem.
Um, so if you have like the last couple of lines in a poem, can you predict the lines that came before it? Um, and for this specific task, um, this model significantly outperformed most of the other models, they, they tested against including, um, 4.0. Um, and they, they attribute that to kind of the, um, the lack of the casual mask, um, for tension that, you know, just talking about.
Um, so again, a figure from the paper where they're, they're masking stuff. Um, they also introduced this step where, um, they're able to score the probability of a token, um, that's predicted by the, the mass predictor. Um, and things that are low probability, they can then potentially remask and then try to resample again to get a better prediction.
Um, so that's, um, the right or part C of this figure, um, where they're remasking, um, the figure and then re-predicting it again, which is kind of neat. Um, and then more scaling loss stuff, um, for different tasks, um, their model was able to, um, achieve better performance at, at lower, um, training compute.
Um, and sometimes it's worse. So for the middle bottom plot GSM, GSM 8K, um, which I think is a math focused task, um, they outperformed, um, their autoregressive baseline and then to the plot to the left of it, the bottom left plot there, um, we can see all the orange stars are sort of below, um, the, the blue dots.
Um, and so lower is a worse accuracy and this is your zero shot task, um, for the same level of compute. Um, so definitely some trade-offs there. Um, This next one, uh, block diffusion. Um, so this paper, this is 20, um, January, this one came out in, um, March of this year.
Um, this uses, um, a hybrid architecture where it, um, has, um, has blocks, um, that in each block is generated auto aggressively, but with any in the block, it's, um, generated using diffusion. Um, part of the reason they did that is to, um, I think part of it was to make their, um, the sampling task easier.
Um, but they were also able to take advantage of KB caching, um, versus this, this paper, um, despite the results, um, doesn't, hasn't been optimized to fulfill a lot of the tricks, like KB caching. Um, I didn't dig into the specifics of why KB caching doesn't work for this previous paper.
Um, kind of thinking about it, um, my intuition is that, um, if you have like a, a dialogue, um, you know, between the system and a user, um, that it's, it has to generate like the next response starting from zero, um, as opposed to finding a way to, to cache like the previous, um, tokens like you can, um, if you're generating auto aggressively, um, and they're able to, so block diffusion was able to get around that by having, um, like these, these chunks of a fixed length that they then generate.
So we, again, we talked about this last week that, um, the large language diffusion paper, um, applied pretty similar pre-training. Um, I think they use like a 2.3 trillion tokens. Um, and we saw 10 to the 23rd, um, flops on H 100, um, but they didn't do any post training.
Um, and then some of the, the new work that's coming out is improving kind of the, the post training, um, uh, especially with our reasoning slant. Um, so this paper, um, so this paper, uh, uses this, this is a base model, the, the LADA that they introduce, um, and then applies, um, this, their custom GRPO, um, post training.
Um, and then they use like the S1, uh, reasoning dataset to do supervised foreign tuning, um, with a slant towards, um, like math and reasoning and code tasks. Um, and based on that, they're able to, um, you know, dramatically improve the performance on those specific tasks over the base model.
Um, so that's the, the bottom table here, they're in green. Um, the top row is the, the base model. Um, so they, they bumped up the numbers a good amount. Um, let's see, and that's, that's all. So thanks all. This is really good. Um, Tyler, I, there's a question that Eric posed in the channel that I'm also curious about.
That's a compute for the output window scale as O log. Is it N squared for diffusion models or is it linear? Oh, would you know? I don't know offhand. Um, Some of the big ones that we hear about like stable diffusion, dolly image gen, uh, diffusion transformer, video diffusion transformers.
Those are all transformer based. So there is, um, there is that, um, quadratic scaling issue, but some of them are not. So some basic denoising diffusion probabilistic models, they're more CNN based. And then there's no longer that transformer complexity issue. So depending on what you're doing, like some early work trying to do diffusion for completion.
So like short completion for cogen stuff is not transformer based. So you're no longer complexity bound. But I found it weird. Cause like, you know, that's completion. It's like not long, long context. So I don't know. It's just what they did though. But, um, some of the latent diffusion models, like from stability, those I believe are also not transformer based.
So they're, you know, that's more popular. You've probably heard of some latent diffusion stuff. Um, they don't use, um, they don't use transformers for the diffusion itself. I think they just use it for the, um, text encoding. So, you know, it, it kind of depends on where you see the quadratic scaling complexity.
that's cool. Um, that's cool. Yeah, I guess we have a few follow open questions now. Yeah. I know RJ had a question. I don't know if you want to come on camera and just ask it. Oh, no, I was, I was come, I was commenting that I think that most, uh, or stable diffusion anyway, has a, has attention blocks in inside of the diffusion block.
So I think it, but I don't know that would be, uh, so that would be across all the, um, sort of latent space tokens. So that's a fixed size and wouldn't be impacted. So I don't really, I'm not sure that this, that impacts like a, a text diffusion model, except because you would, yeah.
Um, if you look at like BERT being a diffusion model, right, then it is obviously, it has a transformer block. And so therefore would, uh, so I think that the question about that is it's a little bit orthogonal, right? Because I, I view diffusion as like a alternative process to, uh, autoregressive modeling kind of, and not so much transformer versus non-transformer.
Yeah, there was, the people have incorporated transformers into diffusion models in a handful of different ways. Um, so let me see if we can pull this up. Um, uh, I think that's my old desktop. Let's see. Well, so this is the, the unit image that I pulled from one of the slides.
Um, this is from Prince's understanding deep learning, which is, which is pretty solid. Um, so starting with DDPM, this is sort of the, the base model architecture that they use to predict, um, the noise at each time step. Um, so they have, um, you know, the previous time step and then it's fed into this.
And then the output is the, um, the image minus the noise. Um, or maybe it's just, sorry, I think it's actually just the noise itself, um, as a difference from what the image of you, but regardless, um, they've got like a bunch of, um, convolutional blocks, um, scaling it down and then back up.
Um, and they, they don't make it very clear. Um, this color's not very good, but even within the original DDPM model from 2020, um, there's attention operators kind of at the 16 by 16, um, chunks here. So it's able to attend to what came in. Um, and then as models got bigger, um, and more complicated, um, people started tossing in more complicated model architectures and a lot of more attention at different parts of this.
Um, a lot of them still kind of retained this unit shape, um, but stuff kind of got fancier from there. So that's one approach. And then I don't know if I can find the right. And then, can I just add to that? Yeah, please. Yeah. So, so as Vibu was saying, uh, the attention is primarily used in the, some of these earlier image diffusion models to, uh, to the time emitting space, as well as the prompt that would guide the image generation.
Uh, sometimes these prompts are text images or text prompts, but sometimes these are image prompts like depth maps or contours or outlines of different things. So the attention is basically used as a grounding mechanism, uh, to flow forward, but the actual process itself is diffusion. So we give, there is a separation that we can make.
The attention is used for helping the model understand the semantics of the image as well as for the generation. But the actual diffusion process itself is orthogonal to that. Yeah. The term for that is conditioning. So they call it conditioning sometimes. Yeah. And there's even class conditioning. So you can have like class based labels and guide, um, generation towards that.
So like stylistic labels, right? I want anime and that's separate than tax, text conditioning of like ultra realistic. And you know, you're basically using text embeddings, but the, the attention there is it's typically done, um, with something like cross attention over a text embedding dimension. And that's, that's separate than diffusion scaling, uh, you know, a different complexity.
Correct. And then things get much more complex. They have something called as control nets, which is slightly different concept than conditioning. Yeah. It's interesting where like, you know, we basically had our like GPT three moment where we used to have like all these temporal nets to fix consistency when you scale stuff up.
And then with Sora, you know, it turns out that video generation is just scaled up diffusion and you just scale it up a lot and you solve a lot of these little like nuances. And it just kind of works. So I guess I don't want to be argumentative, but, uh, I think that, um, unless I'm misinterpreting this image, uh, that I put in the chat, I'm pretty sure it's saying that there is actually, um, uh, attention blocks in the backbone of the diffusion.
Um, not, not just in the conditioning. Okay. So I don't know if I can pull it up and, or someone can share it if you want. Yeah. I mean, it's been a while since the last week did it, but, uh, that conditioning is primarily for, uh, that attention is primarily for the conditioning, whether it's text or, or image prompts.
Okay. Uh, okay. Uh, well, okay. We, I guess we can argue about it offline. I think, I think that, uh, it's far as I can tell it as, as, uh, Tyler was noting, I think it's actually part of the unit backbone or whatever, uh, the backbone is made of.
And there's like in the stable diffusion, the fusion is unit, right? Yeah. Anyway. I'll always help pull for more background information. Um, thanks again, Tyler. Really, really fun one this time. Um, I think Cirque said to take a call. So just wrapping things up here next week, we have the AI engineer world's fair.
So if any of you guys are around, you know, we'll share something on discord. We'll, we'll do like a little meetup. I think we have time for like an in-person paper club workshop thing. Uh, we're supposed to be announcing our test of time paper club V2. So we'll share it remote too, but, um, TBD on what the paper is, but we'll have some sort of session in person and also remote next week.
So if you're around at the conference, come by, otherwise, you know, same zoom thing and then it'll just be a different one. But yeah, thanks everyone for coming. Thanks, Tyler, for sharing. Thank you. Uh, Tyler, someone asked about slides. Oh yeah. I'll post those in the discord. Yeah. Thanks everyone.
Perfect. Yeah. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thank you.