back to index

Language Diffusion Survey


Whisper Transcript | Transcript Only Page

00:00:00.000 | All right. So jumping into it.
00:00:05.080 | So just general, yeah, intro yourself. Don't forget. We want to know about you and your personal
00:00:11.640 | why. Okay. Yeah, there you go.
00:00:13.480 | Yeah, for sure. Yeah, so I'm currently a master student at Georgia Tech. I know a handful of
00:00:20.040 | folks in this community have also taken some classes there. But yeah, I quit my job about a
00:00:27.100 | year ago to go full time just to speed up the process and take some harder classes and kind
00:00:31.120 | of dig a little deeper. So I've got about a year left trying to focus on machine learning stuff
00:00:39.280 | and a bunch of systems courses. I didn't have like a CS undergrad, so I'm really trying to fill in
00:00:44.440 | some gaps there. But did some web dev stuff, some data engineering, some information retrieval stuff
00:00:50.080 | at a legal company, and then at Rose set a stone before that.
00:00:54.200 | Let's see. So I guess sort of the motivation for this is definitely like the Gemini diffusion
00:01:02.700 | announcement. I know that Mercury got introduced by Inception Labs, I think back in January,
00:01:10.680 | which they're not a big outfit. But one of the people on some of these papers is one of the founders
00:01:19.020 | there. And one of their big claims is just like they can achieve similar levels of accuracy at much
00:01:25.520 | faster output speed. So I think that's just an interesting dimension. I guess sort of for the
00:01:34.360 | structure of this task, I think about this as like, how do we teach in general, spend a little bit of time
00:01:42.260 | as teaching at a coding boot camp. And I'm trying to keep you guys in that flow state between boredom and
00:01:48.060 | anxiety. So, you know, if it's going too fast, feel free to ask questions. If you're bored, I'm sorry.
00:01:52.580 | You know, take a look at some of the papers that are referenced. This is definitely optimized for
00:01:58.580 | exploration as opposed to exploitation. So I'm going over a bunch of different papers, but not very deep,
00:02:03.620 | but happy to kind of pause and redirect as needed. Don't want to keep this like super formal.
00:02:08.420 | nice.
00:02:13.620 | Did he freeze?
00:02:19.780 | Did in fact freeze. Oh, no.
00:02:24.260 | Yeah. Okay. It's okay.
00:02:26.260 | We broke out a flow.
00:02:27.620 | Flow is still going. Keep keep it going. Who's seen Gemini diffusion? Thousands of tokens per second.
00:02:34.260 | So smart.
00:02:35.060 | I don't think thousands. I think I think a thousand, right?
00:02:38.420 | We shall see. Let's ask Gemini. How many tokens per second is Gemini diffusion?
00:02:48.420 | Yeah, let's let's also ping him on discord.
00:02:52.660 | I think that there's quite a few ways to still optimize it. Wow. Gemini is so smart. Gemini says that
00:03:00.100 | it says it is 1479 tokens per second. But there's a brief overhead delay of 0.84 seconds.
00:03:09.300 | That's rough. You got to wait a second for your thousand tokens per second.
00:03:15.380 | There's a question here from IO about are diffusion models auto-regressive? Typically not. But you can
00:03:21.860 | have an auto-regressive body and a diffusion head.
00:03:27.540 | Okay. Sorry about that. I think Zoom crashed on me. That's the last of it. Cool. So I'll try to run
00:03:39.540 | through this. So really just kind of motivating this with some of the foundations. What is a generative
00:03:45.860 | model in the abstract? We're trying to model kind of an underlying probability distribution of data
00:03:53.140 | that's unknown. And then once we learned the distribution, then we're able to do stuff with
00:03:58.340 | it. So this slide is based on a talk from Yang Song, who did some foundational stuff starting kind of
00:04:06.020 | in 2019 and has been a big contributor to diffusion models in general since then. But I thought his graphic
00:04:13.380 | was pretty cool. So we've got our probability of a data point X given the -- based on the probability
00:04:32.420 | of the data. And then we're trying to find a model that's able to learn that distribution based on a
00:04:39.700 | laminal parameters theta. And during training, the p theta X is treated as a likelihood function and
00:04:46.580 | measuring it that deserves observed data under the current parameters theta. We can think about this
00:04:52.020 | as a KL divergence between the two probability distributions and sort of the optimization function
00:04:57.460 | we're trying to do is minimize the difference between these two distributions. Another equivalent
00:05:03.940 | framing is to say that we're trying to find the parameters theta that maximize the likelihood of
00:05:12.100 | estimating expected data. And then we can do this equivalence relation. The problem is that the
00:05:18.020 | for most data distributions that we care about, the shape of the distribution is really complex and
00:05:28.340 | hard to model. And then it's also hard to find a set of weights that are able to appropriately model it.
00:05:36.180 | So in that naive case, we can say we're using a Gaussian distribution, which isn't very expressive and
00:05:46.020 | isn't able to model complex spaces. Or we can use kind of a neural net, which is based on universal approximation
00:05:56.420 | theorem is able to model our data. And then we can sample from it. Sort of the challenge becomes, you know,
00:06:02.420 | how do we make sure that it's appropriately expressive, but isn't too expressive. So another view of this is
00:06:12.820 | just like a narrow slice of a two-dimensional mapping of sort of the feature space that Claude 3 learns.
00:06:25.140 | So just another kind of perspective into what the general modeling task is. We're trying to learn this
00:06:30.980 | distribution or a distribution that's able to model the true space of the data that we care about.
00:06:40.100 | Cool. So jumping into a bunch of different types of general models. Again, this is sort of more
00:06:48.980 | background information. This is inspired by an old talk by Ian Goodfellow, but breaking down the
00:06:58.500 | general models into explicit densities of implicit density. I'm starting kind of on the left branch here.
00:07:04.740 | We're saying that a generative model aims to learn the underlying probability distribution data. Cool.
00:07:10.100 | So that's what we just discussed. And then the explicit density
00:07:13.780 | is we're defining p theta x. Again, using the maximum likelihood. These are split, again,
00:07:25.380 | into tractable density models and approximate density models. The tractable density models include
00:07:32.660 | auto-aggressive models and normalizing flows. So sort of the classic formulation for auto-aggressive models.
00:07:40.260 | RNNs, LSTMs, GPT, et cetera, kind of follows this joint probability distribution of a single token,
00:07:55.300 | given the previous tokens. And that just kind of continues auto-aggressively.
00:07:59.140 | For explicit density functions, they're modifying the density itself, but it's intractable to compute
00:08:16.340 | that directly. So we use an approximation on that. And so these both use lower bounds on what the density
00:08:27.060 | function we're trying to learn is. So we talked about VAE or VQVAE recently. I think Ted did a talk on that.
00:08:36.900 | And then today we're getting into diffusion models, but this is sort of just like framing it with other
00:08:43.300 | types. And then, you know, going back up to our tree, we've got implicit density functions, which includes
00:08:50.740 | GANs, energy-based models, which aren't used a ton, but sort of inspire or motivate a lot of the -- or some of the
00:08:58.900 | approaches that are used.
00:08:59.780 | So with GANs, you've got a generator discriminator, and you're trying to
00:09:09.380 | generate a sample that's able to fool the discriminator.
00:09:14.500 | GANs, but with each of all of these, we can view it as forming some sort of lean representation of
00:09:24.180 | what our model -- our data distribution is, compressing it into a different form and then trying to learn
00:09:34.900 | how to reassemble it from that later representation. There's a lot of kind of back and forth between
00:09:44.500 | the different generative modeling approaches. The space is pretty rich and people pollinate ideas
00:09:54.020 | between all of this stuff. Again, this is sort of hopefully not too much in the boredom side of
00:09:59.860 | things. But for folks that are newer to the space, I think it's helpful to kind of have the preliminaries.
00:10:06.260 | So jumping into diffusion models foundations. Actually, let me pause for questions real quick and check chat.
00:10:18.420 | Just let me know if anybody's got something I can pause for.
00:10:22.020 | There's a question about elbow. I vaguely remember elbow. I forget what elbows are.
00:10:27.300 | Yeah, I think it's the expression lower bounds. So it comes up in VE's and saw it in some of the diffusion stuff.
00:10:40.100 | But it's sort of a trick for how to work around learning, like the true probability distribution,
00:10:48.340 | because that's intractable. So we're saying we can get as close as this approximation
00:10:54.180 | of what the distribution should be.
00:11:02.980 | On my side, I didn't super get implicit versus explicit. I guess like we're used to explicit.
00:11:09.780 | Yeah, for sure.
00:11:13.460 | Also, how did you make these?
00:11:16.180 | This was an obsidian canvas. And I just took screenshots of it.
00:11:24.100 | Yeah.
00:11:24.100 | It looks great.
00:11:25.620 | Thanks.
00:11:30.100 | Like, everyone is explicit, right? Based on the diagram that you...
00:11:34.740 | Everyone except for GANs.
00:11:38.020 | Yeah. Okay.
00:11:39.780 | Because they do like the weird generator discriminator min/max game to like try to optimize.
00:11:48.260 | I don't think it's weird at all. I think we're about to see that with multi-agent. That's basically
00:11:53.620 | what OpenAI is working on.
00:11:54.580 | Oh, really? That's cool.
00:11:58.900 | Well, I'll bite my tongue.
00:12:00.100 | It's a little spoiler that we recorded an episode with Gnome Brown.
00:12:06.660 | Damn. That's cool.
00:12:08.100 | Anyway.
00:12:08.100 | So, trying to go back as far as I could. The first reference I found to kind of iteratively
00:12:20.260 | applying noise was in extracting this paper, extracting and composing robust features with denoising autoencoders.
00:12:26.820 | So, this is before like diffusion became a thing.
00:12:29.140 | Yashio Bengio's one here.
00:12:31.380 | And then Pascal Vincent also shows up a lot of kind of earlier work out of University of Montreal.
00:12:37.860 | So, a quote from the paper is destruction. For each input X, a fixed number of VD of components are
00:12:46.500 | chosen at random and their value is forced to zero while the others are left untouched. All information
00:12:51.780 | about the chosen components is thus removed from the particular input pattern. The autoencoder will be
00:12:56.820 | trained to fill in those particular introduced blanks.
00:13:02.020 | So, they're just randomly picking elements and masking them out and then trying to learn a representation
00:13:12.180 | based on the mask stuff. They found that it improves sort of the performance.
00:13:16.660 | Benjio picked that up a couple of years later. So, that was 2008. He had a handful of papers between this.
00:13:25.700 | And then, in 2013, this one I thought was kind of interesting where definitely in the same vein,
00:13:35.140 | but they've taken a bit further. We have proven that training a model to denoise is a way to
00:13:39.380 | implicitly estimate the underlying data generating process and that a simple Markov chain that alternates
00:13:45.380 | sampling from the denoising model and from the corrupting process converges to the estimator.
00:13:50.420 | This provides a mean for generating data from any denoising autoencoder. So, they've got, you know,
00:13:57.620 | they had a figure from MNIST where they're iteratively applying noise on one side and then walking it back
00:14:07.060 | on the other side. So, can we take a noisy image and reconstruct it? Which definitely leaves the
00:14:13.380 | foundation for some of the stuff to come. This was like the first paper that people associate with
00:14:19.300 | Diffusion process. So, kind of like the past ones, a lot of this comes from physics. Like a lot of
00:14:30.740 | old school ML, like early 2000s and before is like statistics and statistical mechanics. So, one of the
00:14:40.500 | examples I've heard to describe this is like how do you like a stochastic process is like dropping a drop
00:14:47.700 | of ink in water and seeing it kind of like spread out. And then the reverse process is like how would you
00:14:52.660 | like imagine where the drop came from? Shit. Okay. Okay. So, similar to the past ones, they iteratively add
00:15:09.220 | noise. They talk about stochastic differential equations, which the math is deeper than I fully
00:15:17.860 | understand or want to try to get into. But it's the underpinnings are just kind of on these stochastic
00:15:23.540 | processes, which were heavily inspired by physics and some of the stuff into like how do particles move?
00:15:32.820 | How does heat and entropy associated with heat kind of perform?
00:15:37.860 | And some of the examples from that paper, they've got like on the left, figure A, we've got C410 holdout
00:15:50.420 | images. They corrupt them in figure B with like Gaussian noise. And then they denoise the images
00:16:00.420 | and try to reconstruct them. And we can see that the reconstructions are pretty decent. And I think
00:16:06.020 | this was with a two layer network. So, not a ton of parameters, but they're still able to get some
00:16:12.900 | interesting results. And again, yeah, we're gradually applying noise in the forward diffusion and the
00:16:19.540 | reverse diffusion, recovering the distribution. Kind of speed summary of 2015 to 2019.
00:16:26.900 | 2017. GANs got released. I think Bingio was also on the GAN paper with Ian Goodfellow, I think around 2014.
00:16:35.460 | They blew up, they had kind of leading results for the Fisher inception, just to whatever the FID score
00:16:47.620 | for a while. Their GANs kind of struggle with a mode collapse. So, I did a project on GANs a few months ago.
00:16:59.300 | And I was trying to reconstruct the fashion in this. And like half of my results came out as boots.
00:17:08.180 | when there should be shirts and hats and all kinds of stuff. VQVAE was released, I think, 2018. We talked
00:17:19.060 | about that recently. But it compresses images into a discrete one-dimensional space. And then tries to
00:17:24.740 | predict it from the compressed space. And these kind of like went back and forth on what the state-of-the-art
00:17:34.100 | results were. Then in 2020, Song and Ehrman took another stab at diffusion models. They framed it in
00:17:47.700 | terms of this thing called score matching. Where their approach to this was very math heavy. But they're
00:17:59.540 | trying to understand how a diffusion process is able to learn a function to score
00:18:04.820 | the data in a high-dimensional space. And then they also introduced some processes for scaling that.
00:18:13.940 | And then they introduced a slightly larger network. And based on their principled approach, we're able to
00:18:24.340 | increase the sample size and quality of the images that they could generate. So just a snapshot from this.
00:18:30.740 | So still kind of look bad, but better. And then their FID score was 10.
00:18:39.940 | The big breakout paper for diffusion was this one, which is usually called DDPMs, but denoising diffusion
00:18:51.140 | of probabilistic models. Similar to previous work. This one uses a forward and refuse deversion process.
00:18:59.860 | One of the key breakouts here was that they restricted this to only predicting the mean of the Gaussian
00:19:09.860 | distribution from the noise was added. And then we're able to show with a bunch of math that the mean,
00:19:19.460 | predicting the mean is equivalent to predicting the sample X or the image sample that they're trying
00:19:24.980 | to find. So they effectively, they're trying to predict the sample X, but they do that by predicting
00:19:37.460 | the noise applied to X. And then they also introduced a much larger network. So this is a unit that they pulled
00:19:45.380 | from image segmentation tasks. And they achieved a state of the art for FID. Again, this is 2020.
00:19:54.100 | I'm not going to go through all this. Song came back pretty shortly afterwards. And it iterated on his last
00:20:10.580 | framework. More stuff with stochastic differential equations. Some of these papers or all these papers
00:20:17.220 | that I've talked about sort of form the underpinnings for current work. They're really kind of like the
00:20:22.420 | landmark ones. And people keep referencing particularly this, the DDPM paper.
00:20:33.220 | And again, the results look even better. You can see here, this is the score function in the middle.
00:20:39.700 | From the song paper.
00:20:50.340 | Okay. So jumping ahead to actually pause there. Any questions? A lot of the stuff is kind of
00:20:59.060 | background, so trying to go through quickly. But that doesn't mean it's not interesting or important.
00:21:03.540 | I had a side tangent on video diffusion.
00:21:10.020 | And then also, I think the other thing I was waiting for you to cover, but it looks like you
00:21:16.740 | stopped it like 2020, right? Did the paper be covered?
00:21:21.300 | Yeah, I think so.
00:21:23.620 | 2021. You didn't cover latent diffusion? Was that latent diffusion?
00:21:28.500 | No, I didn't.
00:21:29.620 | Well, latent diffusion is very important. And then consistency models and flow matching.
00:21:34.900 | Those are the three things that I think the last three years of diffusion have kind of thought taught us.
00:21:39.300 | For sure. Yeah.
00:21:45.940 | We've covered those things in previous paperclips.
00:21:51.620 | Yeah, I think that the background was super helpful. Thank you, Tyler.
00:21:57.780 | It was a good catch up for me.
00:21:59.860 | Okay, cool.
00:22:04.580 | Cool. So jumping into this one is 2021.
00:22:10.980 | So after the DDPM paper got state of the art on FID scores for ImageNet, it got a lot of attention.
00:22:22.180 | The research kind of spiked afterwards. A lot of the focus was still on applications with image generation.
00:22:32.180 | But increasingly, it started to become with text modeling and other modalities. So audio, video.
00:22:38.180 | This was one of the first ones I could find with reference to
00:22:44.740 | text modeling. And this is the Austin et al. 2021 structure denoising models in discrete spaces.
00:22:56.020 | So let's see.
00:23:00.020 | One of the question paper is we develop a structured corruption process for appropriate
00:23:07.060 | processes appropriate for text using similarity between tokens to enable gradual corruption and
00:23:12.580 | denoising. Expanding further, we also explore a corruption process that insert mask tokens,
00:23:17.620 | allow us to draw parallels to autoregressive and mass-based generative models.
00:23:20.900 | I'm actually going to skip the slide.
00:23:28.980 | So the process they introduce is they think of, you know, a string of text as a sequence of discrete
00:23:41.300 | tokens. We can do that as a reshape it into a matrix. And then if we're randomly masking a token,
00:23:52.020 | we can do that by sampling Gaussian noise. But each one of these samples is discrete versus
00:23:59.460 | some of the previous approaches we talked about, or assume that the distribution is continuous,
00:24:04.740 | which is more appropriate for like color values or from, you know, an audio signal,
00:24:11.860 | we're text. We're treating this as just one of the values from a row cab. And we can iteratively apply
00:24:19.780 | noise using these masks. So again,
00:24:24.500 | this is the four process Q, XT condition on XS. And they use like a transition matrix, which I think is from
00:24:39.700 | Markov, Markov, like a Markov process. But I'm feel free to jump in if someone has a better understanding
00:24:50.020 | of that. But each time step, they determine, you know, randomly, which based on a probability from
00:24:58.580 | from this transition basics, should we mask a token at each time step T.
00:25:08.020 | And then the reverse process is similar to that, where if we know what the true distribution is of
00:25:15.300 | the data XO, we know like what the unmasked value should be. And so we're able to sample from like
00:25:24.340 | the reverse of the distribution to unmask it. But of course, you know, in actual gender modeling,
00:25:33.460 | we don't know what the true distribution of XO is. So in our, you know, our reverse process,
00:25:41.380 | probability of X given theta, we can said, approximate this using our neural network
00:25:57.140 | to determine what the probability should be. So this is, this is what the network learns.
00:26:02.500 | And here's kind of a sample of like what the
00:26:14.100 | what the model is, this is a figure from the paper.
00:26:16.260 | And based on the LM1B task to generate new sentences. And then the bottom is the bottom
00:26:26.820 | portion on the right is reconstruction samples. And so what this is trying to show is like on the,
00:26:33.780 | on the top as T goes up, more stuff is masked. And then on the bottom as T moves forward, more stuff is
00:26:40.340 | unmasked. And then if we learned a good representation of the, of our data, then what gets reconstructed is, is like
00:26:50.660 | pretty close to what the data should be.
00:26:55.540 | So we can see, you know, it masks. The original is Caterpillar is eager to expand in Asia. And then the,
00:27:03.940 | the reconstructed version of Caterpillar is eager to expand in China.
00:27:06.820 | Some of the claims in the paper, this is a big block quote that I'm not going to read,
00:27:18.260 | but I encourage people to check out because I think it's
00:27:20.340 | pretty, pretty interesting.
00:27:25.140 | But they, they claim that that BERT viewed through this, this lens that they've, they've established
00:27:31.220 | is a one-step diffusion model.
00:27:33.540 | And the auto aggressive models are discrete diffusion models.
00:27:39.060 | And the generative masculine drink models. So the type of model they've proposed,
00:27:47.700 | or, or, or mass language models in general. So like BERT, um, or diffusion models.
00:27:53.780 | So interesting parallel with kind of like previous NLP research, um,
00:27:59.460 | and trying to like establish a background for, for their approach.
00:28:04.420 | This plays out in some of the, um, the future papers. So I wasn't sure how much depth to go in.
00:28:11.780 | So I cut out a bunch of it stuff. Um, so this one was 2021. Um, there's a ton of.
00:28:17.620 | Sorry Tyler. Oh yeah. I was going to actually ask a question on this. I think this feels actually
00:28:23.060 | kind of important. Um, so I want to talk through my thinking of it. And I think you, you will know
00:28:30.580 | this a lot more than me. I think BERT is a one-step diffusion model because we add noise, right? We corrupt
00:28:36.660 | the tokens. That's why it's a one-step diffusion model.
00:28:40.980 | I'm a little bit stuck on how autoregressive models are discrete diffusion models. Could
00:28:45.460 | you talk a little bit more about that piece? Yeah, totally. Um, so
00:28:52.260 | I think there, um, I'm not as, as firm on this as, as I, as I could be. Um, but
00:29:06.980 | you know way more than me at this point. I don't know. I went really broad and skimmed a ton of
00:29:12.580 | papers. Um, and I can, I can share a bit of a high level about what they're staying here. So what's the
00:29:17.940 | intuition? Yes, please. Discrete diffusion in this sense is basically them. Uh, so at each step there,
00:29:27.220 | there's a deterministic answer and they're just masking the next token, right? So when you train the
00:29:32.660 | other aggressive model, what you're basically doing is you're predicting one token at a time
00:29:38.020 | and in a sense, right, that's, that's just, you have deterministic outputs of what the token should be.
00:29:43.940 | You're just now masking one token at a time. So it's deterministic in the sense of, you know,
00:29:49.940 | you know exactly what it is. It's equivalent to like a single diffusion step, but instead of stochastic
00:29:55.620 | diffusion, it's deterministic because you know what the token should be. But yeah, I like read a little
00:30:02.420 | bit more about this and that's, that's kind of all they're saying. They're just saying you're masking token
00:30:06.580 | by token and that's all it is. Right. So it is discrete as opposed to probabilistic or continuous
00:30:14.980 | like images because images, image pixels are continuous, right? Right. In this case, it's text
00:30:19.540 | tokens. That's why it's discrete. Is that it? Images, images are continuous in the sense of you're predicting
00:30:26.180 | how much noise was applied and what the noise is. In this case, you have exactly one token of measurable,
00:30:32.180 | predictable, discrete change. Right. So that's, that's how they're framing the idea, which, which
00:30:38.420 | kind of in some sense makes sense, right? Like how much noise did I apply over an image? There's a
00:30:43.060 | spectrum of answers, right? And then your loss is measured differently. In this case, it's a very
00:30:47.540 | discrete. Yeah. I also wonder in this case, like maybe the word big, it could be large or huge. I think
00:30:54.420 | all of them, all of them are actually, the word big itself may not be the, you have many synonyms that
00:31:01.380 | could also be correct answers. So in that sense, framing as discrete loss or discrete diffusion.
00:31:07.860 | Never mind. I don't know enough of this to comment more. Sorry, please go ahead.
00:31:12.020 | So in that case, why are mass language models? Oh, RJ, go. Yeah, sorry. One interesting thing that I
00:31:21.220 | learned when I did the consistency paper was that the diffusion process is actually the model from the, from the
00:31:30.020 | beginning, given the predictions up until now. Right. So like, it's not, it's not just like given
00:31:37.860 | the last prediction, what's the next step. It's given, what's the next step, given all the predictions up
00:31:45.700 | until now. And this is the same for our language model. And so that it, in, from that perspective,
00:31:50.580 | it looks very much like an auto regressive model. Right. That makes sense. Yeah. And, and, and the other thing that I
00:31:58.180 | I just like, maybe, maybe there's a statement, the obvious, but BERT is it like with the mass language
00:32:04.660 | modeling is just sort of like a generalization of masking only the next token. Right. So it's sort of
00:32:10.900 | like, you're just, you're just, you just happen to have all masked the next token every time for the,
00:32:16.980 | um, for the auto aggressive model. Whereas with a mass language model, you're doing anywhere in,
00:32:24.580 | or like a chunk inside of the, and like tokens inside of the, inside of the text, instead of at the end.
00:32:32.740 | Right. That makes sense. I kind of think like, but it's like, you're masking tokens in the middle,
00:32:39.380 | like 15% of the time. Yeah. Whereas for auto regressive, you're masking everything ahead.
00:32:43.860 | Yeah, exactly.
00:32:45.460 | Um, everything's a mask. Okay. And then they denoise one position at a time, if you think of it that way.
00:32:50.100 | But, but, but maybe more. Yeah. So maybe more just like you mask whatever the next token is.
00:32:55.540 | Or, or maybe it, maybe it's better. Maybe what you said is better way to look at it in the sense
00:33:01.460 | it, because it parallels what I said about diffusion models in the, um, like for whatever the stable
00:33:09.220 | diffusion or whatever. No, that's totally right. Yeah. And then going to get into it in a minute. Um,
00:33:14.420 | at least a little bit is that, um, some of the state of the art stuff for language diffusion,
00:33:20.180 | um, of course they, they lean heavier on big transformer models. Um, and they're able to remove
00:33:25.780 | kind of the, the causal, causal mask, um, that's applied for attention. Um, so you think of like the,
00:33:32.500 | you know, the bottom triangle, um, that we usually see for attention, or if you have like group query
00:33:36.980 | attention, it's like the little squares, um, down your, your triangular matrix, um, for, for
00:33:43.380 | diffusion language models, they're able to look at, you know, the whole, um, uh, KV, um, for,
00:33:53.700 | for the attention, um, operator. So, so if I'm jumping ahead here, I'm sorry,
00:34:00.020 | sorry, please finish, uh, yeah, no, that's it. So if I'm jumping ahead here, let's say I have a code base
00:34:06.260 | and I say, I want to refactor this specific object that's being used.
00:34:13.060 | Discrete model, uh, a generative model just fills in the blanks.
00:34:17.060 | And that's why you don't actually have to go left to right, like autoregressive models,
00:34:21.620 | which you have to go left to right, even when they're just filling in the blanks.
00:34:24.340 | Actually, I don't know if there's, uh, better techniques for filling in the blanks with an
00:34:29.060 | autoregressive model for code models, but discrete model can just say, okay,
00:34:32.180 | someone will just draft out, okay, these are all the definitions I need, all the functions I need,
00:34:36.500 | some autoregressive model draft that out and then discrete model just fill in the blanks and
00:34:41.060 | it can do that very fast. I'm, I'm just kind of trying to match it to what we saw at Google I/O
00:34:46.980 | when they were demoing this, um, sorry, I mean a diffusion model for coding.
00:34:52.660 | Totally. Yeah. I mean, I think one of the challenges is too, is that the context for
00:34:57.220 | things that are ahead. Um, so if you're, you know, in that example, and I, I don't know the specific
00:35:05.940 | process that are applied for, for code models and assuming it's, you know, mom with a, with a similar
00:35:10.660 | loss function, um, and that they're just sampling from, um, saying, here's what it is, here's what we
00:35:18.580 | want. And it's able to interpret that there should be a larger output space, um, or the output space
00:35:25.060 | sequence should be longer. Um, that's true. Expanding that dynamically, right?
00:35:29.860 | Yeah, totally. Yeah. But if you're, if you have like, you know, your, your hundred line file,
00:35:35.460 | if you're doing it auto aggressively, you, you have to feed in the, like, and you're making edit at a
00:35:43.460 | character 50 or something, you have to feed in the first 50 lines in context and say, we want the
00:35:49.700 | edit at line 50, and then also feed in the remaining 50 lines, but it, you know, so, and then the,
00:35:58.740 | your auto aggressive model that has to attend to the stuff that's ahead of it, um, with reference
00:36:04.740 | to the stuff that's behind it versus if we don't have the, the causal mask, it can attend to all of it
00:36:10.100 | at once, um, or it's able to attend to it at once in a different way. Um, but hopefully I'm not.
00:36:17.060 | Makes sense. Thank you, Tyler. This was helpful for my intuition. Thank you.
00:36:20.740 | For sure. Um,
00:36:22.660 | cool. So yeah, there's a bunch of papers that happened from 2021. I'm jumping ahead to 2024.
00:36:31.220 | Um, the state of the art kind of like kept ticking up a little bit, but still significantly worse than
00:36:38.340 | auto aggressive models. Um, even for comparable model sizes, um, there was stuff that like the
00:36:46.500 | back and forth between discrete state spaces and then continuous state spaces where they have like
00:36:51.460 | a continuous representation and then they have like a separate process. So the continuous representation is
00:36:56.180 | like an embedding. Um, and then they have a separate process to sample from the embedding
00:37:00.740 | to determine what the token should be. Um, which hasn't really panned out in terms of like what the,
00:37:09.700 | what is now state of the art. So that's part of the reason I chose to skip over it. Um, but, um,
00:37:15.140 | encourage folks that are interested to look into it. There's a bunch of papers that came out,
00:37:20.180 | um, kind of in this time span that I'm glossing over. Um, one of the next ones that I thought was,
00:37:30.740 | was pretty cool. Um, and I think this is kind of looking backwards. It's sort of what this year,
00:37:35.380 | the art is and what's referenced from that. So looking at, um, um, um,
00:37:41.380 | the study in October of last year, um, where they started scaling up, um, mass diffusion models. Um,
00:37:54.100 | and they were able to achieve results, um, that are competitive with, um, auto aggressive model,
00:38:01.780 | um, language models, um, of similar sizes, um, and with relatively similar, um, levels of, of training
00:38:11.220 | compute. Um, so they trained up to a 1.1 billion parameter model. Um, and then for, depending on the
00:38:20.980 | different benchmark was, was competitive with GPT two, um, the 1.5 B version and then llama two,
00:38:27.460 | um, the seven B version. Um, so those weren't quite state of the art when this was released. Um,
00:38:32.820 | but still kind of a, um, a step forward. Um, and then it's, I don't know, I kind of like this scaling,
00:38:39.460 | um, scaling work just to see like the number goes down as compute goes up. Um, there was pretty graphs.
00:38:49.620 | Um, so, you know, the classic Kaplan, um, chinchilla isoflop curves, um, based on our,
00:38:57.700 | our training budget, we make stuff go down. Um,
00:39:03.140 | and here they have, like, they're saying they followed a similar, um, scaling law to auto aggressive
00:39:14.740 | models with their mass diffusion models. Um, but with some constant multiplier. Um,
00:39:20.900 | but the, the general curve, you know, on our, our log log plot was, was pretty similar for, for their
00:39:28.580 | approach. Um, digging into this a little bit more, um, the, to achieve a similar validation loss, um,
00:39:40.900 | they had to have 16 X, um, more compute, um, than the auto aggressive model and on their,
00:39:48.020 | their approach that they model here. So that's the, the left plot. And then on the right plot,
00:39:53.300 | they were able to achieve similar, um, what is this?
00:40:03.460 | I think better performance with fewer parameters is the right plot.
00:40:06.900 | And then, yeah, here's their, the results, um, which are competitive with the, the models we listed.
00:40:18.020 | Um, this one we, we talked about last, last week. Um, so I'm not going to dig into it too much. Um,
00:40:24.420 | so I just want to hit some things that I thought were cool that we didn't quite touch on. Um,
00:40:28.580 | so this is the same, uh, most of the same authors as the scaling loss paper, um, where they continue
00:40:36.820 | the trend and scale up to, um, 7 billion parameters. Um, and then the results got even better. Um, it's now
00:40:45.460 | as good or better than llama two, seven B or llama three, eight B, um, and handful of different benchmarks.
00:40:50.420 | Um, one of the things, you know, we were just talking about is that the bi-directional reasoning.
00:40:56.900 | Um, so one of the tasks they, um, they looked at was, um, reversing a poem. Um, so if you have like
00:41:10.740 | the last couple of lines in a poem, can you predict the lines that came before it?
00:41:14.820 | Um, and for this specific task, um, this model significantly outperformed
00:41:21.140 | most of the other models, they, they tested against including, um, 4.0. Um, and they, they attribute
00:41:28.340 | that to kind of the, um, the lack of the casual mask, um, for tension that, you know, just talking about.
00:41:34.260 | Um, so again, a figure from the paper where they're, they're masking stuff. Um, they also introduced this
00:41:42.180 | step where, um, they're able to score the probability of a token, um, that's predicted by the,
00:41:49.620 | the mass predictor. Um, and things that are low probability, they can then potentially remask and
00:41:54.820 | then try to resample again to get a better prediction. Um, so that's, um, the right or part C of this
00:42:02.500 | figure, um, where they're remasking, um, the figure and then re-predicting it again, which is kind of neat.
00:42:07.940 | Um, and then more scaling loss stuff, um, for different tasks, um, their model was able to, um,
00:42:20.020 | achieve better performance at, at lower, um, training compute. Um, and sometimes it's worse.
00:42:26.820 | So for the middle bottom plot GSM, GSM 8K, um, which I think is a math focused task, um,
00:42:34.740 | they outperformed, um, their autoregressive baseline and then to the plot to the left of it,
00:42:39.380 | the bottom left plot there, um, we can see all the orange stars are sort of below, um, the, the blue dots.
00:42:48.420 | Um, and so lower is a worse accuracy and this is your zero shot task, um, for the same level of compute.
00:42:55.380 | Um, so definitely some trade-offs there. Um,
00:42:58.260 | This next one, uh, block diffusion. Um, so this paper, this is 20, um, January, this one came out in, um,
00:43:08.660 | March of this year. Um, this uses, um, a hybrid architecture where it, um, has, um,
00:43:18.340 | has blocks, um, that in each block is generated auto aggressively, but with any in the block,
00:43:24.020 | it's, um, generated using diffusion. Um, part of the reason they did that is to, um,
00:43:30.740 | I think part of it was to make their, um, the sampling task easier. Um, but they were also able
00:43:39.220 | to take advantage of KB caching, um, versus this, this paper, um, despite the results, um,
00:43:48.420 | doesn't, hasn't been optimized to fulfill a lot of the tricks, like KB caching. Um,
00:43:53.540 | I didn't dig into the specifics of why KB caching doesn't work for this previous paper. Um, kind of
00:44:00.660 | thinking about it, um, my intuition is that, um, if you have like a, a dialogue, um, you know, between
00:44:09.860 | the system and a user, um, that it's, it has to generate like the next response starting from zero,
00:44:16.500 | um, as opposed to finding a way to, to cache like the previous, um, tokens like you can, um,
00:44:23.300 | if you're generating auto aggressively, um, and they're able to, so block diffusion was able to
00:44:28.180 | get around that by having, um, like these, these chunks of a fixed length that they then generate.
00:44:34.100 | So we, again, we talked about this last week that, um, the large language diffusion paper, um, applied
00:44:46.100 | pretty similar pre-training. Um, I think they use like a 2.3 trillion tokens. Um, and we saw 10 to the 23rd,
00:44:53.860 | um, flops on H 100, um, but they didn't do any post training. Um, and then some of the, the new work
00:45:05.380 | that's coming out is improving kind of the, the post training, um, uh, especially with our reasoning slant.
00:45:11.860 | Um, so this paper, um, so this paper, uh, uses this, this is a base model, the, the LADA that they introduce,
00:45:20.660 | um, and then applies, um, this, their custom GRPO, um, post training. Um, and then they use like the S1, uh,
00:45:34.580 | reasoning dataset to do supervised foreign tuning, um, with a slant towards, um, like math and reasoning
00:45:42.020 | and code tasks. Um, and based on that, they're able to, um, you know, dramatically improve the
00:45:47.620 | performance on those specific tasks over the base model. Um, so that's the, the bottom table here,
00:45:57.940 | they're in green. Um, the top row is the, the base model. Um, so they, they bumped up the numbers a
00:46:05.380 | good amount. Um, let's see, and that's, that's all. So thanks all.
00:46:16.020 | This is really good. Um, Tyler, I, there's a question that Eric posed in the channel
00:46:27.540 | that I'm also curious about. That's a compute for the output window scale as O log. Is it N squared
00:46:35.140 | for diffusion models or is it linear? Oh, would you know?
00:46:39.460 | I don't know offhand. Um,
00:46:45.940 | Some of the big ones that we hear about like stable diffusion, dolly image gen, uh, diffusion transformer,
00:46:57.220 | video diffusion transformers. Those are all transformer based. So there is, um, there is
00:47:03.780 | that, um, quadratic scaling issue, but some of them are not. So some basic denoising
00:47:10.740 | diffusion probabilistic models, they're more CNN based. And then there's no longer that transformer
00:47:16.820 | complexity issue. So depending on what you're doing, like some early work trying to do diffusion
00:47:23.460 | for completion. So like short completion for cogen stuff is not transformer based. So you're
00:47:30.100 | no longer complexity bound. But I found it weird. Cause like, you know, that's completion. It's like
00:47:35.940 | not long, long context. So I don't know. It's just what they did though. But, um, some of the latent
00:47:42.980 | diffusion models, like from stability, those I believe are also not transformer based. So they're,
00:47:50.260 | you know, that's more popular. You've probably heard of some latent diffusion stuff. Um,
00:47:54.740 | they don't use, um, they don't use transformers for the diffusion itself.
00:48:01.380 | I think they just use it for the, um, text encoding. So, you know, it, it kind of depends on where you see
00:48:09.700 | the quadratic scaling complexity.
00:48:20.660 | that's cool. Um, that's cool. Yeah, I guess we have a few follow open questions now.
00:48:25.620 | Yeah. I know RJ had a question. I don't know if you want to come on camera and just ask it.
00:48:30.500 | Oh, no, I was, I was come, I was commenting that I think that most, uh, or stable diffusion
00:48:41.300 | anyway, has a, has attention blocks in inside of the diffusion block. So I think it, but I don't know
00:48:48.580 | that would be, uh, so that would be across all the, um, sort of latent space tokens. So that's a fixed
00:48:59.620 | size and wouldn't be impacted. So I don't really, I'm not sure that this, that impacts like a, a text
00:49:05.380 | diffusion model, except because you would, yeah. Um, if you look at like BERT being a diffusion model,
00:49:13.220 | right, then it is obviously, it has a transformer block. And so therefore would, uh, so I think that
00:49:18.500 | the question about that is it's a little bit orthogonal, right? Because I, I view diffusion as
00:49:23.780 | like a alternative process to, uh, autoregressive modeling kind of, and not so much transformer versus
00:49:34.900 | non-transformer.
00:49:35.700 | Yeah, there was, the people have incorporated transformers into diffusion models in a handful
00:49:47.300 | of different ways. Um, so let me see if we can pull this up. Um, uh, I think that's my old desktop.
00:49:55.860 | Let's see.
00:50:01.780 | Well, so this is the, the unit image that I pulled from one of the slides. Um, this is from
00:50:07.380 | Prince's understanding deep learning, which is, which is pretty solid. Um, so starting with DDPM,
00:50:13.780 | this is sort of the, the base model architecture that they use to predict, um, the noise at each time
00:50:19.780 | step. Um, so they have, um, you know, the previous time step and then it's fed into this. And then the output is the, um, the image
00:50:31.540 | minus the noise. Um, or maybe it's just, sorry, I think it's actually just the noise itself, um,
00:50:37.140 | as a difference from what the image of you, but regardless, um, they've got like a bunch of, um,
00:50:44.340 | convolutional blocks, um, scaling it down and then back up. Um, and they, they don't make it very clear.
00:50:51.460 | Um, this color's not very good, but even within the original DDPM model from 2020, um, there's attention
00:50:58.980 | operators kind of at the 16 by 16, um, chunks here. So it's able to attend to what came in. Um, and then as
00:51:09.300 | models got bigger, um, and more complicated, um, people started tossing in more complicated model
00:51:15.780 | architectures and a lot of more attention at different parts of this. Um, a lot of them still
00:51:20.020 | kind of retained this unit shape, um, but stuff kind of got fancier from there. So that's one approach.
00:51:26.900 | And then I don't know if I can find the right. And then, can I just add to that? Yeah, please.
00:51:31.460 | Yeah. So, so as Vibu was saying, uh, the attention is primarily used in the, some of these earlier
00:51:37.540 | image diffusion models to, uh, to the time emitting space, as well as the prompt that would guide the
00:51:47.060 | image generation. Uh, sometimes these prompts are text images or text prompts, but sometimes these are
00:51:54.180 | image prompts like depth maps or contours or outlines of different things. So the
00:52:01.060 | attention is basically used as a grounding mechanism, uh, to flow forward, but the actual
00:52:06.980 | process itself is diffusion. So we give, there is a separation that we can make. The attention is used
00:52:12.980 | for helping the model understand the semantics of the image as well as for the generation.
00:52:19.700 | But the actual diffusion process itself is orthogonal to that.
00:52:23.380 | Yeah. The term for that is conditioning. So they call it conditioning sometimes.
00:52:29.300 | Yeah. And there's even class conditioning. So you can have like class based labels and
00:52:34.660 | guide, um, generation towards that. So like stylistic labels, right? I want anime and that's
00:52:40.500 | separate than tax, text conditioning of like ultra realistic. And you know, you're basically using text
00:52:47.220 | embeddings, but the, the attention there is it's typically done, um, with something like cross
00:52:54.180 | attention over a text embedding dimension. And that's, that's separate than diffusion scaling, uh,
00:53:01.460 | you know, a different complexity.
00:53:04.100 | Correct.
00:53:05.860 | And then things get much more complex. They have something called as control nets,
00:53:09.460 | which is slightly different concept than conditioning. Yeah.
00:53:13.860 | It's interesting where like, you know, we basically had our like GPT three moment where we used to have
00:53:22.820 | like all these temporal nets to fix consistency when you scale stuff up. And then with Sora, you know,
00:53:28.340 | it turns out that video generation is just scaled up diffusion and you just scale it up a lot and
00:53:33.700 | you solve a lot of these little like nuances. And it just kind of works.
00:53:38.260 | So I guess I don't want to be argumentative, but, uh, I think that, um, unless I'm misinterpreting this
00:53:47.460 | image, uh, that I put in the chat, I'm pretty sure it's saying that there is actually, um, uh,
00:53:56.900 | attention blocks in the backbone of the diffusion.
00:54:00.180 | Um, not, not just in the conditioning.
00:54:04.260 | Okay.
00:54:08.180 | So I don't know if I can pull it up and, or someone can share it if you want.
00:54:11.540 | Yeah. I mean, it's been a while since the last week did it, but, uh, that conditioning is primarily
00:54:16.980 | for, uh, that attention is primarily for the conditioning, whether it's text or, or image prompts.
00:54:25.540 | Okay. Uh, okay. Uh, well, okay. We, I guess we can argue about it offline. I think,
00:54:36.820 | I think that, uh, it's far as I can tell it as, as, uh, Tyler was noting, I think it's actually part of
00:54:44.500 | the unit backbone or whatever, uh, the backbone is made of. And there's like in the stable diffusion,
00:54:50.500 | the fusion is unit, right? Yeah.
00:54:52.500 | Anyway.
00:54:54.820 | I'll always help pull for more background information. Um, thanks again, Tyler. Really, really fun one this
00:55:04.740 | time. Um, I think Cirque said to take a call. So just wrapping things up here next week, we have the
00:55:12.820 | AI engineer world's fair. So if any of you guys are around, you know, we'll share something on
00:55:17.940 | discord. We'll, we'll do like a little meetup. I think we have time for like an in-person paper club
00:55:23.140 | workshop thing. Uh, we're supposed to be announcing our test of time paper club V2. So we'll share it
00:55:30.500 | remote too, but, um, TBD on what the paper is, but we'll have some sort of session in person and also
00:55:39.140 | remote next week. So if you're around at the conference, come by, otherwise, you know, same
00:55:44.020 | zoom thing and then it'll just be a different one. But yeah, thanks everyone for coming. Thanks, Tyler,
00:55:49.300 | for sharing. Thank you. Uh, Tyler, someone asked about slides. Oh yeah. I'll post those in the discord.
00:55:59.300 | Yeah. Thanks everyone. Perfect. Yeah. Thanks. Thanks.
00:56:07.060 | Thanks.
00:56:09.060 | Thanks.
00:56:11.060 | Thanks.
00:56:13.060 | Thanks.
00:56:15.060 | Thanks.
00:56:17.060 | Thanks.
00:56:19.060 | Thanks.
00:56:21.060 | Thanks.
00:56:23.060 | Thanks.
00:56:25.060 | Thanks.
00:56:27.060 | Thanks.
00:56:29.060 | Thanks.
00:56:31.060 | Thanks.
00:56:33.060 | Thanks.
00:56:35.060 | Thanks.
00:56:37.060 | Thanks.
00:56:39.060 | Thanks.
00:56:41.060 | Thanks.
00:56:43.060 | Thanks.
00:56:45.060 | Thanks.
00:56:47.060 | Thanks.
00:56:49.060 | Thanks.
00:56:51.060 | Thanks.
00:56:53.060 | Thanks.
00:56:55.060 | Thanks.
00:56:57.060 | Thanks.
00:56:57.060 | Thanks.
00:56:59.060 | Thanks.
00:56:59.060 | Thanks.
00:57:01.060 | Thanks.
00:57:01.060 | Thanks.
00:57:03.060 | Thanks.
00:57:03.060 | Thanks.
00:57:05.060 | Thanks.
00:57:07.060 | Thank you.