[Paper Club] Intro to Diffusion Models and OpenAI sCM: Simple, Stable, Scalable Consistency Models

00:00:00.000 | It should just record to my account so that I can upload it to YouTube.

00:00:06.000 | Is done. I did it. Okay, great. Oops.

00:00:11.000 | All right, great. So, we're off.

00:00:14.000 | So, Hi everybody I'm RJ.

00:00:19.000 | And we're going to talk about this paper, I hope people had a chance to at least pick through it I know this is one of the hardest papers I've read in a long time.

00:00:32.000 | So I understand if you didn't understand everything that's going on I certainly didn't, and I had to dig through a bunch of other papers to have the context also to understand it so.

00:00:43.000 | And then there's a lot of proofs that are like take too long to read and are not as important understanding the paper so like there's a lot, a lot there.

00:00:52.000 | But without so without further ado, this, this SCM, there's a simple stable and scalable and they actually just this authors do a great job of actually going through all three of those things in the paper.

00:01:08.000 | There's a consistency model, we'll talk about what that means in a little bit, but that's them and then, like, why are they doing this well because these continuous time consistency models show promise, but they're, but, but they're hard to train

00:01:26.000 | and hard to scale. So the paper kind of present some techniques for getting over those challenges.

00:01:34.000 | So I'm going to, you know, as in, so we're talking about in the beginning.

00:01:41.000 | There's, I think it is good to have just a recap of diffusion in general and this is more complex version of diffusion but I thought a lot of people have seen this diagram so it'd be useful to talk about what we're talking about here.

00:01:57.000 | And in, in reference to, you know, sort of stable diffusion and latent diffusion.

00:02:04.000 | So, the, the, you know, sort of the left space is the, the variational auto encoder and that lifts the pixels into latent space.

00:02:16.000 | And so we're not going to talk about that.

00:02:20.000 | And then the right hand side is the conditioning for on images and text so that's how you actually prompt the model, instead of just standing generating random images.

00:02:30.000 | And so we're also not going to really talk about that.

00:02:33.000 | But suffice these mechanisms are, um, they, they, they're compatible with the existing technologies, or these things, so you can just view it kind of as a drop in replacement for the screen part.

00:02:50.000 | And actually keep this diagram in mind, the unit in the sort of in the middle the big part in the middle with the attention blocks there are is is basically identical even.

00:03:03.000 | So, a lot of a lot of this diagram is the same.

00:03:07.000 | It's really about the training process and, you know, some tweaks to make the, the main network work well with that training process.

00:03:18.000 | And, and, you know, specifically, with, with the regular diffusion models that we all know and love. They have to iterate multiple times generally to generate a good image, whereas the, these consistency models are designed so they can generate a good

00:03:40.000 | image with only one pass through the network and there's a mechanism by which you can iterate to refine if you want to but that you. The idea is that you don't need to even even one shot through the network is enough to get a really good image.

00:03:56.000 | I tried to put all the links that I took images and other things from, so I'll share this and everyone can everyone can click on those if they want.

00:04:10.000 | So, please, and also please, if you guys have questions I know they go in the chat. I'm not watching the chat so let me know if you want to interrupt and ask a question.

00:04:21.000 | So, you know, this is just regular diffusion, and the idea is quite simple you're just adding step by step, some amount of noise, and there's this is actually scheduled.

00:04:35.000 | This that is not actually, it's, um, it, you know, like it's not adding a constant amount of noise at a time, it does work that way and so the top images like that they found in this paper that it's actually inefficient that you end up with a lot

00:04:54.000 | of wasted away wasted backwards process backwards diffusion steps that you don't need, and you can cut out a whole bunch of it just by using a cosine schedule.

00:05:06.000 | So the intensity of the noise is modulated by this cosine.

00:05:11.000 | Yeah.

00:05:13.000 | And then I think this is where it gets interesting the reverse diffusion process and I actually learned a little bit about this.

00:05:22.000 | During when I was studying this, because my impression was that the, you know, you sort of like crank in the image.

00:05:31.000 | And it just sort of like progressively refines and refines and refines but that's not actually quite what's happening. And this, this diagram here kind of shows this, and so I want to, we're gonna have to look at the next one to really understand what's

00:05:47.000 | going on here, maybe fully but the, the. If you look on the right, maybe starting on the right.

00:05:54.000 | You see that like the predicted noise every step is actually like you can't really make any sense of it right it's all just noise.

00:06:02.000 | And then the middle column, you have like slightly more, more and more refined picture and then on the left you have this like sort of full noise removed column.

00:06:18.000 | And the weird thing is that like if I look at, for example, on the T equals 40 row.

00:06:28.000 | And then I look at the T was 30 the full noise removed.

00:06:33.000 | That doesn't look like, like the one in the full noise remove column looks a lot more refined than the one in the input at T.

00:06:41.000 | Right. And so like in the 30, you would expect 30, the output at 30 would be more than, like, nicer than the input at 40, and it's actually not the case and the reason is because that's not exactly what's happening right it's not the output

00:06:55.000 | goes into the input, but rather the, the input is only being used to improve the noise estimate that's in the right hand column.

00:07:06.000 | So, so that noise estimate is is added to the noise in the, like, from the, like sort of generated noise is added the right hand column is added to that generated noise to get the left hand column.

00:07:26.000 | So in the in the middle column is only input to the model in order to refine how to generate that is that I hope that I want that thought to make sense.

00:07:37.000 | And I'll switch this page but if anyone wants to ask a question that now would be a good time, because I think, and I try to illuminate a little bit what's happening here.

00:07:50.000 | But I want to. Okay, so I'll if please interrupt if you want to ask about this. It's good so far.

00:07:56.000 | Okay, great.

00:07:57.000 | Um, so, uh, so now what's happening. And this is sort of like a bad illustration of the same concept, but in a different way, so that if you look on the T equals 10 right hand column that's like just random noise that was generated as input into the model.

00:08:18.000 | And then on the right hand column.

00:08:21.000 | You have denoised versions of that noise. And so, if you look at like, let's look at the bottom t equals eight. Right, so time goes kind of backwards here so left and left hand is t equals zero right hand is t equals 10.

00:08:37.000 | And then t equals eight is two to the left of equals 10.

00:08:41.000 | And, and so then because the backwards diffusion process goes backwards in time. So, that t equals eight, what's happening here is that, that the, the, the lot the thing at the top of the line is actually the input to the model it goes through.

00:08:59.000 | That, that thing with the attention block, and then it outputs something to add to t equals 10. And then when you add those two things together you get the thing it equals zero down at the very left, lower left.

00:09:12.000 | Right, so then, as time goes on, we get to t five, and we have a better estimate of that noise because we've gone through several steps, and then we add that to the t equals 10, and we get that airplane.

00:09:27.000 | It's kind of blurry to the, in the top, top left and then he was three and then finally we get to something that's really close to the original data.

00:09:36.000 | So, so I'm making this distinction because it's very different.

00:09:42.000 | This is clear because it's different than this is sort of the motivation for how to understand what's happening with these trajectories in the paper is that you can see that these, these are separate trajectories and I've intentionally drawn that it's not just

00:09:57.000 | like me being inaccurate. I've intentionally drawn it that way because the, the locations that you're learning are not basically on the same trajectory in the, in this latent space.

00:10:11.000 | They're, they're not in the same trajectory, as, as each other, right so that like in this causes a lot of inefficiency and that's sort of the whole point to this, these consistency models and other flow matching and other things that uses technique is,

00:10:32.000 | is to sort of like be more efficient about the places you're sampling.

00:10:37.000 | And you're saying that you're saying that the trajectories are not the same, like T equals eight, T equals five, T equals three, they're not the same trajectories on each other, in the sense that with T equals three, you could not have gotten what you have gotten at T equals five.

00:10:50.000 | That, that's right, that's right.

00:10:52.000 | No, don't restart your computer.

00:10:57.000 | Okay, I understand that intuition. Thank you. And why is that?

00:11:02.000 | I, so I, this is slightly vague in my mind but basically because the, you know, all that the like the.

00:11:12.000 | Maybe I've drawn them like further than they might be in reality right they'll probably be pretty close but the point is that when they're, when the input to the model is like the T equals 10 and the T equals eight noise at the top.

00:11:29.000 | Right. And that doesn't necessarily like, and I'm just trying to generate something that I can add to T equals 10 to do a little better than that equals a one.

00:11:41.000 | Right, so I'm trying to like, I'm trying to, I didn't draw this, the lines on the score but I'm trying to generate this thing. And so there's nothing constraining it in the latent space to be on that same trajectory so it doesn't necessarily pick something.

00:11:55.000 | Okay, so each trajectory at T8, T5, T3, they each try to optimize as much as they can and therefore they arrive at the different, that's the intuition, thank you.

00:12:05.000 | Yeah, they're always, always like this T equals three is just trying to get this, this like little fuzzy airplane noise back, or something like that, not actually not the fuzzy airplane, sorry, I misspoke.

00:12:18.000 | Just some noise that produces something like that corresponds to that fuzzy airplane thing.

00:12:25.000 | That, does that make sense?

00:12:29.000 | A little bit.

00:12:30.000 | Yes, yes it does. Thank you. Okay, great. Okay, so okay so this fancy plot is sort of the alternative that I just alluded to. And we're not.

00:12:41.000 | These, these stochastic differential equations, we're not going to really talk about this and this is like a thing that you don't really need to understand so I just got it from this diagram so you can just kind of ignore the SDE part.

00:12:57.000 | It looks almost the same, it's slightly different, doesn't really matter but it's actually a cool diagram because you can see this, there's these stochastic differential equations that are like sort of modeling this random process by which, you know, sort of, you have, you have noise,

00:13:16.000 | and it has a continuous instead of a discrete.

00:13:22.000 | It's a continuous instead of a discrete process so like adding noise to like on your left hand side, very very left is the data and these, these two, like normal looking distribution so there's like bimodal normal distribution.

00:13:36.000 | And then, and then what this differential, so this is sort of like heat, heat diffusion right and as time like time is going to the right and as, as the, you know, sort of like heat that is kind of sort of condensed in those two the modes of these two Gaussians

00:13:58.000 | like over time it kind of spreads out and, and ends up, you know, sort of like in the middle with this prior here, and that's not exactly what would happen because it's not quite this, it's not the heat equation but like it's a useful intuition for what's happening

00:14:13.000 | here. So that like, there's some differential equation and it's describing this, like sort of diffusion process by which everything ends up in this sort of like literally Gaussian prior for the distribution.

00:14:27.000 | And so this is where we generate our noise. And then we use the backwards process to get back to these, you know, these distributions, these bimodal distribution, and there's like a each one of these squiggly lines is a trajectory from that some random process

00:14:50.000 | could have taken based on the sort of based on sort of likelihood that is imposed on the field by the, by the diffusion process it in the colored diffusion process in the background.

00:15:07.000 | Right. So, and then, and then this probability flow ODE is sort of a deterministic version that looks at what is the maximum likelihood path.

00:15:23.000 | If I started at that trajectory. Right. So in this, so like looking at what's the easiest one to see is this bottom white one. So if you look, what this is saying is I started here.

00:15:34.000 | What is the maximum likelihood path for me to get to somewhere in the, in this middle area. Right, so, so instead of like Sam, so you can imagine if I sampled like a huge number of these SDS.

00:15:49.000 | And then I took the each one of these columns I took the, you know, sort of the means, then I would get. If I started here I would sort of follow this path right to get to this point.

00:16:04.000 | And I'm not speaking super accurately but that's sort of the intuition. So, um, and. Okay, so, and I, this looks quite different from what this previous diagram but it's actually the same kind of space.

00:16:19.000 | So I want to, and that's, so I want to.

00:16:22.000 | So, in, in this diagram you would have like, like, you know, maybe a chunk right here, which is, you know, like I did a, like a discrete Gaussian blurring of my data and then I did another one and another one and another one's is discrete and so what happened,

00:16:41.000 | this is what happens if you just make the number of those discrete, discrete Gaussian blurrings infinite and infinitely small.

00:16:53.000 | Okay, so and then the other thing, just to look at this is a score function, we don't have to really understand it right now and or not too much ever in this talk but the score function is more or less.

00:17:08.000 | This is sort of how the, this is the mathematical mechanism by which the, the reverse diffusion can happen, and inside of this score function is our unit model right so our unit is there.

00:17:26.000 | And, and so I'm using that unit to predict what's called the marginal distribution on that probability at for that data, and then that's what helps it to sort of figure out how to do this process in reverse.

00:17:43.000 | Okay.

00:17:45.000 | I know this is not super clear.

00:17:48.000 | Hopefully it will become slightly more clear. And then just to be clear, this is like a unit, this is a, I don't think it really matters but this is like the name of the particular unit model.

00:17:59.000 | And then I, this is sort of what I just said it estimates a gradient of the log probability density with respect to data at a given time step.

00:18:08.000 | And, but most importantly, that's where your unit lives.

00:18:13.000 | This, this is just this works in latent space too so this is like all the other stuff that is in that in that first diagram that's all here in the left hand side.

00:18:27.000 | Okay, good. So now let's talk about consistency model so you can see this, this looks a lot like this, but it's slightly different right, same author so not surprising, but, um, so this is sort of the left hand side.

00:18:40.000 | Right. And then we got rid of all the squiggly SD stuff.

00:18:45.000 | So, uh, so what's happening here and this is this goes back to the discussion about trajectories these are trajectories so you start, you know, in the forward direction you start at one of these green dots and then you follow this trajectory and you end up

00:18:59.000 | somewhere in this noise distribution here so that like this, like structured data turns into a Gaussian distribution. And what our diff. So we have a, then we have this

00:19:16.000 | particle function, ordinary differential equation I think it's particle it could be probability function, I think it's probability, and it, and it sort of is a, is a differential equation that can map from a given point here, back to a given point here.

00:19:33.000 | So then we can do that because since it's deterministic, then there's exactly one for any one place here there's exactly one place here.

00:19:42.000 | Right. So, and so once it because you have that, what a consistency model does is it says that everything should be on the same trajectory. Right, so I'm going to, if I estimate it, I can.

00:19:55.000 | I'm going to learn how to map from any point on this trajectory to the to this point in the data so including these very left hand things.

00:20:07.000 | And you'll see that it's useful to be able to map to the middle in a minute, but, but the point is that the differential equation solver tries to find the parameters of the differential equation, such that if I know this, then I know this.

00:20:25.000 | Okay.

00:20:26.000 | And the link. If you want to learn about that, that's the sort of some, the first link about that.

00:20:33.000 | And then, so just to tie it. I want to just tie this back to this is the same information even the same paper.

00:20:40.000 | But I just want to tie it back to our, you know, sort of noising and denoising idea is that we're, you know, sort of like mapping this so this is one trajectory in this case it's straight but like this is one of those curvy trajectories from the previous

00:20:55.000 | diagram, and it's mapping back to this data, this image at x zero this is the data that this particular noise.

00:21:08.000 | Best corresponds to.

00:21:10.000 | Right. So, um, and, and, you know, so the idea is just that I'm mapping all the way from any point on that trajectory.

00:21:20.000 | And this is why it's useful to useful to be able to map from any point in the trajectory is that now I can, if I want to.

00:21:31.000 | And most of the time it's two times but if I want to.

00:21:37.000 | If I want to do like refine my image by throwing more compute at the problem what I can do is I can, I can add some this, you know, sort of these these time steps, I can divide my space up into time steps.

00:21:55.000 | I can, I can sample noise I can re add it to the trajectory at that timestamp and then I can refine based on that noise and that seems to help the model sort of zone honing on the like just having another opportunity to refine.

00:22:18.000 | I can put it on a slightly different trajectory and help the model to learn from a closer space.

00:22:24.000 | What the original data was so like for example, let's go back here, you know, I might, you know, step one I started t I have this noise I go all the way to x zero, but then I add it back enough noise to get back to this step here, and then I follow

00:22:40.000 | where that lands man so I will be like slightly off this trajectory. So maybe this is a better model like you think multiple trajectories maybe I end up on a different trajectory and that's not what this diagram is for but I'm end up in a different trajectory,

00:22:53.000 | so that my, my data is slightly different, the data that I'm that that I'm outputting is slightly different. So I want to just any questions in the comments.

00:23:06.000 | I keep just plowing ahead or do people want to come in or ask questions.

00:23:12.000 | I'm not seeing any questions in the comments. Okay, yeah, what are some good open source models I don't know if you're a chime in on that RJ.

00:23:20.000 | Yeah, um, especially if they can generate images with good text.

00:23:26.000 | I, I, well I, I'm actually not the best person to answer that I think that, like the xl three and flux, but those are not this type of model I don't think.

00:23:42.000 | Perfect. That's what Alexander suggested as well. And then you would add your own text Laura's.

00:23:47.000 | Yeah, exactly flux flux is, is great for text as well as the 3.5.

00:23:54.000 | Yeah. Additionally, you can always add some Laura's.

00:23:58.000 | There's plenty of Laura sensitivity for making better text.

00:24:05.000 | I can do this right now actually I have a copy why open so if anyone wants to test I can just throw and from the text and generate some images quickly.

00:24:18.000 | Awesome. So, so like, but I think there's like maybe this is actually a good slightly good talking point so these, these types of models are not like these consistency models and there's like related models there.

00:24:30.000 | So, I, with the exception of maybe some of the closed sourced models from.

00:24:36.000 | I think that.

00:24:38.000 | Well, a lot of this. These papers come from open AI so I suspect they're maybe using some of these techniques, maybe. So, but, and I haven't really looked at the sort of big models that are out there to see if any of them are using any of these techniques but like

00:24:55.000 | these this is pretty cutting edge research so none of this technology that we're discussing today is in any of really in any of the big models that we know and love, with, maybe, maybe with the exception of flux.

00:25:12.000 | I don't, but that's actually a great question I'm going to look, look into that as soon as this is over.

00:25:20.000 | Okay, so let's, let's keep going in. So this is now so this.

00:25:26.000 | These od solvers, they kind of their discrete time solvers so they they sort of chunk up the time into little chunks and then they.

00:25:36.000 | And then they solve the od as best they can, given the discretization the quantization.

00:25:42.000 | But that can cause errors. And so that's what this diagram is trying to present is if you have like this, if this delta t here is very big you see it goes like far from x t x minus delta t, then the error that it can have is very big and that can put it

00:26:01.000 | on a different trajectory, so you get the wrong trajectory.

00:26:06.000 | And then, if the discretization is maybe a bit smaller than the error might be a little bit smaller, and then you'll get a closer but not the same trajectory, and that's sort of the point and then like in continuous time in theory, then because your discretization

00:26:24.000 | is infinitely small, then you, you, you, it's like, quote unquote, impossible to get on the wrong trajectory, although there are still obviously ways that you can.

00:26:37.000 | But that's the theory anyway.

00:26:40.000 | So in this, I was a little scratching my head a little bit I didn't see the od solver. Sorry, the continuous models don't use an od solver, they do actually in the paper reference using od solvers in certain steps but they're not essential to the process

00:26:59.000 | here is, I think the point that they're making.

00:27:03.000 | And this unbiased estimator.

00:27:06.000 | They don't give any information on how to do this so I think that what they just mean is an empirical estimation of the marginal distribution but they think they left that as an exercise the reader, or somewhere in the appendix that I didn't see.

00:27:20.000 | Okay, so I'm going to get very mathy for a minute.

00:27:24.000 | I know, I like I understand that, like, first of all, even, you know, the math geniuses among us might have trouble following because there's just too little context and unless you really read the paper carefully you're not gonna.

00:27:40.000 | So, you're not going to be able to follow super well but so I want to just call it a few things. And then there's like these two one and two equations one and two are going to reference a little bit to explain how they accomplish what they did.

00:27:54.000 | And, and also these.

00:27:56.000 | There's this, this is sort of this canonical equation that everybody has been using and when I say everybody I think I mean, mostly the same author in previous papers but also some other people.

00:28:08.000 | And so that like the this, this is sort of like the canonical setup and then there's this these constraints that are on the the C skip and the C out.

00:28:18.000 | And the interesting thing to note, and the important reason those are there is it. So if this is the time variable, the C out and C skip and if time is zero.

00:28:29.000 | That means that you're just you have the data right and and see out, you have none of the input from your, from your, from your neural network.

00:28:42.000 | Right, so this is just saying, at time zero. This is a way to guarantee that at time zero, you're getting back your data.

00:28:50.000 | And then the skip and see out they just trade off in some way between each other and as time goes gets later, and you have more than you, you are giving more weight to the, to the neural network.

00:29:08.000 | Okay.

00:29:09.000 | And then, so then, and then this is just, I don't think we need to go through this in any detail. This is just the training objective.

00:29:17.000 | And then if you look at this equation two is just that, you know, sort of instantiated with this certain kind of distance function.

00:29:31.000 | So you'll see a distance function appears here, this, you know, L2 loss is being used, which is just squared error loss that's being used.

00:29:45.000 | And then you, you know, sort of plug everything in, you, you know, you sort of take the limit as t goes to zero, or delta t goes to zero and you get this thing out and there's a proof of it elsewhere.

00:29:58.000 | And this, this tangent function is kind of the main actor here. So, like, you don't really have to care again, what this is just if you see this thing.

00:30:12.000 | Or if you see like this delta maybe then think tangent function, mostly.

00:30:19.000 | Okay, so I know that this is like probably like pretty opaque, it's, I would be surprised if it wasn't unless you read the paper carefully.

00:30:28.000 | And then, so then they. So, um, you know, again reminding you this, this is great, but it has a problem and then it's unstable and and hard to understand.

00:30:41.000 | So, what they did was they took this is just a repeat of what we saw there's nothing different here.

00:30:48.000 | Just previously, so they took and they, this is like the old way of doing things where you have the, this is for, you know, for these discrete, mostly being used for discrete but also for continuous and they have these, you know, sort of, they did some derivations

00:31:03.000 | and got these values for C skip and for C out and for C in, and, and then they have all these stability problems, and so they came up with this other idea and we'll motivate this a little bit in a minute.

00:31:20.000 | They use cosine and sine, and this one over sigma d for the, for the C skip C out and C in, and then when you plug that in, then you end up with this instead.

00:31:34.000 | Right. So that nothing, nothing magical happened here I'm just plugging my CF, my F theta, I'm just plugging in these values and I get this.

00:31:46.000 | Okay, so now here's, this is the thing. And this is I, in my opinion, the meat of the paper. So, you have this, you have this part of the, you have this tangent function that I had called out.

00:32:01.000 | And this is like if you plug stuff in again into that tangent function there's nothing magical here I'm just plugging that cosine and sine and everything in there, and you get this big long thing here that's really hard to read.

00:32:15.000 | And in the paper they talk about. Okay, this thing, if you look at just this thing, like the stability is instant, like, this is not causing instability.

00:32:26.000 | This thing here is not causing stability oh this part, like this sign times this, this expression is causing instability. So, let's look there. Okay, so then they'd say okay look when I like this part is also not causing instability it's this, this part.

00:32:43.000 | So we have the sign with a differential F.

00:32:48.000 | And then, so I take that, and, and, and so now we're going to address all the sources of instability in these three, or not necessarily these three classes but in these this expression.

00:33:01.000 | Right, so let's go through piece by piece and decide and figure out what is causing all of the instability you'll notice this is the chain rule.

00:33:14.000 | I hope you guys are following I'm trying to keep it very as understandable as possible. So, any questions.

00:33:24.000 | No, I was able to follow the last slide. Thank you for breaking it up and reminding us of chain rule. Yeah, I think I'm going to get started getting lost here but you're doing a great job.

00:33:34.000 | Okay, good. So, this, and then the noise. Um, this, this was just from the previous paper papers again, um, I, they just, you know, sort of give it to you and say go read the paper if you want to know, and they, so they have this.

00:33:49.000 | C noise which you'll recall appears here, right, the C noise.

00:33:56.000 | And, and this is basically just a way to a coefficient function that can, like, alter the time, like the rate at which time changes basically, and they are saying this actually causes a lot of this like definitely will cause stability at equals two because

00:34:16.000 | it's zero and so therefore this goes to infinity. So, definitely can't have that so they just say let's just set it to T, and they don't really motivate this except for this is just makes time asking to cut a constant rate, which makes, I think, intuitive sense.

00:34:32.000 | Okay, and then. So now, so that's like this C noise from this page.

00:34:38.000 | The, this, it appears here it appears here, here's here. Right. And then. Okay, so now what about this embedding thing.

00:34:48.000 | So what they say is, this is sort of basically, you know, similar to the fourier.

00:34:55.000 | This is a fourier embedding similar to what you have, you know, in the, in the positional embeddings in transformers, and there's a reason for that because this is using the attention block.

00:35:08.000 | But what, so they point out that, you know, the, the scale that they had set here was 16 that's very high and it causes lots of instability including what right when you get to pi over two, it goes to infinity so that's obviously bad.

00:35:27.000 | So they, they played with this and they, I think empirically determined that oh if we just make this really small to this value I don't know where they got point two but it just basically corresponds to the same positional embeddings that you see in the transformer

00:35:42.000 | in attention is all you need.

00:35:48.000 | Hopefully that's also not so hard I think they were very hand wavy so I don't think.

00:35:54.000 | So what they say is just set s to 0.02. Yeah, that's right.

00:36:00.000 | Man I wish I had the math to understand how they got to the intuition or maybe it's empirical but okay. I think, I think they.

00:36:08.000 | Yeah, I think that basically what it scales, the coefficients to the same scale that the ones in attention out is all you need are at. I think that's, and I don't I don't think it's, I think it's all algebra to do that.

00:36:23.000 | If I understood if I recall and understood correctly.

00:36:28.000 | Thank you. Um, yeah.

00:36:30.000 | And then this adaptive double normalization I don't think it's that important, they didn't really talk about this as a whole. This is everything they say about it here.

00:36:38.000 | But they basically they just say okay there's this adaptive normalization we think it works but it doesn't work in our case so we just do it twice, and seems to work.

00:36:48.000 | So, I don't think we need to talk about it very much.

00:36:52.000 | So you can see like there's a bunch of stuff they're stacking on top of each other.

00:36:56.000 | Okay, and then there's this like tangent normalization they tried this and then they also tried just clipping to between one negative one and one.

00:37:05.000 | And they, they, they see like oh yeah like with our, with our models normalization has a lower FID score this is first fair fair shed inception distance it's a measure of quality of or closeness of perceptual closeness of images.

00:37:24.000 | So,

00:37:27.000 | and so you can see with either two steps or one step in our consistency model.

00:37:35.000 | Then we do better.

00:37:37.000 | You know, if we have normalization, and maybe if clipping is clipping might be good enough, because it's obviously cheaper than doing this normalization here.

00:37:50.000 | And then finally, they have this adaptive waiting, basically, I think all you need to understand is they throw this

00:38:00.000 | weight term, they throw the weight value into the loss function. So this is a loss function for the optimizer when they're, you know, sort of doing the gradient descent, then they, they throw this weight term in there.

00:38:16.000 | And that is comes from right somewhere right here.

00:38:22.000 | Right, so this is our gradient. Right. And this weight term isn't here so they just in, you know, when they derive the loss function they just throw in that weight term and that seems to help a little bit.

00:38:36.000 | Right.

00:38:38.000 | This yellow.

00:38:43.000 | Oh, no, actually, are they saying.

00:38:47.000 | No, it looks like it's always it's better. In some cases, but as you go it looks like it's worse, or no, I guess.

00:38:56.000 | If you have. Sorry, if you have. If you do two steps, then it's worse if you do one step it's slightly better so it doesn't look like it matters that much.

00:39:06.000 | And then tangent warm up I don't, again, probably don't need to understand that the sign T, they just put our in front of it and that are just literally increases from zero to one over the first 10k iterations, and they just do this because it's a it's instable

00:39:20.000 | in the beginning.

00:39:22.000 | No need to discuss a lot.

00:39:25.000 | Okay, and so, like when you stack all of these things together, then you're able to train much more effectively and continuous time does much better than these discreet this so this n is the number of discrete steps that your model is taking.

00:39:45.000 | And, you know, maybe one interesting thing about this plot is that, you know, sort of the best you can do is at 1024 and then it gets worse. Right, so it goes, so like it's better from here to green and then green to purple gets way better and then it

00:40:01.000 | gets worse again so like at some point.

00:40:05.000 | So, like it's the issues that they brought up, start to start to matter.

00:40:11.000 | Why is continuous so much better than screen, is it because it's just continuous and therefore it's easy to learn but I thought the continuous unstable as well.

00:40:20.000 | No, all of these techniques that we just talked about in the last few slides are making continuous stable. So then they are able to train more effectively.

00:40:34.000 | In previous, like in the previous attempts that actually other authors did, and they did people thought I think they found that they had to be very conservative in the, you know, sort of like the way that they train the model in order to avoid all the instabilities.

00:40:52.000 | And so because of that they were not able to get good results, whereas now they're able to sort of like this intuition that they right here.

00:41:01.000 | They like this quant, it's sort of like, because if I have only a few steps, you know this, the time delta is very big between these and I'm going to have a lot of error in my tangent calculation.

00:41:12.000 | Right, so this, this tangent is that tangent that we talked about. And if there's a error. Like if there's a big quantization here, or like only a few time steps then the ODE solver has to, you know, sort of has only a limited amount of data to work with

00:41:30.000 | and it ends up making mistakes due to that discretization and so they ends up having big errors and you get on the wrong trajectory.

00:41:39.000 | Makes sense. That makes sense. Yeah, and we also have a question from the chat, what does the FID metric evaluate.

00:41:47.000 | Yeah, it's, that's a great question I had to look it up myself because I've seen it before and I forgot it completely it's.

00:41:53.000 | Um, so it is just, it is like it is a numerical measure of the difference between two images, but in the reason why it's popular is because it seemed, it was explicitly designed to match the match closely to human perceptual difference.

00:42:15.000 | So they, there's a paper that talks about it and one of the things that they talk about is and I think it's what's in the definitely a bibliography but I can dig it if, if, if anyone's interested.

00:42:29.000 | It basically describes the, the, the previous methods that were being used which were just like, there was one of the divergence divergence measures are not measures divergences.

00:42:48.000 | It didn't match to what people visually what humans actually were visually using to distinguish between images, so that this one was designed to do a better job of that so it's, it's supposed to be like a human perceptual distance metric between the, the, the input, like an input

00:43:13.000 | input image. So, I think that meaning.

00:43:22.000 | Actually, you.

00:43:24.000 | Yeah, I think there's a, there's like a, and I don't know exactly that's a, that's a great question. I don't know exactly how they do the experiment.

00:43:32.000 | I think that they go do a forward pass with the image, and then they add some noise, and then they do the backwards pass and they see the difference, difference between the images but I'm not 100% sure that is does anyone know this, how this works.

00:43:48.000 | Exactly.

00:43:52.000 | No. Okay, well, so yeah that's actually, that's something that I want to follow up on.

00:43:59.000 | Okay. And then I talked about a tangent warm up continuous versus discrete. Okay, so here's the part that everybody should be a little more comfortable with hopefully is just like we're evaluating models.

00:44:12.000 | The, you know, you have this NFC stands for

00:44:18.000 | number of function evaluations so this is how many times did I did my sort of core. How many iterations that I have to do to generate my images.

00:44:30.000 | And so, um, and you see that they all their numbers in this section of the paper, they have they they compared to two evaluations in one.

00:44:41.000 | And they do slightly better. Whenever they do too.

00:44:46.000 | So, and then, so there's several things to note about this one thing that they said in the paper it's not here but that, that it.

00:44:55.000 | They it takes about two x to compute to train the this consistency model from as a, as a distillation of whatever they distilled from.

00:45:09.000 | Approximately twice the compute. So, if you spend a lot of compute to train a model and then you want to distill it with this mechanism you're going to pay about twice as much.

00:45:19.000 | So there could be that could present an operational problem or it might not.

00:45:26.000 | Another thing to know these joint training sections these are basically GANs but like you'll see some of the more common GANs here, but, and I don't know exactly the difference between these guys but this CTM, like was the sort of overall winner for this.

00:45:44.000 | CIFAR data benchmark and then, and then, you know, for a conditional class image net 64 by 64, it's a different one but it's also in this joint training. So the GANs tend to be tend to be winning on these benchmarks.

00:46:03.000 | So my belief is that these are the reason why people don't use them is because they're hard to train and they're, they have their very, they have mode seeking behavior meaning it's hard to get any diversity and hard to control them.

00:46:16.000 | But for these benchmarks they do the best.

00:46:20.000 | And you see it down here.

00:46:23.000 | So sort of their, you know, their consistency model is actually quite close for what it's worth. And then, so this is from distillation.

00:46:35.000 | And then this is if you train from scratch, and they do better.

00:46:39.000 | So that's maybe another interesting thing so the distillation doesn't work quite as well as the training from scratch does.

00:46:48.000 | This is a place where someone might want to like comment or ask questions so I want to pause and make sure.

00:46:58.000 | Okay, so, and then, um, so they also compared this variational score distillation which is kind of in the same ballpark, in terms of effectiveness, and they found that they, they have, or VSD has higher precision and lower recall meaning lower diversity.

00:47:17.000 | And like as the guidance scale gets higher it does worse. And, you know, this, because if you'll notice the diffusion teacher model, and, and the, the,

00:47:37.000 | the models that they built are very close to each other and in these cases both for precision and recall so you end up having some very similar FI score as well.

00:47:51.000 | And then like another interesting thing here.

00:47:56.000 | So, a couple things so there, you know, their model does quite well.

00:48:05.000 | When it's trained and not distilled.

00:48:10.000 | So, or sorry. Let's see.

00:48:16.000 | The distillation doesn't do quite as well as the training and, but neither of them do as well as the diffusion teacher, including the one that was trained from scratch, and that like you'll see also interestingly once.

00:48:32.000 | So the two step does actually the two step trained model does worse with, you know, sort of like

00:48:46.000 | in the smaller models but better in the bigger models.

00:48:51.000 | Okay, and then this is sort of like their scaling study we're kind of out of time so I won't talk too much but like, yeah, you can see they they did well they have up to 1.5 be model.

00:49:05.000 | Let's see the. Yeah, that's sort of the main takeaway Okay yeah so that's, that's all I have.

00:49:13.000 | I can take any questions if people want to stick around for a few minutes.

00:49:24.000 | I know this is not at all an easy paper.

00:49:32.000 | I think it's one of the hardest papers. Yeah, I think the other one was probably PPO or PPO.

00:49:40.000 | I missed that paper so.

00:49:43.000 | Yeah, I mean to me this, this was challenging for several reasons right like one is just the topic, like diffusion by itself is hard and then you have this like really obscure diffusion that is like even more complicated and you have to understand a little bit about

00:49:58.000 | differential equations and whatever. And then on top of that there's just a ton of literature to read in order to read the paper.

00:50:05.000 | So all those things like kind of, it's like a triple whammy.

00:50:13.000 | But I must say, I really really enjoyed learning by reading, like because you know I dug into all these papers which I normally don't have time to do but I, you know, because I'm presenting I took the time to really look at all the references and try

00:50:27.000 | to understand things well enough to hopefully explain them right to other people. And that was super valuable experience and I'm glad I did that on such a hard paper.

00:50:40.000 | I feel like

00:50:43.000 | you guys felt that you got some of the intuition behind this.

00:50:51.000 | I know it's not easy I tried to focus on the intuitions for the paper, and not so much on the math.

00:51:03.000 | So, let me stop sharing.

00:51:08.000 | Great guys, if there's nothing else.

00:51:12.000 | You know I thoroughly enjoyed doing this I hope, hope somebody new will will.

00:51:22.000 | Someone who's never presented before will will take take up the mantle for the next session or the next open session will be awesome.

00:51:31.000 | If anyone has any paper that they want to cover so it's great to for you to voice out and say, because we always welcome new, new paper presenters.

00:51:40.000 | Yeah, I think we, I think we had a volunteer for next week but it's on the loom I don't remember which one it was, but people was signing someone up.

00:51:49.000 | Awesome. Yay.

00:51:51.000 | More need more more still needed.

00:51:56.000 | Okay guys, well enjoy your Wednesday then I will.

00:52:01.000 | I'll see you guys on discord.

00:52:04.000 | Thank you very much once again for presenting my pleasure. Yeah, you're welcome.

00:52:10.000 | Bye bye.