Back to Index

Lesson 9B - the math of diffusion


Chapters

0:0 Introduction
2:19 Data distribution
6:38 Math behind lesson 9’s “Magic API”
18:50 CLIP (Contrastive Language–Image Pre-training)
27:4 Forward diffusion (markov process with gaussian transitions)
36:11 Likelihood vs log likelihood
42:16 Denoising diffusion probabilistic model (DDPM)
48:4 Conclusion

Transcript

Hello, everyone. My name is Waseem. I am an entrepreneur in residence at FAST.ai and I'm currently at the FAST.ai headquarters at the moment in Australia, although I'm originally from South Africa, from Cape Town. And I'm joined here today by Tanishq. Tanishq works at Stability AI. And we've been working together with a couple other people on diffusion models, generative kind of modeling.

And that's been super fun. So, Tanishq, do you want to maybe, you know, introduce yourself as well? Yeah, so my name is Tanishq. I am a PhD student at UC Davis, but I also work at Stability AI. And I've been exploring and playing around with diffusion models for the past several months.

And so, it's been great to also explore that with the FAST.ai community as well in these last few weeks as well. Awesome. Cool. Cool. So this talk is for me trying to understand the math behind diffusion. So, you know, if you've done the FAST.ai courses before, you know that you don't need to understand the math to be effective with any of these models.

In fact, you don't even need the math to do research, to do novel research and contribute to these. But for me, it was all about, it came out of interest. And, you know, I thought it was kind of, it's kind of beautiful how what diffusion models were discovered. And I think a large part of that was thanks to some some really clever math.

And so I wanted to understand that. I'm not, I don't have a math background. And so I want to help kind of describe how I think about it and how I, how you can kind of interpret, you know, all of these notations and things. Cool. Yeah, so I can just dive into it, I think.

So the first bit of math that we see in this paper is Q of X superscript zero. And they call this the data distribution. Do you want to mention what the paper, exactly which paper this is? Good, good question. So this paper is the 2015 paper. Do you remember the authors of that paper, Tanish?

I think it's Jasa, Sol Dikstein, who now works at Google, I think. And it's from Surya Ganguly's lab. So cool. Yeah, so this was the paper, as far as I understand that introduced this idea of diffusion. Yeah, 2015 by those authors. They start out by defining this data distribution and they use this notation.

And already, like a lot of people, you know, myself included, find this quite confusing. But let's go through what's described here. So they have an X. And, you know, in math, X is often used as the input variable, much like Y, which is then used often as the output variable.

Yeah, and the fact that it has a superscript also implies something. So the fact that we have X superscript zero implies that there might be a sequence of Xs. And, you know, I think it's useful to get comfortable with this idea of simple compact notations implying a lot more than, you know, might be obvious at first glance.

So X implies that it means something about this quantity. It's an input variable. And, you know, the zero implies that there might be other things that you might have. You might have an X1, an X2, and so on, but we might see that. And then the third part is you have Q.

And Q is what we call a probability density function. So the first part here is probability. And the question is, you know, what does Q have to do with probabilities? Well, it's because usually we use the letter P to describe probability density functions of interest. And then because Q is right after that, it's another common one.

So it's kind of like how you use X and Y. We use P and Q. And the fact that we use Q here instead of P is because it suggests that there might be a P that we'll introduce. And maybe P is the thing that we're modeling and Q is kind of supplementary to that.

Does that sound right, Tanishk? Yeah, yeah. And I think it's also helpful to kind of maybe think about like X0 in a more practical, concrete way. Of course, if we're working with images then X0 would be, you know, that's what's representing the images. So it's also useful to think about it from kind of that concrete practical approach as well.

Right. So X0 might be, you know, an MNIST digit. And then we got Q. So Q, I'll just use this to mean Q is some function. So we look at it as a box. And it takes in X0 and it gives us the probability that this X0, which is an image, looks like an MNIST digit.

So in this case, you know, this would be 0.9 or maybe even, yeah, it's 0.98. So this is quite high probability that this is an MNIST digit. Hi, this is Jeremy. Can I jump in for a moment? Please do. Oh, thank you. I just wanted to double check. This looks a lot like the magic API that we had at the start of lesson one that you feed in a digit and it gives you back a probability.

Is that basically what Q is doing here? Absolutely, yeah. It's a magic API. That's a good way to think of it. We don't know, we couldn't write down what Q is, but we imagine that somebody has it somewhere. Yeah, so this is a concrete example. And like, if you had to do something to this image, you might get a smaller number here.

So another thing worth mentioning here is probability density functions. So these are these magic APIs that, you know, give us a number, tells us how likely the thing is. They, you don't often see them, they don't often make it all the way to your code. In fact, they very rarely will appear in your code.

But it turns out that there are very useful ways or tools to work with random quantities, because they allow you to represent random quantities as functions, just ordinary functions. And because they're functions, you have a whole century's worth of math to analyze and understand them. So you'll often find probability density functions in papers, and eventually they work out to really simple equations or formulas that end up in your code.

Do you want to add anything Tanish? I think that sounds all correct. Of course, I think you probably will go over some examples of probability density functions, especially relevant to this one. But yeah, it's useful to think about the, also the sorts of functions you may have in a simplified case.

And that's what we probably are going to talk about next, right? Yeah, yeah, that's exactly what we talked about. So we have this QXO0, and then we introduce another one. And like you said, this is going to turn out to have a really nice simple form. But before that, the next thing we define is QXT, given XT minus one.

So we'll say what we define this to be, but to begin with, this is another probability density function. And this bar over here means it's a conditional probability density function, which you can think of as you are given the thing on the right to calculate probabilities over the thing on the left.

In this case, you can think of it as something that takes images. So maybe another magic API and produces other images. But we don't know what these look like yet because we haven't defined over here. And this, we would call XT minus one, which could be X0. And this would be XT, which in the X0 case would be X1.

Something worth noting here is this notation can be a little bit confusing because we've said Q is one thing earlier and now we think Q is another thing. So this year, I'm going to need your help on this one, Tanisha. I think people would usually, in the strictest sense, define the first one like this, maybe, and the second one with a subscript.

And this notation that we see here on the left is just a shortcut where they wanted to save the space of writing that and included that, implied it by what was in the practice. Is that true? Yeah. I think they used the variables Q and then, of course, later on, we see P to describe, as we'll see, different aspects of the diffusion model and the different processes of the diffusion model, which we'll see.

So I think that's why they use the same variables to kind of demonstrate this is corresponding to this process and the other variable corresponds to the other process of the diffusion model. So we'll obviously go over that. So I think that's where those variables or those letters are being used in that matter.

But if you do want to make it more specific, more clear, yeah, I think that that notation is fine as well. Okay. Yeah, that makes sense. Okay. So let's describe what this Q does to the image on the left to produce the one on the right. So I'll start over here so we have more space.

I'll write it out first and then we can go into the details. Okay. So kind of like the bar, you can think of this semicolon as grouping things together. And so you have the things on the left and the things on the right. My understanding is these two things on the right are the parameters of the model, sorry, of the probability.

And the thing on the left is actually, Denise, could you help me understand what the thing on the left is? Do you know? Right. Well, so this is, again, like a probability distribution. And the thing on the left is saying this is a probability distribution for this particular variable.

So that's just representing what it is a probability distribution for. And then the stuff on the right are the parameters for this probability distribution. So that's kind of what's going on here. So like, yeah, anytime you have a normal distribution and it's describing some variable, you'll have that sort of notation where it's the normal distribution of some variable.

And then these are the parameters that describe that normal distribution. Right. So just to clarify, the bit after the semicolon is the bit that we're kind of used to seeing to describe a normal distribution, which is the mean and variance of the normal distribution. So we're going to be sampling random numbers from that normal distribution according to that mean and that variance.

Is that right? Yes, that's correct. Yeah. So we need to describe a bit more there about normal distribution. We kind of skip past that. So we have this fancy N and fancy letters in math for distributions usually refer to well-known distributions. And the N here stands for normal, which is also known as the Gaussian distribution.

And it's probably the most well-known probability distribution that you can find. And when I say well-known, I mean that these things pop up everywhere. You know, you can do in all sorts of fields, measuring all sorts of things, turns out that they follow roughly something that looks like this distribution.

And because they have pop up so much, you know, people studied them, studied all of their properties, and we understand them really well now. The reason that they used often in cases like this is because they turns out they have really useful properties and they're easy to work with.

Some reasons are they're described by just two parameters. So the mean, called the mean and the covariance. Another property is that they have kind of, you know, what people would call sun tails, which kind of means that they only, you only need to describe their behavior in a small region of space.

You can kind of just ignore the rest. Yeah. Do you mind drawing a quick example of a normal distribution? That's a good point. So we have, let's say our random variable is just one kind of dimensional. So just a single number of floats. This is sort of what the normal distribution would look like.

And in this case, that would be our mean. And the variance would sort of describe the width over here, which in this case, you'd use a small sigma because you're doing a single variable. In our case, we used a capital sigma, which is the symbol for multiple variables or multiple dimensions.

And yeah, I also didn't say that this is the Greek letter mu. So capital sigma, mu, and lowercase sigma. I just wanted to note that typically the lowercase sigma represents the standard deviation, which is the square root of the variance. So for example, sometimes you may see in papers sigma squared, and that's just the variance, but they will write it sometimes as sigma squared instead.

So it depends on the notation. So sigma is the standard deviation often, and sigma squared would then be the variance. Cool. Yeah, we can also show with our example what this would look like. So we start out with a MNIST digit, put it through this magic API, and what would we get out?

Okay, so something we didn't describe is what does this I mean? Did you want me to talk about that, Wesleyan? Yes, please. Okay, sure. Because I think this is something which actually-- can I borrow your pen? It actually came up in the lesson we were doing kind of in an interesting way.

So in that lesson-- Do you want to get in the video? Ah, no, they know what I look like. Oh, well, okay, I'm in the video now. Yeah, in the video-- Hi, Tanish, nice to see you. Yeah, so in the lesson, we did this thing for clip, I don't know if you remember, where we had the various pictures down here.

I'm so embarrassed, you're better at the graphics tablet than I am, and it's my graphics tablet. And we had the various sentences along here. And we said, oh, it would be kind of cool to take the dot product of their embeddings. Because if their dot products are high, that means they're similar to each other.

And if we subtracted the means from those first, then you've got the dot-- and instead of having images down here, what if we had the exact same vectors on each side? Then what you've got down here is basically x minus its average, if it's a check that first, squared.

And that is the variance. So that's the variance for each one of these vectors. But what's interesting, as you pointed out, is that normally, at high school, when we look at a normal distribution, it looks like this, right? But you're not just doing one normal distribution. You've got a whole bunch of normal distributions for all of your different pixels.

They're the pixels, right, Tanish? Normal distribution of every pixel. So there's a whole bunch of them. And so one of them might have a normal distribution that's there. And another one might have a normal distribution that's here. And another one might have a normal distribution that's here. And it's more than that, though, because it's possible that one pixel tends to be higher when another pixel tends to be higher, or one pixel tends to be higher when another pixel's lower.

So it actually has kind of created this surface in n-dimensional space where n is the number of pixels. So if you now, like, look at, like, OK, well, what happens if we multiply this by this, just like we did in CLIP, right? Then if this number is high, then it's saying that when this variable is high, where this pixel's high, this pixel tends to be high, and vice versa.

Or if it's low, it's saying when this pixel tends to be high, this one tends to be low. Or, interesting to us, what happened-- oopsie, Daisy, sorry -- what happens if this is zero? That says that if this is high, then this could be anything. Or if this is high, this could be anything.

There's no relationship between them. So statistically, we would say that these two pixels are independent. And so now, that basically means we could do that for all of these. We could say, oh, these are all zeros. And what that says is that, oh, every pixel is independent of every other pixel.

Now, of course, in real pictures, that's not how real pixels work. But that's the assumption we're making. Because if we start with a very special matrix called i, which is 1, 1, 1, 1, 0, 0, 0. If we take this very special matrix-- it's very special because I can multiply it by something, say beta.

And if I multiply it by a matrix, I get back the original matrix. If I multiply it by a scalar, I'm going to get beta, beta, beta, dah, dah, dah, dah, and lots of zeros. And so if I multiply something by this matrix, then I'm just multiplying it by beta.

But what's interesting about this is that this is what Wasim wrote. Wasim wrote i times beta, i times beta t. So what he's saying is, oh, we've now got a covariance matrix where for each individual pixel, it's like pixel number 1, beta 1, pixel number 2, beta 2. This is the variances of each one.

And the covariances, the relationship between the pixels is 0. They're expected to be independent. So that's where we're going from statistics you do in high school to statistics you do at university. It's like suddenly covariance is now a matrices, not individual numbers. Does that sound about right to you, Tanishk?

Yeah, that's a great explanation of it, yes. Awesome. Cool. So now let's try to describe what this would do to MNIST digits. So let's put back our mean equation and our covariance. So mean and our covariance. And let's look at how this behaves at the edges. So it's really hard to understand this.

I don't think anybody can just look at this and know what it means. What we typically do is we try to describe it at the edges. And so we'll start with what happens if that's 0. And we'll work with x0 as well instead of xt minus 1, which would mean an MNIST digit.

So if beta is 0 then we get our x0, you know, square root 1 minus 0, which is 1 and square root of 1 is 1. So that kind of falls away. So we just have a mean of our previous image. And this is just variance of 0. So we have a normal distribution with a mean of our previous image of variance of 0, which means we have the same image.

Yeah, just to clarify, when you have variance of 0, that means that there's really no noise or anything. It's just at that mean and, you know, your distribution is just saying that's the only point that you can get from it. So yeah, that just becomes the same image because yeah, there's no noise or variance because the variance is 0.

Yeah, exactly. And then when our beta is 1, we still have this and then we have, you know, square root 1 minus 1 and that becomes 0. So this whole thing becomes 0. And this thing becomes i times beta t, which is, you know, i. And if it's just i, then as Jeremy described, it would, you know, imply a variance of 1.

And so our image through this function would just be pure noise. So let, you know, mean of 0, standard deviation of 1, and it would just be a bunch of noise and kind of somewhere in between that, we have to say over here, you know, what would it produce?

It would be some mixture. So, you know, like maybe a light, the lighter pixels of 8 and some noise, maybe a bit darker. And we can kind of draw this and you would have seen this in the previous lecture. You can draw the sequence of things that become progressively more noisy in very small steps, all the way until it becomes pure noise.

This is what we call the forward diffusion process. And we can now describe some of these things. So this would be a sample from our data distribution qx0. This would be the function for the conditional probability density function that takes, so of x1 given x0 and so on. And the way that the terminology that we would use or that mathematicians used to describe this is they would call it a Markov process with Gaussian transitions.

And this can sound quite scary, but we've just described exactly what this is. So when we say process, it usually means, you know, something where there's a sequence involved. When we say Markov, it means that the thing at time t depends only on the thing at t minus 1.

The transition is this function. How do you actually go from t minus 1 to t? And Gaussian is the fact that that transition is the normal distribution. Does that sound right? Yes. Just to also clarify a couple of things, when we say that, you know, we're sampling from the data distribution, what that is referring to is trying to find some random sample or some random data point that maximizes that likelihood or that has a high likelihood.

So when we say that, you know, we're looking at that API, that magic API we were talking about, and we're trying to get some, you know, some data points that have a high value from that API. And, you know, for some distributions, it's very simple and we know how it works like a Gaussian distribution.

If we know the parameters of that Gaussian distribution, it's very easy to be able to do that sampling. And then of course, in other cases, it's not very easy, it's not, it's quite difficult to do that sampling. So then we have to figure alternative ways of doing that sampling.

But that's why in this case, with the forward distribution, we just have these simple Gaussian transitions. And we already know the parameters of those Gaussian transitions, so we can easily do that sampling. And going back also to that, I think it's worthwhile to also kind of show and think about maybe how this is again done practically.

Because one of the nice properties of Gaussian distributions as a whole is that you can, you know, simply take some normal noise at with a mean of zero and variance of one. So that's, I think they usually typically call that a unit, unit distribution is just like, yeah, normal of zero, one.

And then if you want to get to some other point with a mean of whatever value you specify and a variance of whatever value you specify, you can simply take that normal distribution, scale it by the, you multiply it by the variance, and then you add your, your mean.

So then there's a simple equation that you can take to get the, to get any particular mean and variance. So that's how you would get the samples for these other distributions that we have defined throughout the forward distribution. So, you know, for example, when you're coding this up, of course, a lot of these softwares, they will have a way of getting a sample from this normal distribution of zero, one.

And then you just use that equation then to get it at the desired mean and variance. So that's how it kind of happens under the hood when you're, when you're kind of describe this with code. That's really helpful. Yeah. And this idea of, we can't really sample from this thing.

That's exactly, you know, the problem that generative kind of modeling is trying to solve. Like how do you represent this in such a way that you can easily sample from it? And so it turns out that if you have one of these processes, you know, where you have many, many steps, so let's say a thousand steps, a thousand of these steps going to the right and they're all very small steps that eventually go to noise, somebody, you know, maybe in the 1950s, I think, discovered that you can represent the process of going backwards in exactly the same functional form with just different parameters.

So what that means is if we say P is the thing that goes backwards. So, you know, the previous one, given the current one, the P has the same functional form. So it's also, the transitions are also normal, but the mean is, you know, some unknown. So we'll use a square and the variance is some unknown.

So we use a triangle. Is that correct? Yeah, that's correct. And just going back to our previous point about P versus Q, here we can see that the Q was describing the sort of forward process going, you know, yeah, the sort of steps that we're doing. And then the P is describing what we're going in the reverse way.

So that's why, you know, these papers are using, you know, Q for one process and then P for another. That's what they're kind of indicating, at least in the diffusion model literature. And P is kind of like X, you know, it's the one we want to figure out. So like Q is kind of like Y and P is kind of like X.

That's how I like to think of that. And so, you know, we have this functional form and the next question is how can we use this or, you know, we just don't know what these parameters are. How can we figure out what those are? And this goes back, you know, to early kind of statistics literature where you can fit this model using by maximizing what's called the likelihood function.

So we can try different parameters until we have one that maximizes the likelihood. It turns out that we can't quite do this exactly because you would need to calculate some integral and that integral is over very high dimensional values, continuous values. So you can't actually calculate this. I think you can think of it because, you know, we're having these thousands of steps that we're trying to go in this reverse process.

And so, you know, you have these thousands of steps that there are going to be many possible values for each step. So it's kind of hard to evaluate it over all these thousands of steps and all the possible values for all these different steps. So I think that's kind of where the challenges arise and that's what it makes it difficult because you have to evaluate it over these multiple steps and try to find these functions for all these different steps.

So that's kind of where the challenge is. And so you might see people talk not about the likelihood function, but about the log-likelihood. And correct me if I'm wrong here, Tanish, but I think the log here is a bit of a computational trick almost. So I think it has a few properties.

The first is that it's always increasing. You know, people would call this monotonic. You know, it looks always kind of increasing. And because it's always increasing, you get the same parameters if you optimize the log-likelihood versus you optimize the likelihood. It also takes products to sums. And that's helpful because we have joint distributions, you know, which turn out to be products.

So it turns out we have a lot of products here and they become sums, which is easy to work with. And the last thing is that, you know, this normal distribution has exponential functions and those disappear with the log. So this is the much friendlier thing to optimize. Cool.

And then there's one more step. You know, we still can't optimize the log-likelihood of the thing that this eventually describes. But again, and this is kind of the beauty of math is that somebody figured out a long time ago that there's a way to optimize some other quantity called the ALBO for short, which stands for Evidence Lower Bound.

And the evidence is just another name for the likelihood. And the lower bound means it's, you know, the lower bound of the evidence. And if you optimize that, it's almost as good as optimizing the thing that we really want to. But this one we can calculate very, very easily.

And so you can use this as a loss function to train two neural networks that predict our square from earlier, which was our mean, and our triangle, which is our variance of this reverse process. And once you have that, you go all the way back here. So then you have these values.

You can start with pure noise and keep calling these neural networks sampling from those normal distributions, kind of applying that iteratively over many steps and you recover this data distribution. One thing that's important to clarify here is that you can recover the whole distribution, but you can't necessarily take a single image, get it to pure noise and then convert it back.

So this operates sort of at the distribution level. So you can take this kind of magic API, you can reconstruct that whole API. And if you can do that, then you can generate images, MNIST digits or cats or dogs or whatever you want to. I want to just clarify one thing about this process of the kind of the loss function.

So this sort of evidence, lower bound loss function, the kind of approach that it's taking is that we have this forward process, right? We can go from the original images and figure out these sorts of intermediate distributions going all the way finally to noise. With this sort of evidence, lower bound loss function, what we're really kind of doing is trying to match our distribution that we're trying to optimize to those distributions that we saw in the forward process.

So that's what we're trying to do. We're trying to match that sort of those distributions and there's a specific type of function and it's able to do that. It's called a KL divergence. That's the sort of function that can compare probability distributions. And again, because we're dealing with Gaussians, you can calculate that analytically and a lot of the math becomes very simple.

So that's again, with the whole Gaussians, we know them quite well and the math is very simple. So that allows us to do this sort of comparison between these distributions very easily and optimize that. And so we want to kind of minimize the difference between the distributions we see in the forward process and the distributions we're trying to determine for the reverse process.

Perfect. Then there's one more thing, I think, one more kind of major step to get closer to the form that you would have seen in Jeremy's lesson. So there was a 2020 paper. The initials of that model is DDPM. Tanish, do you know what this stands for? Yeah, that's for denoising, diffusion, probabilistic model.

Okay, cool. And what they did was they said, let's assume that this variance is just a constant so we don't learn it. And we assume also that the step size from earlier, you know, the variance of the noise that we add at each step is also a constant. We don't learn that.

So we're just predicting the mean and these are set to some really convenient values. Then the last turns out to be that you predict the noise. So you can restructure this whole thing as you need to train a network that takes in images. So here's your network and it tells you what of this image is noise.

Thanks to these simplifying assumptions. And even though they're assumptions, turns out you can train much more, you know, models that produce much better images. Now, I think this relates to something from the, you know, the lesson that Jeremy gave. Tanish, do you remember that there was something about the gradient or something like that?

Yes, yes. So this idea of, you know, adding noise and learning to remove noise, the idea is that kind of by, you know, again, you have this sort of this image that you have noise, right? And by, sorry, let me think about the best way to say this. Oh, yeah, sorry.

Okay, let me just turn it over. So I'll just start. Yeah, so like Jeremy will say in the lesson, what we want to do is we want to figure out the gradient of this likelihood function. So this is just kind of a different way about thinking about this. If we had some information about this gradient, then we could, for example, you know, use that information to produce like we talked about, kind of this optimization, kind of produce images with high likelihood.

So the idea is that we can add noise to the images that we have. So those are samples that we have. And that kind of takes us away from, you know, the regular images that we know that we have. And, you know, that kind of decreases the likelihood, right?

So we have those images and we're adding noise that decreases the likelihood. And we want to kind of learn how to get back to high likelihood images and kind of use that to provide some sort of estimate of our gradient. So this sort of denoising process actually allows us to do that.

So there are actually theorems also, I think, from the 1950s that demonstrate that especially in the case of this sort of Gaussian noise that we're working with, this denoising process is equivalent to learning what is known as the score function. And the score function is the gradient of the log of the likelihood.

So again, they have this log here, which, again, makes the math nicer and easier to work with. But the general idea is the same because as we talked about, log is a monotonic function. So again, the general ideas are the same, but the score function specifically refers to the gradient of the log likelihood.

So this sort of denoising process allows us to learn the score function. So that's when we're doing this noise, predicting that we had this whole probabilistic framework using that sort of likelihood framework. And it came back down to just predicting the noise. And that's what the DPM paper showed in 2020.

But it turns out that is equivalent to calculating out this sort of score function and using that information to be able to sample from our distribution. So that's kind of how these two approaches connect. So there's a lot of literature talking about maybe the sort of probabilistic likelihood perspectives of diffusion models.

And there's also a lot of literature talking about this score-based perspective. But this hopefully allows you to think about the similarities and how these two approaches connect with each other. Yeah, awesome. Yeah. And that's kind of the beauty, I think, of the math side of things here is that you find all of these relationships between different fields and also like between different centuries, basically.

And that allows you to do really kind of powerful and unexpected things. Okay, so you can just do a quick recap of where we got to. So we started out with our data distribution, which we want to model. We said we'll define this forward diffusion process, which is a way of kind of adding noise to this model.

And because we add it in this specific way, thanks to some discovery in the 1950s, the reverse process has the same form. And then we already know how to train a neural network for this using the elbow. And then a couple of years later came the discovery, simplifying assumptions that in the end, all we do is predict the noise.

And I just remembered we take actually the MSE of this noise prediction, the mean squared error, which is a nice, very simple framing of the model. And Antoni spoke about another way to derive all of this, which is the score function approach, the gradient of the log-like period. Okay, cool.

Yeah, I highly recommend checking out the course lesson as well, if you haven't. You know, if you don't understand this, there's no need to be intimidated. You can still do be very effective without ever using math, you can be very effective at deep learning, as fast AI has shown us, and you can do novel research as well.

For me, this is, it's interesting. And, you know, it's even beautiful, in a way. So I recommend checking it out, but don't feel intimidated. You can find the course lesson links in the fast AI forum. We'll add those links as well in the description of this video. We'll also have a topic in the forum for this lesson.

You can have discussions there, post any comments, add any, you know, relevant links to the math. And then we have another lesson, you know, video by Jono, which I really recommend checking out. He's a, you know, he's a great teacher and he was, I think he was the first person to do a full course on unstable diffusion.

Yeah, Jono's video is kind of a deep dive into some of the code a little bit more and into some of the concepts a little bit more. So I feel like between these three videos, it's a good overview. You know, I think, I mean, just to clarify, you don't need to understand all the math that was described in this video.

That's not to say you want me to understand math. We'll be covering lots of math in these lessons. But we'll be covering just the math you need to understand and build on the code. And we'll be covering it over many, many more hours than this rather rapid overview. Perfect.

Cool. And yeah, thank you so much, Denise. I had a lot of fun. And thank you so much, Westy. And that was awesome. Awesome. Cool. Bye-bye.