Back to Index

Lesson 19: Deep Learning Foundations to Stable Diffusion


Chapters

0:0 Introduction and quick update from last lesson
2:8 Dropout
12:7 DDPM from scratch - Paper and math
40:17 DDPM - The code
41:16 U-Net Neural Network
43:41 Training process
56:7 Inheriting from miniai TrainCB
60:22 Using the trained model: denoising with “sample” method
69:9 Inference: generating some images
74:56 Notebook 17: Jeremy’s exploration of Tanishq’s notebook
84:9 Make it faster: Initialization
87:41 Make it faster: Mixed Precision
89:40 Change of plans: Mixed Precision goes to Lesson 20

Transcript

Okay, hi everybody, and this is Lesson 19 with extremely special guests Tanish and Jono. Hi guys, how are you? Hello. Hey Jeremy, good to be here. And it's New Year's Eve 2022, finishing off 2022 with a bang, or at least a really cool lesson. And most of this lesson's going to be Tanish and Jono, but I'm going to start with a quick update from the last lesson.

What I wanted to show you is that Christopher Thomas on the forum, what I want to show you is that Christopher Thomas on the forum came up with a better winning result for our challenge, the Fashion MNIST Challenge, which we are tracking here. And be sure to check out this forum thread for the latest results.

And he found that he was able to get better results with Dropout. Then Peter on the forum noticed I had a bug in my code, and the bug in my code for ResNets, actually I won't show you, I'll just tell you, is that in the res block I was not passing along the batch norm parameter, and as a result, all the results I had were without batch norm.

So then when I fixed batch norm and added Dropout at Christopher's suggestion, I got better results still, and then Christopher came up with a better Dropout and got better results still for 50 epochs. So let me show you the 93.2 for 5 epochs improvement. I won't show the change to batch norm because that's actually, that'll just be in the repo now.

So the batch norm is already fixed. So I'm going to tell you about what Dropout is and then show that to you. So Dropout is a simple but powerful idea where what we do with some particular probability, so here that's a probability of 0.1, we randomly delete some activations.

And when I say delete, what I actually mean is we change them to zero. So one easy way to do this is to create a binomial distribution object where the probabilities are 1 minus p, and then sample from that. And that will give you a 0.1 probability. So in this case, oh, this is perfect.

I have exactly 1, 0. Of course, randomly, that's not always going to be the case. But since I asked for 10 samples and 0.1 of the time, it should be 0, I so happened to get, yeah, exactly one of them. And so if we took a tensor like this and multiplied it by our activations, that will set about a tenth of them to 0 because multiplying by 0 gives you 0.

So here's a Dropout plus. So you pass it and you say what probability of Dropout there is, store it away. Now we're only going to do this during training time. So at evaluation time, we're not going to randomly delete activations. But during training time, we will create our binomial distribution object.

We will pass in the 1 minus p probability. And then you say, how many binomial trials do you want to run, so how many coin tosses or dice rolls or whatever each time, and so it's just one. And this is a cool little trick. If you put that one onto your accelerator, GPU or MPS or whatever, it's actually going to create a binomial distribution that runs on the GPU.

That's a really cool trick that not many people know about. And so then if I sample and I make a sample exactly the same size as my input, then that's going to give me a bunch of ones and zeros and a tensor the same size as my activations. And then another cool trick is this is going to result in activations that are on average about one tenth smaller.

So if I multiply by 1 over 1 minus 0.9, so I multiply in this case by that, then that's going to scale up to undo that difference. And Jeremy, in the line above where you have props equals 1 minus p, should that be 1 minus self.p? Oh, it absolutely should.

Thank you very much, Jenna. Not that it matters too much because, yeah, you can always just use nn.dropout at this point and use point one, which is why I didn't even see that. So as you can see, I'm not even bothering to export this because I'm just showing how to repeat what's already available in PyTorch.

So yeah, thanks, Jenna. That's a good fix. So if we're in evaluation mode, it's just going to return the original. If p equals 0, then these are all going to be just ones anyway. So we'll be multiplying by 1 divided by 1, so there's nothing to change. So with p of 0, it does nothing in effect.

Yeah, and otherwise it's going to kind of zero out some of our activations. So a pretty common place to add dropout is before your last linear layer. So that's what I've done here. So yeah, if I run the exact same epochs, I get 93.2, which is a very slight improvement.

And so the reason for that is that it's not going to be able to memorize the data or the actions, because there's a little bit of randomness. So it's going to force it to try to identify just the actual underlying differences. There's a lot of different ways of thinking about this.

You can almost think of it as a bagging thing, a bit like a random forest. Each time it's giving a slightly different kind of random subset. Yeah, but that's what it does. I also added a drop2d layer right at the start, which is not particularly common. I was just kind of like showing it.

This is also how Christopher Thomas' idea tried it as well, although he didn't use dropout2d. What's the difference between dropout2d and dropout? So this is actually something I'd like you to do to implement yourself is an exercise is to implement dropout2d. The difference is that with dropout2d, rather than using x.size as our tensor of ones and zeros.

So in other words, potentially dropping out every single batch, every single channel, every single x, y independently, instead, we want to drop out an entire kind of grid area, all of the channels together. So if any of them are zero, then they're all zero. So you can look up the docs for dropout2d for more details about exactly what that looks like.

But yeah, so the exercise is to try and implement that from scratch and come up with a way to test it. So like actually check that it's working correctly, because it's a very easy thing to think that it's working and then realize it's not. So then yeah, Christopher Thomas actually found that if you remove this entirely and only keep this, then you end up with a better results for 50 epochs.

And so he's the first to break 95%. So I feel like we should insert some kind of animation or trumpet sounds or something at this point. I'm not sure if I'm clever enough to do that in the video editor, but I'll see how I go. Hooray. Okay. So that's about it for me.

Did you guys have any other things to add about dropout, how to understand it or what it does or interesting things? Oh, I did have one more thing before, but you go ahead if you've got anything to mention. And I was going to ask just because I think the standard is to set it like remove the dropout before you do inference.

But I was wondering if there's anyone you know of, or if it works to use it for some sort of test time augmentation. Oh, dude. Thank you. Because I wrote a callback for that. Did you see this or were you just like, okay, test time dropout callback. Nice. So yeah, before Epoch, if you remember in learner, we put it into training mode, which actually what it does is it puts every individual layer into training mode.

So that's why for the module itself, we can check whether that module is in training mode. So what we can actually do is after that's happened, we can then go back in this callback and apply a lambda that says if this is a dropout, then put it in training mode all the time, including a devaluation.

And so then you can run it multiple times just like did for TTA with this callback. Now that's very unlikely to give you a better result because it's not kind of showing at different versions or anything like that, like TTA does that are kind of meant to be the same.

But what it does do is it gives you some a sense of how confident it is. If it kind of has no idea, then that little bit of dropouts quite often going to lead to different predictions. So this is a way of kind of doing some kind of confidence measure.

You'd have to calibrate it by kind of looking at things that it should be confident about and not confident about and seeing how that test time dropout changes. But the basic idea, it's been used in medical models before. I wouldn't say it's totally popular, which is why I didn't even bother to show it being used, but I just want to add it here because I think it's an interesting idea and maybe could be more used than it is, or at least more studied than it has been.

A lot of stuff that gets used in the medical world is less well-known in the rest of the world. So maybe that's part of the problem. Cool. All right. So I will stop my sharing and we're going to switch to Tanish, who's going to do something much more exciting, which is to show that we are now at a point where we can do DDPM from scratch or at least everything except the model.

And so to remind you, DDPM doesn't have the latent VAE thing and we're not going to do conditional, so we're not going to get to tell it what to draw. And the UNET model itself is the one bit we're not going to do today. We're going to do that next lesson, but other than the UNET, it's going to be unconditional DDPM from scratch.

So Tanish, take it away. Okay. Hi. Welcome back. Sorry for the slight continuity problem. You may notice people look a little bit different. That's because we had some Zoom issues. So we have a couple of days have passed and we're back again. And then Chavado recorded his bit before we do Tanishk's bit.

And then we're going to post them in backwards. So hopefully there's not too many confusing continuity problems as a result. And it all goes smoothly, but it's time to turn it over to Tanishk to talk about DDPM. So we've reached the point where we have this mini AI framework and I guess it's time to now start using it to build more, I guess, sophisticated models.

And as we'll see here, we can start putting together a diffusion model from scratch using the mini AI library, and we'll see how it makes our life a lot easier. And also, it'd be very nice to see how, you know, the equations in the papers correspond to the code.

So I have here, of course, the notebook that we'll be watching from. The paper, which, you know, we have the diffusion model paper, the Doising Diffusion Probabilistic Models, which is the paper that was published in 2020, it was one of the original diffusion model papers that kind of set off the entire trend of diffusion models and is a good starting point as we delve into this topic further.

And also, I have some diagrams and drawings that I will also show later on. But yeah, basically, let's just get started with the code here, and of course, the paper. So just to provide some context with this paper, you know, this paper that was published from this group in UC Berkeley, I think a few of them have gone on now to work at Google.

And this is a big lab at UC Berkeley. And so diffusion models were actually originally introduced in 2015. But this paper in 2020 greatly simplified the diffusion models and made it a lot easier to work with and, you know, got these amazing results, as you can see here, when they trained on faces and in this case, CIFAR-10, and, you know, this really was very kind of a big leap in terms of the progress of diffusion models.

And so just to kind of briefly provide, I guess, kind of an overview. If I could just quickly just mention something, which is, you know, when we started this course, we talked a bit about how perhaps the diffusion part of diffusion models is not actually all that. Everybody's been talking about diffusion models, particularly because that's the open source thing we have that works really well.

But this week, actually a model that appears to be quite a lot better than stable diffusion was released that doesn't use diffusion at all. Having said that, the basic ideas, like most of the stuff that Tanishk talks about today, will still appear in some kind of form, you know, but a lot of the details will be different.

But strictly speaking, actually, I don't even know if I've got a word anymore for the kind of like modern generative model things we're doing. So in some ways when we're talking about diffusion models, you should maybe replace it in your head with some other word, which is more general and includes this paper that Tanishk is looking at here.

iterative refinement, perhaps. That's what I... Yeah, that's not bad. Iterative refinement. I'm sure by the time people watch this video, probably, you know, somebody will have decided on something. We will keep our course website up to date. Steve? Yeah, yeah, this is the paper that first Jeremy was talking about.

And yeah, every week there seems to be another state of the art model. But yeah, like Jeremy said, a lot of the principles are the same, but you know, the details can be different for each paper. But yeah. And just to, I just want to again, also, like Jeremy was saying, kind of zoom back a little bit and kind of talk about a little bit about what, you know, I just kind of provide a review of what we're trying to do here, right?

So let me just, yeah, so with this task, we were trying to, in this case, I would try to do image generation, of course, it could be other forms of generation like text generation or whatever. And the general idea is that, of course, we have some, you know, data points, you know, in this case, we have some images of dogs, and we want to produce more like these data points that were given.

So in this case, maybe the dog image generation or something like this. And so the overall idea that a lot of these approaches take for image, you know, for some sort of generative modeling task is they have, they tried to, not over there, here, they tried to, yeah, so let me use it in a bit, P of X, which is our, which is basically the sort of likelihood of data point X, of X.

So let's say X is some image, then P of X tells us like, what is the probability that you would see that image in real life. And like, if we can take like a simpler example, which may be easier to think about of like a one dimensional data point, like height, for example.

And if we were to look at height, of course, we know like we have a data distribution that's kind of a bell curve. And, you know, you have maybe some, you know, mean height, which is like something like 5'9", 5'10", yeah, I guess 5'9", 10 inches or something like that, or 5'9", whatever.

And of course, we have some is like, you have some on more unlikely points, but that is still possible. Like, for example, we have a set of feet, or where you have something that maybe not as likely, it was just like, you know, like three feet or something like this.

So here's my X axis is height, and the Y axis is the probability of some random person you meet being that tall. Exactly. So you know, yeah, this is basically the probability. And so of course, you have this sort of peak, which is where, you know, you have higher probability.

And so those are the sorts of, you know, values that you would see more often. So this is this is our what we do call our P of X. And like, the important part about P of X is that you can use this now to sample new values, if you know what P of X is, or even if you have some sort of information about P of X.

So for example, here, you can think of like, if you were to like, say, maybe have some, let's say you have some game, and you have some human characters in the game, and you just want to randomly generate a height for this human character, you know, you could, you wouldn't want to, of course, select a random height between three and seven, that's kind of uniformly distributed, you would instead maybe want to, you would want to have the height dependent on this sort of function where you would more likely sample values, you know, in the middle, and less likely sample the source of extreme points.

So it's dependent on this function to P of X. So having some information about P of X will allow you to sample more data points. And so that's kind of the overall goal of generative modeling is to get some information about P of X, that then allows us to sample new points and, you know, create new generations.

So that's kind of a high level kind of description of what we're trying to do when we're doing generative modeling. And of course, there are many different approaches. We, you know, we have our famous scans, which, you know, used to be the common method back in the day before diffusion models, you know, we have VAEs, which I think we'll probably talk a little bit more about that later as well.

We'll be talking about both of those techniques later here. Yeah. And so there are many different other techniques. There are also some niche techniques that are out there as well. But of course, now the popular one is, are these diffusion models? Or, you know, as we talked about, maybe a better term might be inhibitor of refinement or whenever, you know, whatever the term ends to be.

But yeah, so there are many different techniques. And yeah, so let's just, so this is kind of the general diagram that shows what diffusion models are. And if we can look at the paper here, which let's pull up the paper. Yeah, you see here, this is the sort of, they call it a direct directed graphical model.

It's a very complicated term. It's just kind of showing what's going on in this, you know, in this process. And, you know, there's a lot of complicated math here, but we'll highlight some of the key variables and equations here. So basically the idea is that, okay, so let's see here.

So this is like, so this is an image that we want to generate, right? And so X0 is basically, you know, these are actually the samples that we want. So we want to, X0 is what we want to generate. And, you know, these would be, yeah, these are images.

And we start out with pure noise. So that's the, that's what, X uppercase T, pure noise. And the whole idea is that we have two processes. We have this process where we're going from pure noise to our image. And we have this process where we're going from image to pure noise.

So the process where we're going from our image to pure noise, this is called the forward process. Well, word, sorry, my typing instead of my handwriting is not so good in it. So hopefully it's clear enough. Let me know if it's not. So we have the forward process, which is mostly just used for our training.

Then we also have our reverse process. So this is the reverse process, right up here. Reverse process. So this is a bit of a summary, I guess, of what you and Waseem talked about in lesson 9b. And just, it's just mostly to highlight now what, what are the different variables as we look at the, the code and see, you know, the different variables in the code.

Okay. So we'll be focusing today on the code, but the code we'll be referring to things by name and those names won't make sense very much unless we see the, what they're used for in the math. Okay. And I will dive too much into the math. I just want to focus on the sorts of variables and equations that we see in the code.

So basically the general idea is that, you know, we do these in multiple different steps. You know, we have here from time step zero all the way to time steps, time steps, uppercase T. And so there's some fixed number of steps, but then we have this intermediate process where we're going, you know, from some particular, yeah, some particular time step.

Yeah. We have this time step lower T steep, which is some noisy image. And, and, and yes, we're transitioning between these two different noisy images. So we have this, what is sometimes called the transition. We have this one here. This is like sometimes called the transition kernel or yeah, whatever it is, it basically is just telling us, you know, how do we go from, you know, one in this case, we're going from a less noisy image to a more noisy image.

And then going backwards is going from a more noisy image to a less noisy image. So let's look at the equations. So the forward direction is trivially easily to make it something more noisy. You just add a bit more noise to it. And the reverse direction is incredibly difficult, which is to particularly to go from the far left to the far right is strictly speaking impossible because none of that person's face exists anymore.

But somewhere in between, you could certainly go from something that's partially noisy to less noisy by a learned model. Exactly. And that's like what I'm going to write down right now in terms of, you know, in terms of, I guess, the symbols in the map. So yeah, basically, I'm just trying to pull out the, just to write down the equations here.

So we have, let me zoom in a bit. Yeah, so we have our two, oops, let's see here. Two of x t, x t minus one. Or actually, you know what, maybe it's just better if I just snip. Yeah, it's a snippet from here. So the one that is going from our, the one that is going from our forward process is this, this, this equation here.

So I'll just make that a little smaller for you guys. So right there. So that is going, and basically, to explain, we have this, we have this sort of script, a little bit of a, maybe a little bit confusing notation here. But basically, this is referring to a normal distribution or Gaussian distribution.

And instead of saying, okay, this is a Gaussian distribution that's describing this particular variable. So it's just saying, okay, you know, n is our normal or Gaussian distribution, and it's representing this variable x of t, or x, sorry, x t. And then we have here is, is the mean.

And this is the variance. So just to again, clarify, I think we've talked about this before as well. But like, you know, this is a, you know, this is, of course, a bad drawing of a Gaussian. But, you know, our mean is just, our mean is just, you know, this, you know, the middle point here is the mean, and the variance just kind of describes the sort of spread of, of the Gaussian distribution.

So if you think about this a little further, you have this beta, which is one of the important variables that kind of describe the diffusion process, beta dot t. So you'll see the beta t in the code. And basically, beta t increases as t increases. So basically, your beta t will be greater than your beta t minus one.

So if you think about that a little bit more carefully, you can see that, okay, so at, you know, t minus one, at this time point here, and then you're going to the next time point, you're going to increase your beta t, so increasing the variance, but then you have this one minus beta t and take the square root of that on and multiply it by x t minus one.

So as your t is increasing, this term actually decreases. So your mean is actually decreasing, and you're getting less of the original image, because the original image is going to be part of x t minus one. So as you, And just to let you know, just like, you know, we, we can't see your pointer.

So if you want to point at things, you would need to highlight them or something. Yeah, so yeah, I'll just, let's see. Yeah. Or yeah, basically, I mean, I don't particularly play anything in specific, I was just saying that, yeah, basically, if we have our, our, our x of t here, as, as the time step increases, you know, you're getting less contribution from your x of x t minus one.

And so that means your mean is going towards zero. And so you've got to have a mean of zero, and, you know, the variance keeps increasing, and basically, you just have a Gaussian distribution and you lose any contribution from the original image as your time step increases. So that's why when we start out from x of zero and go all the way to our x of t here, this becomes pure noise, it's because we're doing this iterative process where we keep adding noise, we lose that contribution from the original image, and, and, you know, that that leads to, that leads to the image having pure noise at the end of the process.

So just something I find useful here is to consider one extreme, which is that is to consider x one. So at x one, the mean is going to be one minus beta t times x naught. And the reason that's interesting is x naught is the original image. So we're taking the original image.

And at this point, one minus beta t will be pretty close to one. So at x one, we're going to have something that's the mean is very close to the image. And the variance will be very small. And so that's why we will have a image that just has a tiny bit of noise.

Right, right. And then another thing that sometimes is easier to write out is sometimes you can write out, in this case, you can write out q of x t directly. Because these are all independent in terms of like, q of x t is only dependent of x t minus one, and then x t minus one is only dependent on x t minus two.

And you can you can use this independent, each of these steps are independent. So based on the different laws of probability, you can get your q above x t in close form. So yeah, that's what's shown here. q of x t did it the original image. So this is also another way of kind of seeing this more clearly where you can see you can see that.

Anyways, some going back here. Yeah, so this is another way to see here more directly. So this is, of course, our clean image. And this is our clear, our noisy image. And so you can also see again, now, alpha bar t is dependent on beta t, basically, it's like one minus like the cumulative.

This is I mean, we'll see the card for it, I guess. So maybe, yes, yes. So it might be clear to see that this is alpha bar t or something like this. But basically, basically, the idea is that alpha bar t, alpha bar t is going to be, again, less, this is what is going to be less than alpha bar t minus one.

So basically, alpha, this keeps decreasing, right? This decreases as as time step increases. And on the other hand, this is going to be increasing as time step increases. So again, you can see the contribution from the original image decreases as time step increases, while the noise, you know, as shown by the variance is increasing, while, you know, the time step is increased.

Anyway, so that hopefully clap eyes that the forward process and then the reverse process is basically a neural network, as we as Jeremy had mentioned. And yeah, this is a screenshot this, that's he's this. That's yes, this is our this is our reverse process. And basically, the idea is, well, this is a neural network.

And this is also a neural network, a neural network. And we learned during the training of the model. But the the nice thing about this particular diffusion model paper that made it so simple was actually, we completely ignored this and actually said to our instance just based on, you know, We can't say what you're pointing at.

So I think it's important to mention what this is here. This term here. So this one, we just kind of ignore and it's just a constant dependent on beta t. So you only have one neural network that you need to train, which is basically referring to this mean. And when the nice thing about this decision model process is that it also meet power prices mean into this easier form, where you do a lot of complicated math, which will not get into here.

But basically, you get this kind of simplified training objective where see here. Yeah, you see the simplified training objective, you instead have this epsilon beta function. And let me just screenshot that again, screenshot. This is our loss function that we train and we have this epsilon beta function. You could see it's a very simple loss function, right?

This is just a, you can just write this down. This is just an MSC loss. And we have this epsilon beta function here. That is our... I mean, to folks like me here are less mathy. It might not be obvious that it's a simple thing because it looks quite complicated to me, but once we see it in code, it'll be simple.

Yes, yes. Basically, you're just doing like an... Yeah, you'll see an encode how simple it is. But this is like just an MSC loss. So we've seen MSC loss before, but you'll see how... Yeah, this is basically MSC. So the nice... So just to kind of take a step back again, what is this epsilon theta?

Because this is like a new thing that seems a little bit confusing. Basically, epsilon... You can see here, basically... Yeah, absolutely. This here is saying... This is actually equivalent to this equation here. These two are equivalent. This is just another way of saying that because basically it's saying that's X of t.

So this is giving X of t just in a different way. But epsilon is actually this normal distribution with a mean of zero and a variance of one. And then you have all these scaling terms that changes the mean to be the same as this equation that we have over here.

So this is our X of t. And so what epsilon is, it's actually the noise that we're adding to our image to make it into a noisy image. And what this neural network is doing is trying to predict that noise. So what this is actually doing is this is actually a noise predictor.

And it is predicting the noise in the image. And why is that important? Basically, the general idea is if we were to think about our distribution of data, let's just think about it in a 2D space. Just here, each data point here represents an image. And they're in this blob area, which represents a distribution.

So this is in distribution. And this is out of the distribution. Out of distribution. And basically, the idea is that, OK, if we take an image and we want to generate some random image, if we were to take a random data point, it would most likely be noisy images.

So if we take some random data point, the way to generate a random data point, it's going to be just noise. But we want to keep adjusting this data point to make it look more like an image from your distribution. That's the whole idea of this iterative process that we're doing in our diffusion model.

So the way to get that information is actually to take images from your data set and actually add noise to it. So that's what we try to do in this process. So we have an image here, and we add noise to it. And then what we do is we try to plan a neural network to predict the noise.

And by predicting the noise and subtracting it out, we're going back to the distribution. So adding the noise takes you away from the distribution, and then predicting the noise brings you back to distribution. So then if we know at any given point in this space how much noise to remove, that tells you how to keep going towards the data distribution and get a point that lies within the distribution.

So that's why we have noise prediction, and that's the importance of doing this noise prediction is to be able to then do this iterative process where we can start out at a random point, which would be, for example, pure noise and keep predicting and removing that noise and walking towards the data distribution.

Okay. Okay. So yeah, let's get started with the code. And so here, we of course have our imports, and we're going to load our dataset. We're going to work with our fashion amnesty set, which is what we've been working with for a while already. And yeah, this is just basically similar code that we've seen from before in terms of loading the dataset.

And then we have our model. So we remove the noise from the image. So what our model is going to take in is it's going to take in the previous image, the noisy image, and predict the noise. So the shapes of the input and the output are the same.

They're going to be in the shape of an image. So what we use is we use a unit neural network, which takes in kind of an input image. And we do see your pointer now, by the way, so feel free to point at things. Yeah. So yeah, it takes an input image.

And in this case, a unit is for purpose, but they can also be used for any sort of image-to-image path, where we're going from an input image and then outputting some other image of some sort. And we'll talk about a new architecture, which we haven't learned about yet, and we will be learning about in the next lesson.

But broadly speaking, those gray arrows going from left to right are very much like ResNet skip connections, but they're being used in a different way. Everything else is stuff that we've seen before. So basically, we can pretend that those don't exist for now. It's a neural network that the output is the same size or a similar size to the input, and therefore you can use it to learn how to go from one image to a different image.

Yeah. So that's where the unit is. And yeah, like Jefferson said, we'll talk about it more. The sort of units that are used for diffusion models also tend to have some additional tricks, which again, we'll talk about them later on as well. But yeah, for the time being, we will just import a unit from the Diffusers library, which is the Hagen-Feiss library for diffusion models.

So they have a unit implementation, and we'll just be using that for now. And so, yeah, of course, strictly speaking, we're cheating at this point because we're using something we haven't written from scratch, but we're only cheating temporarily because we will be writing it from scratch. Yeah. And yeah, so and then of course, we're working with one channel images or fashion MNIST images or one channel images.

So we just have to specify that. And then of course, the channels of the different blocks within the unit are also specified. And then let's go into the training process. So basically, the general idea of course, is we want to train with this MSE loss. What we do is we select a random time step, and then we add noise to our image based on that time step.

So of course, if we have a very high time step, we're adding a lot of noise. If we have a lower time step, then we're adding very little noise. So we're going to randomly choose a time step. And then yeah, we add the noise accordingly to the image. And then we pass the noisy image to a model as well as the time step.

And we are trying to predict the amount of noise that was in the image. And we predicted with the MSE loss. So we can see all the-- I have some pictures of some of these variables I could share if that would be useful. So I have a version. So I think Tanishka is sharing notebook number 15.

Is that right? And I've got here notebook number 17. And so I talked to Tanishka's notebook and just as I was starting to understand it, I'd like to draw pictures for myself to understand what's going on. So I talked to things which are in Tanishka's class and just put them into a cell.

So I just copied and pasted them, although I replaced the Greek letters with English written out versions. And then I just plotted them to see what they look like. So in Tanishka's class, he has this thing called beta, which is just lin space. So that's just literally a line.

So beta, there's going to be 1,000 of them. And they're just going to be equally spaced from 0.001 to 0.02. And then there's something called sigma, which is the square root of that. So that's what sigma is going to look like. And then he's also got alpha bar, which is a cumulative product of 1 minus this.

And there's what alpha bar looks like. So you can see here, as Tanishka was describing earlier, that when T is higher, this is T on the x axis, beta is higher. And when T is higher, alpha bar is lower. So yeah, so if you want to remind yourself, so each of these things, beta, sigma, alpha bar, they're each got 1,000 things in them.

And this is the shape of those 1,000 things. So this is the amount of variance, I guess, added at each step. This is the square root of that. So it's the standard deviation added at each step. And then if we do 1 minus that, it's just the exact opposite.

And then this is what happens if you multiply them all together up to that point. And the reason you do that is because if you add noise to something, you add noise to something that you add noise to something that you add noise to something, then you have to multiply together all that amount of noise to say how much noise you would get.

So yeah, those are my pictures, if that's helpful. Yep, good to see the diagram or see how the actual values and how it changes over time. So yeah, let's see here, sorry. Yeah, so like Jeremy was showing, we have our lint space for our beta. In this case, we're using kind of more of the Greek letters.

So you can see the Greek letters that we see in the paper as well as now we have it here in the code as well. And we have our lint space from our minimum value to our maximum value. And we have some number of steps. So this is the number of time steps.

So here, we use a thousand time steps, but that can depend on the type of model that you're training. And that's one of the parameters of your model or high parameters of your model. And this is the callback you've got here. So this callback is going to be used to set up the data, I guess, so that you're going to be using this to add the noise so that the models then got the data that we're trying to get it to learn to then denoise.

Yeah, so the callback, of course, makes life a lot easier in terms of, yeah, setting up everything and still being able to use, I guess, the mini AI learner with maybe some of these more complicated and maybe a little bit more unique training loops. So yeah, in this case, we're just able to use the callback in order to set up the data, the data, I guess, the batch that we are passing into our learner.

I just wanted to mention, when you first did this, you wrote out the Greek letters in English, alpha and beta and so forth. And at least for my brain, I was finding it difficult to read because they were literally going off the edge of the page and I couldn't see it all at once.

And so we did a search in a place to replace it with the actual Greek letters. I still don't know how I feel about it. I'm finding it easier to read because I can see it all at once and it was a scroll and I don't get it as overwhelmed.

But when I need to edit the code, I kind of just tend to copy and paste the Greek letters, which is why we use the actual word beta in the init parameter list so that somebody using this never has to type a Greek letter. But I don't know, Johnno or Tanishka, if you had any thoughts over the last week or two since we made that change about whether you guys like having the Greek letters in there or not.

I like it for this demo in particular. I don't know that I do this in my code, but because we're looking back and forth between the paper and the implementation year, I think it works in this case just fine. Yeah, I agree. I think it's good for like, yeah, when you're trying to study something or trying to implement something, having the Greek letters is very useful to be able to, I guess, match the math more closely and it's just easy just to do it.

Take the equation and put it into code or, you know, white-swear style looking at the code and try to match to the equation. So I think for like, yeah, educational purpose, I tend to like, I guess, the Greek letters. So yeah, so, you know, we have our initialization, but we're just defining all these variables.

We'll get to the predict in just a moment. But first, I just want to go over the before batch, where we're ever setting up our batch to pass into the model. So remember that the model is taking in our noisy image and the time step. And of course the target is the actual amount of noise that we are adding to the image.

So basically, we generate that noise. So that's what... epsilon is that target. So epsilon is the amount of noise, not the amount of, is the actual noise. Yes, epsilon is the actual noise that we're adding. And that's the target as well, because our model is a noise-predicting model. It's predicting the noise in the image.

And so our target should be the noise that we're adding to the image during training. So we have our epsilon on and we're generating it with this random function. It's the random normal distribution with a mean of zero, variance of one. So that's what that's doing and adding the appropriate shape and device.

Then the batch that we get originally will contain the clean images. These are the original images from our dataset. So that's x0. And then what we want to do is we want to add noise. So we have our alpha bar and we have a random time step that we select.

And then we just simply follow that equation, which again, I'll just show in just a moment. That equation, you can make a tiny bit easier to read, I think. If you were to double-click on that first alpha bar underscore t, cut it and then paste it, sorry, in the xt equals torch dot square root, take the thing inside the square root, double-click it and paste it over the top of the word torch.

That would be a little bit easier to read, I think. And then you'll do the same for the next one. There we go. Yeah, so basically, yeah, so yeah, I guess let's just pull up the equation. So let's see. There's a section in the paper that has the nice algorithm.

Let's see if I can find it. No, here. I think earlier. Yes, string. Right, so we're just following the same sort of training steps here, right? We select a clean image that we take from our data set. This fancy kind of equation here is just saying take an image from your data set, take a random time step between this range.

Then this is our epsilon that we're getting, just to say get some epsilon value. And then we have our equation for x of t, right? This is the equation here. You can see that square root of alpha bar t x0 plus square root of 1 minus alpha bar t times epsilon.

So that's the same equation that we have right here, right? And then what we need to do is we need to pass this into our model. So we have x t and t. So we set up our batch accordingly. So this is the two things that we pass into our model.

And of course, we also have our target, which is our epsilon. And so that's what this is showing here. We pass in our x of t as well as our t here, right? And we pass that into a model. The model is represented here as epsilon theta. And theta is often used to represent like this is a neural network with some parameters and the parameters are represented by theta.

So epsilon theta is just representing our noise predicting model. So this is our neural network. So we have passed in our x of t and our t into a neural network. And we are comparing it to our target here, which is the actual epsilon. And so that's what we're doing here.

We have our batch where we have our x of t and t and epsilon. And then here we have our prediction function. And because we actually have, I guess, in this case, we have two things that are in a tuple that we need to pass into our model. So we just kind of get those elements from our tuple with this.

Yeah, we get the elements from the tuple, pass it into the model, and then hugging face has its own API in terms of getting the output. So you need to call dot sample in order to get the predictions from your model. So we just do that. And then we do, yeah, we have learned dot preds.

And that's what's going to be used later than when we're trying to do our loss function calculation. So, I mean, it's just worth looking at that a little bit more since we haven't quite seen something like this before. And it's something which I'm not aware of any other framework that would let you do this.

Literally replace how prediction works. And many AIs kind of really fun for this. So because you're inherited from TrainedCB, TrainedCB has predict, it ought to find, and you have to find a new version. So it's not going to use the TrainedCB version anymore. It's going to use your version.

And what you're doing is instead of passing learned dot batch zero to the model, you've got a star in front of it. So the key thing is that star is going to, and we know actually learned dot batch zero has two things in it because that learned dot batch that you showed at the end of the before batch method has two things in learned dot zero.

So that star will unpack them and send each one of those as a separate argument. So our model needs to take two things, which the diffusers unit does take two things. So that's the main interesting point. And then something I find a bit awkward honestly about a lot of HackingFace stuff including diffusers is that generally their models don't just return the result, but they put it inside some name.

And so that's what happens here. They put it inside something called sample. So that's why Tanishk added dot sample at the end of the predict because of this somewhat awkward thing, which HackingFace like to do for some reason. But yeah, now that you know, I mean, this is something that people often get stuck on.

I see on Kaggle and stuff like that. It's like, how on earth do I use these models? Because they take things in weird forms and they give back things with weird forms. Well, this is hell, you know, if you inherit from TranCB, you can change predict to do whatever you want, which I think is quite sweet.

Yep. So yeah, that's the training loop. And then of course you have your regular training loop that's implemented in many AI where you are going to have, yeah, you have your loss function calculation. I mean, at the predictions, learn.preds, and of course the target is our learn.batch1, which is our epsilon.

So, you know, we have those and we pass it into the loss function. It calculates the loss function and does the back propagation. So I'll just go over that. We'll get back to the sampling in just a moment. But just to show the training loop. So most of this is copied from our, I think it's 14 augment notebook.

The way you've got the Tmax and the shed. The only thing I think you've added here is the DDPM callback, right? Yes. The DDPM callback. And the transient loss function. Yes. So basically we have to initialize our DDPM callback with the appropriate arguments. So like the number of time steps and the minimum beta and maximum beta.

And then of course we're using an MSC loss as we talked about. It just becomes a regular training loop. And everything else is from before. Yeah. We have your scheduler, your progress bar, all of that we've seen before. I think that's really cool that we're using basically the same code to train a diffusion model as we've used to train a classifier just with, yeah, an extra callback.

Yeah. Yeah. Yeah. So I think callbacks are very powerful for being, you know, for allowing us to do such things. It's like pretty, you can take all this code and now we have a diffusion training loop and we can just call it lower.fit. And yeah, you can see got a nice training loop, nice loss curve.

We can save our model. Saving functionality to be able to save our model and we could load it in. But now that we have our trained model, then the question is what can we do to use it to sample, you know, the dataset? So the basic idea, of course, was that, you know, we have like basically we're here, right?

We have, let's see here. Okay. So we have a basic idea is that we start out with a random data point. And of course that's not going to be within the distribution at first, but now we've learned how to move from, you know, at one point towards the data distribution.

That's what our noise prediction predicting function does. It basically tells you in what direction and how much to... So the basic idea is that, yeah, I guess I'll start from a new drawing here. Again, we have distribution is and we have a random point. And we use our noise predicting model that we have trained to tell us which direction to move.

So it tells us some direction. Or I guess, let's... It tells us some direction to move. At first that direction is not going to be like you cannot follow that direction all the way to get the correct data point. Because basically what we were doing is we're trying to reverse the path that we were following when we were adding noise.

So like, because we had originally a data point and we kept adding noise to the data point and maybe, you know, it followed some path like this. And we want to reverse that path to get to... So our noise predicting function will give us an original direction which would be some kind of...

It's going to be kind of tangential to the actual path at that location. So what we would do is we would maybe follow that data point all the way towards... We're just going to keep following that data point. We're going to try to predict the fully denoised image by following this noise prediction.

But our fully denoised image is also not going to be a real image. So what we... So let me... I'll show you an example of that over here in the paper on where they show this a little bit more carefully. Let's see here. So X zero... Yeah. So basically, you can see the different...

You can see the different data points here. It's not going to look anything like a real image. So you can see all these points. You know, it doesn't look anything. That we would do is we actually had a little bit of noise back to it and we start... We have a new point where then we could maybe estimate a better...

Get a better estimate of which direction to move. Follow that all the way again. We follow a new point. I get add back a little bit of noise. You get a new estimate. You make a new estimate of this noise prediction and removing the noise. Follow that all again completely and add a little bit of noise again to the image and burst onto a dimension.

So that's kind of what we're showing here as well. That's a lot like SGD. With SGD, we don't take the gradient and jump all the way. We use a learning rate to go some of the way because each of those estimates of where we want to go are not that great, but we just do it slowly.

Exactly. And at the end of the day, that's what we're doing with this noise prediction. We are predicting the gradient of this p of x, but of course, we need to keep making estimates of that gradient as we're progressing. So we have to keep evaluating our noise prediction function to get updated and better estimates of our gradient in order to finally converge onto our image.

And then you can see that here where we have maybe this fully predicted denoised image, which at the beginning doesn't look anything like a real image, but then as we continue throughout the sound like process, we finally converge on something that looks like an actual image. Again, these are CIFAR-10 images and still a little bit maybe unclear about how realistic these images, these very small images look, but that's kind of the general principle I would say.

And so that's what I can show in the code. This idea of we're going to start out basically with a random image, right? And this random image is going to be like a pure noise image and it's not going to be part of the data distribution. It's not anything like a real image, it's just a rounded image.

And so this is going to be our x, I guess, x uppercase T, right? That's what we start out with. And we want to go from x uppercase T all the way to x0. So what we do is we go through each of the time steps and we have to put it in this sort of batch format because that's what our neural network expects.

So we just have to format it appropriately. And we'll get to Z in just a moment. I'll explain that in just a moment. But of course, we just again have similar alpha bar, beta bar, which is getting those variables that we -- And we faked beta bar because we couldn't figure out how to type it, so we used b bar instead.

Yeah, we were unable to get beta bar to work, I guess. But anyway, at each step, what we're trying to do is to try to predict what direction we need to go. And that direction is given by our noise predicting model, right? So what we do is we pass in x of t and our current time step into our model and we get this noise prediction and that's the direction that we need to move it.

So basically, we take x of t, we first attempt to completely remove the noise, right? That's what this is doing. That's what x0 cap is. That's completely removing the noise. And of course, as we said, that estimate at the beginning won't be very accurate. And so now what we do is we have some coefficients here where we have a coefficient of how much that we keep of this estimate of our denoise image and how much of the originally noisy image we keep.

And on top of that, we're going to add in some additional noise. So that's what we do here. We have x0 cap and we multiply by its coefficient and we have x of t, we multiply by some coefficient and we also add some additional noise. That's what the z is.

It's basically a weighted average of the two plus the natural noise. And then the whole idea is that as we get closer and closer to a time step equals to zero, our estimate of x0 will be more and more accurate. So our x0 coefficient will get closer as we're going through the process and our x t coefficient will get closer and closer to zero.

So basically we're going to be weighting more and more of the x0 hat estimate and less and less of the x t as we're getting closer and closer to our final time step. And so at the end of the day, we will have our estimated generated image. So that's kind of an overview of the sampling process.

So yeah, basically the way I implemented it here was I had the sample function that's part of our callback and it will take in the model and the kind of shape that you want for your images that you're producing. So if you want to specify how many images you produce, that's going to be part of your back size or whatever.

And you'll just see that in a moment. But yeah, it's just part of the callback. So then we basically have our DDPM callback and then we could just call the sample method of our DDPM callback and we pass in our model. And then here you can see we're going to produce, for example, 16 images and it just has to be a one channel image of shape 32 by 32.

And we get our samples. And one thing I forgot to note was that I am collecting each of the time step effects of t. So the predictions here, you can see that there are a thousand of them. We want the last one because that is our final generation. So we want the last one and that's what we have.

They're no sad actually. Yeah. And this is a long way since DDPM. So this is like slower and less great than it could be. But considering that except for unit, we've done this from scratch, you know, literally from matrix multiplication. I think those are pretty decent. Yeah. And we're only trained for about five epochs.

It took like, you know, maybe like four minutes to train this model, something like that. It's pretty quick. And this is what we could get with very little training. And it's yeah, pretty decent. You can see it or some clear shirts and shoes and pants and whatever else. So yeah.

And you can see fabric and it's got texture and things have buckles and yeah. You know, something to compare, like we did generative modeling in the first time we did part two back in the days when something called Vassus Guy and Gan was just new, which is actually created by the same guy that created PyTorch or one of the two guys, Sumith.

And we trained for hours and hours and hours and got things that I'm not sure were any better than this. So things that come a long way. Yeah. Yeah. And of course, then yeah, so we can see then like how this sampling progresses over time, over the multiple time steps.

So that's what I'm showing here because I collected during the sampling process, we are collecting at each time step what that estimate looks like. And you can kind of see here. And so this is an estimate out of like the noisy image over the time steps. Oops. And I guess I had to pause for a minute.

Yeah, you can kind of see. But you'll notice that actually, so we actually what we did is like, okay, so we selected an image, which is like the ninth image. So that's this image here. So we're looking at this image, particularly here. And we're going over, yeah, we have a function here that's showing the i time step during the sampling process of that image.

And we're just getting the images. And what we are doing is we're only showing basically from time step 800 to 1000. And here, we're just, we're just having it like where it's like, okay, we're looking at like, maybe every five steps and we're going from 800 to nine. And this kind of make it a little bit visually easier to see the transition.

But what you'll notice is I didn't start all the way from zero, I started from 800. And the reason we do that is because actually, between zero and 800, there's very little change in terms of like, it's just mostly a noisy image. And it turns out, yeah, I didn't see as I make a note of this year, it's actually a limitation of the noise schedule that is used in the original DDP on paper.

And especially when applied to some of these smaller images when we're working with images of like size 32 by 32 or whatever. And so there are some other papers like the improved DDP on paper that propose other sorts of noise schedules. And what I mean by noise schedule is basically how beta is defined, basically.

So, you know, we had this definition of torch.lenspace for our beta, but people have different ways of defining beta that lead to different properties. So, you know, things like that, people have come up with different improvements, and those sorts of improvements work well when we're working with these smaller images.

And basically, the point is like, if we are working from 0 to 800, and it's just mostly just noise that entire time, you know, we're not actually making full use of all those time steps. So, it would be nice if we could actually make full use of those time steps and actually have it do something during that time period.

So, all these, there are some papers that examine this a little bit more carefully. And it would be kind of interesting for maybe some of you folks to also look at these papers and see if you can try to implement, you know, those sorts of models with this notebook as a starting point.

And it should be a fairly simple change in terms of like noise schedule or something like that. So, yeah. So I actually think, you know, this is the start of our next journey, you know, which is our previous journey was, you know, going from being totally rubbish at FashionMnist classification to being really good at it.

I would say now we're like a little bit rubbish at doing FashionMnist generation. And yeah, I think, you know, we should all now work from here over the next few lessons and so forth and people, you know, trying things at home and all of us trying to make better and better generative models, you know, initially a FashionMnist and hopefully we'll get to the point where we're so good at that, that we're like, oh, this is too easy.

And then we'll pick something harder. Yeah. And eventually that'll take us to stable diffusion and beyond, I imagine. Yeah. That's cool. I got some stuff to show you guys. If you're interested, I tried to, you know, better understand what was going on in Tanishk's notebook and tried doing it in a thousand different ways and also see if I could just start to make it a bit faster.

So that's what's in notebook 17, which I will share. So we've already seen the start of notebook 17. Well, one thing I did just do is just drew a picture for myself, partly just to remind myself what the real ones look like, and they definitely have more detail than the samples that Tanishk was showing.

But they're not, you know, they're just 28 by 28. I mean, they're not super amazing images and they're just black and white. So even if we're fantastic at this, they're never going to look great because we're using a small, simple dataset. As you always should, when you're doing any kind of R and D or experiments, you should always use a small and simple dataset up until you're so good at it that it's not challenging anymore.

And even then, when you're exploring new ideas, you should explore them on small, simple datasets first. Yeah. So after I drew the various things, what I like to do is one thing I found challenging about working with your class to Tanishk is I find when stuff is inside a class, it's harder for me to explore.

So I copied and pasted it before batch contents and called it Noisify. And so one of the things that's fun to do that is it forces you to figure out what are the actual parameters to it. And so now that I, rather than putting that fast, now that I've got all of my various things to do with, so these are the three parameters to the DDPM callbacks in it.

So then these things we can calculate from that. So with those, then actually all we need is, yeah, what's the image that we're going to Noisify and then what's the alpha bar, which I mean, we can get from here, but it sort of would be more general if you can pass in your alpha bar.

So yeah, this is just copying and pasting from the class. But the nice thing is then I could experiment with it. So I can call Noisify on my first 25 images and with a random T, each one's got a different random T. And so I can print out the T and then I could actually use those as titles.

And so this lets me, I thought this was quite nice. I might actually rerun this because actually none of these look like anything because as it turns out in this particular case, all of the Ts are over 200. And as Tanishk mentioned, once you're over 200, it's almost impossible to see anything.

So let me just rerun this and see if we get a better, there we go. There's a better one. So with a T of 0, right? So remember T equals 0 is the pure image. So T equals 7, it's just a slightly speckled image. And by 67, it's a pretty bad image.

And by 94, it's very hard to see what it is at all. And by 293, maybe I can see a pair of pants. I'm not sure I can see anything. So yeah, by the way, there's a handy little, so I think we've looked at map before in the course.

There's an extended version of map in fast core. And one of the nice things is you can pass it a string and it basically just calls this format string if you pass it a string rather than a function. And so this is going to stringify everything using its representations.

This is how I got the titles out of it, just by the way. So yeah, I found this useful to be able to draw a picture of everything. And then I wanted to, yeah, look at what else can I do. So then I took, you won't be surprised to see, I took the sample method and turned that into a function.

And I actually decided to pass everything that it needs even, I mean, you could actually calculate pretty much all of these. But I thought since I've calculated them before, it was passed them in. So this is all copied and pasted from Janiszk's version. And so that means the callback now is tiny, right?

Because before batch is just noisify and the sample method just calls the sample function. Now what I did do is I decided just to, yeah, I wanted to try like as many different ways of doing this as possible, partly as an exercise to help everybody like see all the different ways we can work with our framework, you know.

So I decided not to inherit from train_cb, but instead I inherited from callback. So that means I can't use Janiszk's nifty trick of replacing predict. So instead I now need some way to pass in the two parts of the first element of the tuple, add separate things to the model and return the sample.

So how else could we do that? Well, what we could do is we could actually inherit from unit 2D model, which is what Janiszk used directly, unit 2D model, and we could replace the model. And so we could replace specifically the forward function. That's the thing that gets called.

And we could just call the original forward function, but rather than passing an x or passing star x, and rather than returning that, we'll return that dot sample. Okay. So if we do that, then we don't need the train_cb anymore and we don't need the predict. And so if you're not working with something as beautifully flexible as mini AI, you can always do this, you know, to make, to replace your model so that it has the interface that you need it to have.

So now, again, we did the same as Janiszk had of create the callback. And now when we create the model, we'll use our unit class, which we just created. I wanted to see if I can make things faster. I tried dividing all of Janiszk's channels by two, and I found it worked just as well.

One thing I noticed is that it uses group_norm in the unit, which we have briefly learned about before. And in group_norm, it splits the channels up into a certain number of groups. And I needed to make sure that those groups had more than one thing in. So you can actually pass in how many groups do you want to use in the normalization.

So that's what this is for. If you're going to be a little bit careful of these things, I didn't think of it at first. And I ended up, I think the NAM groups might have been 32, and I got an error saying you can't split 16 things into 32 groups.

But it also made me realize, actually, even in Janiszk's, maybe you probably had 32 in the first with 32 groups, and so maybe the group_norm wouldn't have been working as well. So they're little subtle things to look out for. So now that we're not using anything inherited from train_cb, that means we either need to use train_cb itself or just use our train_liner, and that everything else is the same as what Janiszk had.

So then I wanted to look at the results of Noisify here, and we've seen this trick before, which is we call fit, but don't call the training part of the fit, and use the single batch_cb callback that we created way back when we first created learner. And now learn.batch will contain the tuple of tuples, which we can then use that trick to show.

So, I mean, obviously we'd expect it to look the same as before, but it's nice. I always like to draw pictures of everything all along the way, because it's very, very often. I mean, the first six to seven times I do anything, I do it wrong. So given that I know that, I might as well draw a picture to try and see how it's wrong until it's fixed.

It also tells me when it's not wrong. Isn't there a show_batch function now that does something similar? Yes, you wrote that show_image_batch, didn't you? I can't quite remember. Yeah, we should remind ourselves how that worked. That's a good point. Thanks for reminding it. Okay, so then I'll just go ahead and do the same thing that Nish did.

But then the next thing I looked at was I looked at the, how am I going to make this train faster? I want a higher learning rate. And I realized, oddly enough, the diffuser's code does not initialize anything at all. They use the defaults, which just goes to show, like even the experts at Hugging Face that don't necessarily think like, oh, maybe the PyTorch defaults aren't perfect for my model.

Of course they're not because they depend on what activation function do you have and what resbox do you have and so forth. So I wasn't exactly sure how to initialize it. Partly by chatting to Kat Crowley, who's the author of K. Diffusion, and partly by looking at papers and partly by thinking about my own experience, I ended up doing a few things.

One is I did do the thing that we talked about a while ago, which is to take every second convolutional layer and zero it out. You could do the same thing with using batch norm, which is what we tried. And since we've got quite a deep network, that seemed like it might, it helps basically by having the non-ID path in the resnets do nothing at first.

So they can't cause problems. We haven't talked about orthogonalized weights before and we probably won't because you would need to take our computational linear algebra course to learn about that, which is a great course. Rachel Thomas did a fantastic job of it. I highly recommend it, but I don't want to make it a prerequisite.

But Kat mentioned she thought that using orthogonal weights for the downsamplers was a good idea. And then for the app blocks, they also set the second comms to zero. And something Kat mentioned she found useful, which is also from, I think it's from the Darrow while Google paper is to also zero out the weights of basically the very last layer.

And so it's going to start by predicting zero as the noise, which is something that can't hurt. So that's how I initialized the weights. So call in at DDPM on my model. Something that I found made a huge difference is I replaced the normal atom optimizer with one that has an epsilon of one in Nick five.

The default I think is one in egg eight. And so to remind you, this is when we divide by the kind of exponentially weighted moving average of the squared gradients, when we divide by that, if that's a very, very small number, then it makes the effective learning rate huge.

And so we add this to it to make it not too huge. And it's nearly always a good idea to make this bigger than the default. I don't know why the default is so small. And I found until I did this, anytime I tried to use a reasonably large learning rate somewhere around the middle of the one cycle training, it would explode.

So that makes a big difference. So this way, yeah, I could train, I could get 0.016 after five epochs and then sampling. So it looks all pretty similar. We've got some pretty nice textures, I think. So then I was thinking, how do I get faster? So one way we can make it faster is we can take advantage of something called mixed precision.

So currently we're using 32 bit floating point values. That's the defaults and also known as single precision. And GPUs are pretty fast at doing 32 bit floating point values, but they're much, much, much, much faster at doing 16 bit floating point values. So 16 bit floating point values aren't able to represent a very wide range of numbers or much precision at the difference between numbers.

And so they're quite difficult to use, but if you can, you'll get a huge benefit because modern GPUs, modern Nvidia GPUs specifically have special units that do matrix multiplies of 16 bit values extremely quickly. You can't just cast everything to 16 bit because then you, there's not enough precision to calculate gradients and stuff properly.

So we have to use something called mixed precision. Depending on how enthusiastic I'm feeling, I guess we ought to do this from scratch as well. We'll see. We do have an implementation from scratch because we actually implemented this before Nvidia implemented it in an earlier version of fast AI.

Anyway, we'll see. So basically the idea is that we use 32 bit for things where we need 32 bit and we use 16 bit for things we use 16 bit. So that's what we're going to do is we're going to use this mixed precision. But for now, we're going to use Nvidia's, you know, semi-automatic or fairly automatic code to do that for us.

Actually, we had a slight change of plan at this point when we realized this lesson was going to be over three hours in length and we should actually split it into two. So we're going to wrap up this lesson here and we're going to come back and implement this mixed precision thing in lesson 20.

So we'll see you then.