Back to Index

Lesson 21: Deep Learning Foundations to Stable Diffusion


Chapters

0:0 A super cool demo with miniai and CIFAR-10
2:55 The notebook
7:12 Experiment tracking and W&B callback
16:9 Fitting
17:15 Comments on experiment tracking
20:50 FID and KID, metrics for generated images
23:35 FID notebook (18_fid.ipynb)
31:7 Get the FID from an existing model
37:22 Covariance matrix
42:21 Matrix square root
46:17 Why it is called Fréchet Inception Distance (FID)
47:54 Some FID caveats
50:13 KID: Kernel Inception Distance
55:30 FID and KID plots
57:9 Real FID - The Inception network
61:16 Fixing (?) UNet feeding - DDPM_v3
68:49 Schedule experiments
74:52 Train DDPM_v3 and testing with FID
79:1 Denoising Difussion Implicit Models - DDIM
86:12 How does DDIM works?
90:15 Notation in Papers
92:21 DDIM paper
113:49 Wrapping up

Transcript

Hello, Jono. Hello, Tanishk. Are you guys ready for lesson 21? Ready. Yep, I'm excited. I don't know what I would have said if you had said no. So good. I'm actually particularly excited because I had a little bit of a peak preview of something that Jono has been working on, which I think is a super cool demo of what's possible with very little code with mini AI.

So let me turn it over to Jono. Great, thanks, Jeremy. Yeah, so as you'll see, when it's back to Jeremy to talk through some of the experiments and things we've been doing, we've been using the fashion reminisce dataset at a really small scale and really rapidly try out these different ideas and see some maybe nuances or things that we'd like to explore further.

And so as we were doing that, I started to think that maybe it was about time to explore just ramping up the level, like seeing if we can go to the next slightly larger datasets, slightly harder difficulty, just to double check that these ideas still hold for longer training runs and different more difficult data.

That's a really good idea because I feel pretty confident that the learnings from fashion reminisce are going to move across most of the time these things seem to, but sometimes they don't and it can be very hard to predict. So it seems like a very wise choice. Yeah. And so we'll keep wrapping up, but as a next step, one above fashion enmost, I thought I'd look at this data called CIFAR10.

And so CIFAR10 dataset is a very popular dataset originally for things like image classification, but also now for any paper on generative modeling. It's kind of like the smallest dataset that you'll see in these papers. And so yeah, if you look at the classification results, for example, pretty much every classification paper since they started tracking has reported results on CIFAR10 as well as their larger datasets.

And likewise with image generation, very, very popular, all of the recent diffusion papers will usually report CIFAR10, add their image net, and then whatever large massive dataset they're training on. We were somewhat notable in 2018 for managing to train. So for CIFAR10, 94% classification is kind of the benchmark.

So there was a competition a few years ago where we managed to get to that point at a cost of like 26 cents worth of AWS time, I think, which won a big global competition. So I actually hate CIFAR10, but we had some real fun with it a few years ago.

Yeah. And it's good. It's a nice dataset for quickly testing things out, but we'll talk about why we also like us as a group don't like it at all. And we'll pretty soon move on to something better. So one of the things you'll notice in this notebook, I'm basically using all of the same code that Jeremy is going to be looking at and explaining.

So I won't go into too much, but the data sets also on HuggingFace. So we can load it just like we did the fashion Mnest. The images are three channel rather than single channel. So the shape of the data is slightly different to what we've been working with. That's weird.

Yeah. So we have instead of the single channel image, we have a three channel red, green, and blue image. And this is what a batch of data looks like. And you've got then two images in your batch. So that's batch by channel by height by width, right? Back to our channel by height and width.

That was a little confused by the 32 by 32 is it? Oh, yeah. It's fine. I got it now. Batch size in the arbitrary. And so if you plot these, one of the things, if you look at this, okay, I can see these are different classes. Like I know this is an airplane, a frog, an airplane, but it's actually a puzzle with an airplane on the cover, a bird, a horse, a car.

That one you squint, you can tell it's a dare, but only if you really know what you're looking for. And so when we started to talk about generating these images, this is actually quite frustrating. Like this, if I generated this, I'd say this might be the model doing a really bad job.

But it's actually that this is a boat, this is a dog. It's just that this is what the data looks like. And so I've actually got something that can help you out. I'll show later today, which is something like this. It's really actually hard to see whether it's good because the images are bad.

It can be helpful to have a metric to generate that can see how good samples are. So I'll be showing a metric for that later today. Yeah. And that'll be great. And I hope to have like automated. But anyway, I just wanted to flag like for visually inspecting these, it's not great.

And so we don't really like CIFAR 10 because it's hard to tell, but still a good one to test with. So the Noisify and everything I'm following what Jeremy is going to be showing exactly, the code works without any changes because we're adding random noise in the same shape as our data.

So even though our data now has three channels, the Noisify function still works fine. If we try and visualise the noise based images because we're adding noise in the red pink blue channels, and some of that's quite extreme values. Yeah, it looks slightly different, looks all crazy RGB. But you can see, for example, this frog doesn't have as much noise and it's vaguely visible.

But it is, it's a many impossible task to look at this and tell what image is hiding out all of that noise. So I think this is really neat that you could use the same Noisify. Yeah, and it still, it still works thanks to, it's not just that shape thing, but I guess just thanks to kind of PyTorch's broadcasting kind of stuff.

This often happens, you can kind of change the dimensions of things that just keeps working. Exactly. And we've been paying attention to those broadcasting rules and the right dimensions and so on. Cool. So I'm going to use the same sort of approach to learning unit, except that now obviously I need to specify three input channels and three output channels because we're working with three channel images.

But I did want to explore for this demo, like, okay, how could I maybe justify wanting to do this kind of experiment tracking thing that I'll talk about. And so I'm bumping up the size of the model substantially. I've gone from, this is the default settings that we were using for FashionM Nest, but the Diffuser's default unit has what, many, 20 times as many parameters, 274 million versus 15 million.

So we're going to try a larger model. We're going to try some longer training. And so I could just do the same training that we've always done just in the notebook, set up a learner with progress CV to kind of plot the loss, subtract some metrics. But yeah, I don't know about you, but once it's beyond a few minutes training, I quickly get a patient and I have to wait for it to finish before we can sample.

So I'm doing the DDPM sample, but I have to, I actually interrupted the training to say, I just want to get a look at what it looks like initially and to plot some samples. And again, the sampling function works without any modification, but I'm passing in my size to be a three channel image.

Yeah. And so this is like, we could do it like this, but at some point I would like to A, keep track of what experiments I've tried and B, be able to see things as it's going over time, including like, I'd love to see what the samples look like if you generated after the first epoch, after the second epoch.

And so that's where my little callback that I've been playing with comes in. So just before you do that, I'll just mention like, I mean, there are simple ways you could do that, right? Like, you know, one popular way a lot of people do is that they'll save some sample images as files every epoch or two, or we could like, the same way that we have an updating plot, as we train with fast progress, we could have an updating set of sample images.

So there's a few ways we could solve that. That wouldn't handle the tracking that you mentioned of like looking over time at how different changes have improved things or made them worse, whatever that would, I guess, would require you kind of like saving multiple versions of a notebook or keeping some kind of research journal or something.

That'd be a bit fiddly. It is. And all of that's doable, but I also find like, I'm a little bit lazy sometimes. Maybe I don't write down what I'm trying or yeah, I've saved untitled member 37 notebooks. So yeah, the idea that I wanted to show here is just that there are lots of other solutions for this kind of experiment tracking and logging.

And one that I really like is called Weights and Biases. And so I'll explain what's going on in the code here that I'm running a training with this additional Weights and Biases callback. And what it's doing is it's allowing me to log whatever I'd like. So I can log samples at a different-- Okay, so you're switching to a website here called wnb.ai.

So that's where your callback is sending information to. Yeah, so Weights and Biases accounts are free for personal and academic use. And it's very, very, like, I don't think anyone hates Weights and Biases. But it's a very nice service. You sign in and you log in on your computer or you get an authentication token.

And then you're able to log these experiments and you can log into different projects. And what it gives you is for each experiment, anything that you call Weights and Biases.log at any step in the training, that's getting logged and sent to their server and stored somewhere where you can later access it and display it.

They have these plots that you can visualize easily. And you can also share them very easily in these reports that integrate this data sort of interactive thing. And why that's nice is that later you can go and look at-- So this is now the project that I'm logging into.

You can log multiple runs with different settings. And for each of those, you have all of these things that you've tracked, like your training, loss, and validation. But you can also track your learning rate if you're doing a learning rate schedule. And you can save your model as an artifact and it'll get saved on their server so you can see exactly what run reproduced, what model.

It logs the code. If you set that to-- You can save code equals true. And then it creates a copy of your whole Python environment, what libraries were installed, what code you ran. So being able to come back later and say, oh, these images here, these look really good.

I can go back and see, oh, that was this experiment here. I can check what settings I used. In the initialization, you can log whatever configuration details you'd like in any comments. And yeah, there's other frameworks for this. Yeah, in some ways, it's kind of-- initially, when I first saw Weights and Biases, it felt a bit weird to me actually sending your information off to an external website because, I mean, before Weights and Biases existed, the most popular way to do this was something called TensorBoard, which Google provides, which is actually a lot like this, but it's a little server that runs on your computer.

And so like when you log things, it just puts it into this little database on your computer, which is totally fine. But I guess actually, there are some benefits to having somebody else run this service instead of running your own little TensorBoard or whatever server. One is that you can have multiple people working on a project collaborating.

So I've done that before, where we will each be sending different sets of hyperparameters, and then they'll end up in the same place. Or if you want to be really antisocial, you can interrupt your romantic dinner and look at your phone to see how your training's going. So yeah, I'm not going to say it's always the best approach to doing things, but I think there's definitely benefits to using this kind of service.

And it looks like you're showing us that you can also create reports for sharing this, which is also pretty nifty. Yeah, yeah. So I like for working with other people or you want to show somebody the final results and being able to, yeah, pull together the results from some different runs or just say, oh, by the way, here's a set of examples from my two most recent.

And things track to different steps. What do you think of this? And being able to have this place where everyone can go and they can inspect the different loss curves. For any run, they can say, oh, what was the batch size for this? Let me go look at the info there.

OK, I didn't log it, but I logged time in the epochs and the learning rate. So yeah, I find it quite nice, especially in a team or if you're doing lots and lots of experiments to be able to have this permanent record that somebody else deals with and may host the storage and the tracking.

Yeah, it's quite nice. Wait, and this is all the code you had to write? That's amazing. Yeah. So this is using the callback system. The way Weights and Bices works is that you start an experiment with this 1db.init, and you can specify any configurational settings that you used there.

And then anything you need to log is 1db.log, and you pass in whatever the name of your value is, again, logging the loss and then the value. And once you've done 1db.finish, and that syncs everything up and sends it to the server. Oh, this is wild, the way you've inherited from Metric CV, and you replaced that underscore log that we previously we used to allow fast progress to do the logging, and you've replaced it to allow Weights and Bices to the logging.

So yeah, it's really sweet. Yeah, yeah. So this is using the callback system. I wanted to do the things that Metric CV normally does, which is tracking different metrics that you pass in. So this will still do that. And I just offload to the super like the original Metric CV method for things like the after batch.

But in addition to that, I'd also like to log the Weights and Bices. And so before I fit, I initialize the experiments, every batch, I'm going to log the loss after every epoch, and the default metrics callback is going to accumulate the metrics and so on. And then it's going to call this underscore log function.

So I chose to modify that to say, I'm going to log my training loss, it's training, I'm going to log my validation loss, if I'm doing validation, and I'd like to log some samples. And Weights and Bices is quite flexible in terms of what you can log, you can create images or videos or whatever.

But it also takes a method with figure. And so I'm generating samples and plotting them with show image and splitting back that map of the figure, which I can then log and that becomes these pretty pictures. And that you can see over time, like every, every time that log function runs, which is after every epoch, you can go in and see what the images look like.

So maybe we can make your code even simpler in the future. If we had show images, maybe it could have like a optional return fake parameter that returns the figure, and then we could replace those four lines of code with one, I suspect. Yeah. Yeah. And I mean, this, I just sort of threw this together.

It's quite early still. You could also what I've done in the past is usually just create a PIL image where you can, you know, make a grid or overlay a text or whatever else you'd like, and then just log that as 1db.image. And otherwise, like apart from that, I'm just passing in this callback as an extra callback to my set of callbacks for the learner instead of a metric callback.

And so when I call that fit, I still get my little progress bar, I still get this printed out version because my log function still also prints those metrics just for debugging. But instead of having to like watch the progress in the notebook, I can set this running disconnect from the server, go have dinner, and then I can check on my phone or whatever.

What do the samples look like? And okay, cool. They're starting to look like less than random nonsense, but still not necessarily recognizable. Maybe we need to train for longer. That can be the next experiment. What I should probably do next is think of some extra metrics, but Jeremy's going to talk about that.

So for now, that's pretty much all I had to show is just to say, yeah, it's worth as you move to these longer, you know, 10 minutes, one hour, 10 hours, these experiments, it's worth setting up a bit of infrastructure for yourself so that you know what were the settings I used.

And maybe you're saving the model so you have the artifact as a result. And yeah, I like this Wix devices approach, but there's lots of others. The main thing is that you're doing something to track these experiments beyond just, you know, creating for any different versions of your notebook.

I love it. One thing I was going to note that, I don't know if many people know, but like Wix and Biasis can also save the exact code that you used to run for that run. So like if you make any changes to your code, and then you know that you don't know which version of your code you use for this particular experiment, so then you can figure out exactly what code you use.

So it's all completely reproducible. And so I love, you know, weights and biases, all these different features it has. And I use weights and biases all the time for my own research, like almost daily, like I had to, you know, put it on just last night, chuck on it today morning.

So it's like, I use it all the time for my own research. And it's and yeah, like I use it, especially to just know like, oh, this run had this particular config. And then like, yeah, the models go straight into weights and biases. And then if I want to run a model on the test set, I literally actually take it off of weights and biases like downloaded for weights and biases and run it on the test set.

So I use it all the time. And also just having the ability to have everything reproducible and know exactly what you were doing is very convenient, instead of having to like manually track it in some sort of like, I guess, a big Excel sheet or some sort of journal or something like that.

Sometimes this is, you know, this is a lot more convenient, I feel so yeah, lest we get into too much billing for weights and biases, I'm going to put a slightly alternative point of view, which is I don't use it or any experiment tracking framework myself, which is not to say maybe I could get some benefits by doing so, but I fairly intentionally don't because I don't want to make it easy for myself to try 1000 different hyper parameters or do kind of like, you know, directed, you know, sampling of things I like to be very, like directed, you know.

And so that's, that's kind of the workflow I'm looking for is one that allows that to happen, right? Constantly going back and refactoring and thinking what did I learn and how do I change things from here and never kind of doing like 17 learning rates and six architectures and whatever.

Now, obviously, that's not something that John O is doing at the moment. I don't be so easy for him to get on if you want to. I can normally a script that just does a hundred runs with different models and different tasks and then I can look at my weaknesses and say filter by the best loss, which is very tempting.

So I would say to people like, yeah, definitely be aware that these tools exist. And I definitely agree that as we do this, which is early 2023, weights and biases is by far the best one I've seen. It has by far the best integration with fast AI. And as of today, if shadow is pushed yet, it has by far the best integration with mini AI.

I think also fast AI is the best library for using with weights and biases. It works in both ways. So yeah, no, it's there. Consider using it, but also consider not going crazy on on experiments because, you know, I think experiments have their place clearly, but also definitely thought out hypotheses, testing them, changing your code is overall the approach that I think is best.

Well, thank you, John. I think that's awesome. I got some fun stuff to share as well, or at least I think it's fun. And what I wanted to share is like, well, the first of all, I should say we had said, we all had said that we were going to look at units this week.

We are not going to look at units this week, but we have good reason, which is that we had said we're going to go from foundations to stable diffusion. That was also a lie because we're actually going beyond stable diffusion. And so we're actually going to start showing today some new research directions.

I'm going to describe the process that I'm using at the moment to investigate some new research directions. And we're also going to be looking at some other people's research directions that have gone beyond stable diffusion over the past few months. So we will get to units, but we haven't quite finished, you know, as it turns out, the training and sampling yet.

Now, one challenge that I was having as I started experimenting with new things was started getting to the point where actually the generated images looked pretty good and it felt like, you know, almost like being a parent, you know, each time a new set of images would come out, I would want to convince myself that these were the most beautiful.

And so, yeah, when they're crap, it's obvious they're crap, you know, but when they're starting to look pretty good, it's very easy to convince yourself you're improving. So I wanted to have a metric which could tell me how good they were. Now, unfortunately, there is no such metric. There's no metric that actually says do these images, would these images look to a human being like pictures of clothes?

Because only talking to a person can do that. But there are some metrics which give you an approximation of that. And as it turns out, these metrics are not actually a replacement for human beings looking at things, but they're a useful addition. So, and I certainly found them useful.

So I'm going to show you the two most common, well, there's really the one most common metric, which is called FID, and I'm going to show another one called KID or KID. So let me describe and show how they work. And I'm going to demonstrate them using the model we trained in the last lesson, which was in DDPM2.

And you might remember, we trained one with mixed precision, and we saved it as fashion DDPM MP for mixed precision. Okay, so this is all the usual imports and stuff. This is all the usual stuff. But there's a slight difference this time, which is that we're going to try to get the FID for a model we've already trained.

So basically, to get the model we've already trained to get its FID, we can just torch.load it, right, and then .cuda to pop it on the GPU. So I'm going to call that the S model, which is the model for samples, the samples model. And this is just a copied and pasted DBPM from the last time.

So that's for sampling. So we're going to do sampling from that model. And so once we've sampled from the model, we're then going to try and calculate this score called the FID. Now, what the FID is going to do is it's not going to say how good are these images.

It's going to say how similar are they to real images. And so the way we're going to do that is we're going to actually look specifically at four of the images that we generated in these samples. We're going to look at some statistics of some of the activations. So what we're going to do, we've generated these samples, and we're going to create a new data leader, which contains no training batches, and it contains one validation batch, which contains the samples.

It doesn't actually matter what the dependent variable is, so I just put in the same dependent variable that we already had. And then what we're going to do is we're going to use that to extract some features from a model. Now, what do we mean by that? So if you remember back to notebook 14, we created this thing called summary.

And summary shows us at different blocks of our model, there are various different output shapes. In this case, it's a batch size of 102.4. And so after the first block, we had 16 channels, 28 by 28, and then we had 32 channels, 48 by 14 and so forth until just before the final linear layer, we had the 1024 batches, and we had 512 channels with no height and width.

Now, the idea of fit and kid is that the distribution of these 512 channels for a real image has a particular kind of like signature, right? It looks a particular way. And so what we're going to do is we're going to take our samples, we're going to run it through a model that's learned to predict, you know, fashion classes, and we're going to grab this layer, right?

And then we're going to average it across a batch, right, to get 512 numbers. And that's going to represent the mean of each of those channels. So those channels might represent, for example, does it have a pointed color? Does it have, you know, smooth fabric? Does it have sharp heels and so forth, right?

And you could recognize that something's probably not a normal fashion image if it says, "Oh, yes, it's got sharp heels and flowing fabric." It's like, "Oh, that doesn't sound like anything we recognize," right? So there are certain kind of like sets of means of these activations that don't make sense.

So this is a metric for... it's not a metric for an individual image necessarily, but it's across a whole lot of images. So if I generate a bunch of fashion images, and I want to say, does this look like a bunch of fashion images? If I look at the mean, like maybe X percent have this feature and X percent have that feature.

So if I'm looking at those means, as like comparing the distribution within all these images I generated, do roughly the same amount have sharp colors as those in the trend? Yeah, that's a very good point, too. Yeah, and it's actually gonna get even more sophisticated than that. But let's just start at that level, which is this features.bin.

So the basic idea here is that we're going to take our samples and we're going to pass them through a pre-trained model that has learned to predict what type of fashion something is. And of course, we train some of those in this notebook. And specifically, we trained a nice 20 epoch one in the data augmentation section, which had a 94.3% accuracy.

And so if we pass our samples through this model, we would expect to get some, you know, useful features. One thing that I found made this a bit complicated, though, is that this model was trained using data that had gone through this transformation of subtracting the mean and dividing by the standard deviation.

And that's not what we're creating in our samples. And so, generally speaking, samples in most of these kinds of diffusion models tend to be between negative one and one. So I actually added a new section to the very bottom of this notebook, which simply replaces the transform with something that goes from negative one to one and just creates those data loaders and then trains something that can classify fashion.

And I save this as not data aug, but data aug two. So this is just exactly the same as before, but it's a fashion classifier where the inputs are expected to be between minus one and one. Having said that, it turns out that our samples are not between minus one and one.

But actually, if you go back and you look at DDPM2, we just use TF dot to tensor, and that actually makes images that are between zero and one. So actually, that's a bug. Okay, so our images have a bug, which is they go between zero and one. So we'll look at fixing that in a moment.

But for now, we're just trying to get the fit of our existing model. So let's do that. So what we need to do is we need to take the output of our model, and we need to multiply by two, so that'll be between zero and two, and subtract one.

So that'll change our samples to be between minus one and one, and we can now pass them through our pre-trained fashion classifier. Okay, so now, how do we get the output of that pooling layer? Because that's actually what we want to remind you. We want the output of this layer.

So just to kind of flex our PyTorch muscles, I'm going to show a couple of ways to do it. So we're going to load the model I just trained, the data or to model. And what we could do is, of course, we could use a hook. And we have a hooks callback.

So we could just create a function, which just depends the output. So very straightforward. Okay, so that's what we want. We want the output. And specifically, it's, so we've got these are all sequentials. So we can just go through and go, oh, one, two, three, four, five, the layer that we want.

Okay, and so that's the module that we want to hook. So once we've hooked that, we can pass that as a callback. And we can then, it's a bit weird calling fit, I suppose, because we're saying train equals false, but we're just basically capturing. This is just to put make one batch go through and grab the outputs.

So this means now in our hook, there's never gonna be thinking called out P, because we put it there. And we can grab, for example, a few of those to have a look. And yep, here, we've got a 64 by 512 set of features. Okay, so that's one way we can do it.

Another way we could do it is that actually sequential models are what's called in Python collections, they have certain certain API that they're expected to support. And out of something a collection can do like a list is you can call Dell to delete something. So we can delete this layer and this layer and be left with just these layers.

And once we do that, that means we can just call capture Preds, because now they don't have the last two layers. So we can just delete layers eight and seven call capture Preds. And one nice thing about this is it's going to give us the entire 10,000 images in the test set.

So that's what I ended up deciding to do. There's lots of other ways I played around with which worked, but I decided to show these two as being two good, pretty good techniques. Okay, so now we've got what do 1000 real images look like at the end of the pooling layer.

So now we need to do the same for our sample. So we'll load up our fashion DDPM MP, we'll sample, let's just grab 256 images for now, make them go between minus one and one, make sure they look okay. And as I described before, create a data loaders where the validation set just has one batch, which contains our samples and call capture Preds.

Okay, so that's going to give us our features. And the reason why is because we're passing the sample to model and model is the classifier, which we've deleted the last two layers from. So that's going to give us our 256 by 512. So now we can get the means now.

That's not really enough to tell us whether something is looks like real images. So maybe I should draw here. So we started out with our batch of 256, 256, and our channels of 512. And we squished them by taking their mean. So it's now just 256 a vector. So this is the, so wrong way around.

We squished them this way, 512, because this is the main for each channel. Okay. And we did exactly the same thing for the much bigger, you know, full set of real images. So this is our samples and this is our real. But when we squish it, that was 10,000 by 512, we get again 512.

So we could now compare these two, right? But, you know, you could absolutely have some samples that don't look anything like images, but have similar averages for each channel. So we do a second thing, which is we create a covariance matrix. Now, if you've forgotten what this is, you should go back to our previous lesson where we looked at it, but just remind you a covariance matrix says, in this case, we do it across the channels.

So it's going to be 512 by 512. So it's going to take each of these columns, and it says in each cell, so here's cell one, one, basically it says, what's the difference between it, basically it's saying, what's the difference between each row, each element here and the mean of the whole column, multiplied by exactly the same thing for a different column.

Now, on the diagonal, it's the same column twice. So that means that these in the diagonal is just the variance, right? But more interestingly, the ones in the off diagonal, like here, is actually saying, what's the relationship between column one and column two, right? So if column one and column two are uncorrelated, then this would be zero, right?

If they were identical, right, then it would be the same as the variance in here. So it's how correlated are they. And why is this interesting? Well, if we do the same, exactly the same thing for the reals, that's going to give us another 512 by 512. And it's going to say things like, so let's say this first column was kind of like that, you know, doesn't have pointy heels, right?

And sorry, heels, spell. And the second one might be, doesn't have flowing fabric, right? And this is where we say, okay, if, you know, generally speaking, you would expect these to be negatively correlated, right? So over here in the reals, this is probably going to have a negative, right?

Whereas if over here it was like zero or even worse if it's positive, it'd be like, oh, those are probably not real, right? Because it's very unlikely you're going to have images that have both, where pointy heels are positively associated with a flowing fabric. So we're basically looking for two data sets where their covariance matrices are kind of the same and their means are also kind of the same.

All right. So there are ways of comparing these, you know, basically comparing two sets of data to say, are they, you know, from the same distribution? And you can broadly think of it as being like, oh, do they have pretty similar covariance matrices? So they have pretty similar mean vectors.

And so this is basically what the fresh A inception distance does. Does that make sense so far, guys? Yes. It's, when he's striking me now, I was from the similarity as to when we were talking about like the style loss and those kinds of things. How do we get the types of features that occur together without worrying about like which I data is in the data, the grams matrices or whatever.

Yeah. Now the particular way of comparing. So, okay. So I've got the means and I've got the covariances for my samples. And I've actually just created this little calc stats, right? So I always, I'm showing you how I build things, not just things that are built, right? So I always create things step by step and check their shapes, right?

And then I paste them into our merge the cells, copy the cells and merge them into functions. So here's something that gets the means and the covariance matrix. So then I basically do recall that both for my sample features and for my features of the actual data set or the test set and the data set.

Now, what I now do with that, if they have those, now they have those features, I can calculate this thing called the fresh A inception distance, which is here. And basically what happens is we multiply together the two covariance matrices and that's now going to make them like bigger, right?

So we now need to basically scale that down again. Now, if we were working with, you know, non-matrices, you know, if you kind of like multiply two things together, then to kind of bring it back down to the original scale, you know, you could kind of like take the square root, right?

So particularly if it was by itself, you took the square root, you get back to the original. And so we need to do exactly the same thing to renormalize these matrices. The problem is that we've got matrices and we need to take the matrix square root. Now the matrix square root, you might not have come across this before, but it exists and it's the thing where the matrix square root of the matrix A times itself is A.

Now, I'm going to slightly cheat because we've used the float square root before and we did not re-implement it from scratch because it's in the Python standard library and also it wouldn't be particularly interesting. But basically the way you can calculate the float square root from scratch is by using, there's lots of ways, but you know, the classic way that you might have done it in high school is to use Newton's method, which is where you basically can solve if you're trying to calculate A equals root x, then you're basically saying A squared equals x, which means you're saying A squared minus x equals zero.

And that's an equation that you can solve and you can solve it by basically taking the derivative and taking a step along the derivative a bunch of times. You can basically do the same thing to calculate the matrix square root. And so here it is, right? It's the Newton method, but because it's for matrices it's slightly more complicated, so it's a short method and I'm not going to go through it, but it's basically the same deal.

You go through up to 100 iterations and you basically do something like traveling along that kind of derivative and then you say, okay, well, the result times itself ought to equal the original matrix. So let's subtract the matrix times itself from the original matrix and see whether the absolute value is small and if it is, we've calculated it.

Okay. So that's basically how we do a matrix square root. So we do, that's that. And so now that we have strictly speaking, implemented from scratch, we're allowed to use the one that already exists. iTorch doesn't have one, sadly, so we have to use the one from SciPy, SciPy.minelk.

So this is basically going to give us a measure of similarity between the two covariance matrices. And then we, here's the measure of similarity between the two mean matrices, which is just the sum of squared errors. And then basically for reasons that aren't really interesting, but it's just normalizing, we subtract what's called the trace, which is the sum of the diagonal elements, and we subtract two times the trace of this thing.

And that's called the Frechet Inception Distance. So a bit hand wavy on the math, because I don't think it's particularly relevant to anything, but it gives you a number which represents how similar is, you know, this for the samples to this for some real data. Now it's weird, it's called Frechet Inception Distance when we've done nothing to do with Inception.

Well, the reason why is that people do not normally use the fast.ai part two custom fashioned MNIST data org 2.pickle. They normally use a more famous model. They normally use the Inception model, which was an image net winning model from Google Brain from a few years ago. There's no reason whatsoever that Inception is a good model to use for this, it just happens to be the one which the original paper used.

And as a result, everybody now uses that not because they sheep, but because you want to be able to compare your results with other papers results, perhaps. We actually don't. We actually want to compare our results from our other results, and we're going to get a much more accurate metric if we use a model that's good specifically at recognizing fashion.

So that's why we're using this. So very, very few people bother to use this. Most people just hip install Python fit or whatever it's called and use Inception, but it's actually better to use. Now, unless you're comparing to papers, it's better to use a model that you've trained on your data and you know is good at that.

So I guess this is not a fit, it's a... Well, maybe fit now stands for fashion, no fashion, MNIST. I don't know what it stands for. I should have did something. I wanted to bring up two other caveats of FID, especially then like in papers is like, the other thing is that FID is dependent on the number of samples that you use.

So as the number of samples they use for measuring FID, it's more accurate if you use more samples and it's less accurate if you use less samples. Well, that's actually biased. So if you use less samples, it's too high specifically. Yeah. So in papers, you'll see them report how many samples they used.

And so even then comparing to other papers and comparing between different models and different things, you want to make sure that you're comparing with the same amount of samples. Otherwise, it might just be high because they just use less number of samples or something like this. So you want to make sure that's comparable.

And then the other thing that is because I guess it's a kind of a side effect of using the Inception network in these papers is the fact that all of these are at a size 299 by 299, which is like the size that the Inception model was trained. So actually, when you're applying this Inception network for measuring this distance, you're going to be resizing your images to 299 by 299, which in different cases that may not make much sense.

So like in our case, we're working with 32 by 32 or 28 by 28 images. These are very small images and if you resize it to 299, or in other cases, this is now kind of an issue with some of these latest models, you have these large 512 by 512 or 1024 by 1024 images.

And then you're, you know, kind of shrinking these images to 299 by 299. And you're losing a lot of that detail and quality in those images. So actually, it's kind of become a problem with some of these latest papers, when you look at the FID scores and how they're comparing them.

And then visually, when you see them, you can kind of notice, oh, yeah, these are much better images, but the FID score doesn't capture that as well, because you're actually using these much smaller images. So there are a bunch of different caveats. And so FID, you know, it's very good for like, yeah, it's nice and simple and automated for this sort of comparison, but you have to be aware of all these different caveats of this metric as well.

So excellent segue, because we're going to look at exactly those two things right now. And in fact, there is a metric that compares the two distributions in a way that is not biased. So it's not necessarily higher or lower if you use more or less samples, and it's called the KID or KID, which is the kernel inception distance.

It's actually significantly simpler to calculate than the fresh A inception distance. And basically, what you do is you create a bunch of groups, a bunch of partitions, and you go through each of those partitions and you grab a few of your X's at a time and a few of your Y's at a time.

And then you calculate something called the MMD, which is here, which is basically that, again, the details don't really matter. We basically do a matrix product and we actually take the Q of it. This K is the kernel. And we basically do that for the first sample by its, compared to itself, the second compared to itself, and the first compared to the second.

And we then normalize them in various ways and add the two with themselves together and subtract the, with the other one. And this one actually does not use the stats. It doesn't use the means and covariance metrics. It uses the features directly. And the actual final result is basically the mean of this calculated across different little batches.

Yeah, again, the math doesn't really matter as to exactly why all these are exactly what they are, but it's going to give you, again, a measure of the similarity of these two distributions. At first I was confused as to why more people weren't using this because people don't tend to use this and it doesn't have this, a nasty bias problem.

And now that I've been using it for a while, I know why, which is that it has a very high variance, which means when I call it multiple times with just like samples with different random seeds, I get very different values. And so I actually haven't found this used at all.

So we left in the situation, which is, yeah, we don't actually have a good unbiased metric. And I think that's the truth of where we are, the best practices. And even if we did, all I would tell you is like how similar distributions are to each other. It doesn't actually tell you whether they look any good, really.

So that's why pretty much all good papers, they have a section on human testing. But I've definitely found this fairly useful for me for like comparing fashion images, which particularly like humans are good at looking at like faces that are reasonably high resolution and be like, "Oh, that eye looks kind of weird," but we're not good at looking at 28 by 28 fashion images.

So it's particularly helpful for stuff that our brains aren't good at. So I basically wrapped this up into a class, which I call image eval for evaluating images. And so what you're going to do is you're going to pass in a pre-trained model for a classifier and your data loaders, which is the thing that we're going to use to basically calculate the real images.

So that's going to be the data loaders that were in this learn, so the real images. And so what it's going to do in this class, then again, this is just copying and pasting the previous lines of code and putting them into a class. This is going to be then something that we call capture preds on to get our features for the real images, and then we can also calculate the stats for the real images.

And so then we can call fit by calling calc fit, which is the thing we already had, passing in the stats for the real images and calculate the stats for the features from our samples, where the features, the thing that we've seen before, we pass in our samples, any random Y value is fine, so I just have a single tensor there and call capture preds.

So we can now create an image eval object passing in our classifier, passing in our data loaders with the real data, any other callbacks you want. And if we call fit, it takes about a quarter of a second and 33.9 is the fit for our samples. So something that I think, okay, then kid, kid's very going to be a very different scale.

It's only 0.05, so kids are generally much smaller than fits. So I'm mainly going to be looking at fits. And so here's what happens if we call fit on sample zero and then sample 50 and then sample 100 and so forth, all the way up to 900. And then we also do samples 975, 990, and 999.

And so you can see over time, our samples fits improved. So that's a good little test. There's something curious about the fact that they stopped improving about here. So that's interesting. I've not seen anybody plot this graph before. I don't know if Jono or Tanishk, if you guys have, I feel like it's something people should be looking at because it's really telling you it's your sampling, making consistent improvements.

And to clarify, this is like the predicted de-noised sample at the different stages during sampling, right? Yes, exactly. If I was to stop something now and just go straight to the predicted X error, what would the fit be? So I just want to check our samples. Yeah, we preset, we add the X naught hat at each time.

Yep. Yep, exactly. Same for kid. And I was hoping that they would look the same and they do. So that's encouraging that kid and fit are basically measuring the same thing. And then something else that I haven't seen people do, but I think it's a very good idea is to take the fit of an actual batch of data.

Okay. And so that tells us how good we could get. Now that's a bit unfair because I think the different sizes, our data is 512, our sample is 256, but anyway, it's a pretty huge difference. And then, yeah, the second thing that Tanishk talked about, which I thought I'd actually show is what does it take to get a real fit to use the Inception Network?

So I didn't particularly feel like re-ellementing the Inception Network. So I guess I'm cheating here. I'm just going to grab it from itorchfit. But there's absolutely no reason to study the Inception Network because it's totally obsolete at this point. And as Tanishk mentioned, it wants 299 by 299 images, which actually you can just call resize input to have that done for you.

It also expects three-channel images. So what I did is I created a wrapper for an Inception v3 model that when you call forward, it takes your batch and replicates the channel three times. So that's basically creating a three-channel version of a black and white image just by replicating it three times.

So with that wrapping, and again, this is good flexing of your PyTorch muscles. Try to make sure you can replicate this, that you can get an Inception model working on your batch and MNIST samples. And yeah, then from there, we can just pass that to our image eval instead.

And so on our samples, that gives us 63.8. And on a real batch of data, it gives 27.9. And I find this a good sign that this is much less effective than our real-fashioned MNIST classifier because that's only a difference of a ratio of three or so. The fact that our FID for real data using a real classifier was 6.6, I think that's pretty encouraging.

Yeah, so that is that. And we now have a FID. More specifically, we now have an image eval. Did you guys have any questions or comments about that before we keep going? No. Let's begin that pretty much every other FIDGC reported is going to be set up for CIFAR-10, tiny 32 by 32 pixels resized up to 299 and fed through Inception that was trained on imaging, not CIFAR-10.

Yeah, it's bearing in mind that, once again, this is a slightly weird metric. And even things like the types of image, like the imagery sizing algorithms in PyTorch Intensiflow might be slightly different. Or if you saved your images as JPEGs and then reloaded them, your FID might be twice as bad.

Yeah, it makes a big difference. Yeah, exactly. So just to reiterate, the takeaway from all of this that I get is that it's really useful. Everything's the same, like using the same backbone model, using the same approach, the same number of samples, then you can compare it to other samples.

But yeah, for one set of experiments, a FIDGC might be good, because it's the way everything's set up. And for another, that might be terrible. So if you want to compare to a paper or whatever is very easy. So I'm going to maybe the approach is that like, if you're doing your own experiments, these sorts of metrics are good.

But then if you're going to compare to other models, it's best to rely on human studies if you're comparing to other models. And that, yeah, I think that's kind of the sort of approach or mindset that we should be having when it comes to this. Yeah, or both, you know.

But yeah, so we're going to see this is going to be very useful for us. And we're just going to be using the same, pretty much all the time, we're going to use the same number of samples, and we're going to use the same fashion emitter-specific classifier. So the first thing I wanted to do was fix our bug.

And to remind you, the bug was that we had, we were feeding into our unit in DDPM v2 and the original DDPM images that go from zero to one. And yeah, that's that's wrong. That's like nobody does that. Everybody feeds in images that are from minus one to one.

So that's very easy to fix. You just... It drew me just to ask, like, why is that a bug? Why is it a bug? I mean, it's like everybody knows it's a bug because that's what everybody does. Like, I've never seen anybody do anything else and it's very easy to fix.

So I fixed it by adding this to DDPM v2 and I reran it and it didn't work. It made it worse. And this was the start of, you know, a few horrible days of pain because, like, when you, you know, fix a bug and it makes things worse, that generally suggests there's some other bug somewhere else that somehow is offset your first bug.

And so I had to go, you know, I basically went back through every other notebook at every cell and I did find at least one bug elsewhere, which is that we hadn't been shuffling our training sets the whole time. So I fixed that, but it's got absolutely nothing to do with this.

And I ended up going through everything from scratch three times, rerunning everything three times, checking every intermediate output three times. So days of, you know, depressing and annoying work and made no progress at all. At which point I then asked Johnno's question to myself more carefully and provided a less driven response to myself, which was, well, I don't know why everybody does this, actually.

So I asked to Nishkan Johnno and I was like, oh, in Patreon, I was like, have you guys seen any math, papers, whatever that's based on this particular input range? And yeah, you guys are both like, no, I haven't. It's just, it's just what everybody does. So at that point, it raised the possibility that like, okay, maybe what everybody does is not the right thing to do.

And is there any reason to believe it is the right thing to do? Given that it seemed like fixing the bug made it worse, maybe not. But then it's like, well, okay, we are pretty confident from everything we've learned and discussed that having centered data is better than uncentered data.

So having data that go from zero to one clearly seems weird. So maybe the issue is not that we've changed the center, but that we've scaled it down so that rather than having a range of two, it's got a range of one. So at that point, you know, I did something very simple, which was I did this, I subtracted 0.5.

So now rather than going from 0 to 1, it goes from minus 0.5 to 0.5. And so the theory here then was, okay, if our hypothesis is correct, which is that the negative one to one range has no foundational reason for being. And we've accidentally hit on something, which is that a range of one is better than a range of two.

And this should be better still, because this is a range of one and it's centered properly. And so this is DDPMv3. And I ran that. And yes, it appeared to be better. And this is great because now I've got fit. I was able to run fit on DDPMv2 and on DDPMv3, and it was dramatically, dramatically, dramatically better.

And in fact, I was running a lot of other experiments at the time, which we will talk about soon. And like all of my experiments are totally falling apart when I fix the bug. And once I did this, all the things that I thought weren't working suddenly started working.

So this is often the case, I guess, is that bugs can highlight accidental discoveries. And the trick is always to be careful enough to recognize when that's happened. Some people might remember the story. This is how the noble gases were discovered. A chemistry experiment went wrong and left behind some strange bubbles at the bottom of the test tube.

And most people would just be like, huh, whoops, bubbles. But people who are careful enough actually went, no, there shouldn't be bubbles there. Let's test them carefully. It's like they don't react. Again, most people would be like, oh, that didn't work. The reaction failed. But if you're really careful, you'll be like, oh, maybe the fact they don't react is the interesting thing.

So yes, being careful is not fair for the journey. When you say things like it didn't work or it was worse, when you first showed us this thing, I kind of said, the images looked fine. The fit was slightly worse. But it was okay. And if you trained it longer, it eventually got better mostly.

There were some things that sampling occasionally went wrong. One image in a hundred or something like that. But it was like, this isn't like everything completely fell apart. He's just the truth. Women's was slightly worse than expected. And if you were doing the run and gun, try a bunch of things, it's like, oh, well, I just doubled my training time and set a few runs going and looked at the weights and biases stats later.

And oh, that seems like it's better now. We just needed to train for longer. And we have internet GQs and lots of money. You would notice this. So yeah, it wasn't like, yeah, the fact that you picked up on it showed that you had this deep intuition for where it should be at this stage in training versus where it was, what the samples should look like.

And you had the fit as well to say like, okay, I would have expected a fit of nine and I'm getting 14. What's up here? And that was enough to start asking these questions and we all jumped on. We all started to think where this came from. Yeah, I mean, definitely.

I drive people crazy that I work with. I don't know why you guys aren't crazy yet, but with this kind of like, no, I need to know exactly why this is not exactly what we expected. But yeah, this is why. To find that when something's mysterious and weird, it means that there's something you didn't understand and that's an opportunity to learn something new.

So that's what we did. And so that was quite exciting because yeah, going -0.5 to 0.5 made the fit better still. And I was definitely in, yeah, I moved from this frame of mind from like total depression. I was so mad. I still remember when I spoke to Giotto, I was just so upset.

And then I suddenly like, oh my gosh, we're actually onto something. So I started experimenting more and a bit more confidence at this point, I guess. And one thing I started looking at was our schedule. We'd always been copying and pasting this standard, again, set of stuff. And I started questioning everything.

Why is this the standard? Like, why are these numbers here? And we don't see any particular reason why those numbers were there. And I thought, well, we should maybe experiment with them. So to make it easier, I created a little function that would return a schedule. Now you could create a new class for a schedule, but something that's really cool is there's a thing in Python called Simplement namespace, which is a lot like a struct in C, basically lets you wrap up a little bunch of keys and values as if it's an object.

So I created this little simple namespace, which contains our alphas, our alphabars, and our sigmas for our normal beta max, xs.02 namespace. This is what we always do. And then, yeah, there's another paper which mentions an alternative approach, which is cosine schedule, which is where you basically set alphabar equal to T as a fraction of big T times pi/2 cosine of that squared.

And if you make that your alphabar, you can then basically reverse back out to calculate what alpha must have been. And so we can create a schedule for this cosine schedule as well. And yeah, this cosine schedule is, I think, pretty recognized as being better than this linear schedule.

And so I thought, okay, it'll be interesting to look at how they compare. And in fact, really all that matters is the alphabar. The alphabar is the total amount of noise that you're adding. So in DDPM, when we do noisify, it's alphabar that we're actually using. It's the amount of the image and 1 minus alphabar, that's where it's the amount of noise.

Exactly. Yeah. Yeah. So I just printed those out, plotted those, for the normal linear schedule and this cosine schedule. And you can really see the linear schedule. It really sucks badly. It's got a lot of time steps where it's basically about zero. And that's something we can't really do anything with, you know, whereas the cosine schedule is really nice and smooth and there's not many steps which are nearly zero or nearly one.

So I thought, so I was kind of inclined to try using the cosine schedule, but then I thought, well, it'd be easy enough to get rid of this big flat bit by just increasing, by just decreasing beta max. That'd be another thing we can do. So I tried, oh, sorry, first of all, I should mention that the other thing that's really important is the slope of these curves, because that's how much things are stepping during the sampling process.

And so here's the slope of the lin and the cosine. And you can see the cosine slope, really nice, right? You have this nice smooth curve, whereas the linear is just a disaster. So yeah, if I change beta max to 0.01, that actually gets you nearly the same curve as the cosine.

So I thought that was very interesting. It kind of made me think like, why on earth does everybody always use 0.02 as the default? And so we actually talked to Robin, who is one of the two lead authors on the stable diffusion paper. And we talked about all of these things and he said, oh yeah, we noticed not exactly this, but we experimented with everything.

And we noticed that when we decreased beta max, we got better results. And so actually stable diffusion uses beta max at 0.012. I think that might be a little bit higher than they should have picked, but it's certainly a lot better than the normal default. So it's interesting talking to Robin to see all of these kinds of experiments and things that we tried out, they had been there as well and noticed the same things.

But the inputs range as well, they have this magical factor of 0.1802 wherever they scale the latency by. And if you ask why they're like, oh yeah, we wanted to delay this to be like roughly uniform range or whatever, but that's also like, that's reducing the range of your inputs to reasonable value.

I think exactly. We independently discovered this idea. Yeah, exactly. Yeah, exactly. So we'll be talking more about like what's actually going on with that maybe next lesson. Anyway, so here's the curves as well. They're also pretty close. So at this point I was kind of thinking, well, I'd like to like change as little as possible.

So I'm going to keep using a linear schedule, but I'm just going to change beta max to 0.01 for my next version of GDPM. So that's what I've got here, linear schedule, beta max, 0.01. And so that I wouldn't really have to change any of my code. I'd end up just put those in the same variable names that I've always used.

So then noisify is exactly the same as it always has been. So now I just repeat everything that I've done before. So now would I show a batch of data? I can already see that there's more actually recognizable images, which I think is very encouraging. Previously, like almost all of them had been pure noise, which is not a good sign.

So, okay. So now I just train it exactly the same as GDPM v2. And so save this as fashion GDPM 3. Oh, and then the other things I've done here is, you know, this did turn out to work pretty well. I actually decided let's keep going even further. So I actually doubled all of my channels from before, and I also increased the number of epochs by 3 because things are going so well.

I was like, how well could they go? So we've got a bigger model trained for longer. It takes a few minutes. That's what the 25 here is the number of epochs. So samples exactly the same as it always has been. So create 512 samples and here they are. And they definitely look to me, you know, great.

Like they, I'm not sure I could recognize whether these are real samples or generated samples. But luckily, you know, we can test them so we can load up our data org2, delete the last two layers, pass that to image val and get a fit for our samples. And it's eight.

And then I chose 512 for a reason, because that's our batch size. So then I can compare that like with like for the fit for the actual data at 6.6. So this is like hugely exciting to me. We've got down to a fit that is nearly as good as real images.

So I feel like this is, you know, in terms of image quality for small unconditional sampling, I feel like we're done, you know, pretty much. And so at this point, I was like, OK, well, can we make it faster? You know, at the same quality. And I just wanted to experiment with a few things, like really obvious ideas.

And in particular, I thought we're calling this a thousand times, which means we're calling this a thousand times, just running the model. And that's slow. And most of the time you just move a tiny bit. So the model is pretty much the same. It's, you know, the noise being predicted is pretty much the same.

So I just did something really obvious, which is I decided let's only call the model every third time, you know, and maybe also just the last 50 to help with fine tune. I don't know if that's necessary other than that. It's exactly the same. So now this is basically three times faster.

And yeah, samples look basically the same. So the feed is nine point seven eight versus eight point one. And this is like within the normal variance of feed. So I don't know, like you'd have to run this a few times or use bigger samples. But this is basically saying like, yeah, you probably don't need to call the model a thousand times.

I did something else slightly weird, which is I basically said like, oh, let's create a different like schedule for how often we call the model, which is I created this thing called sample app. It basically said when you're for the first few time steps, just do it every 10 and then for the next two, every nine and then actually every eight and so forth.

And just for the last hundred, do it every one. So that makes it even faster. Um, samples look good. This is, you know, it's definitely worse though now, you know, but it's still not bad. So, um, yeah, I kind of felt like, all right, this is encouraging and this, this stuff before we fixed the minus one to one thing was they looked really bad, you know, um, that's why I was thinking that my code is full of bugs.

Um, so at this point I'm thinking, okay, okay, okay. We can create extremely high quality samples using DDPM. What's the like, you know, best paper out there for doing it faster. And the most popular paper for doing it faster is DDIM. So I thought we might switch to this next.

So we're now at the point where we're not actually going to retrain our model at all, right? If you noticed with these different sampling approaches, I didn't retrain the model at all. We're just saying, okay, we've got a model. The model knows how to estimate the noise in an image.

How do we use that to call it multiple times to denoise using iterative refinement as Jono calls it. Um, and so DDIM is a, another way of doing that. So, um, what we're going to do, so I'm going to show you how I built my own DDIM from scratch.

And, um, I kind of cheated, which is I, there's already an existing one in diffusers. So I decided I will use that first, make sure everything works, and then I'll try and re-implement it from scratch myself. Um, so that's kind of like when there's an existing thing that works, you know, that's what I like to do.

And it's been really good to have my own DDIM from scratch because now I can modify it, you know, and I've made it much more concise code than the diffusers version. So, um, now, um, we had created this class called unit, which passed the tuple of X's through as individual parameters and returned the dot sample.

But not surprisingly, the, um, given that this comes from diffusers and we want to use the diffusers schedulers, um, the diffusers, um, schedulers assume this has not happened. It wants the X, you know, as a tuple and it expects to find the thing called dot sample. So here's something crazy.

When we save this thing, this pickle, it doesn't really know anything about the code, right? It just knows that it's from a class called unit. So we can actually lie. We can say, oh yeah, that class called unit. It's actually the same as you connect 2D model with no other changes and Python doesn't know or care, right?

So this, we can now load up this model and it's going to use this unit. Okay. So this is where it's useful to understand how Python works behind the scenes, right? It's, it's, it's a very simple programming language. Um, so we've now got a model which we've trained, but it's not, it's just going to, you know, use the dot sample on it.

That means we can use it directly with the diffusers schedulers. So we'll start by actually repeating what we already know how to do, which is use a DDPM scheduler. So we have to tell it what beta we used to train. Um, and so we can grab some random data and so it could say, okay, we're going to start at time step 999.

So let's create a batch of data and then predict the noise. And then this is the way the diffusers thing works. As you call scheduler.step and that's the thing which does, um, those lines. That's the thing that calculates X, T given noise. So that's what scheduler.step does. So that's why you pass in X, T and the time step and the noise.

Um, and that's going to give you a new set. And so I ran that as usual first cell by cell to make sure I understood how it all worked. I then copied those cells and merged them together and chucked them in a loop. So this is now going to go through all the time steps, use a progress bar to see how we're going, get the noise, call step and append.

So this is just DDPM, but using diffusers and not surprisingly, um, it consists, you know, basically the same results as, you know, nice results, very nice results that we got from our own DDPM. And so we can now use the same code we've used before to create our image evaluator.

And, um, I decided, yeah, we're now going to go right up to 2048 images at a time. So it's now, this is the size I found it's big enough that it's recently stable. And so we now down to 3.7 for our feed, where else the data itself has a feed of 1.9.

So again, it's showing that our DDPM is basically very nearly unrecognizably different from real data using its distribution of those activations. So then we can switch to DDIM by just saying DDIM scheduler. And so with DDIM, you can say, I don't want to do all thousand steps. I just want to do 333 steps to every third.

So that's basically a bit like, um, a bit like this sample skip of doing every third. But DDIM as we'll see, does it in a smarter way. Um, and so here's exactly the same code basically as before, but I put it into a little function. Okay. So I can basically pass in my model, the size, the scheduler.

Um, and then there's a parameter called ADA, which is basically how much noise to add. Um, so uh, just add all the noise. Um, and so this is now going to take three times. This three times faster. And yeah, the fit's basically the same. That's encouraging. So they weren't added 200 steps.

If it's basically the same, 100 steps. And at this point, okay, the fit's getting worse. Um, and then 50 steps. We're still 25 steps. We're still, that's interesting. Like when you get down to 25 steps, like what does it look like? And you can see that they're kind of like, they're too smooth.

You know, they don't have interesting, you know, fabric swirls so much or buckles or logos or patterns as much, you know, as the, these ones, they've got a lot more texture to them. So that's kind of what tends to happen. So you can still like get something out pretty fast, but that's, that's kind of how they suffer.

So, okay. So how does DDIM work? Well, DDIM, it's nice. It's actually, in my opinion, it makes things a lot easier than DDPM. So there's basically an equation from the paper, which Tanishka will explain shortly. But basically what you do is I've, I've actually grabbed the sample function from, from here and I split it out into two bits.

One bit is the bit that says what are the time steps, creates that random starting point, loops through, finds what my current A bar is, gets the noise, and then basically does the same as shed.step, calls some function, right? And then that's been pulled out. So this allows me to now create my own different steps.

So I go to the DDIM step and basically all I did was I took this equation and I turned it into code. Actually this, this one is a second equation from the paper. Now it's a bit confusing, which is that the notation here is different. DDPM, what it calls alpha bar, this paper calls alpha.

So you've got to look out for that. So basically you'll see, I basically go, I've got here XT, XT minus, okay, one minus alpha bar is, we've got to call that beta bar. So beta bar dot square root times noise. This here is the, this is the neural net, so this here is the noise.

Okay. And here I've got my next XT is, oh sorry, yes, here's my A bar T1 square root times this. And you can see here it says predicted X naught. So here's my predicted X naught plus beta bar T1 minus sigma squared square root. Again, here's noise. That's the same thing as here.

Okay. And then plus a bit of random noise, which we only do if you're not at the last step. So yeah, so I can call that. So I just did it for, so rather than saying a hundred steps, I said skip every, skip 10 steps. So do 10 steps at a time.

So it's basically going to be a hundred steps. And so you can see here actually, this is happened to do a bit better for my hundred steps. It's not bad at all. So yeah, I mean, this has been getting to this point, it's been a bit of a lifesaver to be honest, because, you know, I can now run a tooth, you know, two batch of 2048 samples.

I can sample them in under a minute, which doesn't feel painful. And so, you know, now at a point where I can actually get a pretty good measure of how I'm doing, get a pretty reasonable amount of time and I can, you know, easily compare it. And I got to admit, you know, the difference between a fit of five and eight and 11, I can't necessarily tell the difference.

So for fashion, I think fit is better than my eyes for this, as long as I use a consistent sample size. So yeah, Tanisha, did you want to talk a bit about, you know, the ideas of why we do this or where it comes from or what the notation means?

Can I say a little bit before we do that, which is just that what you have there, Jeremy, which is like a screenshot from the paper and then the code that as close as possible tries to follow that, like the difference that makes for people is huge. Like I've got a little research team that I'm doing some, you know, contract work with.

And the fact that like, it's called Alpha in the data and paper and Alpha elsewhere. And then in the code that they were copying and pasting from it was called A and B for Alpha and B done. And it's like you can get things kind of working by copying and pasting into things.

And it's all just sort of kind of works. But just spending their time that actually take two screenshots from the equation 14 and 16 from the paper and put them in there and rewrite the code so that it, you know, with some comments and things to say, like, this is what this is, this is that part from the equation.

It's like, you know, the look of pain on their face when I said, oh, by the way, did you notice that like, it's called Alpha there and Alpha by there? They're like, yes, how could they do that? You know, it's just like, you could just tell how many hours have been spent, you know, like grinding text and saying what's wrong here.

Yeah, and building this stuff in notebooks is such a good idea. Like we're doing MIDI AI because the, you know, the next engineer to come along and work on that can see the equation right there and you can add rows and stuff. So I think, you know, NVDev works particularly well for this, this kind of development.

Yeah. Yeah, before, before, before I talk about this, I just wanted to briefly, in the context of all of these different notations, I recently created this meme, which I thought was, was relevant in terms of like, each paper basically has a different diffusion model of me, different notation. So it's just like this, but they all try to come up with their own universal notation and it's just, just keeps proliferating.

It's just to me, we should all use AMAL. Yes, exactly. We need to implement diffusion models in APLs. So yeah, the paper that Jeremy had implemented was this denoising diffusion implicit model paper. And if you look at the paper again, you can see like the notation could be again, a little bit intimidating, but when we walk through it, we'll see it's not too bad actually.

So I'll just bring up, I guess, some of the important equations and also comparing and contrasting, you know, DDPM and the notation of DDPM and the equations with DIM. Not only is it not too bad, I actually discovered it's making life a lot. The DDIM notation and equations are a lot easier to work with than DDPM.

So I found my life is better since I discovered DDIM. Yes, yes. I think a lot of people prefer to use DDIM as well. Yeah. So yeah, basically in, let's see here. So yeah, so in both DDIM and in both DDIM and DDPM, we have this same sort of equation.

This equation is exactly the same. This is telling us the predicted denoised image. So we predict our, but basically we predict the, you can see my pointer, right? Just want to confirm. By the way, so the little double-headed arrow in the top right, does that, if you click that, do you get more room for us to see what's going on?

I'm sorry? Yeah, I see. Yeah, that works much better. Yeah. So we have our predicted noise. So our model is predicting the noise in the image. It is also passed in the time step, but this is just emitted. It basically kind of is given in the XT, but our model also takes in the time step.

And so it's predicting the noise in this XT, our noisy image. And we are trying to remove the noise. That's what this whole term here is, remove the noise. So because our noise that we're predicting is unit variance noise. So we have to scale the variance of our noise appropriately to remove it from our noisy image.

So we have to scale the noise and subtract it out of the original image. And that's how we would get our predicted denoised image. And I think we have to write this one before by looking at the equation for XT in the Noisify function and rearranging it to solve for X naught.

And that's what you get. Yes, that's basically what this is. So basically the idea is, okay, instead of noisifying it where we're starting out with X0 and some noise and get an XT, we're doing the opposite where we have some noise and we have XT. So how can we get X0?

So that's what this equation is. So that's the predicted X0 or our predicted clean image. And this equation is the same for both DDPM and DDIM, but these distributions are what's different between DDPM and DDIM. So we have these distribution, which tells us, okay, so if we have XT, which is our current noisy image, and X0, which is our clean image, can we find out what some sort of intermediate noisy image is in that process?

And that's XT minus one. So we have a distribution for that. And so that tells us how to get such an image. And so this is in the DDPM paper, they did to define some distribution and explain the math behind it. But yeah, basically, they have some equations. So you have, again, a Gaussian distribution with some sort of mean and variance, but it's, again, some form of you have this sort of interpolation between your original clean image and your noisy image.

And that gives you your intermediate, slightly less noisy image. So that's what this is giving. Given a clean image and a noisy image, you're slightly less noisy image. And so the sampling procedure that we do with DDPM basically is predict the noisy, predict the X0, and then plug it into this distribution to give you your slightly less noisy image.

So maybe it's worth drawing that out. So like if we had, let's say, some sort of, like, I don't know, I'm just making some sort of, I don't know, maybe a lot of some sort of better, yeah, some sort of. So then, in this case, I'm showing a one-dimensional example.

Let's say you have some sort of a point. So it's kind of a one-dimensional example that's still in the sort of 2D space. But let's say you have any point on this, it represents an actual image that you want to sample from, right? So this is where your distribution of actual images would lie.

And you want to estimate this. So when this sort of algorithm that we've been seeing here says that, okay, if we take some random point, this is some random point that we choose, you know, when we start out. And what we did is we learned this function, the score function, to take us to this manifold, but it's only going to be accurate in some space.

So it's going to be accurate, you know, it's only, it would be accurate in some area. So we get an estimate of the score function and it tells us the direction to move in. And it's going to give us the direction to predict our denoised image, right? So basically, like, let's say, let's say this, let's say you actually, your score function, sorry, so let's say your score function is actually in reality, some curve, okay?

So it's in reality, some curve that points to your, oops, it points here. So that's your score function. And you know, the value here, that's what score function basically means your gradient. Yeah. Yes, yes, it's a gradient. So, you know, we have, again, doing some form of, in this case, I guess you would say gradient ascent, because you're not really minimizing the score, you're maximizing it.

You want, sorry, you're maximizing your, the likelihood of that data point being an actual data point. You want to go towards it. So you're doing the sort of gradient ascent process. So you're following the gradient to get to that. So when we estimate epsilon theta and predict our noise, what we're doing is we're getting the score value here.

And then so we can, you know, follow that. And we follow it to some point. I'm being kind of exaggerating here. But this point will now represent our x zero hat. Yeah. So, yeah, our x or hat. So, and in reality, you know, that's not maybe that's not going to be some point that is an actual point, it wouldn't be next to the distribution.

So, you know, it's, it's not going to be a very good estimate of a clean image at the beginning. But, you know, we only have that estimate at the beginning at this point, and we have to follow it all the way to to some place. So this is where we follow it to.

And then we want to find some sort of x t minus one. So that's what our next point is. And so that's what our second distribution tells us. And it basically takes us all the way, takes us all the way back to maybe some point here. And now we can re-estimate the the the score function or our gradient over there, you know, do this prediction of noise.

And, you know, it may be more accurate of a of a score function. And maybe we go somewhere here. And then we re-estimate and get another point and then we follow it. And so that's kind of this iterative process where we're trying to follow this the score function to your own point.

And in order to do so, we first have to estimate our x zero hat and then and then basically add back some noise and to get, you know, a little bit again, a new estimate and keep falling and add back a little bit more noise and keep estimating. So that's what we're doing here in these two steps.

We have our x zero hat, and then we have this distribution. And that's how we do it regular DDPM. And it's that I think that's the maybe where the sort of broken it up in two steps is a bit clearer. And I don't think the DDPM paper really clarifies that really talks about it too much.

But the DDPM paper also really hammers that point home, I think, and especially in their in their update equation. So the other so that's the deep DDPM, but then with DDPM, just the one thing is that you look at your prediction, use that to make a step that you also add back some additional noise that's always fixed.

Right. There's no parameter to control how much extra noise you add back at each step. Right, exactly. So oops. So yeah, so then you're, let's see here. Yeah, so yeah, basically, you won't be exactly at this point, you could be, you know, you're in that general vicinity, and, you know, adding that noise also helps with, you know, you don't want to fall into, you know, specific modes where it's like, oh, you know, this is the most likely data point, you want to add some noise where you can like explore other data points as well.

So yeah, there's some, you know, the noise also can help and that's something you really can't control with DDPM. And this is something that DDIM explores a little bit further is in terms of the noise and even trying to get rid of the noise altogether in DDIM. So with the DDIM paper, the main difference is literally this one equation, that's all all really it is in terms of changing this distribution where you predict the less noisy equation, the less noisy, sorry, the less noisy image.

And basically, as you can see, it's just a, you have this additional parameter now, which is sigma. And the sigma controls how much noise, like we were just mentioning, is going to be part of this process. And you can actually, for example, if you want, you could set sigma to zero.

And then you can see here, now you have a variance, that would be zero. And so this becomes a completely deterministic process. So if you want, this could be completely deterministic. So that's one aspect of it. And then, yeah, so the other, the other aspects of that, right, the reason is called DDIM is just not DDPM, because it's like not probabilistic anymore.

It can be made deterministic. So the name was changed for that reason. But the other thing is like, you would think that you've kind of changed the model altogether with a new distribution altogether. And so you say, oh, wouldn't you have to like trade a different model for this purpose?

But it turns out the math works out where the same model objective would work well with this distribution as well. And in fact, I think that's what they were setting out from the very beginning is what kind of other models can we get with the same objective. And so this is what they're able to do is you can make some, you can have this new parameter that you can introduce, in this case, kind of controlling the stochasticity of the model.

And it still can be, you can still use the same exact trained model that you had. So what this means is that this actually is just a new sampling algorithm, and not anything new with the training itself. And this is just, yeah, just like we talked about a new way of sampling the model.

And then, so yeah, this is how, you know, given now this equation, then you can rewrite your x t minus one term. And again, we're doing the same sort of thing where we split it up into predicting the x zero, and then adding back to go back to your x t.

And also, if you need to add a little bit of noise back in, like Jonathan was saying, you can do so, you have this extra term here, and the sigma controls that term. And again, like we said, you have to be, again, looking at the DDI equation versus the DDPM equation, you have to be careful of the alphas here are referring to alpha bars in the DDPM equation.

So that's the other caveat. So yeah, and this you have this sigma t set to this particular value will give you back DDPM. So sometimes instead, instead, they will write, basically, Jeremy mentioned this sort of, I guess, eta, which is equal to basically, yeah, so it's just basically, eta is sigma is equal to eta times this coefficient.

So sorry, let me just go back. So basically, yeah, in reality, you take, you have eta here, so it's like, yeah, this is where eta would go. So if it's one, it becomes regular DDPM and if it's zero, of course, that's a determinant case. So this is where the eta that, you know, all these API's and in the code that we have, also the code that Jeremy was showing, they have eta equals to one, which of course, which they say is corresponding to regular DDPM.

This is actually where the eta would go in the equation. So finally, it's like, yeah, you could pass in sigma, right? Like if you weren't trying to match it in clean use papers, you could just say, oh, well, we have this parameter sigma that controls the amount of noise.

So let's just take in a big more scale as an argument. But for convenience, they said, let's create this new thing, eta, where zero means sigma is equal to zero, which if you look at the equation that works, one means we match the amount of noise that's in like vanilla DDPM.

And so then that gives you like a nice slice. So you could say eta equals two or eight equals 0.7 or whatever. But it's like, but a meaningful unit of one equals the same as this previous reference work. Well, it's also convenient because it's sigma t, which is to say different time steps, unless you choose eta equals zero, in which case it doesn't matter.

Different time steps probably want different amounts of noise. And so here's a reasonable way of scaling that noise. Then the last thing of importance, which is of course, one of the reasons that we were exploring this as the first place is to be able to do this sort of rapid sampling.

So the basic idea here is that you can define a similar distribution where again, the math works out similarly, where now you have, let's say you have some subset of diffusion steps. So in this case, it uses tau variable. So for example, if you say, let's say subset of diffusion steps.

So if it's like 10 diffusion steps, then tau one would just be zero, then tau two would be 10. You just keep going all the way up to say a thousand, but you've only got the, sorry, tau two would be a hundred. And then you go all the way up to a thousand.

And so you'd get 10 diffusion steps. So that's what they're referring to when they have this, I guess this tau variable here. And so you can do these sorts of similar equation and similar derivation to show that this distribution here again, meets the sort of objective that you use for training.

And you can now use this for a faster sampling, where basically all you have to do is you have to just select the appropriate alpha bar. And sorry, this one I've written out. So this one, actually the alpha bar is the regular alpha bar that we've talked about. But basically, sorry, it's a little bit confusing switching between different notations.

But basically, you have this distribution and then you just have to select the appropriate alpha bars and it follows the math the same in terms of you have appropriate sampling process. So yeah, and I guess that's, it makes it a lot simpler in terms of doing this, I guess, accelerated sampling.

Yeah, I guess with any other note, maybe other comments that maybe you guys had, or was this? Yeah, the key for me is that in this equation, we just have one, we only need one parameter, which is the alpha bar or alpha depending which notation is raised and everything else is calculated from that.

And so we don't have the, what DDPM calls the alpha or beta anymore. And that's more convenient for doing this kind of smaller number of steps, because we can just jump straight from time step to alpha bar. And we can also then, as particularly convenient with the cosine schedule, because you can calculate the inverse of the cosine schedule function, which means you can also go from an alpha bar to a T.

So it's really easy to like, say like, oh, what would alpha bar be 10 time steps previously to this one, you know, it's just, you could just call a function. We don't need, yeah, we don't need anything else. And so actually the original cosine schedule paper has to fuss around with various like kind of epsilon style small numbers that they add to things to avoid getting weird numerical problems.

And so yeah, when we only deal with alpha bar, all that stuff also goes away. So yeah, so looking, if you're looking at the DDIM code, you know, it's simpler code with less parameters than our DDPM code. And of course, it's dramatically faster. And it's also more flexible because we've got this eta thing we can play with.

Yes. Yeah, that's the other thing. It's like this idea of like, yeah, controlling stochasticity. I think that's something that's interesting to explore. And we've been exploring that a bit now. And I think we'll continue to explore that in terms of deterministic versus stochasticity. So yeah. So it's worth talking about this, the sigma in the middle equation you've got there.

So you've got the sigma t eta t adding the random noise. And intuitively, it makes sense that if you're adding random noise there, you would need to have less, you want to move less back towards xt, which is your noisy image. So that's why, you know, you've got the 1 minus alpha t minus 1 minus sigma squared.

And then you're taking the square root of that. So basically, that's just sigma, the square root of the squared. So you're subtracting sigma t from the direction pointing to xt and adding it to the random noise, or vice versa. So yes, you know, that everything's there for a reason, you know, yes.

And the predicted x0, that entire equation we've derived previously. And it remains the same in pretty much any diffusion model methodology. Well, as long as we'll be talking about actually some places where it's going to change probably next week. Well, yeah, I guess it's another thing where you're predicting the noise.

Yes. Yes. Yes. If you're predicting the noise, yes, there'll be. Okay. Yeah. So I think, you know, we'll probably, yeah, let's wrap it up here so that we leave ourselves plenty of time to cover the kind of new research directions next lesson more in more detail. But as I mentioned, in terms of where we're at, just like we hit a kind of like, okay, we can really predict classes for Fashioned MNIST a few weeks ago, where I think we're there now, and we can do stable diffusion sampling and units, except for the unit architecture, or unconditional generation now, we basically can do Fashioned MNIST almost so it's unrecognizably different to the real samples, and DDIM is the scheduler that the original stable diffusion paper used.

So yeah, you know, we're actually about to go beyond stable diffusion for our sampling and unit training now. So I think we've, yeah, definitely meeting our stretch goals so far, and all from scratch, with weights and biases, experiment logging. And you know, if you wanted to have fun, there's no reason you couldn't like have a little call back that instead logs things into a SQL like database, and then you could write a little front end to show your experiments, you know, that'd be fun as well.

Yeah, I mean, you could do also send you a text message when the loss gets good enough. Yeah. Alright, well, thanks, guys. That was really fun. Thanks, everybody. Alright, bye. Jizzle. Okay, talk to you later, then. We're bye.