All right. Hi, gang. And here we are in Lesson 21, joined by the legends themselves, Johnno and Tanishk. Hello. Hello. And today, you'll be shocked to hear that we are going to look at a Jupiter notebook. We're going to look at notebook 22. This is pretty quick. Just, you know, improvement, pretty simple improvement to our DDPM/DDIM implementation for fashion MNIST.
And this is all the same so far, but what I've done is I've made one quite significant change. And some of the changes we'll be making today are all about making life simpler. And they're kind of reflecting the way the papers have been taking things. And it's interesting to see how the papers have not only made things better, they made things simpler.
And so one of the things that I've noticed in recent papers is that there's no longer a concept of n steps, which is something we've always had before, and it always bothered me a bit, this capital T thing. You know, this T over T, it's basically saying this is time step number, say 500 out of 1000, so it's time step 0.5.
Why not just call it 0.5? And the answer is, well, we can. So we talked last time about the cosine scheduler. We didn't end up using it because I came up with an idea which was, you know, simpler and nearly the same, which is just to change our beta max.
But in this next notebook, I decided let's use the cosine scheduler, but let's try to get rid of the n steps thing and the capital T thing. So here is A bar again. And now I've got rid of the capital T. So now I'm going to assume that your time step is between 0 and 1, and it basically represents what percentage of the way through the diffusion process are you?
So 0 would be all noise, and 1 would be - well, no, sorry, the other way around - 0 would be all clean, and 1 would be all noise. So how far through the forward diffusion process? So other than that, this is exactly the same equation we've already seen.
And I realized something else, which is kind of fun, which is you can take the inverse of that. So you can calculate T. So we would basically first take the square root, and we would then take the inverse cos, and we would then divide by 2 over pi, or times pi over 2.
So we can both - so it's interesting now we don't - the alpha bar is not something we look up in a list, it's something we calculate with a function from a float. And so yeah, interestingly, that means we can also calculate T from an alpha bar. So Noisify has changed a little.
So now when we get the alpha bar through our time step, we don't look it up, we just call the function. And now the time step is a random float between 0 and 1, actually between 0 and 1.999, which actually I'm sure there's a function I could have chosen to do a float in this range, but I just tapped it because I was lazy.
Couldn't be bothered hooking it up. Other than that, Noisify is exactly the same. So we're still returning the xt, the time step, which is now a float, and the noise. That's the thing we're going to try and predict, dependent variable, this tuple there as our inputs to the model.
All right, so here is what that looks like. So now when we look at our input to our unit training process, you can see we've got a T of 0.05, so 5% of the way through the forward diffusion process, it looks like this, and 65% through it looks like this.
So now the time step and basically the process is more of a kind of a continuous time step and a continuous process. Rather before we were having these discrete time steps here, we get just any random value that could be between 0 and 1. I think, yeah, that's also something.
Yeah, we should kind of get this more convenient, you know, to have... Yeah, it is convenient. To have a function to call. Yeah, I find this life a little bit easier. So the model's the same, the callbacks are the same, the fitting process is the same. And so something which is kind of fun is that we could now, we can, when we do now, create a little denoise function.
So we can take, you know, this batch of data that we generated, the noisified data, so here it is again, and we can denoise it. So we know the T for each element, obviously. So remember T is different for each element now. And we can therefore calculate the alpha bar for each element.
And then we can just undo the noisification to get the denoised version. And so if we do that, there's what we get. And so this is great, right? It shows you what actually happens when we run a single step of the model on variatingly, partially noised images. And this is something you don't see very often because I guess not many people are working in these kind of interactive notebook environments where it's really easy to do this kind of thing.
But I think this is really helpful to get a sense of like, okay, if you're 25% of the way through the forward diffusion process, this is what it looks like when you undo that. If you're 95% of the way through it, this is what happens when you undo that.
So you can see here, it's basically like, oh, I don't really know what the hell's going on, so at least a noisy mess. Yeah, I guess my feeling from looking at this is, I'm impressed, you know, like this 45% noise thing, it looks all noise to me. It's found the long-sleeved top.
And yeah, it's actually pretty close to the real one. I looked it up, or you might see it later, it's a little bit more of a pattern here, but it even gives a sense of the pattern. So it shows you how impressive this is. So this is 35%. You can kind of see there's a shoe there, but it's really picked up the shoe nicely.
So these are very impressive models in one step, in my opinion. So, okay, so sampling is basically the same, except now rather than starting with using the range function to create our timesteps, we use lin space to create our timesteps. So our timesteps start at, you know, if we did 1000, it would be 0.999, and they end at 0, and then they're just linearly spaced with this number of steps.
So other than that, you know, a bar we now calculate, and the next a bar is going to be whatever the current step is, minus 1 over steps. So if you're doing 100 steps, then you'd be minus 0.01. So this is just stepping through linearly. And yeah, that's actually it for changes.
So if we just do DDIM for 100 steps, you know, that works really well. We get a fit of 3, which is actually quite a bit better than we had on 100 steps for our previous DDIM. So this definitely seems like a good sampling, sampling, sampling approach. And I know Jono's going to talk a bit more shortly about, you know, some of the things that can make better sampling approaches, but yeah, definitely we can see it making a difference here.
Did you guys have anything you wanted to say about this before we move on? No, but it is a nice transition towards some of the other things we'll be looking at to start thinking about how do we frame this. And it's also good, like the idea, so the original DDPM paper has this 1,000 time steps, and a lot of people followed that.
But the idea that you don't have to be bound to that, and maybe it is worth breaking that convention. I know Tanish made that meme about, you know, this 15 competing different standards connotation. But yeah, sometimes it's helpful to reframe it, okay, time goes from 0 to 1. That can simplify some things.
It complicates others, but yeah, it's nice to think how you can reframe stuff sometimes. In fact, where we will head today by the time we get to notebook 23, we will see, you know, even simpler notation. And yeah, simpler notation generally comes. I think what happens is over time people understand better what's the essence of the problem and the approach, and then that gets reflected in the notation.
So okay, so the next one I wanted to share is something which is an idea we've been working on for a while, and it's some new research. So partly, I guess this is an interesting like insight into how we do research. So this is 22 noise pred. And the basic idea of this was, well, actually, I've got to take you through it to see what the basic idea is.
So what I'm going to do is I'm going to create, okay, so fashion MNIST as before, but I'm going to create a different kind of model. I'm not going to create a model that predicts the noise given the noised image in t. Instead, I'm going to try to create a model which predicts t given the noised image.
So why did I want to do that? Well, partly, well, entirely because I was curious. I felt like when I looked at something like this, I thought it was pretty obvious roughly how much noise each image had. And so I thought, why are we passing noise when we call the model?
Why are we passing in the noised image and the amount of noise or the t? Given that I would have thought the model could figure out how much noise there is. So I wanted to check my contention, which is that the model could figure out how much noise there is.
So I thought, okay, well, let's create a model that would try and figure out how much noise there is. So I created a different noisify now, and this noisify grabs an alpha bar t randomly. And it's just a random number between 0 and 1. You don't want 1 per item in the batch.
And so then after just randomly grabbing an alpha bar t, we then noisify in the usual way. But now our independent variable is the noised image and the dependent variable is alpha bar t. And so we've got to try to create a model that can predict alpha bar t given a noised image.
Okay, so everything else is the same as usual. And so we can see an example. You've got alpha bar t.squeeze.logit. Oh, yeah, that's true. So the alpha bar t goes between 0 and 1. So we've got a choice. Like, I mean, we don't have to do anything. But normally, if you've got something between 0 and 1, you might consider putting a sigmoid at the end of your model.
But I felt like the difference between 0.999 and 0.99 is very significant. So if we do log it, then we don't need the sigmoid at the end anymore. It will naturally cover the full range of kind of-- it ought to be centered at 0. It would have covered all the normal kind of range of numbers.
And it also will treat equal ratios as equally important at both ends of the spectrum. So that was my hypothesis was that using logit would be better. I did test it and it was actually very dramatically better. So without this logit here, my model didn't work well at all.
And so this is like an example of where thinking about these details is really important. Because if I hadn't have done this, then I would have come away from this bit of research thinking like, oh, I was wrong. We can't predict noise amount. Yeah. So thanks for pointing that out, China.
Yeah. So that's why in this example of a mini batch, you can see that the numbers can be negative or positive. So 0 would represent alpha bar of 0.5. So here, 3.05 is not very noise at all, or else negative 1 is pretty noisy. So the idea is that, yeah, given this image, you would have to try to predict 3.05.
So one thing I was kind of curious about is like, it's always useful to know is like, what's the baseline? Like, what counts as good? Because often people will say to me like, oh, I created a model and the MSE was 2.6. And I'll be like, well, is that good?
Well, it's the best I can do, but is it good? Like, or is it better than random or is it better than predicting the average? So in this case, I was just like, okay, well, what if we just predicted? Actually, this is slightly out of date. I should have said 0 here rather than 0.5, but never mind close enough.
So this is before I did the logit thing. So I basically was looking at like, what's the loss if you just always predicted a constant, which as I said, I should have put 0 here, haven't updated it. And so it's like, oh, that would give you a loss of 3.5.
Or another way to do it is you could just put MSE here and then look at the MSE loss between 0.5 and your various, just a single mini batch, mini batch of alphabets, logits. Yeah, so we wanted to get some, if we're getting something that's about 3, then we basically haven't done any better than random.
And so in this case, this model, it doesn't actually have anything to learn. It always returns the same thing. So we can just call fit with trade equals false just to find the loss. So this is just a couple of ways of getting quickly finding a loss for a baseline naive model.
One thing that thankfully PyTorch will warn you about is if you try to use MSE and your inputs and targets have different shapes, it will broadcast and give you probably not the result you would expect, and it will give you a warning. So one way to avoid that is just to use dot flatten on each.
So this kind of flattened MSE is useful to avoid the warning and also avoid getting weird errors or weird results. So we use that for our loss. So the model's the model that we always use. So it's kind of nice. We just use our same old model. Same changes.
Even though we're doing something totally different. Oh, well, okay, that's not quite true. One difference is that our output, we just have one output now, because this is now a regression model. It's just trying to predict a single number. And so our learner now uses MSE as a loss.
Everything else is the same as usual. So we could go ahead and trade it. And you can see, okay, the loss is already much better than 3, so we're definitely learning something. And we end up with a 0.075 mean squared error. That's pretty good considering, you know, there's a pretty wide range of numbers we're trying to predict here.
So I've got to save that as noise prediction on sigma. So save that model. And so we can take a look at how it's doing by grabbing our one batch of noise images, putting it through our T model. Actually, it's really an alpha bar model, but never mind, call it a T model.
And then we can take a look to see what it's predicted for each one. And we can compare it to the actual for each one. And so you can see here it said, oh, I think this is about 0.91. And actually, it is 0.91. I said, oh, here it looks like about 0.36.
And yeah, it is actually 0.36. So, you know, you can see overall 0.72. It's actually 0.72. Well, those are exactly right. This one's 0.02 off. But yeah, my hypothesis was correct, which is that we, you know, we can predict the thing that we were putting in manually as input.
So there's a couple of reasons I was interested in checking this out. You know, the first was just like, well, yeah, wouldn't it be simpler if we weren't passing in the T each time? You know, why not pass in the T each time? But it also felt like it would open up a wider range of kind of how we can do sampling.
The idea of doing sampling by like precisely controlling the amount of noise that you try to remove each time and then assuming you can remove exactly that amount of noise each time feels limited to me. So I want to try to remove this constraint. So having, yeah, built this model, I thought, okay, well, you know, which is basically like, okay, I think we don't need to pass T in.
Let's try it. So what I then did is I replicated the 22 cosine notebook. I just copied it, pasted it in here. But I made a couple of changes. The first is that Noisify doesn't return T anymore. So there's no way to cheat. We don't know what T is.
And so that means that the unit now doesn't have T, so it's actually going to pass zero every time. So it has no ability to learn from T because it doesn't get T. So it doesn't really matter what we pass in. We could have changed the unit to like remove the conditioning on T.
But for research, this is just as good, you know, for finding out. And it's good to be lazy when doing research. There's no point doing something a fancy way where you can do it a quick and easy way before you even know if it's going to work. So yeah, that's the only change.
So we can then train the model and we can check the loss. So the loss here is 0.034. And previously it was 0.033. So interestingly, you know, maybe it's a tiny bit worse at that, you know, but it's very close. Okay, so we'll save that model. And then for sampling, I've got exactly the same DDIM step as usual.
And my sampling is exactly the same as usual, except now, when I call the model, I have no T to pass in. So we just pass in this. I mean, I still know T because I'm still using the usual sampling approach, but I'm not passing it to the model.
And yeah, we can sample. And what happens is actually pretty garbage. 22 is our fit. And as you can see here, you know, some of the images are still really noisy. So I totally failed. And so that's always a little discouraging when you think something's going to work and it doesn't.
But my reaction to that is like, if I think something's going to work and it doesn't is to think, well, I'm just going to have to do a better job of it. You know, it ought to work. So I tried something different, which is I thought like, okay, since we're not passing in the T, then we're basically saying like, how much noise should you be removing?
It doesn't know exactly. So it might remove a little bit more noise that we want or a little bit less noise than we want. And we know from the testing we did that sometimes it's out by like, in this case, 0.02. And I guess if you're out consistently, sometimes it's got to end up not removing all the noise.
So the change I made was to the DDAM step, which is here. And let me just copy this and get rid of the commented out sections just to make it a bit easier to read. Okay. So the DDAM step, this is the normal DDAM step. Okay. And so step one is the same.
So don't worry about that because it's the same as we've seen before. But what I did was I actually used my T model. So I passed the noised image into my T model, which is actually an alpha bar model, to get the predicted alpha bar. And this is remember the predicted alpha bar for each image, because we know from here that sometimes, so sometimes it did a pretty good job, right?
But sometimes it didn't. So I felt like, okay, we need a predicted alpha bar for each image. What I then discovered is sometimes that could be really too low. So what I wanted to make sure is it wasn't too crazy. So I then found the median for a mini batch of all the predicted alpha bars, and I clamped it to not be too far away from the median.
And so then what I did when I did my X naught hat is rather than using alpha bar T, I used the estimated alpha bar T for each image, clamped to be not too far away from the median. And so this way it was updating it based on the amount of noise that actually seems to be left behind, rather than the assumed amount of noise that should be left behind you know, if we assume it's removed the correct amount.
And then everything else is the same. So when I did that, say, whoa, made all the difference. And here it is. They are beautiful pieces of clothing. So 3.88 versus 3.2. That's possibly close enough, like I'd have to run it a few times, you know, my guess is maybe it's a tiny bit worse, but it's pretty close.
But like this definitely gives me some encouragement that, you know, even though this is like something I just did in a couple of days, where else the kind of the with T approaches have been developed since 2015, and we're now in 2023. You know, I would expect it's quite likely that these kind of like, no, no T approaches could eventually surpass the T based approaches.
And like one thing that definitely makes me think there's room to improve is if I plot the fit or the kid, or each sample during the reverse diffusion process, it actually gets worse for a while. I'm like, okay, well, that's, that's a bad sign. I have no idea why that's happening.
But it's a sign that, you know, if we could improve each step that one would assume we could get better than 3.8. So yeah, Tanishko, do you have any thoughts about that, or questions or comments or maybe to just like, to highlight that the research process a little bit, it wasn't like this linear thing of like, Oh, here's this issue.
Not for me as well as we thought. Oh, here's the fix. We just scrapped this. You know, this was like multiple days of like, discussing and like Jeremy saying, like, you know, I'm taking my hair out. Do you guys have any ideas? And Oh, what about this? And Oh, and I just in the team paper, they do this clamping, maybe that'll help.
You know, so there's a lot of back and forth. And also a lot of like, you saw that code that was commented out there, prints, xt.min, xt.max, alphabar, pred, you know, just like seeing, oh, okay, you know, my average prediction is about what I would expect. But sometimes the middle of the max goes, you know, 2, 3, 8, 16, 150, 212 million, infinity, you know, maybe like one or two little values that would just skyrocket out.
Yeah, and so that kind of like, debugging and exploring and printing things out. And actually, our initial discussions about this idea, I kind of said to you guys, before lesson one of part two, I said, like, it feels to me like we shouldn't need the t thing. And so it's actually been like, mumbling away in the background for the months.
Yeah, yeah. And I guess I mean, we should also mention we have tried this, like a friend of ours trained a no T version of stable diffusion for us. And we did the same sort of thing. I trained a pretty bad T predictor and it sort of generates samples.
So we're not like focusing on that large scale stuff yet. But it is fun to like, every now and again, got this idea from fashion in this, we are trying these out on some bigger models and seeing, okay, this does seem like maybe it'll work. And to down the line that future plan is to say that's actually, you know, spend the time train a proper model, and see, yeah, see how well that does.
If it's interesting, you say a friend of ours, we can be more specific. It's Robert, one of the two lead authors of the stable diffusion paper who actually has been fine tuning a real stale stable diffusion model, which is without T and it's looking super encouraging. So yeah, that'll be fun to play with with this new, you know, we'll have to train a T predictor for that.
See how it looks. Yeah. All right. So I guess the other area we've been talking about kind of doing some research on is this weird thing that came up over the last two weeks where our bug in the DDPM implementation, where we accidentally weren't doing it from minus one to one for the input range, it turned out that actually being from minus one to one wasn't a very good idea anyway.
And so we ended up centering it as being from minus point five to point five, and John O and Tanishk have managed to actually find a paper. Well, I say find a paper, a paper has come out in the last 24 hours, which has coincidentally cast some light on this and has also cited a paper that we weren't aware of, which was not released in the last 24 hours.
So John O, are you going to tell us a bit about that? Yeah, sure. I can do that. So it's funny, this was such perfect timing because I actually got up early this morning planning to run with the different input scalings and the cosine schedule that Jeremy was showing and some of the other schedulers we look at.
I thought it might be nice for the lesson to have a little plot of like, what is the fit with these different solvers and input scalings, but it was going to be a lot of work. I'm not looking forward to doing the groundwork. And then Tanishk sent me this paper, which AK had just tweeted out because he reviews anything that comes up on archive every day on the importance of noise scheduling for diffusion models.
And this is by a researcher at the Google Brain team, who's also done a really cool recent paper on something called a recurrent interface network outside of the scope of this lesson, but also worth checking out. Yeah, so this paper they're hoping to study this noise scheduling and the strategies that you take for that.
And they want to show that number one, those scheduling is crucial for performance and the optimal one depends on the tasks. When increasing the image size, the noise scheduling that you want changes and scaling the input data by some factor is a good strategy for working with this. And that's the big thing we've been talking about, right?
Yeah, that's what we've been doing where we said, oh, do we scale from minus 0.5 to 0.5 or minus 1 to 1 or do we normalize? And so they demonstrate the effectiveness by training a really good high resolution model on image met, so class condition model. That's correct. Yeah, amazing samples.
They'll show one later. So I really like this paper. It's very short and concise, and it just gets all the information across. And so they introduced us here. We have this noising process on noiseifier function where we have square root of something times x plus square root of 1 minus that something times the noise.
And here they use gamma, gamma of t, which is often used for the continuous time case. So instead of the alpha bar and the beta bar scheduled for 1,000 times tapes, there'll be some function gamma of t that tells you what your alpha bar should be. Okay, so that's our function is actually called a bar, but it's the same thing.
Yeah, same thing. It takes in a time set from 0 to 1, and then that's used to noise the image. Interestingly, what they're showing here actually is something that we had discovered, and I've been complaining about that my DTAMs with an eater of less than one weren't working, which is to say when I added extra noise to the image, it wasn't working.
And what they're showing here is like, oh yeah, duh, if you use a smaller image, then adding extra noise is probably not a good idea. Yeah. And so they use a lot of reference in this paper to like information being destroyed and signal to noise ratios. And that's really helpful for thinking about because it's not something that's obvious, but at 64 by 64 pixels, adjacent pixels might have much less in common versus the same amount of noise added at a much higher resolution, the noise kind of averages out and you can still see a lot of the image.
So yeah, that's one thing they highlight is that the same noise level for different image sizes, it might be a harder or easier task. And so they investigate some strategies for this. They look at the different noise schedule functions. So we've seen the original version from the DTDM paper.
We've seen the cosine schedule and we've seen, I think we might look at, or the next thing that Jamie's going to show us, a sigmoid based schedule. So they show the continuous time versions of that and they plot how you can change various parameters to get these different gamma functions or in our case, the alpha bar, where we starting at all image, no noise at t equals zero, moving to all noise, no image at t equals one.
But the path that you take is going to be different for these different classes of functions and parameters. And the signal to noise ratio, that's what this or the log signal to noise ratio is going to change over that time as well. And so that's one of the knobs we can tweak.
We're saying our diffusion model isn't training that well, we think it might be related to the noise schedule and so on. One of the things you could do is try different noise schedules, either changing the parameters in one class of noise schedule or switching from a linear to a cosine to a sigmoid.
And then the second strategy is kind of what we were doing in those experiments, which is just to add some scaling factor to exit error. Well, we were accidentally using b of 0.5. Exactly. And so that's the second dial that you can tweak is to say keeping your noise schedule fixed, maybe you just scale x zero, which is going to change the ratio of signal to noise.
And that's what I think there's four in C there is what we were accidentally doing. Yes. Yeah, exactly. And so see if we can get to Oh, yeah. So that again, changes the signal to noise for different scalings you get. And so that's fine. So they have a compound, they have a strategy that combines some of those things.
And this is the important part, they do their experiments. And so they have a nice table of investigating different schedules, cosine schedules and sigmoid schedules. And in bold are the best results. And you can see for 64 by 64 images versus 128 versus 256, the best schedule is not necessarily always the same.
And so that's like important finding number one, depending on what your data looks like using a different noise schedule might be optimal. There's no one true best schedule. There's no one bad value of, you know, beta min and beta max, that's just magically the best. Likewise, for this input scaling at different sizes, with whatever schedules they tested, and different values were kind of optimal.
And so, yeah, it's just a really great illustration, I guess that this is another design choice that's implicit or explicitly part of your diffusion model training and sampling is how are you dealing with this, this noise schedule, what schedule are you following, what scaling are you doing with your inputs.
And by using this thinking and doing these experiments, and they come up with a kind of rule of thumb for how to scale the image based on image size, they show that they can, as they increase the resolution, they can still maintain really good performance. Where previously it was quite hard to train a really large resolution pixel space model, and they're able to do that, they get some advantage from their fancy recurrent interface network, but still, it's kind of cool that they can say, look, we get state of the art, high quality, 512 by 512 or 1024 by 1024 samples on class-conditioned image net.
And using this approach to really like consider how well do you train, how many steps do we need to take, one of the other things in this table is that they compare to previous approaches. Oh, we used, you know, a third of the training steps and for the same other settings, and we get better performance.
Just because we've chosen that input scaling better. And yeah, so that's the paper, really, really nice, great work to the team. And that was very useful. I love that you got up in the morning and thought, oh, it's going to be a hassle training all these different models I need to train for different input scalings and different sampling approaches.
I just look at Twitter first, and then you looked at Twitter and there was a paper saying like, hey, we just did a bunch of experiments for different noise schedules and input scaling. Yeah, does your life always work that way each other? It seems quite blessed. Yeah, it's very lucky like that.
Yeah. You wait long enough, someone else will do it. That's why it shows that the time when the UK starts posting on Twitter, it's like my favourite hour of the day for all the papers to be posted. Oh, well, thank you for that. So let me switch to notebook 23.
Because this notebook is actually an implementation of some ideas from this paper that everybody tends to just call it Keras because there's other people. But I will do it anyway, Keras paper. And the reason we're going to look at this is because in this paper, the authors actually take a much more explicit look at the question of input scaling.
Their approach was not apparently to accidentally put a bug in their code, and then take it out, find it worked worse, and then just put it back in again. Their approach was actually to think, how should things be? So that's an interesting approach to doing things, and I guess it works for them.
So that's fine. I think our approach is more infighting. Yeah, exactly. Our approach is much more fun because you never quite know what's going to happen. And so, yeah, in their approach, they actually tried to say, like, OK, given all the things that are coming into our model, how can we have them all nicely balanced?
So we will skip back and forth between the notebook and the paper. So the start of this is all the same, except now we are actually going to do it minus one to one because we're not going to rely on accidental bugs anymore, but instead we're going to rely on the Keras papers carefully designed scaling.
I say that, except that I put a bug in this notebook as well. One of the things that's in the Keras paper is what is the standard deviation of the actual data, which I calculated for a batch. However, this used to say minus 0.5. I used to do the minus 0.5 to 0.5 thing.
And so this is actually the standard deviation of the data before I, when it was still minus 0.5. So this is actually half the real standard deviation. For reasons I don't yet understand, this is giving me better scaled results. So this actually should be 0.66. So there's still a bug here and the bug still seems to work better.
So we still got some mysteries involved. So we're going to leave this. So it's actually, it's actually not 0.33, it's actually 0.66. Okay, so the basic idea of this, actually I'll come back. Well, let me have a little think. Yeah, okay. Now we'll start here. The basic idea of this paper is to say, you know what, sometimes maybe predicting the noise is a bad idea.
And so like you can either try and predict the noise or you can try and predict the clean image and each of those can be a better idea in different situations. If you're given something which is nearly pure noise, you know, the model's given something which is nearly pure noise and is then asked to predict the noise.
That's basically a waste of time, because the whole thing's noise. If you do the opposite, which is you try to get it predict the clean image. Well, then if you give it a clean image that's nearly clean and try to predict the clean image, that's nearly a waste of time as well.
So you want something which is like, regardless of how noisy the image is, you want it to be kind of like an equally difficult problem to solve. And so what Keras do is they, they basically use this new thing called CSKIP, which is a number, which is basically saying like, you know what we should do for the training target is not just predict the noise all the time, not just predict the clean image all the time, but predict kind of a looped version of one or the other depending on how noisy it is.
So here y is the clean image and n is the noise. So y plus n is the noised image. And so if CSKIP was 0, then we would be predicting the clean image. And if CSKIP was 1, we would be predicting y minus y, we would be predicting the noise.
And so you can decide by picking a different CSKIP whether you're predicting the clean image or the noise. And so, as you can see from the way they've written it, they make this a function. They make it a function of sigma. Now, this is where we got to a point now where we've kind of got a fairly, a much simpler notation.
There's no more alpha bars, no more alphas, no more betas, no more beta bars. There's just a single thing called sigma. Unfortunately, sigma is the same thing as alpha bar used to be, right? So we've simplified it, but we've also made things more confusing by using an existing symbol for something totally different.
So this is alpha bar. Okay. So there's going to be a function that says, depending on how much noise there is, we'll either predict the noise or we'll predict the clean image or we'll predict something between the two. So in the paper, they showed this chart where they basically said like, okay, let's look at the loss to see how good are we with a trained model at predicting when sigma is really low.
So when there's very small alpha bar, or when sigma is in the middle or when sigma is really high. And they basically said, you know what, when it's nearly all noise or nearly no noise, you know, we're basically not able to do anything at all. You know, we're basically good at doing things when there's a medium amount of noise.
So when deciding, okay, what, what sigmas are we going to send to this thing? The first thing we need to do is to, is to figure out some sigmas. And they said, okay, well, let's pick a distribution of sigmas that matches this red curve here, as you can see.
And so this is a normally distributed curve where this is on a log scale. So this is actually a log normal curve. So to get the sigmas that they're going to use, they picked a normally distributed random number and then they expect it. And this is called a log normal distribution.
And so they used a mean of minus 1.2 and a standard deviation of 1.2. So that means that about one third of the time, they're going to be getting a number that's bigger than zero here. And e to the zero is one. So about one third of the time, they're going to be picking sigmas that are bigger than one.
And so here's a histogram I drew of the sigmas that we're going to be using. And so it's nearly always less than five, but sometimes it's way out here. And so it's quite hard to read these histograms. So this really nice library called Seaborn, which is built on top of Matplotlib, has some more sophisticated and often nicer looking plots.
And one of them they have is called a KDE plot, which is a kernel density plot. It's a histogram, but it's smooth. And so I clipped it at 10 so that you could see it better. So you can basically see that the vast majority of the time it's going to be somewhere about 0.4 or 0.5, but sometimes it's going to be really big.
So our Noisify is going to pick a sigma using that log-normal distribution. And then we're going to get the noise as usual, but now we're going to calculate C skip, right? Because we're going to do that thing we just saw. We're going to find something between the plain image and the noised input.
So what do we use for C skip? We calculate it here. And so what we do is we say what's the total amount of variance at some level of sigma? Well it's going to be sigma squared, that's the definition of the variance of the noise, but we also have the sigma of the data itself, right?
So if we add those two together we'll get the total variance. And so what the Keras paper said to do is to do the variance of the data divided by the total variance and use that for C skip. So that means that if your total variance is really big, so in other words it's got a lot of noise, then C skip's going to be really small.
So if you've got a lot of noise then this bit here will be really small. So that means if there's a lot of noise try to predict the original image, right? That makes sense because predicting the noise would be too easy. If there's hardly any noise then this will be, total variance will be really small, right?
So C skip will be really big and so if there's hardly any noise then try to predict the noise. And so that's basically what this C skip does. So it's a kind of slightly weird idea is that our target, the thing we're trying to do actually is not the input image, sorry the original image, it's not the noise but it's somewhere between the two.
And I've found the easiest way to understand that is to draw a picture of it. So here is some examples of noised input, right? With various sigma's, remember sigma is alpha bar, right? So here's an example with very little noise, 0.06. And so in this case the target is predict the noise, right?
So that's the hard thing to do is predict the noise. Whereas here's an example, 4.53 which is nearly all noise. So for nearly all noise the target is predict the image, right? And then for something which is a little bit between the two like here, 0.64, the target is predict some of the noise and some of the image.
So that's the idea of Paris. And so what this does is it's making the problem to be solved by the unit equally difficult regardless of what sigma is. It doesn't solve our input scaling problem, it solves our kind of difficulty scaling problem. To solve the input scaling problem they do it.
I just want to make one quick note. And so like this sort of idea of like is also interpolating between the noise and the image is this similar to what's called the B-objectives as well. So there's also a similar kind of it's yeah, it's very quite similar to what Keras and Dell has, but that's also not been used in a lot of different models.
Like for example Stable Diffusion 2.0 was trained with this sort of B-objective. So people are using this sort of methodology and getting good results. And yeah, so it's an actual practical thing that people are doing. So yeah, I just want to make a note of that. Yeah, as is the case of basically all papers created by Nvidia researchers, of which this is one.
It flies under the radar and everybody ignores it. The V-objective paper came from the senior author was Jim Salomons, which is Google, right? Yeah. And so anything from Google and OpenAI everybody listens to. So yeah, although Keras I think has done the more complete version of this, and in fact the V-objective was almost like mentioned in passing in the distillation paper.
But yeah, that's the one that everybody has ended up looking at. But I think this is the more... Yeah, I think what happened with the V-objective is not many people get attention to it. I think folks like Kat and Robin and these sorts of folks are actually paying attention to that V-objective in that Google brain paper.
But then also this paper did a much more principled analysis of this sort of thing. So yeah, I think it's very interesting how sometimes even these sort of side notes in papers that maybe people don't pay much attention to, they can actually be quite important. Yeah. Yeah. So, okay.
So the noised input as usual is the input image plus the noise times the sigma. But then, and then as we discussed, we decide how to kind of decide what our target is. But then we actually take that noised input and we scale it up or down by this number.
And the target, we also scale up or down by this number. And those are both calculated in this thing as well. So here's C out and here's C in. Now I just wanted to show one example of where these numbers come from because for a while they all seem pretty mysterious to me and I felt like I'd never be smart enough to understand them, particularly because they were explained in the mathematical appendix of this paper, which are always the bits I don't understand, until I actually try to and then it tends to turn out they're not so bad after all, which was certainly the case here, which?
I think it was B something, I think. So B6, I think? Is that the one? So in appendix B6, which does look pretty terrifying, but if you actually look at, for example, what we're just looking at, C in, it's like how do they calculate? So C in is this.
Now this is the variance of the noise, this is the variance of the data, add them together to get the total variance, square roots, the total standard deviation. So it's just the inverse of the total standard deviation, which is what we have here. Where does that come from? Well, they just said, you know what?
The inputs for a model should have unit variance. Now we know that. We've done that to dare in this course. So they said, right. So well, the inputs to the model is the, the clean data plus the noise times some number we're going to calculate and we want that to be one.
Okay. So the variance of the plane images plus the noise is equal to the variance of the clean images plus the variance of the noise. Okay. So if we want that to be, if we want variance to be one, then divide both sides by this and take the square root.
And that tells us that our multiplier has to be one over this. That's it. So it's like literally, you know, classical math. The only bit you have to know is that the variance of three things added together is the variance of the two things added together, which is not rocket science either.
And in this context, like why we want to do this, when we looked at those sigma's that you're putting like the distribution, you've got some that are fairly low, but you've also got some where the standard deviation sigma is like 40, right? So the variance is super high. Yes.
And so we don't want to feed something with standard deviation 40 into our model. You would like it to be closer to unit variance. So we're thinking, okay, well, if you divide by roughly 40, that would scare it down. But then we've also got some extra variance from our data.
It's just like 40 plus space of the data of a little bit. We want to scale back down by that to get unit variance. Yeah. I mean, I love this paper because it's basically just doing what we spent weeks doing of like, I feel like everything that we've done that's improved every model has always been one thing, which is, can we get mean zero variance one inputs to our model and for all of our activations?
And then the only other thing is include enough compute by adding enough layers and enough activations. Those two things seem to be all that matters. Basically, well, I guess ResNet's added an extra cool little thing to that, which is to make it even smoother by giving this kind of like identity path.
So yeah, basically trying to make things as smooth as possible and as equal everywhere as possible. So yeah, this is what they've done. So they did that for the inputs, and then they've also done it for the outputs and for the outputs, it's basically the same idea. They have basically the same kind of analysis to show that.
And so with this, so now, yeah, we've basically we've got our noised input, we've got the linear version somewhere between X0 and the noised input, we've got the scaling of the output and we've got the scaling of the input. So now for the inputs to our model, we're going to have the scaled noise, we're going to have the sigma and we're going to have the target, which is somewhere between the image and the noise.
And so, yeah, so I've never seen anybody draw a picture of this before. So it was really cool when being in a notebook, being able to see like, oh, that's what they're doing. So yeah, have a good look at this notebook to see exactly what's going on because I think it gives you a really good intuition around what problem it's trying to solve.
So then I actually checked the noised input has a standard deviation of 1, the main's not 0 and of course, why would it be? We didn't do anything. The only thing Keras cared about was having the variance 1. We could easily adjust the input and output to have a mean of 0 as well.
And that's something I think we or somebody should try because I think it does seem to help a bit as we saw with that generalised value stuff we did, but it's less important than the variance. And so same with the target, it's got the 1. And yeah, this is where if I change this to the correct value, which is 0.66, then actually it's slightly further away from 1, both here and here, quite a lot further away.
And maybe that's because actually the data is, well, we know the data is not Gaussian distributed. Pixel data definitely isn't Gaussian distributed. So this bug turned out better. Okay. So the unit's the same, the initialisation's the same. This is all the same. Train it for a while. We can't compare the losses because our target's different.
But what we can do is we can create a denoise that just takes the thing that, as per usual, the thing we had in noisify, right? And so for x0, it's going to multiply by c out and then add c skip by noised input. Here it is, multiply by c out, add noised input, c skip.
Okay, so we can denoise. So let's grab our sigmas from the actual batch we had. Let's calculate c skip c out and c in for the sigmas in our mini batch. Let's use the model to predict the target given the noised input and the sigmas, and then denoise it.
And so here's our noised input, which we've already seen, and here's our predictions. And these are absolutely remarkable, in my opinion. Yeah. Like this one here, I can barely see it. You know, it's really found, look at the shirt. There's a shirt here. It's actually really finding the little thing on the front.
And let me show you, here's what it should look like, right? And in cases where the sigma's pretty high, like here, you can see it's really like saying, like, I don't know, maybe it's shoes, but it could be something else. Is it shoes? Yeah, it wasn't shoes. But at least it's kind of got the, you know, the bulk of the pixels in the right spot.
Yeah, something like this one is 4.5, has no idea what it is. It's like, oh, maybe it's shoes, maybe it's pants. You know, it turns out it is shoes. Yeah. So I think that's fascinating how well it can do. And then the other thing I did, which I thought was fun, was I just created, so I just, you did a sigma of 80, which is actually what they do when they're doing sampling from pure noise.
That's what they consider the pure noise level. So I just created some pure noise and denoised it just for one step. And so here's what happens when you denoise it for one step. And you can see it's kind of overlaid all the possibilities. It's like, I can see a pair of shoes here, a pair of pants here at top here.
And sometimes it's kind of like more confident that the noise is actually a pair of pants. And sometimes it's more confident that it's actually shoes. But you can really get a sense of how like from pure noise, it starts to make a call about like what this noise is actually covering up.
And this is also the bit which I feel is like, I'm the least convinced about when it comes to diffusion models. This first step of going from like pure noise to something and like trying to have a good mix of all the possible somethings, I'm, I don't know, it feels a bit hand-waving to me.
It clearly works quite well, but I'm not sure if it's like we're getting the full range of possibilities. And I feel like some of the papers we're starting to see are starting to say like, you know what, maybe this is not quite the right approach. And maybe later in the course, we'll look at some of the ones that look at what we call VQ models and tokenized stuff.
Anyway, I thought this is pretty interesting to see these pictures, which I don't think, yeah, I've never seen any pictures like this before. So I think this is a fun result from doing all this stuff in notebooks step by step. Okay, so sampling. So one of the nice things with this is the sampling becomes much, much, much simpler.
And so, and I should mention a lot of the code that I'm using, particularly in the sampling section is heavily inspired by, and some of it's actually copied and pasted from Kat's K-diffusion repo, which is, I think I mentioned before, some of the nicest generative modeling code or maybe the nicest generative modeling code I've ever seen.
It's really great. So before we talk about the actual sampling, the first thing we need to talk about is what sigma do we use at each reverse time step? And in the past, we've always, well, nearly always done something, which I think has always felt is sketchy as all hell, which is we've just linearly gone down the sigmas or the alpha bars or the t's.
So here, when we're sampling in the previous notebook, we used lin space. So I always felt like that was questionable. And I felt like at the start, you probably, like it was just noise anyway. So who cared? Who cares? So I, in DDPMv3, I experimented with something that I thought intuitively made more sense.
I don't know if you remember this one, but I actually said, oh, let's, for the first 100 times steps, let's actually only run the model every 10 times. And then for the next 100, let's run it nine times. The next 100, let's run it every eight times. So basically at the start, be much less careful.
And so Keras actually ran a whole bunch of experiments and they said, yeah, you know what? At the start of training, you know, you can start with a high sigma, but then like step to a much lower sigma in the next step and then a much lower sigma in the next step.
And then the longer, the more you train step by smaller and smaller steps so that you spend a lot more time fine-tuning carefully at the end and not very much time at the start. Now, this has its own problems. And in fact, a paper just came out today, which we probably won't talk about today, but maybe another time, which talked about the problems is that in these very early steps, this is the bit where you're trying to create a composition that makes sense.
Now for fashion MNIST, we don't have much composing to do. It's just a piece of clothing. But if you're trying to do an astronaut riding a horse, you know, you've got to think about how all those pieces fit together. And this is where that happens. And so I do worry that with the Keras approach is what's not giving that maybe enough time.
But as I've said, that's really the same as this step. That whole piece feels a bit wrong to me. But aside from that, I think this makes a lot of sense, which is that, yeah, the sampling, you should jump, you know, by big steps early on and small steps later on and make sure that the fine details are just so.
So that's what this function does, is it creates this plot. Now it's this schedule of reverse diffusion sigma steps. It's a bit of a weird function in that it's the rowth root of sigma, where row is seven. So the seventh root of sigma is basically what it's scaling on.
But the answer to why it's that is because they tried it and it turned out to work pretty well. Do you guys remember where this was? This is a truncation error analysis, D1. That's very. So this image here, so thanks for telling me where this is, shows fed as a function of row.
So it's basically what the whatth root are we taking? And they basically said, like, if you take the fifth root up, it seems to work well, basically. So, yeah, so that's a perfectly good way to do things is just to try things and see what works. And you'll notice they tried things just like we love on small data sets.
Not as small as us because we're the king of small data sets, but small ish, so far 10, the image net 64. That's the way to do things. I saw, like, I might have even been the CEO of Hugging Face the other day, tweet something saying only people with huge amounts of GPUs can do research now.
And I think it totally misunderstands how research is done, which is research is done on very small data sets. That's that's the actual research. And then when you're all done, you scale it up at the end. I think we're kind of pushing the envelope in terms of like, yeah, how how much can you do?
And yeah, we've, like, we covered this kind of main substantive path of diffusion models history, step by step, showing every improvement and seeing clear improvements across all the papers using nothing but fashioned MNIST running on a single GPU in like 15 minutes of training or something per model. So, yeah, definitely you don't need lots of models.
Anyway, OK, so this is the sigma we're going to jump to. So the denoising is going to involve calculating the C skip, C out and C in and calling our model with the C in scaled data and the sigma and then scaling it with C out and then doing the C skip.
OK, so that's just undoing the Noisify. So check this out. This is all that's required to do one step of denoising for the simplest kind of scheduler, which sorry, the simplest kind of sampler, which is called Euler. So we basically say, OK, what's the sigma at time step I?
What's the sigma 2 at time step I? And now when I'm talking about at time step, I'm really talking about like the step from this function. Right. So this is this is the sampling step. Yeah. OK, so then denoise using the function and then we say, OK, well, just send back whatever you were given plus move a little bit in the direction of the denoised image.
So the direction is X minus denoised. So that's the noise. That's the gradient as we discussed right back in the first lesson of this part. So we'll take the noise. If we divide it by sigma, we get a slope. It's how much noise is there per sigma. And then the amount that we're stepping is sigma 2 minus sigma 1.
So take that slope and multiply it by the change. Right. So that's the distance to travel towards the noise at this fraction. You know, or you could also think of it this way. And I know this is a very obvious algebraic change. But if we move this over here, you could also think of this as being, oh, of the total amount of noise, the change in sigma we're doing, what percentage is that?
OK, well, that's the amount we should step. So there's two ways of thinking about the same thing. So again, this is just, you know, high school math. Well, I mean, actually, my seven-year-old daughter has done all these things. It's plus minus divided in times. So we're going to need to do this once per sampling step.
So here's a thing called sample, which does that. It's going to go through each sampling step, call our sampler, which initially we're going to do sample Euler. Right. With that information, add it to our list of results and do it again. So that's it. That's all the sampling is.
And of course, we need to grab our list of sigmas to start with. So I think that's pretty cool. And at the very start, we need to create our pure noise image. And so the amount of noise we start with is got a sigma of 80. OK, so if we call sample using sample Euler and we get back some very nice looking images and believe it or not, our fed is 1.98.
So this extremely simple sampler, three lines of code plus a loop has given us a bit of 1.98, which is clearly substantially better than our coastline. Now we can improve it from there. So one potential improvement is to you might have noticed we added no new noise at all.
This is a deterministic scheduler. There's no rand anywhere here. So we can do something called an ancestral Euler sampler, which does add rand. So we basically do the denoising in the usual way, but then we also add some rand. And so what we do need to make sure is given that we're adding a certain amount of randomness, we need to remove that amount of randomness from the step that we take.
So I won't go into the details, but basically there's a way of calculating how much new randomness and how much just going back in the existing direction do we do. And so there's the amount in the existing direction and there's the amount in the new random direction. And you can just pass in eta, which is just going to, when we pass it into here, is going to scale that.
So if we scale it by half, so basically half of it is new noise and half of it is going in the direction that we thought we should go, that makes it better still. Again with 100 steps. And just make sure I'm comparing to the same, yep, 100 steps.
Okay, so that's fair, like with like. Okay, so that's adding a bit of extra noise. Now then, something that I think we might have mentioned back in the first lesson of this part is something called Heun's method. And Heun's method does something which we can pictorially see here to decide where to go, which is basically we say, okay, where are we right now?
What's the, you know, at our current point, what's the direction? So we take the tangent line, the slope, right? That's basically all it does is it takes a slope. It says, oh, here's a slope, you know. Okay, and so if we take that slope, and that would take us to a new spot, and then at that new spot, we can then calculate a slope at the new spot as well.
And at the new spot, the slope is something else. So that's it here, right? And then you say, like, okay, well, let's go halfway between the two. And let's actually follow that line. And so basically, it's saying, like, okay, each of these slopes is going to be inaccurate. But what we could do is calculate the slope of where we are, the slope of where we're going, and then go halfway between the two.
It's, I actually found it easier to look at in code personally. I'm just going to delete a whole bunch of stuff that's totally irrelevant to this conversation. So take a look at this compared to Euler. So here's our Euler, right? So we're going to do the same first line exactly the same, right?
Then the denoising is exactly the same. And then this step here is exactly the same. I've actually just done it in multiple steps for no particular reason. And then you say, okay, well, if this is the last step, then we're done. So actually, the last step is Euler. But then what we do is we then say, well, that's okay, for an Euler step, this is where we'd go.
Well, what does that look like if we denoise it? So this calls the model the second time, right? And where would that take us if we took an Euler step there? And so here, if we took an Euler step there, what's the slope? And so what we then do is we say, oh, okay, well, it's just, just like in the picture, let's take the average.
Okay, so let's take the average and then use that, the step. So that's all the HUIN sampler does is just takes the average of the slope where we're at and the slope where the Euler method would have taken us. And so if we now so notice that it called the model twice for a single step.
So to be fair, since we've been taking 100 steps with Euler, we should take 50 steps with HUIN, right? Because it's going to call the model twice. And still that is now whoa, we beat one, which is pretty amazing. And so we could keep going, check this out, we could even go down to 20.
This is actually doing 40 model evaluations and this is better than our best Euler, which is pretty crazy. Now, something which you might have noticed is kind of weird about this or kind of silly about this is we're calling the model twice just in order to average them. But we already have two model results, like without calling it twice, because we could have just looked at the previous time step.
And so something called the LMS sampler does that instead. And so the LMS sampler, if I call it with 20, it actually literally does 20 evaluations and actually it beats Euler with 100 evaluations. And so LMS, I won't go into the details too much. It didn't actually fit into my little sampling very well.
So basically largely copied and pasted the cat's code. But the key thing it does is look, it gets the current sigma, it does the denoising, it calculates the slope, and it stores the slope in a list, right? And then it grabs the first one from the list. So it's kind of keeping a list of up to, in this case, four at a time.
And so it then uses up to the last four to basically, yes, kind of the curvature of this and take the next step. So that's pretty smart. And yeah, I think if you wanted to do super fast sampling, it seems like a pretty good way to do it. And I think, Johnno, you're telling me that, or maybe it's Pedro was saying that currently people have started to move away.
This was very popular, but people started to move towards a new sampler, which is a bit similar called the DPM++ sampler, something like that. Yeah. Yeah. Yeah. Yeah. But I think it's the same idea. I think it kind of keeps a, I said, keep a list of recent results and use that.
I'll have to check it more closely. I'll have to look at the code. Yeah. That's a similar idea. It's like, if it's done more than one step, then it's using some history to the next thing. Yeah. This history and thing doesn't make a huge amount of sense, I guess, from that perspective.
I mean, still works very well. This makes more sense. So then we can compare if we use an actual mini match of data, we get about 0.5. So yeah, I feel like this is quite a stunning result to get close to, very close to real data, this in terms of fit, really with 40 model evaluations and the entire, nearly the entire thing here is by making sure we've got unit variance, inputs, unit variance, outputs, and kind of equally difficult problems to solve in our loss function.
Yeah. Thus having that different schedule for sampling. That's completely unrelated to the training schedule. I think that was one of the big things with Karas et al's paper was they also could apply this to like, oh, existing diffusion models that have been trained by other papers. We can use our sampler and in fewer steps get better results without any of the other changes.
And yeah, I mean, they do a little bit of rearranging equations to get the other papers versions into their C skip C and C out framework. But then yeah, it's really nice that these ideas can be applied to, so for example, I think stable diffusion, especially version one was trained DDPM style training, Epsilon objective, whatever.
But you can now get these different samplers and different sometimes schedules and things like that and use that to sample it and do it in 15, 20 steps and get pretty nice samples. Yeah. And another nice thing about this paper is they, in fact, the name of the paper elucidating the design space of diffusion based models.
They looked at various different papers and approaches and trying to set like, oh, you know what? These are all doing the same thing when we kind of parameterize things in this way. And if you fill in these parameters, you get this paper and these parameters, you get that paper.
And then so we found a better set of parameters, which it was very nice to code because, you know, it really actually ended up simplifying things a whole lot. And so if you look through the notebook carefully, which I hope everybody will, you'll see, you know, that the code is really there and simple compared to all the previous ones, in my opinion.
Like, I feel like every notebook we've done from DDPM onwards, the code's got easier to understand. And just to again clarify like how this connects with some of the previous papers that we've looked at. So like, for example, with the DDIM, the deterministic, that's again, the sort of deterministic approach that's similar to the Euler method sampler that we were just looking at, which was completely deterministic.
And then some of something like the Euler ancestral that we were looking at is similar to the standard DDPM approach with that was kind of a more stochastic approach. So again, there's just all those sorts of connections that then are kind of nice to see, again, the sorts of connections between the different papers and how they change it, how they can be expressed in this common framework.
Yeah. Thanks, Tanish. So we definitely now are at the point where we can show you the unit next time. And so I think we're, unless any of us come up with interesting new insights on the unconditional diffusion sampling, training and sampling process, we might be putting that aside for a while, and instead we're going to be looking at creating a good quality unit from scratch.
And we're going to look at a different data set to do that as we start into scale things up a bit, as Jono mentioned in the last lesson. So we're going to be using a 64 by 64 pixel image net subset called tiny image net. So we'll start looking at some three channel images.
So I'm sure we're all sick of looking at black and white shoes. So now we get to look at shift dwellings and trolley buses and koala bears and yeah, 200 different things. So that'll be nice. Yeah. All right. Well, thank you, Jono. Thank you, Tanish. That was fun as always.
And yeah, next time we'll be lesson 22. Bye. Oh, listen to me. Hey, this was lesson 22. Oh, no way. Okay. You're right. See ya. Bye.