Back to Index

Lesson 25: Deep Learning Foundations to Stable Diffusion


Transcript

Hi everybody, and welcome to the last lesson of part two. Greetings Jono and Greetings Tanishk, how are you guys doing? Good thanks. Doing well. Excited for the last lesson. It's been an interesting, fun journey. Yeah. I should explain, we're not quite completing all of stable diffusion in this part of the course.

There's going to be one piece left for the next part of the course, which is the CLIP embeddings. Because CLIP is NLP, and so the next part of the course we will be looking at NLP. So we will end up finishing stable diffusion from scratch, but we're going to have to have a significant diversion.

And what we thought was, given everything that's happened with GPT-4 and stuff since we started this course, we thought it makes more sense to delve into that quite deeply, more soon. And delay CLIP as a result. So hopefully people will feel comfortable with that decision, but I think we'll have a lot of exciting NLP material coming up.

So that's the rough plan. All right. So I think what we might do is maybe start by looking at a really interesting and quite successful application of pixel level diffusion by applying it not to pixels that represent an image, but pixels that represent a sound, which is pretty crazy.

So maybe Johnno, of course it's going to be Johnno, he does the crazy stuff, which is great. So Johnno, show you your crazy and crazily successful approach to diffusion for pixels of sounds. Please. Sure thing. Right. So this is going to be a little bit of intro and tell.

Most of the code in the notebook is just copied and pasted from, I think, notebook 30. But we're going to be trying to generate something other than just images. So specifically, I'm going to be loading up a dataset of bird calls. These are just like short samples of, I think, 10 different classes of birds calling.

And so we need to understand like, okay, well, this is a totally different domain, right? This is audio. If you look at the data, like, let's look at an example of the data. This is coming from a hugging face dataset so that that line of code will download it automatically if you haven't got it before, right?

Yeah. Yeah. So this will download it into a cache and then sort of handle a lot of, you created this dataset, right? Did you, is this already a dataset you found somewhere else or you made it or what? This is a subset that I made from a much larger dataset of longer call recordings and from an open website called Zeno Kanto.

So they collect all of these sound recordings from people. They have experts who help identify what birds are calling. And so all I did was find the audio peaks, like where is there most likely to be a bird call and clip around those just to get a smaller dataset of things where there's actually something happening.

Not a particularly amazing dataset in terms of like the recordings have a lot of background noise and stuff, but a fun, small audio one to play with. Um, yeah. And so when we talk about audio, you've got a microphone somewhere, it's reading like a pressure level, essentially, um, in the air with these sound waves.

And it's doing that some number of times per second. So we have a sample rate and in this case, the data has a sample rate of 32,000 samples per second. So every second waveform that's being approximated, there's lots of little up across, up across, up across kind of things, basically, correct?

Yeah. Um, and so that's great for, you know, capturing the audio, um, but it's not so good for modeling because we now have 30,000 values per second in this one big, one D array. Um, and so yeah, you can try and find models that can work with that kind of data.

Uh, but what we're going to do is a little hack and we're instead going to use something called a spectrogram. So the original data is the main issue. It's, it's, it's too big and slow to work with. It's, it's too big. Um, but also you have some, like some sound waves are at a hundred Hertz, right?

So they're, they're going up and down a hundred times a second and some are at a thousand and some are at 10,000. And often there's background noise that can have extremely high frequency components. And so if you're looking just at the waveform, there's lots and lots of change second to second.

And there's some very long range dependencies of like, Oh, it's generally high here. It's generally low there. And so it can be quite hard to capture those patterns. Um, and so part of it is, it's just a lot of samples to deal with. Um, but part of it also is that it's not like an image where you can just do a convolution and things nearby each other tend to be related or something like that.

Um, it's quite tricky to disentangle what's going on. Um, and so we have this idea of something called a spectrogram. Uh, this is a fancy 3d visualization, but it's basically just taking that audio and mapping time on one axis. So you can see as time goes by, we're moving along the X axis and then on the Y axis is frequency.

And so the, um, the peaks here show like intensity at different frequencies. And so if I make a pure note, you can see that that maps in the frequency domain. Um, but when I'm talking, there's lots and lots of peaks and that's because our voices tend to produce a lot of overtones.

So if I go, you can see there's a main notes, but there's also the subsequent notes. And if I play something like a chord, you can see this, you know, maybe three main peaks and then each of those have these harmonics as well. Um, so it captures a lot of information about the signal.

Um, and so we're going to turn our audio data into something like this, where even just visually, if I'm a bird, you can see this really nice spatial pattern. And the hope is if we can generate that and then if we can find some way to turn it back into audio and then we'll be off to the races.

And so, yeah, that's what I'm doing in this notebook. We have, um, I'm leaning on the diffusers docs, um, pipelines.audio diffusion.mal class. And so within the realm of spectrograms, there's a few different ways you can do it. So this is from the torch audio docs, um, but this notebook is from the hugging face diffusion models class.

So we had that waveform, that's those raw samples and we'd like to convert that into what they call the frequency domain, um, which is things like these spectrograms. Um, and so you can do a normal, normal spectrogram, a power spectrogram or something like that. Um, but we often use something called a male spectrogram, which is exactly the same.

It's actually probably what's being visualized here. And it's something that's designed to map the, like the frequency ranges into a range that's, um, like tied to what human hearing is based on. And so rather than trying to capture all frequencies from, you know, zero Hertz to 40,000 Hertz, a lot of which we can't even hear, it focuses in on the range of values that we tend to be interested in as, as humans.

And also it does like a transformation into, into kind of like a log space. Um, so that the, the intensities like highs and lows correspond to loud and quiet for human hearing. So it's very tuned for, um, the types of audio information that we actually might care about rather than, you know, tens of thousands of kilohertz that any bats can hear.

Um, okay. So we're going to rely on a class to abstract this away, but it's going to basically give us a transformation from waveform to spectrogram. And then it's also going to help us go from spectrogram back to waveform. Um, and so, uh, let me show you my data.

I have this two image function that's going to take the audio array. It's going to use the male, um, class to handle turning that into, um, spectrograms. And the class also does things like it splits it up into chunks based on, you can set like a desired, um, resolution I'd like 128 by 128 spectrogram.

It says, okay, great. I know how many, I know you need 128, like frequency bins for the frequency axis and 128 steps on the, on the time axis. So it kind of handles that converting and resizing. Um, and then it gives us these audio slice to image. So that's taking a chunk of audio and turning it into the spectrogram.

And it also has the inverse. Um, so our dataset is fairly simple. We just referencing our original audio datasets, but we're calling that to image function and then returning it into a tensor and we're mapping it to minus 0.5 to 0.5, similarly to what we've done with like the grayscale images in the past.

Um, so if you look at a sample from that data, we now have, instead of an audio waveform of 32,000 or 64,000, if it's two seconds samples, we now have this 128 by 128 pixel spectrogram, which looks like this. Um, and it's just, it's grayscale. So this is just matplotlibs colors.

Um, but we can test out going from the spectrogram back to audio using the image to audio function that the male class has, um, and that should give us, um, now this isn't perfect because the spectrogram shows the intensity at different frequencies, but with audio, you've also got to worry about something called the phase.

And so this image to audio function is actually behind the scenes doing a kind of iterative approximation, um, with something called the Griffin Lynn algorithm. Um, so I'm not going to try and describe that here, but it's just, it's approximating. It's guessing what the phase should be. It's creating a spectrogram, it's comparing that to the original, it's updating, it's doing some sort of like iterative, very similar to like an optimization thing to try and generate an audio signal that would produce the spectrogram, which we're trying to invert.

So just to clarify, so my understanding, what you're saying is that the spectrogram is a lossy conversion of the sound into an image. Um, and specifically it's lossy because it, um, tells you the kind of intensity at each point, but it's not, it's kind of like, is it like the difference between a sine wave and a, and a cosine wave?

Like they're just shifted in different ways. We don't know how much it's shifted. So coming back to the sound, you do have to get that, that shifting the phase correct. And so it's trying to guess something and it sounds like it's not doing a great guess from the thing you showed.

The original audio is also not that amazing. Um, but yes, the, the spectrogram back to audio task, this, these dotted lines are like highlighting this is, yeah, it's an approximation and there are deep learning methods now that can do that better, or at least that sound much higher quality, um, because you can train a model somehow to go from this image-like representation back to an audio signal.

Um, but we just use the approximation for this notebook. Um, okay. So now that we can represent our data as like a grayscale 128 by 128 pixel image, um, everything else becomes very much the same as the previous diffusion models examples. We're going to use this noiseify function to add different amounts of noise.

And so we can see now we have our spectrograms, but with varying amounts of noise. Added we can create a simple diffusion model. I'm just copying and pasting the results, but with one extra layer, um, just with very few channels to go from 128 to 64 to 36, I mean, to 16 by eight to eight, um, no attention.

Just I think pretty much copied and pasted from notebook 30, uh, and train it for in this case, 15 epochs. It took about, this is interesting. So you're using simple diffusion. Yes. Um, so specifically, this is the simple diffusion model that you, um, I think I've already introduced. Maybe not.

Um, yeah. So briefly looked at it, so let's remind ourselves of what it does here. Okay. Yeah. Um, so we have some number of down blocks with a specified number of channels. And then the key insight from simple diffusion was that you often want to concentrate the computes in the sort of middle at the low resolution.

So that's these, these mid blocks and their transformers. Um, yes. Um, yeah. Um, and so we can stack some number of those and then, um, the corresponding up path, and this is a unit. So we passing in the features from the, the down path as we go through those up blocks.

Um, and so we're going to go take an, um, image and time step. We can embed the time step. We're going to go through our down blocks and saving the results, we're going to go through the mid blocks. There we go. Through the mid blocks. Yeah. And before that, you've also got the, um, embedding of the, uh, locations that self.la is the learnable embeddings using scale and shift.

I remember. Uh, right. So this is preparing it to go through the transformer blocks by adding some learnable embeddings. Mm-hmm. All right. And then we reshaping it to be effectively a sequence since that's how we had written our transformer to have expect a 1D sequence of embeddings. Um, and so once you've gone through those mid blocks, we reshape it back and then we go through the up blocks passing in and also our saved outputs from the down path.

Um, yeah, so it's a nice, it's a nice model. And you can really control how much parameters and compute you're doing just by setting, like what are the number of features or channels at each of those down block stages and how many mid blocks are you going to stack?

Um, and so if you want to scale it up, it's quite easy to say, oh, let me just add more mid blocks. Maybe I'll add more channels, um, to the, to the down and up paths. Um, and there's a very easy model to tweak to get a larger or smaller model.

One fun thought I know is, um, simple diffusion only came out a couple of months ago and I don't think, I think ours might be the first publicly available code for it because I don't think the author has released the code. I suspect this is probably the first time maybe it's ever been used to generate audio before.

Uh, possibly. Yeah. I guess. Um, I know a couple of people who've at least privately done their implementations when I asked the author if he was releasing code, he said, oh, but it's simple. It's just a bunch of transformer blocks. I'll release it eventually. Um, no, maybe, maybe not.

I don't know the line then, but they were like, Oh, you can see the pseudo code. It's pretty easy. Yeah. It is pretty easy. Yeah. Cool. So trains, the last goes down as we hope, um, sampling is exactly the same as generating images normally. Um, and that's going to give us the spectrograms.

I'm using dealing with a hundred steps, um, and to actually listen to these samples, we then are just going to use that, um, image to audio function again, to take our grayscale image. Um, and in this case, actually it expects a PIL image. So I first converted it to PIL, um, and then turn that back into audio.

And so we can play some of the generated samples. Wow, that's so cool. I don't know that I could guarantee what bird is making these calls and some of them are better than others. Like some of them are better than others. Yeah. Some of the original samples sample, right?

So. Exactly. Yeah. So yeah, that's generating and fake bird calls with, with, um, spectrogram diffusion. There's projects that do this on music. Um, so the refusion projects based on text and yeah, there's, there's various other like pre-trained models that do diffusion on spectrograms to produce, um, you know, music clips or voice or whatever.

Um, I may have frozen. Refusion is actually this stable diffusion model that's, that's fine tuned specifically for, for this, for the spectrogram generation, which is, which I find very impressive. It's like a model that was originally for, you know, text to image is instead can also generate the spectrograms.

I guess there's still some useful information in, you know, the sort of text image model that kind of generalizes, or you can still be used for text to audio. So I found that a very interesting, impressive application as well as a refusion is an awesome name. Indeed. It is, yeah.

Cool. And I guess since it's a latent model that leads us onto the next topic, right? I was just going to say, we've got a natural segue there. Yes. So we're, um, if we want to replicate refusion, then, um, we'll need lightens. Yeah. So the, the final non-NLP part of stable diffusion is this, uh, ability to use the more compressed representation, uh, created by a VAE called Latens, um, instead of pixels.

Um, so we're going to start today by creating a VAE, taking a look at how it works. Um, so to remind you, as we learned back in the, the first lesson of this part of part two, um, the VAE model converts the, um, 256 by 256 pixel, three channel into a, um, is it 64 by 64 by four?

It'll be 32 if it's 256, uh, it's 512 to 65. Oh, 512 to 64. Okay. So do a 32 by 32 by four. So dramatically smaller, which makes life so much easier, um, which is, which is really nice. Um, having said that, you know, simple diffusion does the first, you know, few, in fact, you know, all the downsampling pretty quickly and, and all the hard work happens, you know, at a 16 by 16 anyway.

So maybe it's, you know, with simple diffusion, it's not as big a deal as it used to be, but it's still, you know, it's very handy, particularly because for us folks with more normal amounts of compute, we can take advantage of all that hard work that the stability.ai computers did for us by creating the stable diffusion VAE.

Um, so that's what we're going to do today, but first of all, we're going to create our own. Um, so let's do a VAE using fashion MNIST. So the first or the first stuff is just the, the normal. One thing I am going to do for this simple example though, is I'm going to flatten the, um, fashion MNIST pixels into a vector to make it as simple as possible.

Um, okay. So we've, we're going to end up with vectors of length 784 because 28 by 28 784, uh, we're going to create a single hidden layer MLP with, um, 400, um, hidden and then 200 outputs. So here's a linear layer. So it's a sequential containing a linear and then an optimal activation function and an optional normalization.

Um, we'll update init weights so that we initialize linear layers as well. Um, so before we create a VAE, which is a variational autoencoder, we'll create a normal autoencoder. We've done this once before and we didn't have any luck. Um, in fact, we were so unsuccessful that we decided to go back and create a learner and come back a few weeks later once we knew what we were doing.

So here we are. We're back. We think we know what we're doing. Um, so we're just going to recreate an autoencoder just like we did some lessons ago. Um, so there's going to be an encoder, which is a sequential, which goes from our 768 inputs to our 400 hidden and then a linear layer with our 400 hidden and then an output layer from the 400 hidden to the 200 outputs of the encoder.

So there we got our latents. And then the decoder will go from those 200 latents to our 400 hidden, have our hidden layer and then come back to our 768 inputs. Um, all right. So we can optimize that in the usual way using Adam, um, and we'll do it for 20 epochs runs pretty quickly cause it's quite a small dataset and quite a small model.

Um, and so what we can then do, um, is we can grab a batch of our X who actually grabbed the batch of X earlier, uh, way back here. So I've got a batch of images, um, and we can put it through our model, um, pop it back on the CPU and we can then have a look at our original mini-batch and we have to reshape it to 28 by 28 because we previously had flattened it.

So there's our original and then, um, we can look at the result after putting it through our model. And there it is. And as you can see, it's, you know, very roughly regenerated. And so this is, um, not a massive compression, it's compressing it from 768 to 200. And it's also not doing an amazing job of recreating the original details.

Um, but you know, this is the simplest possible auto encoder. So it's doing, you know, it's a lot better than our previous attempt. Um, so that's good. So what we could now do is we could just generate some noise and then we're not even going to do diffusion. So we're going to go and say like, okay, we've got a decoder.

So let's just decode that noise and see what it creates. And the answer is not anything great. I mean, I could kind of recognize that might be the start of a shoe. Maybe that's the start of a bag. I don't know, but it's not doing anything amazing. So we have not successfully created an image generator here, um, but there's a very simple step we can do to make something that's more like an image generator.

The problem is that, um, these 200, um, this vector of length 200 recreating, there's no particular reason that things that are not in the dataset are going to create items of clothing. We haven't done anything to try to make that happen. We've already tried to make this work for things in the dataset, you know, and, um, therefore, when we just randomly generate a bunch of, you know, a vector of length 200 or 16 vectors of length 200 in this case, um, and then decode them, there's no particular reason to think that they're going to create something that's recognizable as clothing.

So the way a VAE tries to fix this is by, we've got the exact same encoder as before, except it's just missing its final layer. Its final layer has been moved over to here. I'll explain why there's two of them in a moment. So we've got the inputs to hidden, the hidden to hidden, and then the hidden to the latent.

The decoder is identical, okay, latent's to hidden, hidden to hidden, hidden to inputs. And then just as before, we call the encoder, um, but we do something a little bit weird next, which is that we actually have two separate final layers. We've got one called mu for the final of the encoder and one called LV, which stands for log of variance.

So encoder has two different final layers. So we're going to call both of them, okay. So we've now got two encoded 200 long lots of latents. What do we do with them? What we do is we use them to generate random numbers and the random numbers have a mean of mu.

So when you take a random zero one, so this creates zero one random numbers, mean zero standard deviation one. So if we add mu to it, they now have a mean of mu or approximately. And if you multiply the random numbers by half of log of variance, e to the power of that, right?

So given this log of variance, this is going to give you standard deviation. So this is going to give you a standard deviation of e to the half LV and a mean of mu. Why the half? It doesn't matter too much, but if you think about it, um, standard deviation is the square root, so the variance is squared.

So when you take the log, you can move that half into the multiplication because of the log trick. That's why we just got the half here instead of the square root, which would be to the power of a half. So this is just, yeah, this is just the standard deviation.

So we've got the standard deviation times normally distributed random noise plus mu. So we end up with normally distributed numbers, we're going to have 200 of them for each element of the batch where they have a standard deviation of the result of this final layer and a variance, which is the result or log variance of the result of this final layer.

And then finally we passed that through the decoder as usual. I explained why we passed back three things, but for now we're just worried about the fact we passed back the result of the decoder. So what this is going to do is it's going to generate, um, the, the result of calling, um, encode is going to be a little bit random.

On average, you know, it's still generating exactly the same as before, which is the result of a sequential model with, you know, MLP with one hidden layer, but it's also going to add some randomness around that, right? So this is, here's the bit, which is exactly the same as before.

This is the same as calling encode before, but then here's the bit that adds some randomness to it. And the amount of randomness is also itself random. Okay. So then that gets run through the decoder. Um, okay, so if we now just, um, well, you know, trained that, right, using the result of the decoder and using, um, I think we didn't use MSE loss.

We used a binary cross entropy loss, which we've seen before. Um, so if you've forgotten, you should definitely go back and rewatch that by really part one. Um, or we've done a bit of it in part two as well, binary cross entropy loss, um, with logits means that you don't have to worry about doing the soft max.

It does the soft max for you. Um, so if we just, um, optimize this using BCE now, you would expect, and it would, I believe I haven't checked, um, that it would basically take this final, like this layer here and turn these all into zeros, um, as a result of which it would have no variance at all.

Um, and therefore it would behave exactly the same as the previous auto encoder. Does that sound reasonable to you guys? Yeah. Okay. Um, so that wouldn't help at all because what we actually want is we want some variance and the reason we want some variance is we actually want to have it generate some latents, which are not exactly our data.

They're around our data, but not exactly our data. And then when it generates latents that are around our data, we want them to decode to our, to the same thing. We want them to decode to the correct image. And so as a result, if we can train that, right, something that it does include some variation and still decodes back to the original image, then we've created a much more robust model.

And then that's something that we would help then, that we would hope then when we say, okay, well now decode some noise that it's going to decode to something better than this. So that's the idea of a VAE. So how do we get it to create, um, a log variance, which doesn't just go to zero?

Um, well, we have a second, uh, loss term it's called the KL divergence loss. We've got a key called KLD loss. And what we're going to do is our VAE loss is going to take the binary cross entropy between the actual decoded bit. So that's input zero and the target.

Okay. So that's, this is exactly the same as before as this binary cross entropy. And we're going to add it to this KLD loss, KL divergence. Now KL divergence, the details don't matter terribly much. What's important is when we look at the KLD loss, it's getting past the input and the targets, but if you look, it's not actually using the targets at all.

So if we pull out in the, the input into its three pieces, which is our predicted image, our mu and our log variance, we don't use this either. So the BCE loss only uses the predicted image and the actual image. The KL divergence loss only uses mu and log variance.

And all it does is it returns a number, which says, um, for each item in the batch, um, is mu close to zero and is log variance close to one. How does it do that? Well, for mu, it's very easy. Mu squared. So if mu is close to zero, then minimizing mu squared does exactly that, right?

Um, if mu is one, then mu squared is one. If mu is minus one, mu squared is one. If mu is zero, mu squared is zero. That's the lowest you can get for a squared. Um, okay. So we've got a mu squared piece here, um, and we've got a dot mean.

So we're just taking, that's just basically taking the mean of all the mus. And then there's another piece, which is we've got log variance minus e to the power of log variance. So if we look at that, so let's just grab a bunch of numbers between neg three and three and do number minus e to the power of that number.

Um, and I'm just going to pop in the one plus and the point five times as well. They matter much. And you can see that's got a minimum of zero. So when that's a minimum of zero, e to the power of that, which is what we're going to be using actually half times e to the power of that, but that's okay.

Is what we're going to be using in our, um, dot forward method. That's going to be e to the power of zero, which is going to be one. So this is going to be minimized where, um, log variance exp equals one. So therefore this whole piece here will be minimized when mu is zero and LV is also zero.

Um, and, and so therefore LV e to the power of LV is one. Now, the reason that it's specifically this form is basically because, um, there's a specific mathematical thing called the KL divergence, which compares how similar to distributions are. And so the normal distribution can be fully characterized by its main and its variance.

And so this is actually more precisely calculating the similarity that specifically the KL divergence between the actual mu and LV that we have and a distribution with a mean of zero and a variance of one. Um, um, but you can see hopefully why conceptually we have this mu.pal two and why we have this LV.exp, um, LV minus LV.exp here.

Um, so that is our VAE loss. Did you guys have anything to add to any of that description? So maybe to highlight the, the, the objective of this is to say rather than having it so that the exact point that an input is encoded to decodes back to that input, we're saying number one, the space around that point should also decoded that input because we're going to try and force some variance.

And number two, the overall variance should be like, yeah, the, the overall space that it uses should be roughly zero mean and units and variance, right? So instead of able to like map each input to like an arbitrary point and then decode only that exact point to an input, we now mapping them to like a restricted range.

And we're saying that not, not just each point, but its surroundings as well should also decode back to something that looks like that image. Um, and that's trying to like condition this latent space to be much nicer so that any arbitrary point within that, um, range will hopefully map to something useful, which is a harder problem to solve, right?

So we would expect given that this is exactly the same architecture, we would expect its ability to actually decode would be worse than our previous attempt because it's a harder problem that we're trying to solve, which is to just, we've got random numbers in there as well now that we're hoping that this ability to generate images will improve.

Um, thanks, John. Okay. So I actually asked Bing about this, um, which is just, this is more of an example of like, I think for, you know, now that we've got GPT for and Bing and stuff, I find they're pretty good at answering questions that like I wanted to explain to students what would happen if the variance of the latents was very low or what if they were very high?

So why do we want them to be one? And I thought like, Oh gosh, this is hard to explain. So maybe Bing can help. So I actually thought it's pretty good. So I'll just say what Bing said. So Bing says, if the variance of the latents are very well low, then the encoder distribution would be very peaked and concentrated around the main.

So that was the thing we were describing earlier. If we had trained this without the KLD loss at all, right, it would probably make the variance zero. And so therefore the latent space would be less diverse and expressive and limit the ability of the decoder to reconstruct the data accurately, make it harder to generate new data that's different from the training data, which is exactly what we're trying to do.

And if the variance is very high, then the encoder would be very spread out and diffuse. It would be more, the latents would be more noisy and random, make it easier to generate new data that's unrealistic or nonsensical. Okay. So that's why we want it to be exactly at a particular point.

So when we train this, we can just pass VAE loss as our loss function, but it'd be nice to see how well it's going at reconstructing the original image and how it's going at creating a zero one distribution data separately. So what I ended up doing was creating just a really simple thing called func metric, which I derived from the capital M mean class in the torch, just trying to find it here from the torcheval.metrics.

So they've already got something that can just calculate means. So obviously this stuff's all very simple and we've created our own metrics class ourselves back a while ago. And since we're using torcheval, I thought this is useful to see how we can create one, a custom metric where you can pass in some function to call before it calculates the mean.

So if you call, so you might remember that the way torcheval works is it has this thing called update, which gets past the input and the targets. So I add to the weighted sum, the result of calling some function on the input and the targets. So we want two kind of new metrics.

One is the, we're going to print it out as KLD, which is a func metric on KLD loss, someone who went to print out as BCE, which is a func metric on BCE loss. And so the actual, when we call the learner, the loss function we'll use is VAE loss, but we're going to pass in as metrics, this additional metrics to print out.

So it's just going to print them out. And in some ways it's a little inefficient because it's going to calculate KLD loss twice and BCE loss twice, one to print it out and one to go into the, you know, actual loss function, but it doesn't take long for that bit.

So I think that's fine. So now when we call learn.fit, you can see it's printing them all out. So the BCE that we got last time was 0.26. And so this time, yeah, it's not as good. It's 0.31 because it's a harder problem and it's got randomness in it.

And you can see here that the BCE and KLD are pretty similar scale when it starts. That's a good sign. If they weren't, you know, I could always in the loss function scale one of them up or down, but they're pretty similar to start with. So that's fine. So we train this for a while and then we can use exactly the same code for sampling as before.

And yeah, as we suspected, its ability to decode is worse. So it's actually not capturing the LE at all, in fact, and the shoes got very blurry. But the hope is that when we call it on noise called the decoder on random noise, that's much better. We're getting, it's not amazing, but we are getting some recognizable shapes.

So, you know, VAEs are, you know, not generally going to get you as good a results as diffusion models are, although actually if you train really good ones for a really long time, they can be pretty impressive. But yeah, even in this extremely simple, quick case, we've got something that can generate recognizable items of clothing.

Did you guys want to add anything before we move on to the stable diffusion VAE? Okay. So this VAE is very crappy. And as we mentioned, one of the key reasons to use a VAE is actually that you can benefit from all the compute time that somebody else has put into training a good VAE.

Just also like one thing when you say good VAE, the one that we've trained here is good at generating because it maps down to this like one, two dimensional vector and then back in a very useful way. And like, if you look at VAEs for generating, they'll often have a pretty small dimension in the middle and it'll just be like this vector that gets mapped back up.

And so VAE that's good for generating is slightly different to one that's good for compressing. And like the stable diffusion one, we'll see has this like special components still, it doesn't map it down to a single vector, it maps it down to 64 by 64 or whatever. And I think that's smaller than the original, but for generating, we can't just put random noise in there and hope like a cohesive image will come out.

So it's less good as a generator, but it is good because it has this like compression and reconstruction ability. Cool. Yeah. So let's take a look. Now, to demonstrate this, we want to move to a more difficult task because we want to show off how using Latents let us do stuff we couldn't do well before.

So the more difficult task we're going to do is generating bigger images and specifically generate images of bedrooms using the L Sun Bedrooms dataset. So L Sun is a really nice dataset, which has many, many, many millions of images across 10 scene categories and 20 object categories. And so it's very rare for people to use of the object categories to be honest, but people quite often use the scene categories.

They're a little more than a little can be extremely slow to download is that the website they come from is very often down. So what I did was I put a subset of 20% of them onto AWS. They kindly provide some free dataset hosting for our students. And also the original L signs in a slightly complicated form.

It's in something called an LMDB database. And so I turned them into just normal images in folders. So you can download them directly from the AWS dataset site that they've provided for us. So I'm just using fast core to save it and then using Python's shutil to unpack the gzipped tar file.

Okay. So that's given us once that runs, which is going to take a long time. And, you know, if it might be, you know, even more reliable just to do this in the shell with wget or aria 2c or something than doing it through Python. So this will work, but if it's taking a long time or whatever, maybe just delete it and do it in the shell instead.

Okay. So then I thought, all right, how do we turn these into Latents? Well, we could create a dataset in the usual ways. It's going to have a length. So we're going to grab all the files. So glob is a built into Python, which we'll search for in this case, star dot jpeg.

And if you've got star star slash, that's going to search recursively as long as you pass recursive. So we're going to search for all of the jpeg files inside our data slash bedroom folder. So that's what this is going to do. It's going to put them all into the files attribute.

And so then when we get an item, the ith item, it will find the ith file. It will read that image. So this is PyTorch's read image. It's the fastest way to read a jpeg image. People often use PIL, but it's quite hard to find a really well optimized PIL version that's really compiled fast, whereas the PyTorch Torch Vision team have created a very, very fast read image.

That's why I'm using theirs. And if you pass in image read mode.RGB, it will automatically turn any one channel, black and white images, into three channel images for you. Or if there are four channel images with transparency, it will turn those. So this is a nice way to make sure they're all the same.

And then this turns it into floats from not to one. And these images are generally very close to 256 by 256 pixels. So I just crop out the top 250 by 256 bit, because I didn't really care that much. And we do need them to all be the same size in order that we can then pass them to the stable diffusion VAE decoder as a batch.

Otherwise it's going to take forever. So I can create a data loader that's going to go through a bunch of them at a time. So 64 at a time. And use however many CPUs I have as the number of workers. It's going to do it in parallel. And so the parallel bit is the bit that's actually reading the JPEGs, which is otherwise going to be pretty slow.

So if we grab a batch, here it is. Here's what it looks like. Generally speaking, they're just bedrooms, although we've got one pretty risque situation in the bedroom. But on the whole, they're not safe for work. This is the first time I've actually seen an actual bedroom scene taking place, as it were.

All right. So as you can see, this mini batch of, if I just grab the first 16 images, has three channels and 256 by 256 pixels. So that's how big that is for 16 images. So that's 728. So 3.145 million floats to represent this. Okay. So as we learned in the first lesson of part two, we can grab an autoencoder directly using diffusers using from pre-trained.

We can pop it onto our GPU. And importantly, we don't have to say with torch.nograd anymore if we pass requires grad false. And remember this neat trick in PyTorch, if it ends in an underscore, it actually changes the thing that you're calling in place. So this is going to stop it from computing gradients, which would take a lot of time and a lot of memory otherwise.

So let's test it out. Let's encode our mini batch. And so just like Johnno was saying, this has now made it much smaller. It's got just in our 16 batch of 16, it's now a four channel 32 by 32. So if we can compare the previous size to the new size, it's 48 times smaller.

So that's 48 times less memory it's going to need. And it's also going to be a lot less compute for a convolution to go across that image. So it's no good unless we can turn it back into the original image. So let's just have a look at what it looks like first.

Now it's a four channel image, so we can't naturally look at it. But what I could do is just grab the first three channels. And then they're not going to be between 0 and 1. So if I just do dot sigmoid, now they're between 0 and 1. And so you can see that our risque bedroom scene, you can still recognize it.

Or this bedroom, this bed here, you can still recognize it. So there's still that kind of like the basic geometry is still clearly there. But it's, yeah, it's clearly changed it a lot as well. So importantly, we can call decode on this 48 times smaller tensor. And it's really, I think, absolutely remarkable how good it is.

I can't tell the difference to the original. Maybe if I zoom in a bit. Her face is a bit blurry. Was her face always a bit blurry? No, it was always a bit blurry. First, second, third. Oh, hang on. Did that used to look like a proper ND? Yeah, OK.

So you can see this used to say that clearly there's an ND here. And now you can't see those letters. So and this is actually a classic thing that's known for this particular VAE is it's not able to regenerate writing correctly at small font sizes. I think it's also pretty it's like I think we hear with the faces are already pretty low resolution.

But if you are at a higher resolution, the faces also would probably not be converted appropriately. OK, cool. But overall, yeah, it's done a great job. A couple of other things I wanted to note was like, so like you mentioned, like a 40, I guess a factor of 48 degrees.

Oftentimes people refer to mostly at the spatial resolution. So since it's going from 256 by 256 to 32 by 32. So that's like a factor of eight. So they sometimes will know, like, I think it's like F8 or something like this. They'll note the spatial resolution. So sometimes you may see that written out like that.

And of course, it is an eight squared decrease in the number of pixels, which is interesting. Right. Right. And then the other thing I want to note was that the VAE is also trained with with a perceptual loss objective, as well as technically like a like a discriminator, again, objective.

I don't know if you were going to go into that later now. So, yeah, so perceptual loss, we've we've already discussed. Right. So the VAE is going to you know, when they trained it. So I think this was trained by Compviz, right, the, you know, Robin and Gang and used stability.ai donated compute for that.

And they went to be clear, actually, no, the VAE was actually trained separately. And it's actually a train on the open images data set. And it was just this VAE that they trained by themselves on, you know, a small subset of data. But because the VAE is so powerful, it's actually able to be applied to all these other data sets as well.

Okay, great. Yeah. So they so they would have had a KL diversion loss and they would have either had an MSC or BCE loss. I think it might have been an MSC loss. They also had a perceptual loss, which is the thing we learned about when we talked about super resolution, which is where when they compared the the output images to the original images, they would have run that through a, you know, ImageNet trained or similar classifier and confirmed that the activations they got through that model was similar.

And then the final bit is as Tanisha was mentioning is the adversarial loss, which is also known as a as a GAN loss. So a GAN is a generative adversarial network. And the GAN loss what it does is it grabs it is actually more specifically what's called a patchwise GAN loss.

And what it does is it takes like a little section of an image. Right. And what they've done is they train it's let's just simplify it for a moment and imagine that they've pre-trained a classifier, right, where they've basically got something that you can pass it a real, you know, patch from a bedroom scene and a and a fake patch from a bedroom scene.

And they both go into the what's called the discriminator. And this is just a normal, you know, ResNet or whatever, which basically outputs something that either says, yep, the the image is real or nope, the image is fake. So sorry, I said it passes in two things. You just that was wrong.

You just pass in one thing and it returns either it's real or it's fake. And specifically, it's going to give you something like the probability that it's real. There is another version. I don't think it's what they use. You pass in two and it tells you which one's relative.

Do you remember Tanisha? Is it a relativistic GAN or a normal GAN? I think it's a normal one. Yeah. So the realistic GAN is when you pass in two images and it says which is more real. The one we think that we remember correctly, they use as a regular GAN, which just tells you the probability that it's real.

And so you can just train that by passing in real images and fake images and having it learn to classify which ones are real and which ones are fake. So now that once you've got that model trained, then as you train your GAN, you pass in the patches of each image into the discriminator.

So let's call D here, right? And it's going to spit out the probability that that's real. And so if it's spat out 0.1 or something, then you're like, oh, dear, that's terrible. Our VAE is spitting out pictures of bedrooms where the patches of it are easily recognized as not real.

But the good news is that's going to generate derivatives, right? And so those derivatives then is going to tell you how to change the pixels of the original generated image to make it trick the GAN better. And so what it will do is it will then use those derivatives as per usual to update our VAE.

And the VAE in this case is going to be called a generator, right? That's the thing that's generating the pixels. And so the generator gets updated to be better and better at tricking the discriminator. And after a while, what's going to happen is the generator is going to get so good that the discriminator gets fooled every time, right?

And so then at that point, you can fine-tune the discriminator better by putting in your better generated images, right? And then once your discriminator learns again how to recognize the difference between real and fake, you can then use it to train the generator. And so this is kind of ping-ponging back and forth between the discriminator and the generator.

Like when GANs were first created, people were finding them very difficult to train. And actually a method we developed at Fast AI, I don't know if we were the first to do it or not, was this idea of kind of pre-training a generator just using perceptual loss and then pre-training a discriminator to be able to fool the generator and then ping-ponging backwards and forwards between them.

After that, basically whenever the discriminator got too good, start using the generator. Anytime the generator got too good, start using the discriminator. Nowadays, that's pretty standard, I think, to do it this way. And so, yeah, this GAN loss, which is basically saying penalize for failing to fool the discriminator is called an adversarial loss.

To maybe motivate why you do this, if you just did it with a mean squared error or even a perceptual loss with such a high compression ratio, the VAEs tend to produce a fairly blurry output because it's not sure whether there's texture or not in this image or the edges aren't super well defined where they'll be because it's going from one four-dimensional thing up to this whole patch of the image.

And so it tends to be a little bit blurry and hazy because it's kind of hedging its bets, whereas that's something that the discriminator can quite easily pick up. Oh, it's blurry. It must be fake. And so then it's having the discriminator, that is adversarial loss, is just kind of saying, even if you're not sure exactly where this texture goes, rather go with a sharper looking texture that looks real than with some blurry thing that's going to maximize your MSE.

And so it tricks it into kind of faking this higher resolution looking sharper output. Yeah. And I'm not sure if we're going to come back and train our own GAN at some point, but if you're interested in training your own GAN or-- you shouldn't call it a GAN, right?

I mean, nowadays, we never really just use a GAN. We have an adversarial loss as part of a training process. So if you want to learn how to use adversarial loss in detail and see the code, the 2019 FastAI course Less than 7 at part 1 has a walkthrough.

So we have sample code there. And maybe given time, we'll come back to it. OK. So quite often, people will call the VAE encoder when they're training a model, which to me makes no sense, right? Because the encoded version of an image never changes unless you are using data augmentation and want to do augmentation on-- sorry, to encode augmented images.

I think it makes a lot more sense to just do a single run through your whole training set and encode everything once. So naturally, the question is then, well, where do you save that? Because it's going to be a lot of RAM. If you put this, leave it in RAM.

And also, as soon as you restart your computer, we've lost all that work. There's a very nifty file format you can use called a memory mapped numpy file, which is what I'm going to use to save our latency. A memory mapped numpy file is basically-- what happens is you take the memory in RAM that numpy would normally be using, and you literally copy it onto the hard disk, basically.

That's what they mean by memory mapped. There's a mapping between the memory in RAM and the memory in hard disk. And if you change one, it changes the other, and vice versa. They're kind of two ways of seeing the same thing. And so if you create a memory mapped numpy array, then when you modify it, it's actually modifying it on disk.

But thanks to the magic of your operating system, it's using all kinds of beautiful caching and stuff to not make that slower than using a normal numpy array. And it's going to be very clever at-- it doesn't have to store it all in RAM. It only stores the bits in RAM that you need at the moment or that you've used recently.

It's really nifty at caching and stuff. So it's kind of-- it's like magic, but it's using your operating system to do that magic for you. So we're going to create a memory mapped file using np.memmap. And so it's going to be stored somewhere on your disk. So we're just going to put it here.

And we're going to say, OK, so create a memory map file in this place. It's going to contain 32-bit floats. So write the file. And the shape of this array is going to be the size of our data set, so 303,125 images. And each one is 4 by 32 by 32.

OK. So that's our memory mapped file. And so now we're going to go through our data loader, one mini batch of 24 at a time. And we're going to VAE encode that mini batch. And then we're going to grab the means from its latency. We don't want random numbers.

We want the actual midpoints, the means. So this is using the diffusers version of that VAE. So pop that onto the CPU after we're done. And so that's going to be mini batch of size 64 as PyTorch. Let's turn that into NumPy because PyTorch doesn't have a memory mapped thing, as far as I'm aware, but NumPy does.

And so now that we've got this memory mapped array called a, then everything initially from 0 up to 64, not including the 64, that whole sub part of the array is going to be set to the encoded version. So it looks like we're just changing it in memory. But because this is a magic memory mapped file, it's actually going to save it to disk as well.

So yeah, that's it. Amazingly enough. That's all you need to create a memory mapped NumPy array of our latents. When you're done, you actually have to call dot flush. And that's just something that says like anything that's just in cache at the moment, make sure it's actually written to disk.

And then I delete it because I just want to make sure that then I read it back correctly. So that's only going to happen once if the path doesn't exist. And then after that, this whole thing will be skipped. And instead, we're going to call mp.memmap again with an M path.

But this time in the same data type, the same shape, this time we're going to read it. Mode equals R means read it. And so let's check it. Let's just grab the first 16 latents that we read and decode them. And there they are. OK. So this is like not a very well-known technique, I would say, sadly.

But it's a really good one. You might be wondering like, well, what about like compression? Like shouldn't you be zipping them or something like that? But actually remember, these latents are already-- the whole point is they're highly compressed. So generally speaking, zipping latents from a good VAE doesn't do much.

Because they almost look a bit random number-ish. OK. So we've now saved our entire LSUN bedroom. That's a 20% subset, the bit that I've provided. Now, Latents. So we can now run it through-- this is a nice thing. We can use exactly the same process from here on in as usual.

OK. So we've got the Noiser 5 of our usual collated version. Now, the Latents are much higher than 1 standard deviation. So if we about divide it by 5, that takes it back to a standard deviation of about 1. I think in the paper they use like 0.18 or something.

But this is close enough to make it a unit standard deviation. So we can split it into a training and a validation set. So just grab the first 90% of the training set and the last 10% for the validation set. So those are our data sets. We use a batch size of 128.

So now we can use our data loaders class we created with the getDLs we created. So these are all things we've created ourselves with the training set, the validation set, the batch size, and our collation function. So yeah, it's kind of nice. It's amazing how easy it is. A data set has the same interface as a NumPy array or a list or whatever.

So we can literally just use the NumPy array directly as a data set, which I think is really neat. This is why it's useful to know about these foundational concepts, because you don't have to start thinking like, oh, I wonder if there's some torch vision thing to use memmap NumPy files or something.

It's like, oh, wait, they already do provide a data set interface. I don't have to do anything. I just use them. So that's pretty magical. So we can test that now by grabbing a batch. And so this is being noisified. And so here we can see our noisified images.

And so here's something crazy is that we can actually decode noisified images. And so here's I guess this one wasn't noisified much because it's a recognizable bedroom. And this is what happens when you just decode random noise, something in between. So I think that's pretty fun. Yeah, this next bit is all just copied from our previous notebook, create a model, organize it, train for a while.

So this took me a few hours on a single GPU. Everything I'm doing is on a single GPU. Literally nothing in this course, other than the stable diffusion stuff itself is trained on more than one GPU. The loss is much higher than usual. And that's not surprising because it's trying to generate latent pixels, which rare like it's much more precise as to exactly what it wants.

There's not like lots of pixels where the ones next to each other are really similar or the whole background looks the same or whatever. A lot of that stuff, it's been compressed out. It's a more difficult thing to predict latent pixels. So now we can sample from it in exactly the same way that we always have using DDIM.

But now we need to make sure that we decode it, because the thing that it's sampled are latents because the thing that we asked it to learn to predict are latents. And so now we can take a look and we have bedrooms. And some of them look pretty good.

I think this one looks pretty good. I think this one looks pretty good. This one, I don't have any idea what it is. And this one, like clearly there's bedroomy bits, but there's something, I don't know, there's weird bits. So the fact that we're able to create 256 by 256 pixel images where at least some of them look quite good in a couple of hours, I can't remember how long it took to train, but it's a small number of hours in a single GPU is something that was not previously possible.

And we're in a sense, we're totally cheating because we're using the stable diffusion VAE to do a lot of the hard work for us. But that's fine, you know, because that VAE knows how to create all kinds of natural images and drawings and portraits and royal paintings or whatever.

So you can, I think, work in that latent space quite comfortably. Yeah. Do you guys have anything you wanted to add about that? Oh, actually, Tanishka, you've trained this for longer. I only trained it for 25 epochs. How long did you, how many hours did you train it for?

Cause you did, you did a hundred epochs, right? Yes, I did a hundred epochs. I didn't keep trying exactly, but I think it was about 15 hours on an A100. A single A100. Yeah. I argued, I mean, the results, yeah, I'll show it. It's I guess maybe slightly better, but you know, I guess you can, I see maybe.

No, it is definitely slightly better. The good ones are certainly slightly better. Yeah. Yeah. Like the bottom left one is better than any of mine, I think. So it's possible. Maybe at this point, we just may need to use more data, I guess, cause I guess we were using a 20% subset.

So maybe having more of that data to provide more diversity or something like that, maybe that might help. Yeah. Or maybe, have you tried doing the diffusers one for a hundred? No, I'm using this. Okay. Our code here. Yeah. So I've got, all right. So I'll share my screen if you want to stop sharing yours.

So I do have, if we get around to this, maybe we can add the results back to this notebook. Cause I do have a version that uses diffusers. So everything else is identical. 25 epochs, except for the model for the previous one, I was using our, our own MBU net model.

So I have to change the channels now to four and a number of filters. I think I might've increased it a bit. So then I tried using, yeah, the diffusers unit with whatever their defaults were. And so I got, what did I get here? 243 with diffusers. I got a little bit better, 239.

And yeah, I don't know if they're obviously better or not. Like, this is a bit weird. I think like, actually, another thing we could try maybe is do a hundred epochs, but use the diffusers number of channels and stuff that they used for. Cause I think the defaults that they use actually for diffusers is not the same as stable diffusion.

So maybe we could try stable diffusion, matched unit for a hundred epochs. And if we get any nice results, maybe we can paste them into the bottom to show people. Yeah. Yeah. Cool. Yeah. Do you guys have anything else to add at this point? All right. So I'll just mention one more thought in terms of like a bit of a interesting project people could play with.

I don't know if this is too crazy. I don't think it's been done before, but my thought was like, there was a huge difference in our super resolution. Do you remember a huge difference in our super resolution results when we used a pre-trained model and when we used perceptual loss, but particularly when we used a pre-trained model.

I thought we could use a pre-trained model, but we would need a pre-trained latent model, right? We would want something where our, you know, downsampling backbone was pre-trained model on latents. And so I just want to just show you what I've done and you guys, you know, if anybody watching wanted to try taking this further, I've just done the first bit for you to give you a sense, which is I've pre-trained an image net model, not tiny image net, but a full image net model on latents as a classifier.

And if you use this as a backbone, you know, and also try maybe some of the other tricks that we found helpful, like having res nets on the cross connections. These are all things that I don't think anybody's done before. I don't know, the scientific literature is vast and I might've missed it, but I've not come across anybody do these tricks before.

So obviously like we're, one of the interesting parts of this, which is designed to be challenging is that we're using bigger datasets now, but they're datasets that you can absolutely like run on a single GPU, you know, a few tens of gigabytes, which fits on any modern hard drive easily.

So these like are good tests of your ability to kind of like move things around. And if you're somewhere that doesn't have access to a decent internet connection or whatever, this might be out of the question, in which case don't worry about it. But if you can, yes, try this because it's good practice, I think, to make sure you can use these larger datasets.

So image net itself, you can actually grab from Kaggle nowadays. So they call it the object localization challenge, but actually this contains the full image net dataset or the version that's used for the image net competition. So I think people generally call that one case. You just have to accept the terms because that has like some distribution terms.

Yeah, exactly. So you've got to kind of sign in and then join the competition and then yeah, accept the terms. So you can then download the dataset or you can also download it from Hugging Face. It'll be in a somewhat different format, but that'll work as well. So I think I grabbed my version from Kaggle.

So on Kaggle, you know, it's just a zip file, you unzip it and it creates an ILSVRC directory, which I think is what they called the competition. Yeah, image net, large-scale visual recognition challenge. Okay. So then inside there, there is a data and inside there, there is a CLS lock and that's actually where the, that's where actually everything's going to be.

So just like before, I wanted to turn these all into Latents. So I created in that directory, I created a Latents subdirectory and this time partly just to demonstrate how these things work, I wanted to do it a slightly different way. Okay. So again, we're going to create our pre-trained VAE, pop it on the GPU, turn off gradients for it and I'm going to create a dataset.

Now, one thing that's a bit weird about this is that because this is really quite a big dataset, like it's got 1.3 million files, the thing where we go glob star star slash star dot JPEG takes a few seconds, you know, and particularly if you're doing this on like, you know, an AWS file system or something, it can take really quite a long time.

On mine, it only took like three seconds, but I don't want to wait three seconds. So I, you know, a common trick for these kinds of big things is to create a cache, which is literally just a list of the files. So that's what this, this is. So I decided that Z pickle means a gzipped pickle.

So what I do is if, if, if the cache exists, we just gzip dot open the files. If it doesn't, we use glob exactly like before to find all the files. And then we also save a gzip file containing pickle dot dump files. So pickle dot dump is what we use in Python to take basically any Python object list of dictionaries and dictionary of lists, whatever you like, and save them.

And it's super fast, right? And I use gzip with compress level one to basically be like compress it pretty well, but pretty fast. So this is a really nice way to create a little cache of that. So this is the same as always. And so our get item is going to grab the file.

It's going to read it in, turn it into a float. And what I did here was, you know, I'm being a little bit lazy, but I just decided to center crop the middle, you know, so let's say it was a 300 by 400 file, it's going to center crop the middle 300 by 300 section, and then resize it to 256 by 256.

So they'll be the same size. So yeah, we can now, oh, I managed to create the VAU twice. So I can now just confirm, I can grab a batch from that data loader, encode it. And here it is, and then decode it again. And here it is. So the first category must have been computer or something.

So here's, as you can see, the VA is doing a good job of decoding pictures of computers. So I can do something really very similar to what we did before. If we haven't got that destination directory yet, create it, go through our data loader, encode a batch. And this time I'm not using a memmapped file, I'm actually going to save separate NumPy files for each one.

So go through each element of the batch, each item. So I'm going to save it into the destination directory, which is the Latents directory. And I'm going to give it exactly the same path as the original one contained, because it contains the folder of what the label is. Make sure that the directory exists, that we're saving it to, and save that just as a NumPy file.

This is another way to do it. So this is going to be a separate NumPy file for each item. Does that make sense so far? Okay, cool. So I could create a thing called a NumPy data set, which is exactly the same as our images data set. But to get an item, we don't have to use, you know, open a JPEG anymore, we just call mp.load.

So this is a nice way to like take something you've already got and change it slightly. So it's going to return the... Where did you do this? Did the memory map file, Jeremy? Just out of interest? Sorry? Where did you do this versus the memory map file? Was it just to show a different way?

Just to show a different way. Yeah. Yeah. Absolutely no particularly good reason, honestly. Yeah, I like to kind of like demonstrate different approaches. And I think it's good for people's Python coding if you make sure you understand what all the lines of code do. Yeah, they both work fine, actually.

It's partly also for my own experimental interest. It's like, oh, which one seems to kind of feel better? Yeah. All right. So create training and validation data sets by grabbing all the NumPy files inside the training and validation folders. And then I'm going to just create a training data loader for the training data set just to see what the main and standard deviation is on the channel dimension.

So this is every dimension except channel what I mean over. And so there it is. And as you can see there, the main and standard deviation are not close to zero and one. So we're going to store away that main and standard deviation such that we then... We've seen transform data set before.

This is just applying a transform to a data set. We're going to apply the normalization and transform. In the past, we've used our own normalization that TorchVision has one as well. So this is just demonstrating how to just use TorchVision's version. But it's literally just subtracting the main and dividing by the standard deviation.

We're also going to apply some data augmentation. We're going to use the same trick we've used before for images that are very small, which is we're going to add a little bit of padding and then randomly crop our original image size from that. So it's just like shifting it slightly each time.

And we're also going to use our random erasing. And it's nice because we did it all with broadcasting, this is going to apply equally well to a four-channel image as it is to a three or I think we did originally for one. Now, I don't think anybody as far as I know has built classifiers from Latents before.

So I didn't even know if this is going to work. So I visualized it. So we could have Tifem X and a Tifem Y. So for Tifem X, you can optionally add augmentation. And if you do, then apply the augmentation transforms. Now this is going to be applied one image at a time, but our augmentation transforms, some of them expect a batch.

So we create a extra unit axis on the front to be a batch of one and then remove it again. And then Tifem Y, very much like we've seen before, we're going to turn those half names into IDs. So there's our validation and training transform datasets. So that we can look at our results, we need a denormalization.

So let's create our data loaders and grab mini batches and show us. And so I was very pleased to see that the random arrays works actually extremely nicely. So you can see you get these kind of like weird patches, you know, weird patches. But they're still recognizable. So this is like something I very, very often do is to answer like, oh, is this like thing I'm doing in computer vision reasonable?

It's like, well, can my human brain recognize it? So if I couldn't recognize this with a drilling platform myself, then I shouldn't expect a computer to be able to do it or that this is a compass or whatever. I'm so glad they got orders. So cute. And you can see the cropping it's done has also been fine.

Like it's a little bit of a fuzzy edge, but basically like it's not destroying the image at all. They're still recognizable. It's also a good example here of how difficult like this problem is, like the fact that this is seashore, I would have called this surface, you know, but maybe surface is not an image in that category.

Yeah. Okay. This could be food, but actually it's a refrigerator. Okay. So our augmentation seems to be working well. So then I, yeah, basically I've just copied and pasted, you know, our basic pieces here. And I kind of wanted to have it all in one place just to remind myself of exactly what it is.

So this is the preactivation version of convolutions. The reason for that is if I want this to be a backbone for a diffusion model or a unit, then I remember that we found that preactivation works best for units. So therefore our backbone needs to be trained with preactivation. So we've got a preactivation conv, got a res block, res blocks model with dropouts.

This is all just copied from previous. So I decided like I wanted to try to, you know, use the basic trick that we learnt about from simple diffusion of trying to put most of our work in the later layers. So the first layer just has one block, then two blocks, and then four blocks.

And then I figured that we might then delete these final blocks. These maybe you're going to just end up being for classification. This might end up being our pre-trained backbone, or maybe we keep them. I don't know. You know, it's like, as I said, this hasn't been done before.

So anyway, I tried to design it in a way that we've got some, you know, we can mess around a little bit with how many of these we keep. And so also I tried to use very few channels in the first blocks. And so I jump up for the channels that are aware of the works going to do a jump from 128 to 512.

So that's why I designed it this way. You know, I haven't even taken it any further than this. So I don't know if it's going to be a useful backbone or not. I didn't even know if this is going to be possible to classify. It seemed very likely it was possible to classify, even based on the fact that you can still kind of recognize it almost like I could probably recognize it's a computer maybe.

So I thought it was going to be possible. But yeah, this is all new. So that was the model I created. And then I trained it for 40 epochs. And you can see after one epoch, it was already 25% accurate. And that's it recognizing which one of a thousand categories is it.

So I thought that was pretty amazing. And so after 40 epochs, I ended up at 66%, which is really quite fantastic because a ResNet 34 is kind of like 73% or 74% accuracy when trained for quite a lot longer. You know, so to me, this is extremely encouraging that, you know, this is a really pretty good ResNet at recognizing images from their latent representations without any decoding or whatever.

So from here, you know, if you want to, you guys could try, yeah, building a better bedroom diffusion model or whatever you like. It's not to be bedrooms. Actually, one of our colleagues Molly, I'm just going to find it. So one of our colleagues Molly actually used the, do you guys remember, was it the celeb faces that she used?

So there's a celeb A HQ data set that consists of images of faces of celebrities. And so what Molly did was she basically used this exact notebook, but used this faces data set instead. And this one's really pretty good, isn't it? You know, this one's really pretty good. They certainly look like celebrities, that's for sure.

So yeah, you could try this data set or whatever, but yeah, try it. Yeah, maybe try it with the pre-trained backbone, try it with ResNets on the cross connections, try it with all the tricks we used in SuperRes, try it with perceptual loss. Some folks we spoke to about the perceptual loss think it won't help with Latents because the underlying VAE was already trained with perceptual loss, but we should try, you know, or you guys should try all these things.

Yeah, so be sure to check out the forum as well to see what other people have already tried here because it's a whole new world. But it's just an example of the kind of like fun research ideas I guess we can play with. Yeah, what do you guys think about this?

Are you like surprised that we're able to quickly get this kind of accuracy from Latents or do you think this is a useful research path? What are your thoughts? Yeah, I think it's very interesting. Oh, go ahead. I was going to say the Latents are already like a slightly compressed, richer representation of an image, right?

So it makes sense that that's a useful thing to train on. And 66%, I think AlexNet is like 63% or something like that. So, you know, we were already at state of the art, what, eight years ago, whatever. It might be more like 10 years ago. I know time passes quickly.

Yeah, yeah, I guess next year. Yeah, next year it is 10 years ago. But yeah, I'm kind of curious with the pre-training the whole, the whole value for me for like using a pre-trained network where someone else has done lots and lots of compute on ImageNet to learn some features and I'm going to use that because it's kind of funny to be like, oh, well, let's pre-train for ourselves and then try and use that.

I'm curious whether like how best you'd allocate that compute whether you should, if you've got 10 hours of GPU, just do 10 hours of training versus like five hours of pre-training and five hours of training. I mean, based on our super res thing, the pre-training like was so much better.

So that's why I'm feeling somewhat hopeful about this direction. Yeah. Yeah, I'm really curious to see how it goes. I guess I was going to say it's like, yeah, I think there's just a lot of opportunities for I guess the latent doing stuff in the latents. And like, I guess maybe like, yeah, you could, I mean, hear your trade classifier as a backbone, but you could think of like trade classifiers on other things for, you know, guidance or things like this.

Yeah. Of course, we've done some experiments with that. I know John has his mid you guidance approach for some of these sort of things, but there are different approaches that you can play around here that, you know, exploring in the latent space can make it computationally cheaper than, you know, having to decode it every time you want to, you know, like you have to look at the image and then maybe apply a classifier, apply some sort of guidance on the image.

But if you can do it directly in the latent space, a lot of interesting opportunities there as well. Yeah. And, you know, now we're showing that indeed. Yeah, yeah, style transfer is on latents. Everything on latents, like you also do models. Like that's something I've done to make a latent clip is just have it like try and mirror an image space clip.

And so for classifiers as well, you could like distill an image net classifier rather than just having the label, you try and like copy the logits. And then that's like an even richer signal, like you get more value per example. So then you can create your latent version of some existing image classifier or object detector or multi-modal model like clip.

I feel funny about this because I'm like both excited about simple diffusion on the basis that it gets rid of latents, but I'm also excited about latents on the basis of it gets rid of most of the pixels. I don't know how I can be cheering for both, but somehow I am.

I guess may the best method win. So, you know, the folks that are finishing this course, well, first of all, congratulations, because it's been a journey, particularly part two, it's a journey, which requires a lot of patience and tenacity. You know, if you've got to zip through by binging on the videos, that's totally fine.

It's a good approach, but you know, maybe go back now and do it more slowly and do the, you know, build it yourself and really experiment. But assuming, you know, for folks who have got to the end of this and feel like, okay, I get it more or less.

Yeah. Do you guys have any sense of like, what kind of things make sense to do now? You know, where would you guys go from here? I think that great opportunities implementing papers that I guess come along these days. And I think at this stage, no way. Yeah. But also at this stage, I think, you know, we're already discussing research ideas.

And I think, you know, we're in a solid position to come up with our own research ideas and explore, explore those ideas. So I think that's a, that's a real opportunity that we have here. I think that's best done often collaboratively. So I'll just, you know, mention that Fast AI has a Discord, which if you've got to this point, then you're probably somebody who would benefit from, from being there.

And yeah, just pop your head in and say, like, there's an introduction straight to say hello. And you don't, you know, maybe say what you're interested in or whatever, because it's, it's nice to work with others, I think. I mean, both Jono and Tanishka only know because of the Discord and the forums and so forth.

So that would be one. And we also have a, we have a generative channel. So anything related to generative models, that's the place. So for example, Bali was posting some of her experiments in that channel. I think there are other Fast AI members posting their experiments. So if you're doing anything generative model related, that's a great way to also get feedback and thoughts from, from the community.

Yeah. I'd also say that like this, this, if you're at the stage where you finish this course, you actually understand how diffusion models work. You've got a good handle on what the different components and like stable diffusion are. And you know how to wrangle data for training and all these things.

You're like so far ahead of most people who are building in this space. And I've got lots of, lots of companies and people reaching out to me to say, do you know anybody who has like more than just, oh, I know how to like load stable diffusion and make an image.

Like, you know, someone who knows how to actually like tinkle with it and make it better. And if you've got those skills, like don't feel like, oh, I'm definitely not qualified to like apply or like, there's lots of stuff where, yeah, just taking these ideas now and like just simple, sensible ideas that we've covered in the course that have come up and saying, oh, actually, maybe I could try that.

Maybe I could play with this, you know, take this experimentalist approach. I feel like there's actually a lot of people who would love to have you helping them build the million and one little stable diffusion based apps or whatever that you're working on. And particularly like the thing we always talk about at Fast AI, which is particularly if you can combine that with your domain expertise, you know, whether it be from your, your hobbies or your work in some completely different field or whatever, you know, there'll be lots of interesting ways to combine, you know, you probably are one of the only people in the world right now that understand your areas of passion or of vocation as well as these techniques.

So, and again, that's a good place to kind of get on the forum or the discord or whatever and start having those conversations because it can be, yeah, it can be difficult when you're at the cutting edge, which you now are by definition. All right. Well, we better go away and start figuring out how on earth GPT-4 works.

I don't think we're going to necessarily build the whole GPT-4 from scratch, at least not at that scale, but I'm sure we're going to have some interesting things happening with NLP. And Jano, Tanish, thank you so much. It's been a real pleasure. It was nice doing things with the, with a live audience, but I got to say, I really enjoyed this experience of doing stuff with you guys the last few lessons.

So thank you so much. Yeah, thanks for having us. This is really, really fun. All right. Bye. Cool. Bye. Bye. Bye.