back to indexLesson 25: Deep Learning Foundations to Stable Diffusion
00:00:00.000 |
Hi everybody, and welcome to the last lesson of part two. 00:00:05.900 |
Greetings Jono and Greetings Tanishk, how are you guys doing? 00:00:17.040 |
I should explain, we're not quite completing all of stable diffusion in this part of the 00:00:24.440 |
There's going to be one piece left for the next part of the course, which is the CLIP 00:00:29.320 |
Because CLIP is NLP, and so the next part of the course we will be looking at NLP. 00:00:35.420 |
So we will end up finishing stable diffusion from scratch, but we're going to have to have 00:00:43.360 |
And what we thought was, given everything that's happened with GPT-4 and stuff since 00:00:50.680 |
we started this course, we thought it makes more sense to delve into that quite deeply, 00:01:02.800 |
So hopefully people will feel comfortable with that decision, but I think we'll have a lot 00:01:15.840 |
So I think what we might do is maybe start by looking at a really interesting and quite 00:01:23.720 |
successful application of pixel level diffusion by applying it not to pixels that represent 00:01:30.400 |
an image, but pixels that represent a sound, which is pretty crazy. 00:01:35.160 |
So maybe Johnno, of course it's going to be Johnno, he does the crazy stuff, which is 00:01:39.520 |
So Johnno, show you your crazy and crazily successful approach to diffusion for pixels 00:01:49.760 |
So this is going to be a little bit of intro and tell. 00:01:54.160 |
Most of the code in the notebook is just copied and pasted from, I think, notebook 30. 00:01:59.680 |
But we're going to be trying to generate something other than just images. 00:02:03.600 |
So specifically, I'm going to be loading up a dataset of bird calls. 00:02:07.040 |
These are just like short samples of, I think, 10 different classes of birds calling. 00:02:12.600 |
And so we need to understand like, okay, well, this is a totally different domain, right? 00:02:17.400 |
If you look at the data, like, let's look at an example of the data. 00:02:19.840 |
This is coming from a hugging face dataset so that that line of code will download it 00:02:23.560 |
automatically if you haven't got it before, right? 00:02:27.840 |
So this will download it into a cache and then sort of handle a lot of, you created 00:02:33.760 |
Did you, is this already a dataset you found somewhere else or you made it or what? 00:02:38.360 |
This is a subset that I made from a much larger dataset of longer call recordings and from 00:02:45.720 |
So they collect all of these sound recordings from people. 00:02:48.320 |
They have experts who help identify what birds are calling. 00:02:51.600 |
And so all I did was find the audio peaks, like where is there most likely to be a bird 00:02:56.840 |
call and clip around those just to get a smaller dataset of things where there's actually something 00:03:03.920 |
Not a particularly amazing dataset in terms of like the recordings have a lot of background 00:03:08.960 |
noise and stuff, but a fun, small audio one to play with. 00:03:12.200 |
Um, yeah. And so when we talk about audio, you've got a microphone somewhere, it's reading 00:03:16.640 |
like a pressure level, essentially, um, in the air with these sound waves. 00:03:20.960 |
And it's doing that some number of times per second. 00:03:23.000 |
So we have a sample rate and in this case, the data has a sample rate of 32,000 samples 00:03:29.520 |
So every second waveform that's being approximated, there's lots of little up across, up across, 00:03:34.520 |
up across kind of things, basically, correct? 00:03:39.480 |
Um, and so that's great for, you know, capturing the audio, um, but it's not so good for modeling 00:03:44.800 |
because we now have 30,000 values per second in this one big, one D array. 00:03:51.040 |
Um, and so yeah, you can try and find models that can work with that kind of data. 00:03:57.000 |
Uh, but what we're going to do is a little hack and we're instead going to use something 00:04:05.200 |
It's, it's, it's too big and slow to work with. 00:04:11.000 |
Um, but also you have some, like some sound waves are at a hundred Hertz, right? 00:04:18.400 |
So they're, they're going up and down a hundred times a second and some are at a thousand 00:04:23.080 |
And often there's background noise that can have extremely high frequency components. 00:04:27.260 |
And so if you're looking just at the waveform, there's lots and lots of change second to 00:04:33.560 |
And there's some very long range dependencies of like, Oh, it's generally high here. 00:04:39.240 |
And so it can be quite hard to capture those patterns. 00:04:41.240 |
Um, and so part of it is, it's just a lot of samples to deal with. 00:04:44.880 |
Um, but part of it also is that it's not like an image where you can just do a convolution 00:04:51.040 |
and things nearby each other tend to be related or something like that. 00:04:54.240 |
Um, it's quite tricky to disentangle what's going on. 00:04:57.920 |
Um, and so we have this idea of something called a spectrogram. 00:05:01.760 |
Uh, this is a fancy 3d visualization, but it's basically just taking that audio and 00:05:09.120 |
So you can see as time goes by, we're moving along the X axis and then on the Y axis is 00:05:15.480 |
And so the, um, the peaks here show like intensity at different frequencies. 00:05:19.980 |
And so if I make a pure note, you can see that that maps in the frequency domain. 00:05:27.640 |
Um, but when I'm talking, there's lots and lots of peaks and that's because our voices 00:05:33.920 |
So if I go, you can see there's a main notes, but there's also the subsequent notes. 00:05:38.880 |
And if I play something like a chord, you can see this, you know, maybe three main peaks 00:05:47.440 |
and then each of those have these harmonics as well. 00:05:50.360 |
Um, so it captures a lot of information about the signal. 00:05:53.560 |
Um, and so we're going to turn our audio data into something like this, where even just 00:05:59.720 |
visually, if I'm a bird, you can see this really nice spatial pattern. 00:06:04.960 |
And the hope is if we can generate that and then if we can find some way to turn it back 00:06:09.160 |
into audio and then we'll be off to the races. 00:06:13.720 |
And so, yeah, that's what I'm doing in this notebook. 00:06:15.800 |
We have, um, I'm leaning on the diffusers docs, um, pipelines.audio diffusion.mal class. 00:06:25.240 |
And so within the realm of spectrograms, there's a few different ways you can do it. 00:06:29.640 |
So this is from the torch audio docs, um, but this notebook is from the hugging face diffusion 00:06:35.280 |
So we had that waveform, that's those raw samples and we'd like to convert that into 00:06:39.840 |
what they call the frequency domain, um, which is things like these spectrograms. 00:06:44.520 |
Um, and so you can do a normal, normal spectrogram, a power spectrogram or something like that. 00:06:50.960 |
Um, but we often use something called a male spectrogram, which is exactly the same. 00:06:56.440 |
It's actually probably what's being visualized here. 00:06:59.320 |
And it's something that's designed to map the, like the frequency ranges into a range 00:07:06.600 |
that's, um, like tied to what human hearing is based on. 00:07:12.280 |
And so rather than trying to capture all frequencies from, you know, zero Hertz to 40,000 Hertz, 00:07:17.440 |
a lot of which we can't even hear, it focuses in on the range of values that we tend to 00:07:25.400 |
And also it does like a transformation into, into kind of like a log space. 00:07:31.720 |
Um, so that the, the intensities like highs and lows correspond to loud and quiet for 00:07:37.320 |
So it's very tuned for, um, the types of audio information that we actually might care about 00:07:43.240 |
rather than, you know, tens of thousands of kilohertz that any bats can hear. 00:07:48.720 |
So we're going to rely on a class to abstract this away, but it's going to basically give 00:07:52.200 |
us a transformation from waveform to spectrogram. 00:07:56.320 |
And then it's also going to help us go from spectrogram back to waveform. 00:08:03.320 |
I have this two image function that's going to take the audio array. 00:08:07.400 |
It's going to use the male, um, class to handle turning that into, um, spectrograms. 00:08:14.320 |
And the class also does things like it splits it up into chunks based on, you can set like 00:08:18.400 |
a desired, um, resolution I'd like 128 by 128 spectrogram. 00:08:25.320 |
I know how many, I know you need 128, like frequency bins for the frequency axis and 00:08:33.640 |
So it kind of handles that converting and resizing. 00:08:36.720 |
Um, and then it gives us these audio slice to image. 00:08:40.520 |
So that's taking a chunk of audio and turning it into the spectrogram. 00:08:49.200 |
We just referencing our original audio datasets, but we're calling that to image function and 00:08:55.240 |
then returning it into a tensor and we're mapping it to minus 0.5 to 0.5, similarly 00:09:01.920 |
to what we've done with like the grayscale images in the past. 00:09:05.080 |
Um, so if you look at a sample from that data, we now have, instead of an audio waveform 00:09:10.200 |
of 32,000 or 64,000, if it's two seconds samples, we now have this 128 by 128 pixel spectrogram, 00:09:25.720 |
Um, but we can test out going from the spectrogram back to audio using the image to audio function 00:09:32.560 |
that the male class has, um, and that should give us, um, now this isn't perfect because 00:09:43.080 |
the spectrogram shows the intensity at different frequencies, but with audio, you've also got 00:09:50.040 |
And so this image to audio function is actually behind the scenes doing a kind of iterative 00:09:55.800 |
approximation, um, with something called the Griffin Lynn algorithm. 00:09:59.600 |
Um, so I'm not going to try and describe that here, but it's just, it's approximating. 00:10:06.620 |
It's creating a spectrogram, it's comparing that to the original, it's updating, it's doing 00:10:10.400 |
some sort of like iterative, very similar to like an optimization thing to try and generate 00:10:15.240 |
an audio signal that would produce the spectrogram, which we're trying to invert. 00:10:19.480 |
So just to clarify, so my understanding, what you're saying is that the spectrogram is a 00:10:30.880 |
Um, and specifically it's lossy because it, um, tells you the kind of intensity at each 00:10:38.800 |
point, but it's not, it's kind of like, is it like the difference between a sine wave 00:10:47.320 |
So coming back to the sound, you do have to get that, that shifting the phase correct. 00:10:53.160 |
And so it's trying to guess something and it sounds like it's not doing a great guess 00:11:01.940 |
Um, but yes, the, the spectrogram back to audio task, this, these dotted lines are like 00:11:07.240 |
highlighting this is, yeah, it's an approximation and there are deep learning methods now that 00:11:11.480 |
can do that better, or at least that sound much higher quality, um, because you can train 00:11:17.360 |
a model somehow to go from this image-like representation back to an audio signal. 00:11:24.160 |
Um, but we just use the approximation for this notebook. 00:11:28.600 |
So now that we can represent our data as like a grayscale 128 by 128 pixel image, um, everything 00:11:35.080 |
else becomes very much the same as the previous diffusion models examples. 00:11:38.800 |
We're going to use this noiseify function to add different amounts of noise. 00:11:43.520 |
And so we can see now we have our spectrograms, but with varying amounts of noise. 00:11:47.320 |
Added we can create a simple diffusion model. 00:11:50.920 |
I'm just copying and pasting the results, but with one extra layer, um, just with very few 00:11:56.320 |
channels to go from 128 to 64 to 36, I mean, to 16 by eight to eight, um, no attention. 00:12:04.960 |
Just I think pretty much copied and pasted from notebook 30, uh, and train it for in 00:12:19.440 |
Um, so specifically, this is the simple diffusion model that you, um, I think I've already introduced. 00:12:27.120 |
So briefly looked at it, so let's remind ourselves of what it does here. 00:12:33.200 |
Um, so we have some number of down blocks with a specified number of channels. 00:12:37.680 |
And then the key insight from simple diffusion was that you often want to concentrate the 00:12:41.440 |
computes in the sort of middle at the low resolution. 00:12:44.480 |
So that's these, these mid blocks and their transformers. 00:12:51.360 |
Um, and so we can stack some number of those and then, um, the corresponding up path, and 00:12:58.680 |
So we passing in the features from the, the down path as we go through those up blocks. 00:13:03.640 |
Um, and so we're going to go take an, um, image and time step. 00:13:10.880 |
We're going to go through our down blocks and saving the results, we're going to go through 00:13:21.800 |
And before that, you've also got the, um, embedding of the, uh, locations that self.la is the learnable 00:13:34.360 |
So this is preparing it to go through the transformer blocks by adding some learnable 00:13:41.240 |
And then we reshaping it to be effectively a sequence since that's how we had written 00:13:48.920 |
our transformer to have expect a 1D sequence of embeddings. 00:13:52.840 |
Um, and so once you've gone through those mid blocks, we reshape it back and then we 00:13:57.080 |
go through the up blocks passing in and also our saved outputs from the down path. 00:14:06.320 |
And you can really control how much parameters and compute you're doing just by setting, 00:14:11.280 |
like what are the number of features or channels at each of those down block stages and how 00:14:18.960 |
Um, and so if you want to scale it up, it's quite easy to say, oh, let me just add more 00:14:23.520 |
Maybe I'll add more channels, um, to the, to the down and up paths. 00:14:27.360 |
Um, and there's a very easy model to tweak to get a larger or smaller model. 00:14:32.280 |
One fun thought I know is, um, simple diffusion only came out a couple of months ago and I 00:14:38.160 |
don't think, I think ours might be the first publicly available code for it because I don't 00:14:45.840 |
I suspect this is probably the first time maybe it's ever been used to generate audio 00:14:53.680 |
Um, I know a couple of people who've at least privately done their implementations when 00:14:57.640 |
I asked the author if he was releasing code, he said, oh, but it's simple. 00:15:07.160 |
I don't know the line then, but they were like, Oh, you can see the pseudo code. 00:15:16.120 |
So trains, the last goes down as we hope, um, sampling is exactly the same as generating 00:15:22.660 |
Um, and that's going to give us the spectrograms. 00:15:25.840 |
I'm using dealing with a hundred steps, um, and to actually listen to these samples, we 00:15:31.960 |
then are just going to use that, um, image to audio function again, to take our grayscale 00:15:38.840 |
Um, and in this case, actually it expects a PIL image. 00:15:41.440 |
So I first converted it to PIL, um, and then turn that back into audio. 00:15:47.200 |
And so we can play some of the generated samples. 00:15:56.680 |
I don't know that I could guarantee what bird is making these calls and some of them are 00:16:13.000 |
So yeah, that's generating and fake bird calls with, with, um, spectrogram diffusion. 00:16:19.720 |
Um, so the refusion projects based on text and yeah, there's, there's various other like 00:16:28.720 |
pre-trained models that do diffusion on spectrograms to produce, um, you know, music clips or voice 00:16:40.560 |
Refusion is actually this stable diffusion model that's, that's fine tuned specifically 00:16:44.840 |
for, for this, for the spectrogram generation, which is, which I find very impressive. 00:16:49.960 |
It's like a model that was originally for, you know, text to image is instead can also 00:16:55.680 |
I guess there's still some useful information in, you know, the sort of text image model 00:17:00.400 |
that kind of generalizes, or you can still be used for text to audio. 00:17:04.560 |
So I found that a very interesting, impressive application as well as a refusion is an awesome 00:17:16.080 |
And I guess since it's a latent model that leads us onto the next topic, right? 00:17:18.840 |
I was just going to say, we've got a natural segue there. 00:17:22.360 |
So we're, um, if we want to replicate refusion, then, um, we'll need lightens. 00:17:33.520 |
So the, the final non-NLP part of stable diffusion is this, uh, ability to use the more compressed 00:17:41.200 |
representation, uh, created by a VAE called Latens, um, instead of pixels. 00:17:48.080 |
Um, so we're going to start today by creating a VAE, taking a look at how it works. 00:17:54.520 |
Um, so to remind you, as we learned back in the, the first lesson of this part of part 00:18:01.160 |
two, um, the VAE model converts the, um, 256 by 256 pixel, three channel into a, um, is 00:18:27.000 |
So dramatically smaller, which makes life so much easier, um, which is, which is really 00:18:35.840 |
Um, having said that, you know, simple diffusion does the first, you know, few, in fact, you 00:18:44.680 |
know, all the downsampling pretty quickly and, and all the hard work happens, you know, 00:18:52.000 |
So maybe it's, you know, with simple diffusion, it's not as big a deal as it used to be, but 00:18:56.560 |
it's still, you know, it's very handy, particularly because for us folks with more normal amounts 00:19:01.760 |
of compute, we can take advantage of all that hard work that the stability.ai computers 00:19:08.960 |
did for us by creating the stable diffusion VAE. 00:19:13.000 |
Um, so that's what we're going to do today, but first of all, we're going to create our 00:19:23.360 |
So the first or the first stuff is just the, the normal. 00:19:26.500 |
One thing I am going to do for this simple example though, is I'm going to flatten the, 00:19:31.120 |
um, fashion MNIST pixels into a vector to make it as simple as possible. 00:19:40.840 |
So we've, we're going to end up with vectors of length 784 because 28 by 28 784, uh, we're 00:19:48.080 |
going to create a single hidden layer MLP with, um, 400, um, hidden and then 200 outputs. 00:20:03.440 |
So it's a sequential containing a linear and then an optimal activation function and an 00:20:10.000 |
Um, we'll update init weights so that we initialize linear layers as well. 00:20:16.480 |
Um, so before we create a VAE, which is a variational autoencoder, we'll create a normal autoencoder. 00:20:23.800 |
We've done this once before and we didn't have any luck. 00:20:27.440 |
Um, in fact, we were so unsuccessful that we decided to go back and create a learner 00:20:33.060 |
and come back a few weeks later once we knew what we were doing. 00:20:39.400 |
Um, so we're just going to recreate an autoencoder just like we did some lessons ago. 00:20:44.920 |
Um, so there's going to be an encoder, which is a sequential, which goes from our 768 inputs 00:20:50.840 |
to our 400 hidden and then a linear layer with our 400 hidden and then an output layer from 00:20:56.860 |
the 400 hidden to the 200 outputs of the encoder. 00:21:05.400 |
And then the decoder will go from those 200 latents to our 400 hidden, have our hidden 00:21:23.800 |
So we can optimize that in the usual way using Adam, um, and we'll do it for 20 epochs runs 00:21:34.200 |
pretty quickly cause it's quite a small dataset and quite a small model. 00:21:38.440 |
Um, and so what we can then do, um, is we can grab a batch of our X who actually grabbed 00:21:53.440 |
So I've got a batch of images, um, and we can put it through our model, um, pop it back 00:22:02.360 |
on the CPU and we can then have a look at our original mini-batch and we have to reshape 00:22:09.960 |
it to 28 by 28 because we previously had flattened it. 00:22:13.840 |
So there's our original and then, um, we can look at the result after putting it through 00:22:22.760 |
And as you can see, it's, you know, very roughly regenerated. 00:22:27.600 |
And so this is, um, not a massive compression, it's compressing it from 768 to 200. 00:22:35.080 |
And it's also not doing an amazing job of recreating the original details. 00:22:38.660 |
Um, but you know, this is the simplest possible auto encoder. 00:22:42.360 |
So it's doing, you know, it's a lot better than our previous attempt. 00:22:49.480 |
So what we could now do is we could just generate some noise and then we're not even going to 00:22:56.080 |
So we're going to go and say like, okay, we've got a decoder. 00:22:58.160 |
So let's just decode that noise and see what it creates. 00:23:06.560 |
I mean, I could kind of recognize that might be the start of a shoe. 00:23:12.920 |
I don't know, but it's not doing anything amazing. 00:23:16.140 |
So we have not successfully created an image generator here, um, but there's a very simple 00:23:22.480 |
step we can do to make something that's more like an image generator. 00:23:26.240 |
The problem is that, um, these 200, um, this vector of length 200 recreating, there's no 00:23:35.880 |
particular reason that things that are not in the dataset are going to create items of 00:23:44.400 |
We haven't done anything to try to make that happen. 00:23:47.040 |
We've already tried to make this work for things in the dataset, you know, and, um, therefore, 00:23:54.560 |
when we just randomly generate a bunch of, you know, a vector of length 200 or 16 vectors 00:24:01.840 |
of length 200 in this case, um, and then decode them, there's no particular reason to think 00:24:07.440 |
that they're going to create something that's recognizable as clothing. 00:24:13.840 |
So the way a VAE tries to fix this is by, we've got the exact same encoder as before, except 00:24:31.520 |
I'll explain why there's two of them in a moment. 00:24:34.020 |
So we've got the inputs to hidden, the hidden to hidden, and then the hidden to the latent. 00:24:39.680 |
The decoder is identical, okay, latent's to hidden, hidden to hidden, hidden to inputs. 00:24:49.280 |
And then just as before, we call the encoder, um, but we do something a little bit weird 00:24:59.240 |
next, which is that we actually have two separate final layers. 00:25:05.080 |
We've got one called mu for the final of the encoder and one called LV, which stands for 00:25:20.460 |
So we've now got two encoded 200 long lots of latents. 00:25:27.840 |
What we do is we use them to generate random numbers and the random numbers have a mean 00:25:41.640 |
So when you take a random zero one, so this creates zero one random numbers, mean zero 00:25:49.780 |
So if we add mu to it, they now have a mean of mu or approximately. 00:25:54.760 |
And if you multiply the random numbers by half of log of variance, e to the power of 00:26:02.560 |
So given this log of variance, this is going to give you standard deviation. 00:26:08.720 |
So this is going to give you a standard deviation of e to the half LV and a mean of mu. 00:26:17.640 |
It doesn't matter too much, but if you think about it, um, standard deviation is the square 00:26:26.460 |
So when you take the log, you can move that half into the multiplication because of the 00:26:36.680 |
That's why we just got the half here instead of the square root, which would be to the 00:26:44.240 |
So this is just, yeah, this is just the standard deviation. 00:26:48.140 |
So we've got the standard deviation times normally distributed random noise plus mu. 00:26:52.060 |
So we end up with normally distributed numbers, we're going to have 200 of them for each element 00:27:01.160 |
of the batch where they have a standard deviation of the result of this final layer and a variance, 00:27:11.880 |
which is the result or log variance of the result of this final layer. 00:27:16.980 |
And then finally we passed that through the decoder as usual. 00:27:21.320 |
I explained why we passed back three things, but for now we're just worried about the fact 00:27:27.440 |
So what this is going to do is it's going to generate, um, the, the result of calling, 00:27:34.960 |
um, encode is going to be a little bit random. 00:27:40.480 |
On average, you know, it's still generating exactly the same as before, which is the result 00:27:45.640 |
of a sequential model with, you know, MLP with one hidden layer, but it's also going 00:27:55.860 |
So this is, here's the bit, which is exactly the same as before. 00:27:58.320 |
This is the same as calling encode before, but then here's the bit that adds some randomness 00:28:04.120 |
And the amount of randomness is also itself random. 00:28:11.920 |
Um, okay, so if we now just, um, well, you know, trained that, right, using the result 00:28:21.800 |
of the decoder and using, um, I think we didn't use MSE loss. 00:28:25.800 |
We used a binary cross entropy loss, which we've seen before. 00:28:29.840 |
Um, so if you've forgotten, you should definitely go back and rewatch that by really part one. 00:28:34.840 |
Um, or we've done a bit of it in part two as well, binary cross entropy loss, um, with 00:28:41.720 |
logits means that you don't have to worry about doing the soft max. 00:28:46.800 |
Um, so if we just, um, optimize this using BCE now, you would expect, and it would, I 00:28:55.480 |
believe I haven't checked, um, that it would basically take this final, like this layer 00:28:59.840 |
here and turn these all into zeros, um, as a result of which it would have no variance 00:29:04.720 |
at all. Um, and therefore it would behave exactly the same as the previous auto encoder. 00:29:15.360 |
Yeah. Okay. Um, so that wouldn't help at all because what we actually want is we want some 00:29:21.040 |
variance and the reason we want some variance is we actually want to have it generate some 00:29:28.560 |
latents, which are not exactly our data. They're around our data, but not exactly our data. 00:29:35.000 |
And then when it generates latents that are around our data, we want them to decode to 00:29:40.640 |
our, to the same thing. We want them to decode to the correct image. And so as a result, 00:29:45.920 |
if we can train that, right, something that it does include some variation and still decodes 00:29:53.160 |
back to the original image, then we've created a much more robust model. 00:29:58.360 |
And then that's something that we would help then, that we would hope then when we say, 00:30:02.400 |
okay, well now decode some noise that it's going to decode to something better than this. 00:30:08.320 |
So that's the idea of a VAE. So how do we get it to create, um, a log variance, which 00:30:18.520 |
doesn't just go to zero? Um, well, we have a second, uh, loss term it's called the KL 00:30:26.040 |
divergence loss. We've got a key called KLD loss. And what we're going to do is our VAE 00:30:31.280 |
loss is going to take the binary cross entropy between the actual decoded bit. So that's 00:30:40.320 |
input zero and the target. Okay. So that's, this is exactly the same as before as this 00:30:46.440 |
binary cross entropy. And we're going to add it to this KLD loss, KL divergence. Now KL 00:30:52.720 |
divergence, the details don't matter terribly much. What's important is when we look at 00:30:57.560 |
the KLD loss, it's getting past the input and the targets, but if you look, it's not 00:31:03.520 |
actually using the targets at all. So if we pull out in the, the input into its three 00:31:10.840 |
pieces, which is our predicted image, our mu and our log variance, we don't use this 00:31:17.000 |
either. So the BCE loss only uses the predicted image and the actual image. The KL divergence 00:31:23.920 |
loss only uses mu and log variance. And all it does is it returns a number, which says, 00:31:35.360 |
um, for each item in the batch, um, is mu close to zero and is log variance close to 00:31:41.920 |
one. How does it do that? Well, for mu, it's very easy. Mu squared. So if mu is close to 00:31:50.920 |
zero, then minimizing mu squared does exactly that, right? Um, if mu is one, then mu squared 00:31:57.360 |
is one. If mu is minus one, mu squared is one. If mu is zero, mu squared is zero. That's 00:32:03.320 |
the lowest you can get for a squared. Um, okay. So we've got a mu squared piece here, um, 00:32:14.440 |
and we've got a dot mean. So we're just taking, that's just basically taking the mean of all 00:32:17.720 |
the mus. And then there's another piece, which is we've got log variance minus e to the power 00:32:25.960 |
of log variance. So if we look at that, so let's just grab a bunch of numbers between 00:32:33.640 |
neg three and three and do number minus e to the power of that number. Um, and I'm just 00:32:40.160 |
going to pop in the one plus and the point five times as well. They matter much. And 00:32:44.120 |
you can see that's got a minimum of zero. So when that's a minimum of zero, e to the 00:32:51.840 |
power of that, which is what we're going to be using actually half times e to the power 00:32:57.800 |
of that, but that's okay. Is what we're going to be using in our, um, dot forward method. 00:33:05.400 |
That's going to be e to the power of zero, which is going to be one. So this is going 00:33:11.840 |
to be minimized where, um, log variance exp equals one. So therefore this whole piece 00:33:21.000 |
here will be minimized when mu is zero and LV is also zero. Um, and, and so therefore 00:33:31.600 |
LV e to the power of LV is one. Now, the reason that it's specifically this form is basically 00:33:40.280 |
because, um, there's a specific mathematical thing called the KL divergence, which compares 00:33:48.800 |
how similar to distributions are. And so the normal distribution can be fully characterized 00:33:54.640 |
by its main and its variance. And so this is actually more precisely calculating the 00:34:00.460 |
similarity that specifically the KL divergence between the actual mu and LV that we have 00:34:09.720 |
and a distribution with a mean of zero and a variance of one. Um, um, but you can see 00:34:16.920 |
hopefully why conceptually we have this mu.pal two and why we have this LV.exp, um, LV minus 00:34:26.480 |
LV.exp here. Um, so that is our VAE loss. Did you guys have anything to add to any of 00:34:38.720 |
that description? So maybe to highlight the, the, the objective of this is to say rather 00:34:44.020 |
than having it so that the exact point that an input is encoded to decodes back to that 00:34:50.160 |
input, we're saying number one, the space around that point should also decoded that 00:34:55.080 |
input because we're going to try and force some variance. And number two, the overall 00:34:58.700 |
variance should be like, yeah, the, the overall space that it uses should be roughly zero 00:35:05.280 |
mean and units and variance, right? So instead of able to like map each input to like an 00:35:12.000 |
arbitrary point and then decode only that exact point to an input, we now mapping them 00:35:16.080 |
to like a restricted range. And we're saying that not, not just each point, but its surroundings 00:35:20.480 |
as well should also decode back to something that looks like that image. Um, and that's 00:35:25.200 |
trying to like condition this latent space to be much nicer so that any arbitrary point 00:35:29.960 |
within that, um, range will hopefully map to something useful, which is a harder problem 00:35:35.540 |
to solve, right? So we would expect given that this is exactly the same architecture, 00:35:41.020 |
we would expect its ability to actually decode would be worse than our previous attempt because 00:35:48.960 |
it's a harder problem that we're trying to solve, which is to just, we've got random 00:35:52.200 |
numbers in there as well now that we're hoping that this ability to generate images will 00:35:56.280 |
improve. Um, thanks, John. Okay. So I actually asked Bing about this, um, which is just, 00:36:07.760 |
this is more of an example of like, I think for, you know, now that we've got GPT for 00:36:12.480 |
and Bing and stuff, I find they're pretty good at answering questions that like I wanted 00:36:17.660 |
to explain to students what would happen if the variance of the latents was very low or 00:36:21.960 |
what if they were very high? So why do we want them to be one? And I thought like, Oh 00:36:25.760 |
gosh, this is hard to explain. So maybe Bing can help. So I actually thought it's pretty 00:36:30.800 |
good. So I'll just say what Bing said. So Bing says, if the variance of the latents 00:36:34.480 |
are very well low, then the encoder distribution would be very peaked and concentrated around 00:36:41.320 |
the main. So that was the thing we were describing earlier. If we had trained this without the 00:36:45.800 |
KLD loss at all, right, it would probably make the variance zero. And so therefore the 00:36:51.200 |
latent space would be less diverse and expressive and limit the ability of the decoder to reconstruct 00:36:56.280 |
the data accurately, make it harder to generate new data that's different from the training 00:37:00.560 |
data, which is exactly what we're trying to do. And if the variance is very high, then 00:37:07.000 |
the encoder would be very spread out and diffuse. It would be more, the latents would be more 00:37:11.440 |
noisy and random, make it easier to generate new data that's unrealistic or nonsensical. 00:37:19.560 |
Okay. So that's why we want it to be exactly at a particular point. So when we train this, 00:37:27.960 |
we can just pass VAE loss as our loss function, but it'd be nice to see how well it's going 00:37:33.280 |
at reconstructing the original image and how it's going at creating a zero one distribution 00:37:42.340 |
data separately. So what I ended up doing was creating just a really simple thing called 00:37:49.660 |
func metric, which I derived from the capital M mean class in the torch, just trying to 00:38:02.400 |
find it here from the torcheval.metrics. So they've already got something that can just 00:38:08.160 |
calculate means. So obviously this stuff's all very simple and we've created our own 00:38:11.600 |
metrics class ourselves back a while ago. And since we're using torcheval, I thought 00:38:15.440 |
this is useful to see how we can create one, a custom metric where you can pass in some 00:38:20.640 |
function to call before it calculates the mean. So if you call, so you might remember 00:38:28.340 |
that the way torcheval works is it has this thing called update, which gets past the input 00:38:32.400 |
and the targets. So I add to the weighted sum, the result of calling some function on 00:38:39.360 |
the input and the targets. So we want two kind of new metrics. One is the, we're going 00:38:50.080 |
to print it out as KLD, which is a func metric on KLD loss, someone who went to print out 00:38:54.800 |
as BCE, which is a func metric on BCE loss. And so the actual, when we call the learner, 00:39:02.440 |
the loss function we'll use is VAE loss, but we're going to pass in as metrics, this additional 00:39:14.160 |
metrics to print out. So it's just going to print them out. And in some ways it's a little 00:39:17.920 |
inefficient because it's going to calculate KLD loss twice and BCE loss twice, one to 00:39:23.000 |
print it out and one to go into the, you know, actual loss function, but it doesn't take 00:39:27.560 |
long for that bit. So I think that's fine. So now when we call learn.fit, you can see 00:39:33.040 |
it's printing them all out. So the BCE that we got last time was 0.26. And so this time, 00:39:42.400 |
yeah, it's not as good. It's 0.31 because it's a harder problem and it's got randomness 00:39:47.400 |
in it. And you can see here that the BCE and KLD are pretty similar scale when it starts. 00:39:56.000 |
That's a good sign. If they weren't, you know, I could always in the loss function scale 00:40:01.240 |
one of them up or down, but they're pretty similar to start with. So that's fine. So 00:40:06.760 |
we train this for a while and then we can use exactly the same code for sampling as 00:40:12.160 |
before. And yeah, as we suspected, its ability to decode is worse. So it's actually not capturing 00:40:21.240 |
the LE at all, in fact, and the shoes got very blurry. But the hope is that when we 00:40:30.280 |
call it on noise called the decoder on random noise, that's much better. We're getting, 00:40:36.920 |
it's not amazing, but we are getting some recognizable shapes. So, you know, VAEs are, 00:40:45.080 |
you know, not generally going to get you as good a results as diffusion models are, although 00:40:52.360 |
actually if you train really good ones for a really long time, they can be pretty impressive. 00:40:57.240 |
But yeah, even in this extremely simple, quick case, we've got something that can generate 00:41:01.800 |
recognizable items of clothing. Did you guys want to add anything before we move on to 00:41:08.520 |
the stable diffusion VAE? Okay. So this VAE is very crappy. And as we mentioned, one of 00:41:24.680 |
the key reasons to use a VAE is actually that you can benefit from all the compute time 00:41:30.840 |
that somebody else has put into training a good VAE. 00:41:36.240 |
Just also like one thing when you say good VAE, the one that we've trained here is good 00:41:42.160 |
at generating because it maps down to this like one, two dimensional vector and then 00:41:47.040 |
back in a very useful way. And like, if you look at VAEs for generating, they'll often 00:41:52.200 |
have a pretty small dimension in the middle and it'll just be like this vector that gets 00:41:57.560 |
mapped back up. And so VAE that's good for generating is slightly different to one that's 00:42:01.760 |
good for compressing. And like the stable diffusion one, we'll see has this like special components 00:42:06.440 |
still, it doesn't map it down to a single vector, it maps it down to 64 by 64 or whatever. 00:42:12.840 |
And I think that's smaller than the original, but for generating, we can't just put random 00:42:17.160 |
noise in there and hope like a cohesive image will come out. So it's less good as a generator, 00:42:23.920 |
but it is good because it has this like compression and reconstruction ability. 00:42:27.480 |
Cool. Yeah. So let's take a look. Now, to demonstrate this, we want to move to a more 00:42:38.120 |
difficult task because we want to show off how using Latents let us do stuff we couldn't 00:42:44.760 |
do well before. So the more difficult task we're going to do is generating bigger images 00:42:53.080 |
and specifically generate images of bedrooms using the L Sun Bedrooms dataset. So L Sun 00:43:01.960 |
is a really nice dataset, which has many, many, many millions of images across 10 scene categories 00:43:17.920 |
and 20 object categories. And so it's very rare for people to use of the object categories 00:43:25.600 |
to be honest, but people quite often use the scene categories. They're a little more than 00:43:31.440 |
a little can be extremely slow to download is that the website they come from is very 00:43:35.200 |
often down. So what I did was I put a subset of 20% of them onto AWS. They kindly provide 00:43:46.000 |
some free dataset hosting for our students. And also the original L signs in a slightly 00:43:52.920 |
complicated form. It's in something called an LMDB database. And so I turned them into 00:43:56.520 |
just normal images in folders. So you can download them directly from the AWS dataset 00:44:04.480 |
site that they've provided for us. So I'm just using fast core to save it and then using 00:44:13.200 |
Python's shutil to unpack the gzipped tar file. Okay. So that's given us once that runs, which 00:44:23.520 |
is going to take a long time. And, you know, if it might be, you know, even more reliable 00:44:35.080 |
just to do this in the shell with wget or aria 2c or something than doing it through 00:44:41.240 |
Python. So this will work, but if it's taking a long time or whatever, maybe just delete 00:44:44.760 |
it and do it in the shell instead. Okay. So then I thought, all right, how do we turn 00:44:54.880 |
these into Latents? Well, we could create a dataset in the usual ways. It's going to 00:45:04.480 |
have a length. So we're going to grab all the files. So glob is a built into Python, 00:45:11.920 |
which we'll search for in this case, star dot jpeg. And if you've got star star slash, 00:45:19.280 |
that's going to search recursively as long as you pass recursive. So we're going to search 00:45:24.480 |
for all of the jpeg files inside our data slash bedroom folder. So that's what this is 00:45:36.160 |
going to do. It's going to put them all into the files attribute. And so then when we get 00:45:41.000 |
an item, the ith item, it will find the ith file. It will read that image. So this is 00:45:48.400 |
PyTorch's read image. It's the fastest way to read a jpeg image. People often use PIL, 00:45:58.040 |
but it's quite hard to find a really well optimized PIL version that's really compiled 00:46:03.400 |
fast, whereas the PyTorch Torch Vision team have created a very, very fast read image. 00:46:11.320 |
That's why I'm using theirs. And if you pass in image read mode.RGB, it will automatically 00:46:18.600 |
turn any one channel, black and white images, into three channel images for you. Or if there 00:46:23.100 |
are four channel images with transparency, it will turn those. So this is a nice way 00:46:26.960 |
to make sure they're all the same. And then this turns it into floats from not to one. 00:46:35.120 |
And these images are generally very close to 256 by 256 pixels. So I just crop out 00:46:40.480 |
the top 250 by 256 bit, because I didn't really care that much. And we do need them to all 00:46:49.000 |
be the same size in order that we can then pass them to the stable diffusion VAE decoder 00:46:55.640 |
as a batch. Otherwise it's going to take forever. So I can create a data loader that's going 00:47:01.680 |
to go through a bunch of them at a time. So 64 at a time. And use however many CPUs I 00:47:10.320 |
have as the number of workers. It's going to do it in parallel. And so the parallel 00:47:16.000 |
bit is the bit that's actually reading the JPEGs, which is otherwise going to be pretty 00:47:21.000 |
slow. So if we grab a batch, here it is. Here's what it looks like. Generally speaking, they're 00:47:27.000 |
just bedrooms, although we've got one pretty risque situation in the bedroom. But on the 00:47:32.280 |
whole, they're not safe for work. This is the first time I've actually seen an actual 00:47:36.720 |
bedroom scene taking place, as it were. All right. So as you can see, this mini batch 00:47:44.320 |
of, if I just grab the first 16 images, has three channels and 256 by 256 pixels. So that's 00:47:56.560 |
how big that is for 16 images. So that's 728. So 3.145 million floats to represent this. 00:48:10.120 |
Okay. So as we learned in the first lesson of part two, we can grab an autoencoder directly 00:48:20.080 |
using diffusers using from pre-trained. We can pop it onto our GPU. And importantly, 00:48:28.320 |
we don't have to say with torch.nograd anymore if we pass requires grad false. And remember 00:48:35.720 |
this neat trick in PyTorch, if it ends in an underscore, it actually changes the thing 00:48:39.840 |
that you're calling in place. So this is going to stop it from computing gradients, which 00:48:45.040 |
would take a lot of time and a lot of memory otherwise. So let's test it out. Let's encode 00:48:52.760 |
our mini batch. And so just like Johnno was saying, this has now made it much smaller. 00:48:58.920 |
It's got just in our 16 batch of 16, it's now a four channel 32 by 32. So if we can 00:49:06.480 |
compare the previous size to the new size, it's 48 times smaller. So that's 48 times 00:49:13.960 |
less memory it's going to need. And it's also going to be a lot less compute for a convolution 00:49:19.360 |
to go across that image. So it's no good unless we can turn it back into the original image. 00:49:26.520 |
So let's just have a look at what it looks like first. Now it's a four channel image, 00:49:29.540 |
so we can't naturally look at it. But what I could do is just grab the first three channels. 00:49:36.600 |
And then they're not going to be between 0 and 1. So if I just do dot sigmoid, now they're 00:49:41.320 |
between 0 and 1. And so you can see that our risque bedroom scene, you can still recognize 00:49:46.540 |
it. Or this bedroom, this bed here, you can still recognize it. So there's still that 00:49:53.400 |
kind of like the basic geometry is still clearly there. But it's, yeah, it's clearly changed 00:50:00.840 |
it a lot as well. So importantly, we can call decode on this 48 times smaller tensor. And 00:50:13.560 |
it's really, I think, absolutely remarkable how good it is. I can't tell the difference 00:50:22.840 |
to the original. Maybe if I zoom in a bit. Her face is a bit blurry. Was her face always 00:50:34.760 |
a bit blurry? No, it was always a bit blurry. First, second, third. Oh, hang on. Did that 00:50:44.760 |
used to look like a proper ND? Yeah, OK. So you can see this used to say that clearly 00:50:49.360 |
there's an ND here. And now you can't see those letters. So and this is actually a classic 00:50:56.800 |
thing that's known for this particular VAE is it's not able to regenerate writing correctly 00:51:06.260 |
at small font sizes. I think it's also pretty it's like I think we hear with the faces are 00:51:12.360 |
already pretty low resolution. But if you are at a higher resolution, the faces also 00:51:16.320 |
would probably not be converted appropriately. OK, cool. But overall, yeah, it's done a great 00:51:24.400 |
job. A couple of other things I wanted to note was like, so like you mentioned, like 00:51:29.440 |
a 40, I guess a factor of 48 degrees. Oftentimes people refer to mostly at the spatial resolution. 00:51:37.280 |
So since it's going from 256 by 256 to 32 by 32. So that's like a factor of eight. So 00:51:45.200 |
they sometimes will know, like, I think it's like F8 or something like this. They'll note 00:51:48.720 |
the spatial resolution. So sometimes you may see that written out like that. And of course, 00:51:54.760 |
it is an eight squared decrease in the number of pixels, which is interesting. Right. Right. 00:52:02.200 |
And then the other thing I want to note was that the VAE is also trained with with a perceptual 00:52:09.480 |
loss objective, as well as technically like a like a discriminator, again, objective. 00:52:16.840 |
I don't know if you were going to go into that later now. So, yeah, so perceptual loss, 00:52:22.440 |
we've we've already discussed. Right. So the VAE is going to you know, when they trained 00:52:28.520 |
it. So I think this was trained by Compviz, right, the, you know, Robin and Gang and used 00:52:40.200 |
stability.ai donated compute for that. And they went to be clear, actually, no, the VAE 00:52:48.160 |
was actually trained separately. And it's actually a train on the open images data set. 00:52:53.320 |
And it was just this VAE that they trained by themselves on, you know, a small subset 00:52:58.280 |
of data. But because the VAE is so powerful, it's actually able to be applied to all these 00:53:04.360 |
other data sets as well. Okay, great. Yeah. So they so they would have had a KL diversion 00:53:13.960 |
loss and they would have either had an MSC or BCE loss. I think it might have been an 00:53:17.480 |
MSC loss. They also had a perceptual loss, which is the thing we learned about when we 00:53:22.880 |
talked about super resolution, which is where when they compared the the output images to 00:53:30.080 |
the original images, they would have run that through a, you know, ImageNet trained or similar 00:53:38.440 |
classifier and confirmed that the activations they got through that model was similar. And 00:53:45.560 |
then the final bit is as Tanisha was mentioning is the adversarial loss, which is also known 00:53:55.280 |
as a as a GAN loss. So a GAN is a generative adversarial network. And the GAN loss what 00:54:06.040 |
it does is it grabs it is actually more specifically what's called a patchwise GAN loss. And what 00:54:17.800 |
it does is it takes like a little section of an image. Right. And what they've done 00:54:26.200 |
is they train it's let's just simplify it for a moment and imagine that they've pre-trained 00:54:32.120 |
a classifier, right, where they've basically got something that you can pass it a real, 00:54:38.680 |
you know, patch from a bedroom scene and a and a fake patch from a bedroom scene. And 00:54:53.400 |
they both go into the what's called the discriminator. And this is just a normal, you know, ResNet 00:55:08.160 |
or whatever, which basically outputs something that either says, yep, the the image is real 00:55:22.040 |
or nope, the image is fake. So sorry, I said it passes in two things. You just that was 00:55:26.640 |
wrong. You just pass in one thing and it returns either it's real or it's fake. And specifically, 00:55:30.880 |
it's going to give you something like the probability that it's real. There is another 00:55:36.440 |
version. I don't think it's what they use. You pass in two and it tells you which one's 00:55:40.080 |
relative. Do you remember Tanisha? Is it a relativistic GAN or a normal GAN? I think it's 00:55:45.160 |
a normal one. Yeah. So the realistic GAN is when you pass in two images and it says which 00:55:49.000 |
is more real. The one we think that we remember correctly, they use as a regular GAN, which 00:55:54.000 |
just tells you the probability that it's real. And so you can just train that by passing 00:55:59.600 |
in real images and fake images and having it learn to classify which ones are real and 00:56:04.520 |
which ones are fake. So now that once you've got that model trained, then as you train 00:56:12.160 |
your GAN, you pass in the patches of each image into the discriminator. So let's call 00:56:21.480 |
D here, right? And it's going to spit out the probability that that's real. And so if it's 00:56:29.120 |
spat out 0.1 or something, then you're like, oh, dear, that's terrible. Our VAE is spitting 00:56:38.580 |
out pictures of bedrooms where the patches of it are easily recognized as not real. But 00:56:45.560 |
the good news is that's going to generate derivatives, right? And so those derivatives 00:56:51.320 |
then is going to tell you how to change the pixels of the original generated image to 00:56:57.480 |
make it trick the GAN better. And so what it will do is it will then use those derivatives 00:57:05.280 |
as per usual to update our VAE. And the VAE in this case is going to be called a generator, 00:57:16.120 |
right? That's the thing that's generating the pixels. And so the generator gets updated 00:57:21.360 |
to be better and better at tricking the discriminator. And after a while, what's going to happen 00:57:27.540 |
is the generator is going to get so good that the discriminator gets fooled every time, 00:57:32.920 |
right? And so then at that point, you can fine-tune the discriminator better by putting in your 00:57:39.880 |
better generated images, right? And then once your discriminator learns again how to recognize 00:57:44.960 |
the difference between real and fake, you can then use it to train the generator. And 00:57:50.600 |
so this is kind of ping-ponging back and forth between the discriminator and the generator. 00:57:56.120 |
Like when GANs were first created, people were finding them very difficult to train. And 00:58:04.040 |
actually a method we developed at Fast AI, I don't know if we were the first to do it 00:58:08.520 |
or not, was this idea of kind of pre-training a generator just using perceptual loss and 00:58:16.160 |
then pre-training a discriminator to be able to fool the generator and then ping-ponging 00:58:20.320 |
backwards and forwards between them. After that, basically whenever the discriminator 00:58:25.480 |
got too good, start using the generator. Anytime the generator got too good, start using the 00:58:30.360 |
discriminator. Nowadays, that's pretty standard, I think, to do it this way. And so, yeah, 00:58:38.560 |
this GAN loss, which is basically saying penalize for failing to fool the discriminator is called 00:58:47.040 |
an adversarial loss. To maybe motivate why you do this, if you 00:58:59.000 |
just did it with a mean squared error or even a perceptual loss with such a high compression 00:59:05.880 |
ratio, the VAEs tend to produce a fairly blurry output because it's not sure whether there's 00:59:11.200 |
texture or not in this image or the edges aren't super well defined where they'll be because 00:59:17.520 |
it's going from one four-dimensional thing up to this whole patch of the image. And so 00:59:24.800 |
it tends to be a little bit blurry and hazy because it's kind of hedging its bets, whereas 00:59:29.980 |
that's something that the discriminator can quite easily pick up. Oh, it's blurry. It must 00:59:34.760 |
be fake. And so then it's having the discriminator, that is adversarial loss, is just kind of 00:59:39.480 |
saying, even if you're not sure exactly where this texture goes, rather go with a sharper 00:59:43.920 |
looking texture that looks real than with some blurry thing that's going to maximize 00:59:49.920 |
your MSE. And so it tricks it into kind of faking this higher resolution looking sharper 00:59:56.600 |
output. Yeah. And I'm not sure if we're going to come 01:00:02.400 |
back and train our own GAN at some point, but if you're interested in training your 01:00:10.660 |
own GAN or-- you shouldn't call it a GAN, right? I mean, nowadays, we never really just 01:00:17.080 |
use a GAN. We have an adversarial loss as part of a training process. So if you want 01:00:21.160 |
to learn how to use adversarial loss in detail and see the code, the 2019 FastAI course Less 01:00:28.520 |
than 7 at part 1 has a walkthrough. So we have sample code there. And maybe given time, 01:00:35.400 |
we'll come back to it. OK. So quite often, people will call the VAE 01:00:49.680 |
encoder when they're training a model, which to me makes no sense, right? Because the encoded 01:00:55.360 |
version of an image never changes unless you are using data augmentation and want to do 01:01:01.360 |
augmentation on-- sorry, to encode augmented images. I think it makes a lot more sense 01:01:07.520 |
to just do a single run through your whole training set and encode everything once. So 01:01:13.800 |
naturally, the question is then, well, where do you save that? Because it's going to be 01:01:17.040 |
a lot of RAM. If you put this, leave it in RAM. And also, as soon as you restart your 01:01:22.040 |
computer, we've lost all that work. There's a very nifty file format you can use called 01:01:27.680 |
a memory mapped numpy file, which is what I'm going to use to save our latency. A memory 01:01:36.000 |
mapped numpy file is basically-- what happens is you take the memory in RAM that numpy would 01:01:44.480 |
normally be using, and you literally copy it onto the hard disk, basically. That's what 01:01:53.560 |
they mean by memory mapped. There's a mapping between the memory in RAM and the memory in 01:01:58.800 |
hard disk. And if you change one, it changes the other, and vice versa. They're kind of 01:02:02.200 |
two ways of seeing the same thing. And so if you create a memory mapped numpy array, 01:02:10.280 |
then when you modify it, it's actually modifying it on disk. But thanks to the magic of your 01:02:16.120 |
operating system, it's using all kinds of beautiful caching and stuff to not make that 01:02:22.620 |
slower than using a normal numpy array. And it's going to be very clever at-- it doesn't 01:02:31.680 |
have to store it all in RAM. It only stores the bits in RAM that you need at the moment 01:02:36.360 |
or that you've used recently. It's really nifty at caching and stuff. So it's kind of-- 01:02:41.360 |
it's like magic, but it's using your operating system to do that magic for you. So we're 01:02:46.760 |
going to create a memory mapped file using np.memmap. And so it's going to be stored somewhere 01:02:53.660 |
on your disk. So we're just going to put it here. And we're going to say, OK, so create 01:02:58.840 |
a memory map file in this place. It's going to contain 32-bit floats. So write the file. 01:03:06.000 |
And the shape of this array is going to be the size of our data set, so 303,125 images. 01:03:13.840 |
And each one is 4 by 32 by 32. OK. So that's our memory mapped file. And so now we're going 01:03:22.200 |
to go through our data loader, one mini batch of 24 at a time. And we're going to VAE encode 01:03:32.120 |
that mini batch. And then we're going to grab the means from its latency. We don't want 01:03:40.000 |
random numbers. We want the actual midpoints, the means. So this is using the diffusers 01:03:48.200 |
version of that VAE. So pop that onto the CPU after we're done. And so that's going 01:03:55.120 |
to be mini batch of size 64 as PyTorch. Let's turn that into NumPy because PyTorch doesn't 01:04:01.360 |
have a memory mapped thing, as far as I'm aware, but NumPy does. And so now that we've 01:04:05.480 |
got this memory mapped array called a, then everything initially from 0 up to 64, not 01:04:18.120 |
including the 64, that whole sub part of the array is going to be set to the encoded version. 01:04:24.040 |
So it looks like we're just changing it in memory. But because this is a magic memory 01:04:30.160 |
mapped file, it's actually going to save it to disk as well. So yeah, that's it. Amazingly 01:04:36.840 |
enough. That's all you need to create a memory mapped NumPy array of our latents. When you're 01:04:43.120 |
done, you actually have to call dot flush. And that's just something that says like anything 01:04:47.200 |
that's just in cache at the moment, make sure it's actually written to disk. And then I 01:04:53.760 |
delete it because I just want to make sure that then I read it back correctly. So that's 01:04:58.720 |
only going to happen once if the path doesn't exist. And then after that, this whole thing 01:05:04.120 |
will be skipped. And instead, we're going to call mp.memmap again with an M path. But this 01:05:09.640 |
time in the same data type, the same shape, this time we're going to read it. Mode equals 01:05:14.400 |
R means read it. And so let's check it. Let's just grab the first 16 latents that we read 01:05:24.000 |
and decode them. And there they are. OK. So this is like not a very well-known technique, 01:05:34.960 |
I would say, sadly. But it's a really good one. You might be wondering like, well, what 01:05:41.240 |
about like compression? Like shouldn't you be zipping them or something like that? But 01:05:45.720 |
actually remember, these latents are already-- the whole point is they're highly compressed. 01:05:52.200 |
So generally speaking, zipping latents from a good VAE doesn't do much. Because they almost 01:06:01.400 |
look a bit random number-ish. OK. So we've now saved our entire LSUN bedroom. That's 01:06:09.560 |
a 20% subset, the bit that I've provided. Now, Latents. So we can now run it through-- 01:06:18.040 |
this is a nice thing. We can use exactly the same process from here on in as usual. OK. 01:06:23.560 |
So we've got the Noiser 5 of our usual collated version. Now, the Latents are much higher 01:06:34.880 |
than 1 standard deviation. So if we about divide it by 5, that takes it back to a standard 01:06:39.080 |
deviation of about 1. I think in the paper they use like 0.18 or something. But this 01:06:47.120 |
is close enough to make it a unit standard deviation. So we can split it into a training 01:06:54.960 |
and a validation set. So just grab the first 90% of the training set and the last 10% for 01:07:01.080 |
the validation set. So those are our data sets. We use a batch size of 128. So now we 01:07:07.620 |
can use our data loaders class we created with the getDLs we created. So these are all 01:07:11.520 |
things we've created ourselves with the training set, the validation set, the batch size, and 01:07:17.680 |
our collation function. So yeah, it's kind of nice. It's amazing how easy it is. A data 01:07:26.360 |
set has the same interface as a NumPy array or a list or whatever. So we can literally 01:07:34.260 |
just use the NumPy array directly as a data set, which I think is really neat. This is 01:07:40.120 |
why it's useful to know about these foundational concepts, because you don't have to start 01:07:45.400 |
thinking like, oh, I wonder if there's some torch vision thing to use memmap NumPy files 01:07:50.720 |
or something. It's like, oh, wait, they already do provide a data set interface. I don't have 01:07:55.040 |
to do anything. I just use them. So that's pretty magical. So we can test that now by 01:08:02.000 |
grabbing a batch. And so this is being noisified. And so here we can see our noisified images. 01:08:10.040 |
And so here's something crazy is that we can actually decode noisified images. And so here's 01:08:19.600 |
I guess this one wasn't noisified much because it's a recognizable bedroom. And this is what 01:08:24.040 |
happens when you just decode random noise, something in between. So I think that's pretty 01:08:30.320 |
fun. Yeah, this next bit is all just copied from our previous notebook, create a model, 01:08:39.360 |
organize it, train for a while. So this took me a few hours on a single GPU. Everything 01:08:45.000 |
I'm doing is on a single GPU. Literally nothing in this course, other than the stable diffusion 01:08:50.000 |
stuff itself is trained on more than one GPU. The loss is much higher than usual. And that's 01:08:57.600 |
not surprising because it's trying to generate latent pixels, which rare like it's much more 01:09:05.360 |
precise as to exactly what it wants. There's not like lots of pixels where the ones next 01:09:11.200 |
to each other are really similar or the whole background looks the same or whatever. A lot 01:09:15.200 |
of that stuff, it's been compressed out. It's a more difficult thing to predict latent pixels. 01:09:24.240 |
So now we can sample from it in exactly the same way that we always have using DDIM. But 01:09:29.920 |
now we need to make sure that we decode it, because the thing that it's sampled are latents 01:09:39.280 |
because the thing that we asked it to learn to predict are latents. And so now we can 01:09:45.640 |
take a look and we have bedrooms. And some of them look pretty good. I think this one 01:09:54.860 |
looks pretty good. I think this one looks pretty good. This one, I don't have any idea 01:09:59.760 |
what it is. And this one, like clearly there's bedroomy bits, but there's something, I don't 01:10:07.480 |
know, there's weird bits. So the fact that we're able to create 256 by 256 pixel images 01:10:17.840 |
where at least some of them look quite good in a couple of hours, I can't remember how 01:10:23.340 |
long it took to train, but it's a small number of hours in a single GPU is something that 01:10:26.520 |
was not previously possible. And we're in a sense, we're totally cheating because we're 01:10:33.360 |
using the stable diffusion VAE to do a lot of the hard work for us. But that's fine, 01:10:40.720 |
you know, because that VAE knows how to create all kinds of natural images and drawings and 01:10:45.600 |
portraits and royal paintings or whatever. So you can, I think, work in that latent space 01:10:53.720 |
quite comfortably. Yeah. Do you guys have anything you wanted to add about that? Oh, actually, 01:11:01.280 |
Tanishka, you've trained this for longer. I only trained it for 25 epochs. How long did 01:11:06.160 |
you, how many hours did you train it for? Cause you did, you did a hundred epochs, right? 01:11:10.200 |
Yes, I did a hundred epochs. I didn't keep trying exactly, but I think it was about 15 01:11:14.400 |
hours on an A100. A single A100. Yeah. I argued, I mean, the results, yeah, I'll show it. It's 01:11:23.680 |
I guess maybe slightly better, but you know, I guess you can, I see maybe. No, it is definitely 01:11:34.360 |
slightly better. The good ones are certainly slightly better. Yeah. Yeah. Like the bottom 01:11:38.760 |
left one is better than any of mine, I think. So it's possible. Maybe at this point, we 01:11:43.520 |
just may need to use more data, I guess, cause I guess we were using a 20% subset. So maybe 01:11:48.560 |
having more of that data to provide more diversity or something like that, maybe that might help. 01:11:53.000 |
Yeah. Or maybe, have you tried doing the diffusers one for a hundred? No, I'm using this. Okay. 01:12:00.360 |
Our code here. Yeah. So I've got, all right. So I'll share my screen if you want to stop 01:12:06.000 |
sharing yours. So I do have, if we get around to this, maybe we can add the results back 01:12:17.440 |
to this notebook. Cause I do have a version that uses diffusers. So everything else is 01:12:21.840 |
identical. 25 epochs, except for the model for the previous one, I was using our, our 01:12:31.920 |
own MBU net model. So I have to change the channels now to four and a number of filters. 01:12:37.760 |
I think I might've increased it a bit. So then I tried using, yeah, the diffusers unit 01:12:46.400 |
with whatever their defaults were. And so I got, what did I get here? 243 with diffusers. 01:12:54.240 |
I got a little bit better, 239. And yeah, I don't know if they're obviously better or 01:13:07.160 |
not. Like, this is a bit weird. I think like, actually, another thing we could try maybe 01:13:17.000 |
is do a hundred epochs, but use the diffusers number of channels and stuff that they used 01:13:23.040 |
for. Cause I think the defaults that they use actually for diffusers is not the same 01:13:26.800 |
as stable diffusion. So maybe we could try stable diffusion, matched unit for a hundred 01:13:32.080 |
epochs. And if we get any nice results, maybe we can paste them into the bottom to show 01:13:36.400 |
people. Yeah. Yeah. Cool. Yeah. Do you guys have anything else to add at this point? All 01:13:49.600 |
right. So I'll just mention one more thought in terms of like a bit of a interesting project 01:13:56.680 |
people could play with. I don't know if this is too crazy. I don't think it's been done 01:14:02.680 |
before, but my thought was like, there was a huge difference in our super resolution. 01:14:08.720 |
Do you remember a huge difference in our super resolution results when we used a pre-trained 01:14:12.840 |
model and when we used perceptual loss, but particularly when we used a pre-trained model. 01:14:22.560 |
I thought we could use a pre-trained model, but we would need a pre-trained latent model, 01:14:27.240 |
right? We would want something where our, you know, downsampling backbone was pre-trained 01:14:35.120 |
model on latents. And so I just want to just show you what I've done and you guys, you 01:14:40.560 |
know, if anybody watching wanted to try taking this further, I've just done the first bit 01:14:45.960 |
for you to give you a sense, which is I've pre-trained an image net model, not tiny image 01:14:49.960 |
net, but a full image net model on latents as a classifier. And if you use this as a 01:14:56.120 |
backbone, you know, and also try maybe some of the other tricks that we found helpful, 01:15:00.400 |
like having res nets on the cross connections. These are all things that I don't think anybody's 01:15:04.520 |
done before. I don't know, the scientific literature is vast and I might've missed it, 01:15:09.160 |
but I've not come across anybody do these tricks before. So obviously like we're, one 01:15:16.880 |
of the interesting parts of this, which is designed to be challenging is that we're using 01:15:21.400 |
bigger datasets now, but they're datasets that you can absolutely like run on a single 01:15:27.000 |
GPU, you know, a few tens of gigabytes, which fits on any modern hard drive easily. So these 01:15:37.400 |
like are good tests of your ability to kind of like move things around. And if you're 01:15:43.080 |
somewhere that doesn't have access to a decent internet connection or whatever, this might 01:15:47.160 |
be out of the question, in which case don't worry about it. But if you can, yes, try this 01:15:52.080 |
because it's good practice, I think, to make sure you can use these larger datasets. 01:15:59.500 |
So image net itself, you can actually grab from Kaggle nowadays. So they call it the 01:16:06.240 |
object localization challenge, but actually this contains the full image net dataset or 01:16:12.040 |
the version that's used for the image net competition. So I think people generally call 01:16:17.360 |
that one case. You just have to accept the terms because that has like some distribution 01:16:22.740 |
terms. Yeah, exactly. So you've got to kind of sign in and then join the competition and 01:16:28.040 |
then yeah, accept the terms. So you can then download the dataset or you can also download 01:16:35.520 |
it from Hugging Face. It'll be in a somewhat different format, but that'll work as well. 01:16:44.240 |
So I think I grabbed my version from Kaggle. So on Kaggle, you know, it's just a zip file, 01:16:49.720 |
you unzip it and it creates an ILSVRC directory, which I think is what they called the competition. 01:16:58.440 |
Yeah, image net, large-scale visual recognition challenge. Okay. So then inside there, there 01:17:07.320 |
is a data and inside there, there is a CLS lock and that's actually where the, that's 01:17:11.400 |
where actually everything's going to be. So just like before, I wanted to turn these all 01:17:16.680 |
into Latents. So I created in that directory, I created a Latents subdirectory and this 01:17:22.360 |
time partly just to demonstrate how these things work, I wanted to do it a slightly different 01:17:27.000 |
way. Okay. So again, we're going to create our pre-trained VAE, pop it on the GPU, turn 01:17:34.280 |
off gradients for it and I'm going to create a dataset. Now, one thing that's a bit weird 01:17:40.200 |
about this is that because this is really quite a big dataset, like it's got 1.3 million 01:17:49.520 |
files, the thing where we go glob star star slash star dot JPEG takes a few seconds, you 01:17:57.800 |
know, and particularly if you're doing this on like, you know, an AWS file system or something, 01:18:05.000 |
it can take really quite a long time. On mine, it only took like three seconds, but I don't 01:18:09.320 |
want to wait three seconds. So I, you know, a common trick for these kinds of big things 01:18:13.940 |
is to create a cache, which is literally just a list of the files. So that's what this, 01:18:19.560 |
this is. So I decided that Z pickle means a gzipped pickle. So what I do is if, if, if 01:18:25.680 |
the cache exists, we just gzip dot open the files. If it doesn't, we use glob exactly 01:18:32.760 |
like before to find all the files. And then we also save a gzip file containing pickle 01:18:40.680 |
dot dump files. So pickle dot dump is what we use in Python to take basically any Python 01:18:45.960 |
object list of dictionaries and dictionary of lists, whatever you like, and save them. 01:18:52.000 |
And it's super fast, right? And I use gzip with compress level one to basically be like 01:18:58.080 |
compress it pretty well, but pretty fast. So this is a really nice way to create a little 01:19:05.480 |
cache of that. So this is the same as always. And so our get item is going to grab the file. 01:19:13.200 |
It's going to read it in, turn it into a float. And what I did here was, you know, I'm being 01:19:19.040 |
a little bit lazy, but I just decided to center crop the middle, you know, so let's say it 01:19:25.160 |
was a 300 by 400 file, it's going to center crop the middle 300 by 300 section, and then 01:19:32.080 |
resize it to 256 by 256. So they'll be the same size. So yeah, we can now, oh, I managed 01:19:42.480 |
to create the VAU twice. So I can now just confirm, I can grab a batch from that data 01:19:48.840 |
loader, encode it. And here it is, and then decode it again. And here it is. So the first 01:19:55.560 |
category must have been computer or something. So here's, as you can see, the VA is doing 01:20:00.560 |
a good job of decoding pictures of computers. So I can do something really very similar 01:20:07.400 |
to what we did before. If we haven't got that destination directory yet, create it, go through 01:20:12.200 |
our data loader, encode a batch. And this time I'm not using a memmapped file, I'm actually 01:20:17.520 |
going to save separate NumPy files for each one. So go through each element of the batch, 01:20:23.920 |
each item. So I'm going to save it into the destination directory, which is the Latents 01:20:29.680 |
directory. And I'm going to give it exactly the same path as the original one contained, 01:20:34.800 |
because it contains the folder of what the label is. Make sure that the directory exists, 01:20:45.640 |
that we're saving it to, and save that just as a NumPy file. This is another way to do 01:20:52.080 |
it. So this is going to be a separate NumPy file for each item. Does that make sense so 01:20:58.960 |
far? Okay, cool. So I could create a thing called a NumPy data set, which is exactly 01:21:06.680 |
the same as our images data set. But to get an item, we don't have to use, you know, open 01:21:13.240 |
a JPEG anymore, we just call mp.load. So this is a nice way to like take something you've 01:21:18.120 |
already got and change it slightly. So it's going to return the... 01:21:24.040 |
Where did you do this? Did the memory map file, Jeremy? Just out of interest? 01:21:30.880 |
Where did you do this versus the memory map file? Was it just to show a different way? 01:21:34.160 |
Just to show a different way. Yeah. Yeah. Absolutely no particularly good reason, honestly. Yeah, 01:21:43.240 |
I like to kind of like demonstrate different approaches. And I think it's good for people's 01:21:47.720 |
Python coding if you make sure you understand what all the lines of code do. Yeah, they 01:21:53.080 |
both work fine, actually. It's partly also for my own experimental interest. It's like, 01:21:59.560 |
oh, which one seems to kind of feel better? Yeah. All right. So create training and validation 01:22:09.420 |
data sets by grabbing all the NumPy files inside the training and validation folders. 01:22:15.680 |
And then I'm going to just create a training data loader for the training data set just 01:22:20.960 |
to see what the main and standard deviation is on the channel dimension. So this is every 01:22:26.000 |
dimension except channel what I mean over. And so there it is. And as you can see there, 01:22:30.800 |
the main and standard deviation are not close to zero and one. So we're going to store away 01:22:35.800 |
that main and standard deviation such that we then... We've seen transform data set before. 01:22:42.520 |
This is just applying a transform to a data set. We're going to apply the normalization 01:22:47.680 |
and transform. In the past, we've used our own normalization that TorchVision has one 01:22:53.800 |
as well. So this is just demonstrating how to just use TorchVision's version. But it's 01:22:58.320 |
literally just subtracting the main and dividing by the standard deviation. We're also going 01:23:05.240 |
to apply some data augmentation. We're going to use the same trick we've used before for 01:23:10.520 |
images that are very small, which is we're going to add a little bit of padding and then 01:23:15.320 |
randomly crop our original image size from that. So it's just like shifting it slightly 01:23:21.960 |
each time. And we're also going to use our random erasing. And it's nice because we did 01:23:26.360 |
it all with broadcasting, this is going to apply equally well to a four-channel image 01:23:31.400 |
as it is to a three or I think we did originally for one. Now, I don't think anybody as far 01:23:39.960 |
as I know has built classifiers from Latents before. So I didn't even know if this is going 01:23:43.860 |
to work. So I visualized it. So we could have Tifem X and a Tifem Y. So for Tifem X, you 01:23:51.720 |
can optionally add augmentation. And if you do, then apply the augmentation transforms. 01:23:58.040 |
Now this is going to be applied one image at a time, but our augmentation transforms, some 01:24:01.920 |
of them expect a batch. So we create a extra unit axis on the front to be a batch of one 01:24:07.480 |
and then remove it again. And then Tifem Y, very much like we've seen before, we're going 01:24:13.680 |
to turn those half names into IDs. So there's our validation and training transform datasets. 01:24:23.880 |
So that we can look at our results, we need a denormalization. So let's create our data 01:24:31.120 |
loaders and grab mini batches and show us. And so I was very pleased to see that the 01:24:38.480 |
random arrays works actually extremely nicely. So you can see you get these kind of like 01:24:42.560 |
weird patches, you know, weird patches. But they're still recognizable. So this is like 01:24:53.480 |
something I very, very often do is to answer like, oh, is this like thing I'm doing in 01:24:57.680 |
computer vision reasonable? It's like, well, can my human brain recognize it? So if I couldn't 01:25:02.160 |
recognize this with a drilling platform myself, then I shouldn't expect a computer to be able 01:25:07.000 |
to do it or that this is a compass or whatever. I'm so glad they got orders. So cute. And 01:25:13.840 |
you can see the cropping it's done has also been fine. Like it's a little bit of a fuzzy 01:25:18.120 |
edge, but basically like it's not destroying the image at all. They're still recognizable. 01:25:26.920 |
It's also a good example here of how difficult like this problem is, like the fact that this 01:25:31.280 |
is seashore, I would have called this surface, you know, but maybe surface is not an image 01:25:35.080 |
in that category. Yeah. Okay. This could be food, but actually it's a refrigerator. Okay. 01:25:46.840 |
So our augmentation seems to be working well. So then I, yeah, basically I've just copied 01:25:52.040 |
and pasted, you know, our basic pieces here. And I kind of wanted to have it all in one 01:25:56.360 |
place just to remind myself of exactly what it is. So this is the preactivation version 01:26:00.760 |
of convolutions. The reason for that is if I want this to be a backbone for a diffusion 01:26:05.920 |
model or a unit, then I remember that we found that preactivation works best for units. So 01:26:13.880 |
therefore our backbone needs to be trained with preactivation. So we've got a preactivation 01:26:17.340 |
conv, got a res block, res blocks model with dropouts. This is all just copied from previous. 01:26:28.640 |
So I decided like I wanted to try to, you know, use the basic trick that we learnt about from 01:26:34.760 |
simple diffusion of trying to put most of our work in the later layers. So the first 01:26:42.360 |
layer just has one block, then two blocks, and then four blocks. And then I figured that 01:26:48.680 |
we might then delete these final blocks. These maybe you're going to just end up being for 01:26:53.560 |
classification. This might end up being our pre-trained backbone, or maybe we keep them. 01:26:57.960 |
I don't know. You know, it's like, as I said, this hasn't been done before. So anyway, I 01:27:02.960 |
tried to design it in a way that we've got some, you know, we can mess around a little 01:27:07.460 |
bit with how many of these we keep. And so also I tried to use very few channels in the 01:27:13.440 |
first blocks. And so I jump up for the channels that are aware of the works going to do a 01:27:20.800 |
jump from 128 to 512. So that's why I designed it this way. You know, I haven't even taken 01:27:28.080 |
it any further than this. So I don't know if it's going to be a useful backbone or not. 01:27:31.760 |
I didn't even know if this is going to be possible to classify. It seemed very likely 01:27:35.720 |
it was possible to classify, even based on the fact that you can still kind of recognize 01:27:39.200 |
it almost like I could probably recognize it's a computer maybe. So I thought it was 01:27:44.760 |
going to be possible. But yeah, this is all new. So that was the model I created. And 01:27:50.040 |
then I trained it for 40 epochs. And you can see after one epoch, it was already 25% accurate. 01:28:03.120 |
And that's it recognizing which one of a thousand categories is it. So I thought that was pretty 01:28:07.960 |
amazing. And so after 40 epochs, I ended up at 66%, which is really quite fantastic because 01:28:16.280 |
a ResNet 34 is kind of like 73% or 74% accuracy when trained for quite a lot longer. You know, 01:28:28.480 |
so to me, this is extremely encouraging that, you know, this is a really pretty good ResNet 01:28:35.640 |
at recognizing images from their latent representations without any decoding or whatever. So from 01:28:44.560 |
here, you know, if you want to, you guys could try, yeah, building a better bedroom diffusion 01:28:54.080 |
model or whatever you like. It's not to be bedrooms. Actually, one of our colleagues 01:29:00.720 |
Molly, I'm just going to find it. So one of our colleagues Molly actually used the, do 01:29:08.640 |
you guys remember, was it the celeb faces that she used? So there's a celeb A HQ data set 01:29:21.200 |
that consists of images of faces of celebrities. And so what Molly did was she basically used 01:29:29.240 |
this exact notebook, but used this faces data set instead. And this one's really pretty 01:29:36.220 |
good, isn't it? You know, this one's really pretty good. They certainly look like celebrities, 01:29:41.760 |
that's for sure. So yeah, you could try this data set or whatever, but yeah, try it. Yeah, 01:29:48.280 |
maybe try it with the pre-trained backbone, try it with ResNets on the cross connections, 01:29:55.160 |
try it with all the tricks we used in SuperRes, try it with perceptual loss. Some folks we 01:30:00.200 |
spoke to about the perceptual loss think it won't help with Latents because the underlying 01:30:07.480 |
VAE was already trained with perceptual loss, but we should try, you know, or you guys should 01:30:11.680 |
try all these things. Yeah, so be sure to check out the forum as well to see what other people 01:30:18.520 |
have already tried here because it's a whole new world. But it's just an example of the 01:30:23.520 |
kind of like fun research ideas I guess we can play with. Yeah, what do you guys think 01:30:28.760 |
about this? Are you like surprised that we're able to quickly get this kind of accuracy 01:30:33.200 |
from Latents or do you think this is a useful research path? What are your thoughts? Yeah, 01:30:39.560 |
I think it's very interesting. Oh, go ahead. I was going to say the Latents are already 01:30:43.080 |
like a slightly compressed, richer representation of an image, right? So it makes sense that 01:30:48.560 |
that's a useful thing to train on. And 66%, I think AlexNet is like 63% or something like 01:30:54.800 |
that. So, you know, we were already at state of the art, what, eight years ago, whatever. 01:31:01.000 |
It might be more like 10 years ago. I know time passes quickly. Yeah, yeah, I guess next 01:31:07.240 |
year. Yeah, next year it is 10 years ago. But yeah, I'm kind of curious with the pre-training 01:31:13.360 |
the whole, the whole value for me for like using a pre-trained network where someone 01:31:16.960 |
else has done lots and lots of compute on ImageNet to learn some features and I'm going 01:31:21.360 |
to use that because it's kind of funny to be like, oh, well, let's pre-train for ourselves 01:31:26.320 |
and then try and use that. I'm curious whether like how best you'd allocate that compute 01:31:32.240 |
whether you should, if you've got 10 hours of GPU, just do 10 hours of training versus 01:31:37.000 |
like five hours of pre-training and five hours of training. I mean, based on our super res 01:31:42.320 |
thing, the pre-training like was so much better. So that's why I'm feeling somewhat hopeful 01:31:49.360 |
about this direction. Yeah. Yeah, I'm really curious to see how it goes. 01:31:55.360 |
I guess I was going to say it's like, yeah, I think there's just a lot of opportunities 01:31:58.760 |
for I guess the latent doing stuff in the latents. And like, I guess maybe like, yeah, 01:32:04.600 |
you could, I mean, hear your trade classifier as a backbone, but you could think of like 01:32:08.480 |
trade classifiers on other things for, you know, guidance or things like this. Yeah. 01:32:13.080 |
Of course, we've done some experiments with that. I know John has his mid you guidance 01:32:18.040 |
approach for some of these sort of things, but there are different approaches that you 01:32:21.480 |
can play around here that, you know, exploring in the latent space can make it computationally 01:32:28.640 |
cheaper than, you know, having to decode it every time you want to, you know, like you 01:32:32.640 |
have to look at the image and then maybe apply a classifier, apply some sort of guidance 01:32:36.360 |
on the image. But if you can do it directly in the latent space, a lot of interesting 01:32:40.080 |
opportunities there as well. Yeah. And, you know, now we're showing that indeed. 01:32:44.120 |
Yeah, yeah, style transfer is on latents. Everything on latents, like you also do models. Like that's 01:32:52.000 |
something I've done to make a latent clip is just have it like try and mirror an image 01:32:56.520 |
space clip. And so for classifiers as well, you could like distill an image net classifier 01:33:01.440 |
rather than just having the label, you try and like copy the logits. And then that's 01:33:06.480 |
like an even richer signal, like you get more value per example. So then you can create 01:33:13.120 |
your latent version of some existing image classifier or object detector or multi-modal 01:33:20.280 |
model like clip. I feel funny about this because I'm like both excited about simple diffusion 01:33:26.200 |
on the basis that it gets rid of latents, but I'm also excited about latents on the 01:33:29.440 |
basis of it gets rid of most of the pixels. I don't know how I can be cheering for both, 01:33:34.920 |
but somehow I am. I guess may the best method win. So, you know, the folks that are finishing 01:33:47.560 |
this course, well, first of all, congratulations, because it's been a journey, particularly 01:33:53.000 |
part two, it's a journey, which requires a lot of patience and tenacity. You know, if 01:33:58.840 |
you've got to zip through by binging on the videos, that's totally fine. It's a good approach, 01:34:03.280 |
but you know, maybe go back now and do it more slowly and do the, you know, build it 01:34:08.960 |
yourself and really experiment. But assuming, you know, for folks who have got to the end 01:34:15.320 |
of this and feel like, okay, I get it more or less. Yeah. Do you guys have any sense 01:34:21.000 |
of like, what kind of things make sense to do now? You know, where would you guys go 01:34:28.840 |
from here? I think that great opportunities implementing papers that I guess come along 01:34:35.520 |
these days. And I think at this stage, no way. Yeah. But also at this stage, I think, 01:34:44.760 |
you know, we're already discussing research ideas. And I think, you know, we're in a solid 01:34:49.080 |
position to come up with our own research ideas and explore, explore those ideas. So 01:34:53.800 |
I think that's a, that's a real opportunity that we have here. I think that's best done 01:34:59.520 |
often collaboratively. So I'll just, you know, mention that Fast AI has a Discord, which 01:35:06.400 |
if you've got to this point, then you're probably somebody who would benefit from, from being 01:35:11.640 |
there. And yeah, just pop your head in and say, like, there's an introduction straight 01:35:16.560 |
to say hello. And you don't, you know, maybe say what you're interested in or whatever, 01:35:21.760 |
because it's, it's nice to work with others, I think. I mean, both Jono and Tanishka only 01:35:26.840 |
know because of the Discord and the forums and so forth. So that would be one. 01:35:32.480 |
And we also have a, we have a generative channel. So anything related to generative models, 01:35:38.800 |
that's the place. So for example, Bali was posting some of her experiments in that channel. 01:35:43.760 |
I think there are other Fast AI members posting their experiments. So if you're doing anything 01:35:47.640 |
generative model related, that's a great way to also get feedback and thoughts from, from 01:35:55.880 |
I'd also say that like this, this, if you're at the stage where you finish this course, 01:36:00.560 |
you actually understand how diffusion models work. You've got a good handle on what the 01:36:04.520 |
different components and like stable diffusion are. And you know how to wrangle data for 01:36:08.520 |
training and all these things. You're like so far ahead of most people who are building 01:36:13.480 |
in this space. And I've got lots of, lots of companies and people reaching out to me 01:36:18.440 |
to say, do you know anybody who has like more than just, oh, I know how to like load stable 01:36:23.040 |
diffusion and make an image. Like, you know, someone who knows how to actually like tinkle 01:36:25.920 |
with it and make it better. And if you've got those skills, like don't feel like, oh, 01:36:29.400 |
I'm definitely not qualified to like apply or like, there's lots of stuff where, yeah, 01:36:34.840 |
just taking these ideas now and like just simple, sensible ideas that we've covered 01:36:39.200 |
in the course that have come up and saying, oh, actually, maybe I could try that. Maybe 01:36:42.480 |
I could play with this, you know, take this experimentalist approach. I feel like there's 01:36:46.360 |
actually a lot of people who would love to have you helping them build the million and 01:36:51.200 |
one little stable diffusion based apps or whatever that you're working on. 01:36:54.960 |
And particularly like the thing we always talk about at Fast AI, which is particularly 01:36:58.520 |
if you can combine that with your domain expertise, you know, whether it be from your, your hobbies 01:37:05.320 |
or your work in some completely different field or whatever, you know, there'll be lots 01:37:11.080 |
of interesting ways to combine, you know, you probably are one of the only people in 01:37:15.400 |
the world right now that understand your areas of passion or of vocation as well as these 01:37:24.280 |
techniques. So, and again, that's a good place to kind of get on the forum or the discord 01:37:30.680 |
or whatever and start having those conversations because it can be, yeah, it can be difficult 01:37:35.600 |
when you're at the cutting edge, which you now are by definition. 01:37:41.800 |
All right. Well, we better go away and start figuring out how on earth GPT-4 works. I don't 01:37:50.680 |
think we're going to necessarily build the whole GPT-4 from scratch, at least not at 01:37:55.320 |
that scale, but I'm sure we're going to have some interesting things happening with NLP. 01:38:01.000 |
And Jano, Tanish, thank you so much. It's been a real pleasure. It was nice doing things 01:38:06.160 |
with the, with a live audience, but I got to say, I really enjoyed this experience of 01:38:12.320 |
doing stuff with you guys the last few lessons. So thank you so much. 01:38:16.280 |
Yeah, thanks for having us. This is really, really fun.