back to index

Lesson 25: Deep Learning Foundations to Stable Diffusion


Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi everybody, and welcome to the last lesson of part two.
00:00:05.900 | Greetings Jono and Greetings Tanishk, how are you guys doing?
00:00:09.400 | Good thanks.
00:00:11.060 | Doing well.
00:00:12.060 | Excited for the last lesson.
00:00:13.560 | It's been an interesting, fun journey.
00:00:16.040 | Yeah.
00:00:17.040 | I should explain, we're not quite completing all of stable diffusion in this part of the
00:00:23.440 | course.
00:00:24.440 | There's going to be one piece left for the next part of the course, which is the CLIP
00:00:28.320 | embeddings.
00:00:29.320 | Because CLIP is NLP, and so the next part of the course we will be looking at NLP.
00:00:35.420 | So we will end up finishing stable diffusion from scratch, but we're going to have to have
00:00:41.360 | a significant diversion.
00:00:43.360 | And what we thought was, given everything that's happened with GPT-4 and stuff since
00:00:50.680 | we started this course, we thought it makes more sense to delve into that quite deeply,
00:00:56.520 | more soon.
00:01:00.240 | And delay CLIP as a result.
00:01:02.800 | So hopefully people will feel comfortable with that decision, but I think we'll have a lot
00:01:06.640 | of exciting NLP material coming up.
00:01:11.920 | So that's the rough plan.
00:01:13.960 | All right.
00:01:15.840 | So I think what we might do is maybe start by looking at a really interesting and quite
00:01:23.720 | successful application of pixel level diffusion by applying it not to pixels that represent
00:01:30.400 | an image, but pixels that represent a sound, which is pretty crazy.
00:01:35.160 | So maybe Johnno, of course it's going to be Johnno, he does the crazy stuff, which is
00:01:38.520 | great.
00:01:39.520 | So Johnno, show you your crazy and crazily successful approach to diffusion for pixels
00:01:45.760 | of sounds.
00:01:46.760 | Please.
00:01:47.760 | Sure thing.
00:01:48.760 | Right.
00:01:49.760 | So this is going to be a little bit of intro and tell.
00:01:54.160 | Most of the code in the notebook is just copied and pasted from, I think, notebook 30.
00:01:59.680 | But we're going to be trying to generate something other than just images.
00:02:03.600 | So specifically, I'm going to be loading up a dataset of bird calls.
00:02:07.040 | These are just like short samples of, I think, 10 different classes of birds calling.
00:02:12.600 | And so we need to understand like, okay, well, this is a totally different domain, right?
00:02:16.080 | This is audio.
00:02:17.400 | If you look at the data, like, let's look at an example of the data.
00:02:19.840 | This is coming from a hugging face dataset so that that line of code will download it
00:02:23.560 | automatically if you haven't got it before, right?
00:02:25.840 | Yeah.
00:02:26.840 | Yeah.
00:02:27.840 | So this will download it into a cache and then sort of handle a lot of, you created
00:02:32.760 | this dataset, right?
00:02:33.760 | Did you, is this already a dataset you found somewhere else or you made it or what?
00:02:38.360 | This is a subset that I made from a much larger dataset of longer call recordings and from
00:02:43.160 | an open website called Zeno Kanto.
00:02:45.720 | So they collect all of these sound recordings from people.
00:02:48.320 | They have experts who help identify what birds are calling.
00:02:51.600 | And so all I did was find the audio peaks, like where is there most likely to be a bird
00:02:56.840 | call and clip around those just to get a smaller dataset of things where there's actually something
00:03:02.840 | happening.
00:03:03.920 | Not a particularly amazing dataset in terms of like the recordings have a lot of background
00:03:08.960 | noise and stuff, but a fun, small audio one to play with.
00:03:12.200 | Um, yeah. And so when we talk about audio, you've got a microphone somewhere, it's reading
00:03:16.640 | like a pressure level, essentially, um, in the air with these sound waves.
00:03:20.960 | And it's doing that some number of times per second.
00:03:23.000 | So we have a sample rate and in this case, the data has a sample rate of 32,000 samples
00:03:28.520 | per second.
00:03:29.520 | So every second waveform that's being approximated, there's lots of little up across, up across,
00:03:34.520 | up across kind of things, basically, correct?
00:03:38.480 | Yeah.
00:03:39.480 | Um, and so that's great for, you know, capturing the audio, um, but it's not so good for modeling
00:03:44.800 | because we now have 30,000 values per second in this one big, one D array.
00:03:51.040 | Um, and so yeah, you can try and find models that can work with that kind of data.
00:03:57.000 | Uh, but what we're going to do is a little hack and we're instead going to use something
00:04:00.640 | called a spectrogram.
00:04:02.720 | So the original data is the main issue.
00:04:05.200 | It's, it's, it's too big and slow to work with.
00:04:09.240 | It's, it's too big.
00:04:11.000 | Um, but also you have some, like some sound waves are at a hundred Hertz, right?
00:04:18.400 | So they're, they're going up and down a hundred times a second and some are at a thousand
00:04:21.520 | and some are at 10,000.
00:04:23.080 | And often there's background noise that can have extremely high frequency components.
00:04:27.260 | And so if you're looking just at the waveform, there's lots and lots of change second to
00:04:32.560 | second.
00:04:33.560 | And there's some very long range dependencies of like, Oh, it's generally high here.
00:04:37.480 | It's generally low there.
00:04:39.240 | And so it can be quite hard to capture those patterns.
00:04:41.240 | Um, and so part of it is, it's just a lot of samples to deal with.
00:04:44.880 | Um, but part of it also is that it's not like an image where you can just do a convolution
00:04:51.040 | and things nearby each other tend to be related or something like that.
00:04:54.240 | Um, it's quite tricky to disentangle what's going on.
00:04:57.920 | Um, and so we have this idea of something called a spectrogram.
00:05:01.760 | Uh, this is a fancy 3d visualization, but it's basically just taking that audio and
00:05:07.440 | mapping time on one axis.
00:05:09.120 | So you can see as time goes by, we're moving along the X axis and then on the Y axis is
00:05:13.960 | frequency.
00:05:15.480 | And so the, um, the peaks here show like intensity at different frequencies.
00:05:19.980 | And so if I make a pure note, you can see that that maps in the frequency domain.
00:05:27.640 | Um, but when I'm talking, there's lots and lots of peaks and that's because our voices
00:05:32.320 | tend to produce a lot of overtones.
00:05:33.920 | So if I go, you can see there's a main notes, but there's also the subsequent notes.
00:05:38.880 | And if I play something like a chord, you can see this, you know, maybe three main peaks
00:05:47.440 | and then each of those have these harmonics as well.
00:05:50.360 | Um, so it captures a lot of information about the signal.
00:05:53.560 | Um, and so we're going to turn our audio data into something like this, where even just
00:05:59.720 | visually, if I'm a bird, you can see this really nice spatial pattern.
00:06:04.960 | And the hope is if we can generate that and then if we can find some way to turn it back
00:06:09.160 | into audio and then we'll be off to the races.
00:06:13.720 | And so, yeah, that's what I'm doing in this notebook.
00:06:15.800 | We have, um, I'm leaning on the diffusers docs, um, pipelines.audio diffusion.mal class.
00:06:25.240 | And so within the realm of spectrograms, there's a few different ways you can do it.
00:06:29.640 | So this is from the torch audio docs, um, but this notebook is from the hugging face diffusion
00:06:33.880 | models class.
00:06:35.280 | So we had that waveform, that's those raw samples and we'd like to convert that into
00:06:39.840 | what they call the frequency domain, um, which is things like these spectrograms.
00:06:44.520 | Um, and so you can do a normal, normal spectrogram, a power spectrogram or something like that.
00:06:50.960 | Um, but we often use something called a male spectrogram, which is exactly the same.
00:06:56.440 | It's actually probably what's being visualized here.
00:06:59.320 | And it's something that's designed to map the, like the frequency ranges into a range
00:07:06.600 | that's, um, like tied to what human hearing is based on.
00:07:12.280 | And so rather than trying to capture all frequencies from, you know, zero Hertz to 40,000 Hertz,
00:07:17.440 | a lot of which we can't even hear, it focuses in on the range of values that we tend to
00:07:23.040 | be interested in as, as humans.
00:07:25.400 | And also it does like a transformation into, into kind of like a log space.
00:07:31.720 | Um, so that the, the intensities like highs and lows correspond to loud and quiet for
00:07:36.320 | human hearing.
00:07:37.320 | So it's very tuned for, um, the types of audio information that we actually might care about
00:07:43.240 | rather than, you know, tens of thousands of kilohertz that any bats can hear.
00:07:47.720 | Um, okay.
00:07:48.720 | So we're going to rely on a class to abstract this away, but it's going to basically give
00:07:52.200 | us a transformation from waveform to spectrogram.
00:07:56.320 | And then it's also going to help us go from spectrogram back to waveform.
00:08:00.120 | Um, and so, uh, let me show you my data.
00:08:03.320 | I have this two image function that's going to take the audio array.
00:08:07.400 | It's going to use the male, um, class to handle turning that into, um, spectrograms.
00:08:14.320 | And the class also does things like it splits it up into chunks based on, you can set like
00:08:18.400 | a desired, um, resolution I'd like 128 by 128 spectrogram.
00:08:24.320 | It says, okay, great.
00:08:25.320 | I know how many, I know you need 128, like frequency bins for the frequency axis and
00:08:29.600 | 128 steps on the, on the time axis.
00:08:33.640 | So it kind of handles that converting and resizing.
00:08:36.720 | Um, and then it gives us these audio slice to image.
00:08:40.520 | So that's taking a chunk of audio and turning it into the spectrogram.
00:08:43.960 | And it also has the inverse.
00:08:45.600 | Um, so our dataset is fairly simple.
00:08:49.200 | We just referencing our original audio datasets, but we're calling that to image function and
00:08:55.240 | then returning it into a tensor and we're mapping it to minus 0.5 to 0.5, similarly
00:09:01.920 | to what we've done with like the grayscale images in the past.
00:09:05.080 | Um, so if you look at a sample from that data, we now have, instead of an audio waveform
00:09:10.200 | of 32,000 or 64,000, if it's two seconds samples, we now have this 128 by 128 pixel spectrogram,
00:09:21.200 | which looks like this.
00:09:22.200 | Um, and it's just, it's grayscale.
00:09:23.800 | So this is just matplotlibs colors.
00:09:25.720 | Um, but we can test out going from the spectrogram back to audio using the image to audio function
00:09:32.560 | that the male class has, um, and that should give us, um, now this isn't perfect because
00:09:43.080 | the spectrogram shows the intensity at different frequencies, but with audio, you've also got
00:09:47.680 | to worry about something called the phase.
00:09:50.040 | And so this image to audio function is actually behind the scenes doing a kind of iterative
00:09:55.800 | approximation, um, with something called the Griffin Lynn algorithm.
00:09:59.600 | Um, so I'm not going to try and describe that here, but it's just, it's approximating.
00:10:04.600 | It's guessing what the phase should be.
00:10:06.620 | It's creating a spectrogram, it's comparing that to the original, it's updating, it's doing
00:10:10.400 | some sort of like iterative, very similar to like an optimization thing to try and generate
00:10:15.240 | an audio signal that would produce the spectrogram, which we're trying to invert.
00:10:19.480 | So just to clarify, so my understanding, what you're saying is that the spectrogram is a
00:10:26.200 | lossy conversion of the sound into an image.
00:10:30.880 | Um, and specifically it's lossy because it, um, tells you the kind of intensity at each
00:10:38.800 | point, but it's not, it's kind of like, is it like the difference between a sine wave
00:10:42.080 | and a, and a cosine wave?
00:10:43.600 | Like they're just shifted in different ways.
00:10:45.200 | We don't know how much it's shifted.
00:10:47.320 | So coming back to the sound, you do have to get that, that shifting the phase correct.
00:10:53.160 | And so it's trying to guess something and it sounds like it's not doing a great guess
00:10:57.000 | from the thing you showed.
00:10:59.840 | The original audio is also not that amazing.
00:11:01.940 | Um, but yes, the, the spectrogram back to audio task, this, these dotted lines are like
00:11:07.240 | highlighting this is, yeah, it's an approximation and there are deep learning methods now that
00:11:11.480 | can do that better, or at least that sound much higher quality, um, because you can train
00:11:17.360 | a model somehow to go from this image-like representation back to an audio signal.
00:11:24.160 | Um, but we just use the approximation for this notebook.
00:11:27.200 | Um, okay.
00:11:28.600 | So now that we can represent our data as like a grayscale 128 by 128 pixel image, um, everything
00:11:35.080 | else becomes very much the same as the previous diffusion models examples.
00:11:38.800 | We're going to use this noiseify function to add different amounts of noise.
00:11:43.520 | And so we can see now we have our spectrograms, but with varying amounts of noise.
00:11:47.320 | Added we can create a simple diffusion model.
00:11:50.920 | I'm just copying and pasting the results, but with one extra layer, um, just with very few
00:11:56.320 | channels to go from 128 to 64 to 36, I mean, to 16 by eight to eight, um, no attention.
00:12:04.960 | Just I think pretty much copied and pasted from notebook 30, uh, and train it for in
00:12:11.080 | this case, 15 epochs.
00:12:12.360 | It took about, this is interesting.
00:12:14.400 | So you're using simple diffusion.
00:12:19.440 | Um, so specifically, this is the simple diffusion model that you, um, I think I've already introduced.
00:12:25.120 | Maybe not.
00:12:26.120 | Um, yeah.
00:12:27.120 | So briefly looked at it, so let's remind ourselves of what it does here.
00:12:31.200 | Okay.
00:12:32.200 | Yeah.
00:12:33.200 | Um, so we have some number of down blocks with a specified number of channels.
00:12:37.680 | And then the key insight from simple diffusion was that you often want to concentrate the
00:12:41.440 | computes in the sort of middle at the low resolution.
00:12:44.480 | So that's these, these mid blocks and their transformers.
00:12:48.360 | Um, yes.
00:12:49.880 | Um, yeah.
00:12:51.360 | Um, and so we can stack some number of those and then, um, the corresponding up path, and
00:12:57.680 | this is a unit.
00:12:58.680 | So we passing in the features from the, the down path as we go through those up blocks.
00:13:03.640 | Um, and so we're going to go take an, um, image and time step.
00:13:09.000 | We can embed the time step.
00:13:10.880 | We're going to go through our down blocks and saving the results, we're going to go through
00:13:17.800 | the mid blocks.
00:13:18.800 | There we go.
00:13:19.800 | Through the mid blocks.
00:13:20.800 | Yeah.
00:13:21.800 | And before that, you've also got the, um, embedding of the, uh, locations that self.la is the learnable
00:13:30.320 | embeddings using scale and shift.
00:13:32.360 | I remember.
00:13:33.360 | Uh, right.
00:13:34.360 | So this is preparing it to go through the transformer blocks by adding some learnable
00:13:38.240 | embeddings.
00:13:39.240 | Mm-hmm.
00:13:40.240 | All right.
00:13:41.240 | And then we reshaping it to be effectively a sequence since that's how we had written
00:13:48.920 | our transformer to have expect a 1D sequence of embeddings.
00:13:52.840 | Um, and so once you've gone through those mid blocks, we reshape it back and then we
00:13:57.080 | go through the up blocks passing in and also our saved outputs from the down path.
00:14:02.640 | Um, yeah, so it's a nice, it's a nice model.
00:14:06.320 | And you can really control how much parameters and compute you're doing just by setting,
00:14:11.280 | like what are the number of features or channels at each of those down block stages and how
00:14:16.720 | many mid blocks are you going to stack?
00:14:18.960 | Um, and so if you want to scale it up, it's quite easy to say, oh, let me just add more
00:14:22.520 | mid blocks.
00:14:23.520 | Maybe I'll add more channels, um, to the, to the down and up paths.
00:14:27.360 | Um, and there's a very easy model to tweak to get a larger or smaller model.
00:14:32.280 | One fun thought I know is, um, simple diffusion only came out a couple of months ago and I
00:14:38.160 | don't think, I think ours might be the first publicly available code for it because I don't
00:14:44.040 | think the author has released the code.
00:14:45.840 | I suspect this is probably the first time maybe it's ever been used to generate audio
00:14:49.680 | before.
00:14:50.680 | Uh, possibly.
00:14:51.680 | Yeah.
00:14:52.680 | I guess.
00:14:53.680 | Um, I know a couple of people who've at least privately done their implementations when
00:14:57.640 | I asked the author if he was releasing code, he said, oh, but it's simple.
00:15:00.600 | It's just a bunch of transformer blocks.
00:15:02.880 | I'll release it eventually.
00:15:05.360 | Um, no, maybe, maybe not.
00:15:07.160 | I don't know the line then, but they were like, Oh, you can see the pseudo code.
00:15:11.120 | It's pretty easy.
00:15:12.120 | Yeah.
00:15:13.120 | It is pretty easy.
00:15:14.120 | Yeah.
00:15:15.120 | Cool.
00:15:16.120 | So trains, the last goes down as we hope, um, sampling is exactly the same as generating
00:15:21.660 | images normally.
00:15:22.660 | Um, and that's going to give us the spectrograms.
00:15:25.840 | I'm using dealing with a hundred steps, um, and to actually listen to these samples, we
00:15:31.960 | then are just going to use that, um, image to audio function again, to take our grayscale
00:15:37.840 | image.
00:15:38.840 | Um, and in this case, actually it expects a PIL image.
00:15:41.440 | So I first converted it to PIL, um, and then turn that back into audio.
00:15:47.200 | And so we can play some of the generated samples.
00:15:51.800 | Wow, that's so cool.
00:15:56.680 | I don't know that I could guarantee what bird is making these calls and some of them are
00:16:02.520 | better than others.
00:16:03.520 | Like some of them are better than others.
00:16:06.520 | Yeah.
00:16:07.520 | Some of the original samples sample, right?
00:16:11.000 | Exactly.
00:16:12.000 | Yeah.
00:16:13.000 | So yeah, that's generating and fake bird calls with, with, um, spectrogram diffusion.
00:16:18.120 | There's projects that do this on music.
00:16:19.720 | Um, so the refusion projects based on text and yeah, there's, there's various other like
00:16:28.720 | pre-trained models that do diffusion on spectrograms to produce, um, you know, music clips or voice
00:16:36.920 | or whatever.
00:16:37.920 | Um, I may have frozen.
00:16:40.560 | Refusion is actually this stable diffusion model that's, that's fine tuned specifically
00:16:44.840 | for, for this, for the spectrogram generation, which is, which I find very impressive.
00:16:49.960 | It's like a model that was originally for, you know, text to image is instead can also
00:16:54.320 | generate the spectrograms.
00:16:55.680 | I guess there's still some useful information in, you know, the sort of text image model
00:17:00.400 | that kind of generalizes, or you can still be used for text to audio.
00:17:04.560 | So I found that a very interesting, impressive application as well as a refusion is an awesome
00:17:09.320 | name.
00:17:10.320 | Indeed.
00:17:11.320 | It is, yeah.
00:17:15.080 | Cool.
00:17:16.080 | And I guess since it's a latent model that leads us onto the next topic, right?
00:17:18.840 | I was just going to say, we've got a natural segue there.
00:17:22.360 | So we're, um, if we want to replicate refusion, then, um, we'll need lightens.
00:17:31.520 | Yeah.
00:17:33.520 | So the, the final non-NLP part of stable diffusion is this, uh, ability to use the more compressed
00:17:41.200 | representation, uh, created by a VAE called Latens, um, instead of pixels.
00:17:48.080 | Um, so we're going to start today by creating a VAE, taking a look at how it works.
00:17:54.520 | Um, so to remind you, as we learned back in the, the first lesson of this part of part
00:18:01.160 | two, um, the VAE model converts the, um, 256 by 256 pixel, three channel into a, um, is
00:18:14.960 | it 64 by 64 by four?
00:18:17.520 | It'll be 32 if it's 256, uh, it's 512 to 65.
00:18:22.880 | Oh, 512 to 64.
00:18:23.880 | Okay.
00:18:24.880 | So do a 32 by 32 by four.
00:18:27.000 | So dramatically smaller, which makes life so much easier, um, which is, which is really
00:18:34.840 | nice.
00:18:35.840 | Um, having said that, you know, simple diffusion does the first, you know, few, in fact, you
00:18:44.680 | know, all the downsampling pretty quickly and, and all the hard work happens, you know,
00:18:49.480 | at a 16 by 16 anyway.
00:18:52.000 | So maybe it's, you know, with simple diffusion, it's not as big a deal as it used to be, but
00:18:56.560 | it's still, you know, it's very handy, particularly because for us folks with more normal amounts
00:19:01.760 | of compute, we can take advantage of all that hard work that the stability.ai computers
00:19:08.960 | did for us by creating the stable diffusion VAE.
00:19:13.000 | Um, so that's what we're going to do today, but first of all, we're going to create our
00:19:19.200 | Um, so let's do a VAE using fashion MNIST.
00:19:23.360 | So the first or the first stuff is just the, the normal.
00:19:26.500 | One thing I am going to do for this simple example though, is I'm going to flatten the,
00:19:31.120 | um, fashion MNIST pixels into a vector to make it as simple as possible.
00:19:38.160 | Um, okay.
00:19:40.840 | So we've, we're going to end up with vectors of length 784 because 28 by 28 784, uh, we're
00:19:48.080 | going to create a single hidden layer MLP with, um, 400, um, hidden and then 200 outputs.
00:20:01.840 | So here's a linear layer.
00:20:03.440 | So it's a sequential containing a linear and then an optimal activation function and an
00:20:08.000 | optional normalization.
00:20:10.000 | Um, we'll update init weights so that we initialize linear layers as well.
00:20:16.480 | Um, so before we create a VAE, which is a variational autoencoder, we'll create a normal autoencoder.
00:20:23.800 | We've done this once before and we didn't have any luck.
00:20:27.440 | Um, in fact, we were so unsuccessful that we decided to go back and create a learner
00:20:33.060 | and come back a few weeks later once we knew what we were doing.
00:20:36.400 | So here we are.
00:20:37.400 | We're back.
00:20:38.400 | We think we know what we're doing.
00:20:39.400 | Um, so we're just going to recreate an autoencoder just like we did some lessons ago.
00:20:44.920 | Um, so there's going to be an encoder, which is a sequential, which goes from our 768 inputs
00:20:50.840 | to our 400 hidden and then a linear layer with our 400 hidden and then an output layer from
00:20:56.860 | the 400 hidden to the 200 outputs of the encoder.
00:21:02.320 | So there we got our latents.
00:21:05.400 | And then the decoder will go from those 200 latents to our 400 hidden, have our hidden
00:21:13.160 | layer and then come back to our 768 inputs.
00:21:19.840 | Um, all right.
00:21:23.800 | So we can optimize that in the usual way using Adam, um, and we'll do it for 20 epochs runs
00:21:34.200 | pretty quickly cause it's quite a small dataset and quite a small model.
00:21:38.440 | Um, and so what we can then do, um, is we can grab a batch of our X who actually grabbed
00:21:48.040 | the batch of X earlier, uh, way back here.
00:21:53.440 | So I've got a batch of images, um, and we can put it through our model, um, pop it back
00:22:02.360 | on the CPU and we can then have a look at our original mini-batch and we have to reshape
00:22:09.960 | it to 28 by 28 because we previously had flattened it.
00:22:13.840 | So there's our original and then, um, we can look at the result after putting it through
00:22:19.120 | our model.
00:22:21.760 | And there it is.
00:22:22.760 | And as you can see, it's, you know, very roughly regenerated.
00:22:27.600 | And so this is, um, not a massive compression, it's compressing it from 768 to 200.
00:22:35.080 | And it's also not doing an amazing job of recreating the original details.
00:22:38.660 | Um, but you know, this is the simplest possible auto encoder.
00:22:42.360 | So it's doing, you know, it's a lot better than our previous attempt.
00:22:46.240 | Um, so that's good.
00:22:49.480 | So what we could now do is we could just generate some noise and then we're not even going to
00:22:55.080 | do diffusion.
00:22:56.080 | So we're going to go and say like, okay, we've got a decoder.
00:22:58.160 | So let's just decode that noise and see what it creates.
00:23:03.400 | And the answer is not anything great.
00:23:06.560 | I mean, I could kind of recognize that might be the start of a shoe.
00:23:11.920 | Maybe that's the start of a bag.
00:23:12.920 | I don't know, but it's not doing anything amazing.
00:23:16.140 | So we have not successfully created an image generator here, um, but there's a very simple
00:23:22.480 | step we can do to make something that's more like an image generator.
00:23:26.240 | The problem is that, um, these 200, um, this vector of length 200 recreating, there's no
00:23:35.880 | particular reason that things that are not in the dataset are going to create items of
00:23:43.400 | clothing.
00:23:44.400 | We haven't done anything to try to make that happen.
00:23:47.040 | We've already tried to make this work for things in the dataset, you know, and, um, therefore,
00:23:54.560 | when we just randomly generate a bunch of, you know, a vector of length 200 or 16 vectors
00:24:01.840 | of length 200 in this case, um, and then decode them, there's no particular reason to think
00:24:07.440 | that they're going to create something that's recognizable as clothing.
00:24:13.840 | So the way a VAE tries to fix this is by, we've got the exact same encoder as before, except
00:24:25.440 | it's just missing its final layer.
00:24:29.080 | Its final layer has been moved over to here.
00:24:31.520 | I'll explain why there's two of them in a moment.
00:24:34.020 | So we've got the inputs to hidden, the hidden to hidden, and then the hidden to the latent.
00:24:39.680 | The decoder is identical, okay, latent's to hidden, hidden to hidden, hidden to inputs.
00:24:49.280 | And then just as before, we call the encoder, um, but we do something a little bit weird
00:24:59.240 | next, which is that we actually have two separate final layers.
00:25:05.080 | We've got one called mu for the final of the encoder and one called LV, which stands for
00:25:12.120 | log of variance.
00:25:13.920 | So encoder has two different final layers.
00:25:17.000 | So we're going to call both of them, okay.
00:25:20.460 | So we've now got two encoded 200 long lots of latents.
00:25:26.200 | What do we do with them?
00:25:27.840 | What we do is we use them to generate random numbers and the random numbers have a mean
00:25:39.760 | of mu.
00:25:41.640 | So when you take a random zero one, so this creates zero one random numbers, mean zero
00:25:48.080 | standard deviation one.
00:25:49.780 | So if we add mu to it, they now have a mean of mu or approximately.
00:25:54.760 | And if you multiply the random numbers by half of log of variance, e to the power of
00:26:00.760 | that, right?
00:26:02.560 | So given this log of variance, this is going to give you standard deviation.
00:26:08.720 | So this is going to give you a standard deviation of e to the half LV and a mean of mu.
00:26:16.240 | Why the half?
00:26:17.640 | It doesn't matter too much, but if you think about it, um, standard deviation is the square
00:26:23.040 | root, so the variance is squared.
00:26:26.460 | So when you take the log, you can move that half into the multiplication because of the
00:26:34.800 | log trick.
00:26:36.680 | That's why we just got the half here instead of the square root, which would be to the
00:26:41.000 | power of a half.
00:26:44.240 | So this is just, yeah, this is just the standard deviation.
00:26:48.140 | So we've got the standard deviation times normally distributed random noise plus mu.
00:26:52.060 | So we end up with normally distributed numbers, we're going to have 200 of them for each element
00:27:01.160 | of the batch where they have a standard deviation of the result of this final layer and a variance,
00:27:11.880 | which is the result or log variance of the result of this final layer.
00:27:16.980 | And then finally we passed that through the decoder as usual.
00:27:21.320 | I explained why we passed back three things, but for now we're just worried about the fact
00:27:24.200 | we passed back the result of the decoder.
00:27:27.440 | So what this is going to do is it's going to generate, um, the, the result of calling,
00:27:34.960 | um, encode is going to be a little bit random.
00:27:40.480 | On average, you know, it's still generating exactly the same as before, which is the result
00:27:45.640 | of a sequential model with, you know, MLP with one hidden layer, but it's also going
00:27:52.480 | to add some randomness around that, right?
00:27:55.860 | So this is, here's the bit, which is exactly the same as before.
00:27:58.320 | This is the same as calling encode before, but then here's the bit that adds some randomness
00:28:03.120 | to it.
00:28:04.120 | And the amount of randomness is also itself random.
00:28:08.400 | Okay.
00:28:09.600 | So then that gets run through the decoder.
00:28:11.920 | Um, okay, so if we now just, um, well, you know, trained that, right, using the result
00:28:21.800 | of the decoder and using, um, I think we didn't use MSE loss.
00:28:25.800 | We used a binary cross entropy loss, which we've seen before.
00:28:29.840 | Um, so if you've forgotten, you should definitely go back and rewatch that by really part one.
00:28:34.840 | Um, or we've done a bit of it in part two as well, binary cross entropy loss, um, with
00:28:41.720 | logits means that you don't have to worry about doing the soft max.
00:28:44.720 | It does the soft max for you.
00:28:46.800 | Um, so if we just, um, optimize this using BCE now, you would expect, and it would, I
00:28:55.480 | believe I haven't checked, um, that it would basically take this final, like this layer
00:28:59.840 | here and turn these all into zeros, um, as a result of which it would have no variance
00:29:04.720 | at all. Um, and therefore it would behave exactly the same as the previous auto encoder.
00:29:12.880 | Does that sound reasonable to you guys?
00:29:15.360 | Yeah. Okay. Um, so that wouldn't help at all because what we actually want is we want some
00:29:21.040 | variance and the reason we want some variance is we actually want to have it generate some
00:29:28.560 | latents, which are not exactly our data. They're around our data, but not exactly our data.
00:29:35.000 | And then when it generates latents that are around our data, we want them to decode to
00:29:40.640 | our, to the same thing. We want them to decode to the correct image. And so as a result,
00:29:45.920 | if we can train that, right, something that it does include some variation and still decodes
00:29:53.160 | back to the original image, then we've created a much more robust model.
00:29:58.360 | And then that's something that we would help then, that we would hope then when we say,
00:30:02.400 | okay, well now decode some noise that it's going to decode to something better than this.
00:30:08.320 | So that's the idea of a VAE. So how do we get it to create, um, a log variance, which
00:30:18.520 | doesn't just go to zero? Um, well, we have a second, uh, loss term it's called the KL
00:30:26.040 | divergence loss. We've got a key called KLD loss. And what we're going to do is our VAE
00:30:31.280 | loss is going to take the binary cross entropy between the actual decoded bit. So that's
00:30:40.320 | input zero and the target. Okay. So that's, this is exactly the same as before as this
00:30:46.440 | binary cross entropy. And we're going to add it to this KLD loss, KL divergence. Now KL
00:30:52.720 | divergence, the details don't matter terribly much. What's important is when we look at
00:30:57.560 | the KLD loss, it's getting past the input and the targets, but if you look, it's not
00:31:03.520 | actually using the targets at all. So if we pull out in the, the input into its three
00:31:10.840 | pieces, which is our predicted image, our mu and our log variance, we don't use this
00:31:17.000 | either. So the BCE loss only uses the predicted image and the actual image. The KL divergence
00:31:23.920 | loss only uses mu and log variance. And all it does is it returns a number, which says,
00:31:35.360 | um, for each item in the batch, um, is mu close to zero and is log variance close to
00:31:41.920 | one. How does it do that? Well, for mu, it's very easy. Mu squared. So if mu is close to
00:31:50.920 | zero, then minimizing mu squared does exactly that, right? Um, if mu is one, then mu squared
00:31:57.360 | is one. If mu is minus one, mu squared is one. If mu is zero, mu squared is zero. That's
00:32:03.320 | the lowest you can get for a squared. Um, okay. So we've got a mu squared piece here, um,
00:32:14.440 | and we've got a dot mean. So we're just taking, that's just basically taking the mean of all
00:32:17.720 | the mus. And then there's another piece, which is we've got log variance minus e to the power
00:32:25.960 | of log variance. So if we look at that, so let's just grab a bunch of numbers between
00:32:33.640 | neg three and three and do number minus e to the power of that number. Um, and I'm just
00:32:40.160 | going to pop in the one plus and the point five times as well. They matter much. And
00:32:44.120 | you can see that's got a minimum of zero. So when that's a minimum of zero, e to the
00:32:51.840 | power of that, which is what we're going to be using actually half times e to the power
00:32:57.800 | of that, but that's okay. Is what we're going to be using in our, um, dot forward method.
00:33:05.400 | That's going to be e to the power of zero, which is going to be one. So this is going
00:33:11.840 | to be minimized where, um, log variance exp equals one. So therefore this whole piece
00:33:21.000 | here will be minimized when mu is zero and LV is also zero. Um, and, and so therefore
00:33:31.600 | LV e to the power of LV is one. Now, the reason that it's specifically this form is basically
00:33:40.280 | because, um, there's a specific mathematical thing called the KL divergence, which compares
00:33:48.800 | how similar to distributions are. And so the normal distribution can be fully characterized
00:33:54.640 | by its main and its variance. And so this is actually more precisely calculating the
00:34:00.460 | similarity that specifically the KL divergence between the actual mu and LV that we have
00:34:09.720 | and a distribution with a mean of zero and a variance of one. Um, um, but you can see
00:34:16.920 | hopefully why conceptually we have this mu.pal two and why we have this LV.exp, um, LV minus
00:34:26.480 | LV.exp here. Um, so that is our VAE loss. Did you guys have anything to add to any of
00:34:38.720 | that description? So maybe to highlight the, the, the objective of this is to say rather
00:34:44.020 | than having it so that the exact point that an input is encoded to decodes back to that
00:34:50.160 | input, we're saying number one, the space around that point should also decoded that
00:34:55.080 | input because we're going to try and force some variance. And number two, the overall
00:34:58.700 | variance should be like, yeah, the, the overall space that it uses should be roughly zero
00:35:05.280 | mean and units and variance, right? So instead of able to like map each input to like an
00:35:12.000 | arbitrary point and then decode only that exact point to an input, we now mapping them
00:35:16.080 | to like a restricted range. And we're saying that not, not just each point, but its surroundings
00:35:20.480 | as well should also decode back to something that looks like that image. Um, and that's
00:35:25.200 | trying to like condition this latent space to be much nicer so that any arbitrary point
00:35:29.960 | within that, um, range will hopefully map to something useful, which is a harder problem
00:35:35.540 | to solve, right? So we would expect given that this is exactly the same architecture,
00:35:41.020 | we would expect its ability to actually decode would be worse than our previous attempt because
00:35:48.960 | it's a harder problem that we're trying to solve, which is to just, we've got random
00:35:52.200 | numbers in there as well now that we're hoping that this ability to generate images will
00:35:56.280 | improve. Um, thanks, John. Okay. So I actually asked Bing about this, um, which is just,
00:36:07.760 | this is more of an example of like, I think for, you know, now that we've got GPT for
00:36:12.480 | and Bing and stuff, I find they're pretty good at answering questions that like I wanted
00:36:17.660 | to explain to students what would happen if the variance of the latents was very low or
00:36:21.960 | what if they were very high? So why do we want them to be one? And I thought like, Oh
00:36:25.760 | gosh, this is hard to explain. So maybe Bing can help. So I actually thought it's pretty
00:36:30.800 | good. So I'll just say what Bing said. So Bing says, if the variance of the latents
00:36:34.480 | are very well low, then the encoder distribution would be very peaked and concentrated around
00:36:41.320 | the main. So that was the thing we were describing earlier. If we had trained this without the
00:36:45.800 | KLD loss at all, right, it would probably make the variance zero. And so therefore the
00:36:51.200 | latent space would be less diverse and expressive and limit the ability of the decoder to reconstruct
00:36:56.280 | the data accurately, make it harder to generate new data that's different from the training
00:37:00.560 | data, which is exactly what we're trying to do. And if the variance is very high, then
00:37:07.000 | the encoder would be very spread out and diffuse. It would be more, the latents would be more
00:37:11.440 | noisy and random, make it easier to generate new data that's unrealistic or nonsensical.
00:37:19.560 | Okay. So that's why we want it to be exactly at a particular point. So when we train this,
00:37:27.960 | we can just pass VAE loss as our loss function, but it'd be nice to see how well it's going
00:37:33.280 | at reconstructing the original image and how it's going at creating a zero one distribution
00:37:42.340 | data separately. So what I ended up doing was creating just a really simple thing called
00:37:49.660 | func metric, which I derived from the capital M mean class in the torch, just trying to
00:38:02.400 | find it here from the torcheval.metrics. So they've already got something that can just
00:38:08.160 | calculate means. So obviously this stuff's all very simple and we've created our own
00:38:11.600 | metrics class ourselves back a while ago. And since we're using torcheval, I thought
00:38:15.440 | this is useful to see how we can create one, a custom metric where you can pass in some
00:38:20.640 | function to call before it calculates the mean. So if you call, so you might remember
00:38:28.340 | that the way torcheval works is it has this thing called update, which gets past the input
00:38:32.400 | and the targets. So I add to the weighted sum, the result of calling some function on
00:38:39.360 | the input and the targets. So we want two kind of new metrics. One is the, we're going
00:38:50.080 | to print it out as KLD, which is a func metric on KLD loss, someone who went to print out
00:38:54.800 | as BCE, which is a func metric on BCE loss. And so the actual, when we call the learner,
00:39:02.440 | the loss function we'll use is VAE loss, but we're going to pass in as metrics, this additional
00:39:14.160 | metrics to print out. So it's just going to print them out. And in some ways it's a little
00:39:17.920 | inefficient because it's going to calculate KLD loss twice and BCE loss twice, one to
00:39:23.000 | print it out and one to go into the, you know, actual loss function, but it doesn't take
00:39:27.560 | long for that bit. So I think that's fine. So now when we call learn.fit, you can see
00:39:33.040 | it's printing them all out. So the BCE that we got last time was 0.26. And so this time,
00:39:42.400 | yeah, it's not as good. It's 0.31 because it's a harder problem and it's got randomness
00:39:47.400 | in it. And you can see here that the BCE and KLD are pretty similar scale when it starts.
00:39:56.000 | That's a good sign. If they weren't, you know, I could always in the loss function scale
00:40:01.240 | one of them up or down, but they're pretty similar to start with. So that's fine. So
00:40:06.760 | we train this for a while and then we can use exactly the same code for sampling as
00:40:12.160 | before. And yeah, as we suspected, its ability to decode is worse. So it's actually not capturing
00:40:21.240 | the LE at all, in fact, and the shoes got very blurry. But the hope is that when we
00:40:30.280 | call it on noise called the decoder on random noise, that's much better. We're getting,
00:40:36.920 | it's not amazing, but we are getting some recognizable shapes. So, you know, VAEs are,
00:40:45.080 | you know, not generally going to get you as good a results as diffusion models are, although
00:40:52.360 | actually if you train really good ones for a really long time, they can be pretty impressive.
00:40:57.240 | But yeah, even in this extremely simple, quick case, we've got something that can generate
00:41:01.800 | recognizable items of clothing. Did you guys want to add anything before we move on to
00:41:08.520 | the stable diffusion VAE? Okay. So this VAE is very crappy. And as we mentioned, one of
00:41:24.680 | the key reasons to use a VAE is actually that you can benefit from all the compute time
00:41:30.840 | that somebody else has put into training a good VAE.
00:41:36.240 | Just also like one thing when you say good VAE, the one that we've trained here is good
00:41:42.160 | at generating because it maps down to this like one, two dimensional vector and then
00:41:47.040 | back in a very useful way. And like, if you look at VAEs for generating, they'll often
00:41:52.200 | have a pretty small dimension in the middle and it'll just be like this vector that gets
00:41:57.560 | mapped back up. And so VAE that's good for generating is slightly different to one that's
00:42:01.760 | good for compressing. And like the stable diffusion one, we'll see has this like special components
00:42:06.440 | still, it doesn't map it down to a single vector, it maps it down to 64 by 64 or whatever.
00:42:12.840 | And I think that's smaller than the original, but for generating, we can't just put random
00:42:17.160 | noise in there and hope like a cohesive image will come out. So it's less good as a generator,
00:42:23.920 | but it is good because it has this like compression and reconstruction ability.
00:42:27.480 | Cool. Yeah. So let's take a look. Now, to demonstrate this, we want to move to a more
00:42:38.120 | difficult task because we want to show off how using Latents let us do stuff we couldn't
00:42:44.760 | do well before. So the more difficult task we're going to do is generating bigger images
00:42:53.080 | and specifically generate images of bedrooms using the L Sun Bedrooms dataset. So L Sun
00:43:01.960 | is a really nice dataset, which has many, many, many millions of images across 10 scene categories
00:43:17.920 | and 20 object categories. And so it's very rare for people to use of the object categories
00:43:25.600 | to be honest, but people quite often use the scene categories. They're a little more than
00:43:31.440 | a little can be extremely slow to download is that the website they come from is very
00:43:35.200 | often down. So what I did was I put a subset of 20% of them onto AWS. They kindly provide
00:43:46.000 | some free dataset hosting for our students. And also the original L signs in a slightly
00:43:52.920 | complicated form. It's in something called an LMDB database. And so I turned them into
00:43:56.520 | just normal images in folders. So you can download them directly from the AWS dataset
00:44:04.480 | site that they've provided for us. So I'm just using fast core to save it and then using
00:44:13.200 | Python's shutil to unpack the gzipped tar file. Okay. So that's given us once that runs, which
00:44:23.520 | is going to take a long time. And, you know, if it might be, you know, even more reliable
00:44:35.080 | just to do this in the shell with wget or aria 2c or something than doing it through
00:44:41.240 | Python. So this will work, but if it's taking a long time or whatever, maybe just delete
00:44:44.760 | it and do it in the shell instead. Okay. So then I thought, all right, how do we turn
00:44:54.880 | these into Latents? Well, we could create a dataset in the usual ways. It's going to
00:45:04.480 | have a length. So we're going to grab all the files. So glob is a built into Python,
00:45:11.920 | which we'll search for in this case, star dot jpeg. And if you've got star star slash,
00:45:19.280 | that's going to search recursively as long as you pass recursive. So we're going to search
00:45:24.480 | for all of the jpeg files inside our data slash bedroom folder. So that's what this is
00:45:36.160 | going to do. It's going to put them all into the files attribute. And so then when we get
00:45:41.000 | an item, the ith item, it will find the ith file. It will read that image. So this is
00:45:48.400 | PyTorch's read image. It's the fastest way to read a jpeg image. People often use PIL,
00:45:58.040 | but it's quite hard to find a really well optimized PIL version that's really compiled
00:46:03.400 | fast, whereas the PyTorch Torch Vision team have created a very, very fast read image.
00:46:11.320 | That's why I'm using theirs. And if you pass in image read mode.RGB, it will automatically
00:46:18.600 | turn any one channel, black and white images, into three channel images for you. Or if there
00:46:23.100 | are four channel images with transparency, it will turn those. So this is a nice way
00:46:26.960 | to make sure they're all the same. And then this turns it into floats from not to one.
00:46:35.120 | And these images are generally very close to 256 by 256 pixels. So I just crop out
00:46:40.480 | the top 250 by 256 bit, because I didn't really care that much. And we do need them to all
00:46:49.000 | be the same size in order that we can then pass them to the stable diffusion VAE decoder
00:46:55.640 | as a batch. Otherwise it's going to take forever. So I can create a data loader that's going
00:47:01.680 | to go through a bunch of them at a time. So 64 at a time. And use however many CPUs I
00:47:10.320 | have as the number of workers. It's going to do it in parallel. And so the parallel
00:47:16.000 | bit is the bit that's actually reading the JPEGs, which is otherwise going to be pretty
00:47:21.000 | slow. So if we grab a batch, here it is. Here's what it looks like. Generally speaking, they're
00:47:27.000 | just bedrooms, although we've got one pretty risque situation in the bedroom. But on the
00:47:32.280 | whole, they're not safe for work. This is the first time I've actually seen an actual
00:47:36.720 | bedroom scene taking place, as it were. All right. So as you can see, this mini batch
00:47:44.320 | of, if I just grab the first 16 images, has three channels and 256 by 256 pixels. So that's
00:47:56.560 | how big that is for 16 images. So that's 728. So 3.145 million floats to represent this.
00:48:10.120 | Okay. So as we learned in the first lesson of part two, we can grab an autoencoder directly
00:48:20.080 | using diffusers using from pre-trained. We can pop it onto our GPU. And importantly,
00:48:28.320 | we don't have to say with torch.nograd anymore if we pass requires grad false. And remember
00:48:35.720 | this neat trick in PyTorch, if it ends in an underscore, it actually changes the thing
00:48:39.840 | that you're calling in place. So this is going to stop it from computing gradients, which
00:48:45.040 | would take a lot of time and a lot of memory otherwise. So let's test it out. Let's encode
00:48:52.760 | our mini batch. And so just like Johnno was saying, this has now made it much smaller.
00:48:58.920 | It's got just in our 16 batch of 16, it's now a four channel 32 by 32. So if we can
00:49:06.480 | compare the previous size to the new size, it's 48 times smaller. So that's 48 times
00:49:13.960 | less memory it's going to need. And it's also going to be a lot less compute for a convolution
00:49:19.360 | to go across that image. So it's no good unless we can turn it back into the original image.
00:49:26.520 | So let's just have a look at what it looks like first. Now it's a four channel image,
00:49:29.540 | so we can't naturally look at it. But what I could do is just grab the first three channels.
00:49:36.600 | And then they're not going to be between 0 and 1. So if I just do dot sigmoid, now they're
00:49:41.320 | between 0 and 1. And so you can see that our risque bedroom scene, you can still recognize
00:49:46.540 | it. Or this bedroom, this bed here, you can still recognize it. So there's still that
00:49:53.400 | kind of like the basic geometry is still clearly there. But it's, yeah, it's clearly changed
00:50:00.840 | it a lot as well. So importantly, we can call decode on this 48 times smaller tensor. And
00:50:13.560 | it's really, I think, absolutely remarkable how good it is. I can't tell the difference
00:50:22.840 | to the original. Maybe if I zoom in a bit. Her face is a bit blurry. Was her face always
00:50:34.760 | a bit blurry? No, it was always a bit blurry. First, second, third. Oh, hang on. Did that
00:50:44.760 | used to look like a proper ND? Yeah, OK. So you can see this used to say that clearly
00:50:49.360 | there's an ND here. And now you can't see those letters. So and this is actually a classic
00:50:56.800 | thing that's known for this particular VAE is it's not able to regenerate writing correctly
00:51:06.260 | at small font sizes. I think it's also pretty it's like I think we hear with the faces are
00:51:12.360 | already pretty low resolution. But if you are at a higher resolution, the faces also
00:51:16.320 | would probably not be converted appropriately. OK, cool. But overall, yeah, it's done a great
00:51:24.400 | job. A couple of other things I wanted to note was like, so like you mentioned, like
00:51:29.440 | a 40, I guess a factor of 48 degrees. Oftentimes people refer to mostly at the spatial resolution.
00:51:37.280 | So since it's going from 256 by 256 to 32 by 32. So that's like a factor of eight. So
00:51:45.200 | they sometimes will know, like, I think it's like F8 or something like this. They'll note
00:51:48.720 | the spatial resolution. So sometimes you may see that written out like that. And of course,
00:51:54.760 | it is an eight squared decrease in the number of pixels, which is interesting. Right. Right.
00:52:02.200 | And then the other thing I want to note was that the VAE is also trained with with a perceptual
00:52:09.480 | loss objective, as well as technically like a like a discriminator, again, objective.
00:52:16.840 | I don't know if you were going to go into that later now. So, yeah, so perceptual loss,
00:52:22.440 | we've we've already discussed. Right. So the VAE is going to you know, when they trained
00:52:28.520 | it. So I think this was trained by Compviz, right, the, you know, Robin and Gang and used
00:52:40.200 | stability.ai donated compute for that. And they went to be clear, actually, no, the VAE
00:52:48.160 | was actually trained separately. And it's actually a train on the open images data set.
00:52:53.320 | And it was just this VAE that they trained by themselves on, you know, a small subset
00:52:58.280 | of data. But because the VAE is so powerful, it's actually able to be applied to all these
00:53:04.360 | other data sets as well. Okay, great. Yeah. So they so they would have had a KL diversion
00:53:13.960 | loss and they would have either had an MSC or BCE loss. I think it might have been an
00:53:17.480 | MSC loss. They also had a perceptual loss, which is the thing we learned about when we
00:53:22.880 | talked about super resolution, which is where when they compared the the output images to
00:53:30.080 | the original images, they would have run that through a, you know, ImageNet trained or similar
00:53:38.440 | classifier and confirmed that the activations they got through that model was similar. And
00:53:45.560 | then the final bit is as Tanisha was mentioning is the adversarial loss, which is also known
00:53:55.280 | as a as a GAN loss. So a GAN is a generative adversarial network. And the GAN loss what
00:54:06.040 | it does is it grabs it is actually more specifically what's called a patchwise GAN loss. And what
00:54:17.800 | it does is it takes like a little section of an image. Right. And what they've done
00:54:26.200 | is they train it's let's just simplify it for a moment and imagine that they've pre-trained
00:54:32.120 | a classifier, right, where they've basically got something that you can pass it a real,
00:54:38.680 | you know, patch from a bedroom scene and a and a fake patch from a bedroom scene. And
00:54:53.400 | they both go into the what's called the discriminator. And this is just a normal, you know, ResNet
00:55:08.160 | or whatever, which basically outputs something that either says, yep, the the image is real
00:55:22.040 | or nope, the image is fake. So sorry, I said it passes in two things. You just that was
00:55:26.640 | wrong. You just pass in one thing and it returns either it's real or it's fake. And specifically,
00:55:30.880 | it's going to give you something like the probability that it's real. There is another
00:55:36.440 | version. I don't think it's what they use. You pass in two and it tells you which one's
00:55:40.080 | relative. Do you remember Tanisha? Is it a relativistic GAN or a normal GAN? I think it's
00:55:45.160 | a normal one. Yeah. So the realistic GAN is when you pass in two images and it says which
00:55:49.000 | is more real. The one we think that we remember correctly, they use as a regular GAN, which
00:55:54.000 | just tells you the probability that it's real. And so you can just train that by passing
00:55:59.600 | in real images and fake images and having it learn to classify which ones are real and
00:56:04.520 | which ones are fake. So now that once you've got that model trained, then as you train
00:56:12.160 | your GAN, you pass in the patches of each image into the discriminator. So let's call
00:56:21.480 | D here, right? And it's going to spit out the probability that that's real. And so if it's
00:56:29.120 | spat out 0.1 or something, then you're like, oh, dear, that's terrible. Our VAE is spitting
00:56:38.580 | out pictures of bedrooms where the patches of it are easily recognized as not real. But
00:56:45.560 | the good news is that's going to generate derivatives, right? And so those derivatives
00:56:51.320 | then is going to tell you how to change the pixels of the original generated image to
00:56:57.480 | make it trick the GAN better. And so what it will do is it will then use those derivatives
00:57:05.280 | as per usual to update our VAE. And the VAE in this case is going to be called a generator,
00:57:16.120 | right? That's the thing that's generating the pixels. And so the generator gets updated
00:57:21.360 | to be better and better at tricking the discriminator. And after a while, what's going to happen
00:57:27.540 | is the generator is going to get so good that the discriminator gets fooled every time,
00:57:32.920 | right? And so then at that point, you can fine-tune the discriminator better by putting in your
00:57:39.880 | better generated images, right? And then once your discriminator learns again how to recognize
00:57:44.960 | the difference between real and fake, you can then use it to train the generator. And
00:57:50.600 | so this is kind of ping-ponging back and forth between the discriminator and the generator.
00:57:56.120 | Like when GANs were first created, people were finding them very difficult to train. And
00:58:04.040 | actually a method we developed at Fast AI, I don't know if we were the first to do it
00:58:08.520 | or not, was this idea of kind of pre-training a generator just using perceptual loss and
00:58:16.160 | then pre-training a discriminator to be able to fool the generator and then ping-ponging
00:58:20.320 | backwards and forwards between them. After that, basically whenever the discriminator
00:58:25.480 | got too good, start using the generator. Anytime the generator got too good, start using the
00:58:30.360 | discriminator. Nowadays, that's pretty standard, I think, to do it this way. And so, yeah,
00:58:38.560 | this GAN loss, which is basically saying penalize for failing to fool the discriminator is called
00:58:47.040 | an adversarial loss. To maybe motivate why you do this, if you
00:58:59.000 | just did it with a mean squared error or even a perceptual loss with such a high compression
00:59:05.880 | ratio, the VAEs tend to produce a fairly blurry output because it's not sure whether there's
00:59:11.200 | texture or not in this image or the edges aren't super well defined where they'll be because
00:59:17.520 | it's going from one four-dimensional thing up to this whole patch of the image. And so
00:59:24.800 | it tends to be a little bit blurry and hazy because it's kind of hedging its bets, whereas
00:59:29.980 | that's something that the discriminator can quite easily pick up. Oh, it's blurry. It must
00:59:34.760 | be fake. And so then it's having the discriminator, that is adversarial loss, is just kind of
00:59:39.480 | saying, even if you're not sure exactly where this texture goes, rather go with a sharper
00:59:43.920 | looking texture that looks real than with some blurry thing that's going to maximize
00:59:49.920 | your MSE. And so it tricks it into kind of faking this higher resolution looking sharper
00:59:56.600 | output. Yeah. And I'm not sure if we're going to come
01:00:02.400 | back and train our own GAN at some point, but if you're interested in training your
01:00:10.660 | own GAN or-- you shouldn't call it a GAN, right? I mean, nowadays, we never really just
01:00:17.080 | use a GAN. We have an adversarial loss as part of a training process. So if you want
01:00:21.160 | to learn how to use adversarial loss in detail and see the code, the 2019 FastAI course Less
01:00:28.520 | than 7 at part 1 has a walkthrough. So we have sample code there. And maybe given time,
01:00:35.400 | we'll come back to it. OK. So quite often, people will call the VAE
01:00:49.680 | encoder when they're training a model, which to me makes no sense, right? Because the encoded
01:00:55.360 | version of an image never changes unless you are using data augmentation and want to do
01:01:01.360 | augmentation on-- sorry, to encode augmented images. I think it makes a lot more sense
01:01:07.520 | to just do a single run through your whole training set and encode everything once. So
01:01:13.800 | naturally, the question is then, well, where do you save that? Because it's going to be
01:01:17.040 | a lot of RAM. If you put this, leave it in RAM. And also, as soon as you restart your
01:01:22.040 | computer, we've lost all that work. There's a very nifty file format you can use called
01:01:27.680 | a memory mapped numpy file, which is what I'm going to use to save our latency. A memory
01:01:36.000 | mapped numpy file is basically-- what happens is you take the memory in RAM that numpy would
01:01:44.480 | normally be using, and you literally copy it onto the hard disk, basically. That's what
01:01:53.560 | they mean by memory mapped. There's a mapping between the memory in RAM and the memory in
01:01:58.800 | hard disk. And if you change one, it changes the other, and vice versa. They're kind of
01:02:02.200 | two ways of seeing the same thing. And so if you create a memory mapped numpy array,
01:02:10.280 | then when you modify it, it's actually modifying it on disk. But thanks to the magic of your
01:02:16.120 | operating system, it's using all kinds of beautiful caching and stuff to not make that
01:02:22.620 | slower than using a normal numpy array. And it's going to be very clever at-- it doesn't
01:02:31.680 | have to store it all in RAM. It only stores the bits in RAM that you need at the moment
01:02:36.360 | or that you've used recently. It's really nifty at caching and stuff. So it's kind of--
01:02:41.360 | it's like magic, but it's using your operating system to do that magic for you. So we're
01:02:46.760 | going to create a memory mapped file using np.memmap. And so it's going to be stored somewhere
01:02:53.660 | on your disk. So we're just going to put it here. And we're going to say, OK, so create
01:02:58.840 | a memory map file in this place. It's going to contain 32-bit floats. So write the file.
01:03:06.000 | And the shape of this array is going to be the size of our data set, so 303,125 images.
01:03:13.840 | And each one is 4 by 32 by 32. OK. So that's our memory mapped file. And so now we're going
01:03:22.200 | to go through our data loader, one mini batch of 24 at a time. And we're going to VAE encode
01:03:32.120 | that mini batch. And then we're going to grab the means from its latency. We don't want
01:03:40.000 | random numbers. We want the actual midpoints, the means. So this is using the diffusers
01:03:48.200 | version of that VAE. So pop that onto the CPU after we're done. And so that's going
01:03:55.120 | to be mini batch of size 64 as PyTorch. Let's turn that into NumPy because PyTorch doesn't
01:04:01.360 | have a memory mapped thing, as far as I'm aware, but NumPy does. And so now that we've
01:04:05.480 | got this memory mapped array called a, then everything initially from 0 up to 64, not
01:04:18.120 | including the 64, that whole sub part of the array is going to be set to the encoded version.
01:04:24.040 | So it looks like we're just changing it in memory. But because this is a magic memory
01:04:30.160 | mapped file, it's actually going to save it to disk as well. So yeah, that's it. Amazingly
01:04:36.840 | enough. That's all you need to create a memory mapped NumPy array of our latents. When you're
01:04:43.120 | done, you actually have to call dot flush. And that's just something that says like anything
01:04:47.200 | that's just in cache at the moment, make sure it's actually written to disk. And then I
01:04:53.760 | delete it because I just want to make sure that then I read it back correctly. So that's
01:04:58.720 | only going to happen once if the path doesn't exist. And then after that, this whole thing
01:05:04.120 | will be skipped. And instead, we're going to call mp.memmap again with an M path. But this
01:05:09.640 | time in the same data type, the same shape, this time we're going to read it. Mode equals
01:05:14.400 | R means read it. And so let's check it. Let's just grab the first 16 latents that we read
01:05:24.000 | and decode them. And there they are. OK. So this is like not a very well-known technique,
01:05:34.960 | I would say, sadly. But it's a really good one. You might be wondering like, well, what
01:05:41.240 | about like compression? Like shouldn't you be zipping them or something like that? But
01:05:45.720 | actually remember, these latents are already-- the whole point is they're highly compressed.
01:05:52.200 | So generally speaking, zipping latents from a good VAE doesn't do much. Because they almost
01:06:01.400 | look a bit random number-ish. OK. So we've now saved our entire LSUN bedroom. That's
01:06:09.560 | a 20% subset, the bit that I've provided. Now, Latents. So we can now run it through--
01:06:18.040 | this is a nice thing. We can use exactly the same process from here on in as usual. OK.
01:06:23.560 | So we've got the Noiser 5 of our usual collated version. Now, the Latents are much higher
01:06:34.880 | than 1 standard deviation. So if we about divide it by 5, that takes it back to a standard
01:06:39.080 | deviation of about 1. I think in the paper they use like 0.18 or something. But this
01:06:47.120 | is close enough to make it a unit standard deviation. So we can split it into a training
01:06:54.960 | and a validation set. So just grab the first 90% of the training set and the last 10% for
01:07:01.080 | the validation set. So those are our data sets. We use a batch size of 128. So now we
01:07:07.620 | can use our data loaders class we created with the getDLs we created. So these are all
01:07:11.520 | things we've created ourselves with the training set, the validation set, the batch size, and
01:07:17.680 | our collation function. So yeah, it's kind of nice. It's amazing how easy it is. A data
01:07:26.360 | set has the same interface as a NumPy array or a list or whatever. So we can literally
01:07:34.260 | just use the NumPy array directly as a data set, which I think is really neat. This is
01:07:40.120 | why it's useful to know about these foundational concepts, because you don't have to start
01:07:45.400 | thinking like, oh, I wonder if there's some torch vision thing to use memmap NumPy files
01:07:50.720 | or something. It's like, oh, wait, they already do provide a data set interface. I don't have
01:07:55.040 | to do anything. I just use them. So that's pretty magical. So we can test that now by
01:08:02.000 | grabbing a batch. And so this is being noisified. And so here we can see our noisified images.
01:08:10.040 | And so here's something crazy is that we can actually decode noisified images. And so here's
01:08:19.600 | I guess this one wasn't noisified much because it's a recognizable bedroom. And this is what
01:08:24.040 | happens when you just decode random noise, something in between. So I think that's pretty
01:08:30.320 | fun. Yeah, this next bit is all just copied from our previous notebook, create a model,
01:08:39.360 | organize it, train for a while. So this took me a few hours on a single GPU. Everything
01:08:45.000 | I'm doing is on a single GPU. Literally nothing in this course, other than the stable diffusion
01:08:50.000 | stuff itself is trained on more than one GPU. The loss is much higher than usual. And that's
01:08:57.600 | not surprising because it's trying to generate latent pixels, which rare like it's much more
01:09:05.360 | precise as to exactly what it wants. There's not like lots of pixels where the ones next
01:09:11.200 | to each other are really similar or the whole background looks the same or whatever. A lot
01:09:15.200 | of that stuff, it's been compressed out. It's a more difficult thing to predict latent pixels.
01:09:24.240 | So now we can sample from it in exactly the same way that we always have using DDIM. But
01:09:29.920 | now we need to make sure that we decode it, because the thing that it's sampled are latents
01:09:39.280 | because the thing that we asked it to learn to predict are latents. And so now we can
01:09:45.640 | take a look and we have bedrooms. And some of them look pretty good. I think this one
01:09:54.860 | looks pretty good. I think this one looks pretty good. This one, I don't have any idea
01:09:59.760 | what it is. And this one, like clearly there's bedroomy bits, but there's something, I don't
01:10:07.480 | know, there's weird bits. So the fact that we're able to create 256 by 256 pixel images
01:10:17.840 | where at least some of them look quite good in a couple of hours, I can't remember how
01:10:23.340 | long it took to train, but it's a small number of hours in a single GPU is something that
01:10:26.520 | was not previously possible. And we're in a sense, we're totally cheating because we're
01:10:33.360 | using the stable diffusion VAE to do a lot of the hard work for us. But that's fine,
01:10:40.720 | you know, because that VAE knows how to create all kinds of natural images and drawings and
01:10:45.600 | portraits and royal paintings or whatever. So you can, I think, work in that latent space
01:10:53.720 | quite comfortably. Yeah. Do you guys have anything you wanted to add about that? Oh, actually,
01:11:01.280 | Tanishka, you've trained this for longer. I only trained it for 25 epochs. How long did
01:11:06.160 | you, how many hours did you train it for? Cause you did, you did a hundred epochs, right?
01:11:10.200 | Yes, I did a hundred epochs. I didn't keep trying exactly, but I think it was about 15
01:11:14.400 | hours on an A100. A single A100. Yeah. I argued, I mean, the results, yeah, I'll show it. It's
01:11:23.680 | I guess maybe slightly better, but you know, I guess you can, I see maybe. No, it is definitely
01:11:34.360 | slightly better. The good ones are certainly slightly better. Yeah. Yeah. Like the bottom
01:11:38.760 | left one is better than any of mine, I think. So it's possible. Maybe at this point, we
01:11:43.520 | just may need to use more data, I guess, cause I guess we were using a 20% subset. So maybe
01:11:48.560 | having more of that data to provide more diversity or something like that, maybe that might help.
01:11:53.000 | Yeah. Or maybe, have you tried doing the diffusers one for a hundred? No, I'm using this. Okay.
01:12:00.360 | Our code here. Yeah. So I've got, all right. So I'll share my screen if you want to stop
01:12:06.000 | sharing yours. So I do have, if we get around to this, maybe we can add the results back
01:12:17.440 | to this notebook. Cause I do have a version that uses diffusers. So everything else is
01:12:21.840 | identical. 25 epochs, except for the model for the previous one, I was using our, our
01:12:31.920 | own MBU net model. So I have to change the channels now to four and a number of filters.
01:12:37.760 | I think I might've increased it a bit. So then I tried using, yeah, the diffusers unit
01:12:46.400 | with whatever their defaults were. And so I got, what did I get here? 243 with diffusers.
01:12:54.240 | I got a little bit better, 239. And yeah, I don't know if they're obviously better or
01:13:07.160 | not. Like, this is a bit weird. I think like, actually, another thing we could try maybe
01:13:17.000 | is do a hundred epochs, but use the diffusers number of channels and stuff that they used
01:13:23.040 | for. Cause I think the defaults that they use actually for diffusers is not the same
01:13:26.800 | as stable diffusion. So maybe we could try stable diffusion, matched unit for a hundred
01:13:32.080 | epochs. And if we get any nice results, maybe we can paste them into the bottom to show
01:13:36.400 | people. Yeah. Yeah. Cool. Yeah. Do you guys have anything else to add at this point? All
01:13:49.600 | right. So I'll just mention one more thought in terms of like a bit of a interesting project
01:13:56.680 | people could play with. I don't know if this is too crazy. I don't think it's been done
01:14:02.680 | before, but my thought was like, there was a huge difference in our super resolution.
01:14:08.720 | Do you remember a huge difference in our super resolution results when we used a pre-trained
01:14:12.840 | model and when we used perceptual loss, but particularly when we used a pre-trained model.
01:14:22.560 | I thought we could use a pre-trained model, but we would need a pre-trained latent model,
01:14:27.240 | right? We would want something where our, you know, downsampling backbone was pre-trained
01:14:35.120 | model on latents. And so I just want to just show you what I've done and you guys, you
01:14:40.560 | know, if anybody watching wanted to try taking this further, I've just done the first bit
01:14:45.960 | for you to give you a sense, which is I've pre-trained an image net model, not tiny image
01:14:49.960 | net, but a full image net model on latents as a classifier. And if you use this as a
01:14:56.120 | backbone, you know, and also try maybe some of the other tricks that we found helpful,
01:15:00.400 | like having res nets on the cross connections. These are all things that I don't think anybody's
01:15:04.520 | done before. I don't know, the scientific literature is vast and I might've missed it,
01:15:09.160 | but I've not come across anybody do these tricks before. So obviously like we're, one
01:15:16.880 | of the interesting parts of this, which is designed to be challenging is that we're using
01:15:21.400 | bigger datasets now, but they're datasets that you can absolutely like run on a single
01:15:27.000 | GPU, you know, a few tens of gigabytes, which fits on any modern hard drive easily. So these
01:15:37.400 | like are good tests of your ability to kind of like move things around. And if you're
01:15:43.080 | somewhere that doesn't have access to a decent internet connection or whatever, this might
01:15:47.160 | be out of the question, in which case don't worry about it. But if you can, yes, try this
01:15:52.080 | because it's good practice, I think, to make sure you can use these larger datasets.
01:15:59.500 | So image net itself, you can actually grab from Kaggle nowadays. So they call it the
01:16:06.240 | object localization challenge, but actually this contains the full image net dataset or
01:16:12.040 | the version that's used for the image net competition. So I think people generally call
01:16:17.360 | that one case. You just have to accept the terms because that has like some distribution
01:16:22.740 | terms. Yeah, exactly. So you've got to kind of sign in and then join the competition and
01:16:28.040 | then yeah, accept the terms. So you can then download the dataset or you can also download
01:16:35.520 | it from Hugging Face. It'll be in a somewhat different format, but that'll work as well.
01:16:44.240 | So I think I grabbed my version from Kaggle. So on Kaggle, you know, it's just a zip file,
01:16:49.720 | you unzip it and it creates an ILSVRC directory, which I think is what they called the competition.
01:16:58.440 | Yeah, image net, large-scale visual recognition challenge. Okay. So then inside there, there
01:17:07.320 | is a data and inside there, there is a CLS lock and that's actually where the, that's
01:17:11.400 | where actually everything's going to be. So just like before, I wanted to turn these all
01:17:16.680 | into Latents. So I created in that directory, I created a Latents subdirectory and this
01:17:22.360 | time partly just to demonstrate how these things work, I wanted to do it a slightly different
01:17:27.000 | way. Okay. So again, we're going to create our pre-trained VAE, pop it on the GPU, turn
01:17:34.280 | off gradients for it and I'm going to create a dataset. Now, one thing that's a bit weird
01:17:40.200 | about this is that because this is really quite a big dataset, like it's got 1.3 million
01:17:49.520 | files, the thing where we go glob star star slash star dot JPEG takes a few seconds, you
01:17:57.800 | know, and particularly if you're doing this on like, you know, an AWS file system or something,
01:18:05.000 | it can take really quite a long time. On mine, it only took like three seconds, but I don't
01:18:09.320 | want to wait three seconds. So I, you know, a common trick for these kinds of big things
01:18:13.940 | is to create a cache, which is literally just a list of the files. So that's what this,
01:18:19.560 | this is. So I decided that Z pickle means a gzipped pickle. So what I do is if, if, if
01:18:25.680 | the cache exists, we just gzip dot open the files. If it doesn't, we use glob exactly
01:18:32.760 | like before to find all the files. And then we also save a gzip file containing pickle
01:18:40.680 | dot dump files. So pickle dot dump is what we use in Python to take basically any Python
01:18:45.960 | object list of dictionaries and dictionary of lists, whatever you like, and save them.
01:18:52.000 | And it's super fast, right? And I use gzip with compress level one to basically be like
01:18:58.080 | compress it pretty well, but pretty fast. So this is a really nice way to create a little
01:19:05.480 | cache of that. So this is the same as always. And so our get item is going to grab the file.
01:19:13.200 | It's going to read it in, turn it into a float. And what I did here was, you know, I'm being
01:19:19.040 | a little bit lazy, but I just decided to center crop the middle, you know, so let's say it
01:19:25.160 | was a 300 by 400 file, it's going to center crop the middle 300 by 300 section, and then
01:19:32.080 | resize it to 256 by 256. So they'll be the same size. So yeah, we can now, oh, I managed
01:19:42.480 | to create the VAU twice. So I can now just confirm, I can grab a batch from that data
01:19:48.840 | loader, encode it. And here it is, and then decode it again. And here it is. So the first
01:19:55.560 | category must have been computer or something. So here's, as you can see, the VA is doing
01:20:00.560 | a good job of decoding pictures of computers. So I can do something really very similar
01:20:07.400 | to what we did before. If we haven't got that destination directory yet, create it, go through
01:20:12.200 | our data loader, encode a batch. And this time I'm not using a memmapped file, I'm actually
01:20:17.520 | going to save separate NumPy files for each one. So go through each element of the batch,
01:20:23.920 | each item. So I'm going to save it into the destination directory, which is the Latents
01:20:29.680 | directory. And I'm going to give it exactly the same path as the original one contained,
01:20:34.800 | because it contains the folder of what the label is. Make sure that the directory exists,
01:20:45.640 | that we're saving it to, and save that just as a NumPy file. This is another way to do
01:20:52.080 | it. So this is going to be a separate NumPy file for each item. Does that make sense so
01:20:58.960 | far? Okay, cool. So I could create a thing called a NumPy data set, which is exactly
01:21:06.680 | the same as our images data set. But to get an item, we don't have to use, you know, open
01:21:13.240 | a JPEG anymore, we just call mp.load. So this is a nice way to like take something you've
01:21:18.120 | already got and change it slightly. So it's going to return the...
01:21:24.040 | Where did you do this? Did the memory map file, Jeremy? Just out of interest?
01:21:29.880 | Sorry?
01:21:30.880 | Where did you do this versus the memory map file? Was it just to show a different way?
01:21:34.160 | Just to show a different way. Yeah. Yeah. Absolutely no particularly good reason, honestly. Yeah,
01:21:43.240 | I like to kind of like demonstrate different approaches. And I think it's good for people's
01:21:47.720 | Python coding if you make sure you understand what all the lines of code do. Yeah, they
01:21:53.080 | both work fine, actually. It's partly also for my own experimental interest. It's like,
01:21:59.560 | oh, which one seems to kind of feel better? Yeah. All right. So create training and validation
01:22:09.420 | data sets by grabbing all the NumPy files inside the training and validation folders.
01:22:15.680 | And then I'm going to just create a training data loader for the training data set just
01:22:20.960 | to see what the main and standard deviation is on the channel dimension. So this is every
01:22:26.000 | dimension except channel what I mean over. And so there it is. And as you can see there,
01:22:30.800 | the main and standard deviation are not close to zero and one. So we're going to store away
01:22:35.800 | that main and standard deviation such that we then... We've seen transform data set before.
01:22:42.520 | This is just applying a transform to a data set. We're going to apply the normalization
01:22:47.680 | and transform. In the past, we've used our own normalization that TorchVision has one
01:22:53.800 | as well. So this is just demonstrating how to just use TorchVision's version. But it's
01:22:58.320 | literally just subtracting the main and dividing by the standard deviation. We're also going
01:23:05.240 | to apply some data augmentation. We're going to use the same trick we've used before for
01:23:10.520 | images that are very small, which is we're going to add a little bit of padding and then
01:23:15.320 | randomly crop our original image size from that. So it's just like shifting it slightly
01:23:21.960 | each time. And we're also going to use our random erasing. And it's nice because we did
01:23:26.360 | it all with broadcasting, this is going to apply equally well to a four-channel image
01:23:31.400 | as it is to a three or I think we did originally for one. Now, I don't think anybody as far
01:23:39.960 | as I know has built classifiers from Latents before. So I didn't even know if this is going
01:23:43.860 | to work. So I visualized it. So we could have Tifem X and a Tifem Y. So for Tifem X, you
01:23:51.720 | can optionally add augmentation. And if you do, then apply the augmentation transforms.
01:23:58.040 | Now this is going to be applied one image at a time, but our augmentation transforms, some
01:24:01.920 | of them expect a batch. So we create a extra unit axis on the front to be a batch of one
01:24:07.480 | and then remove it again. And then Tifem Y, very much like we've seen before, we're going
01:24:13.680 | to turn those half names into IDs. So there's our validation and training transform datasets.
01:24:23.880 | So that we can look at our results, we need a denormalization. So let's create our data
01:24:31.120 | loaders and grab mini batches and show us. And so I was very pleased to see that the
01:24:38.480 | random arrays works actually extremely nicely. So you can see you get these kind of like
01:24:42.560 | weird patches, you know, weird patches. But they're still recognizable. So this is like
01:24:53.480 | something I very, very often do is to answer like, oh, is this like thing I'm doing in
01:24:57.680 | computer vision reasonable? It's like, well, can my human brain recognize it? So if I couldn't
01:25:02.160 | recognize this with a drilling platform myself, then I shouldn't expect a computer to be able
01:25:07.000 | to do it or that this is a compass or whatever. I'm so glad they got orders. So cute. And
01:25:13.840 | you can see the cropping it's done has also been fine. Like it's a little bit of a fuzzy
01:25:18.120 | edge, but basically like it's not destroying the image at all. They're still recognizable.
01:25:26.920 | It's also a good example here of how difficult like this problem is, like the fact that this
01:25:31.280 | is seashore, I would have called this surface, you know, but maybe surface is not an image
01:25:35.080 | in that category. Yeah. Okay. This could be food, but actually it's a refrigerator. Okay.
01:25:46.840 | So our augmentation seems to be working well. So then I, yeah, basically I've just copied
01:25:52.040 | and pasted, you know, our basic pieces here. And I kind of wanted to have it all in one
01:25:56.360 | place just to remind myself of exactly what it is. So this is the preactivation version
01:26:00.760 | of convolutions. The reason for that is if I want this to be a backbone for a diffusion
01:26:05.920 | model or a unit, then I remember that we found that preactivation works best for units. So
01:26:13.880 | therefore our backbone needs to be trained with preactivation. So we've got a preactivation
01:26:17.340 | conv, got a res block, res blocks model with dropouts. This is all just copied from previous.
01:26:28.640 | So I decided like I wanted to try to, you know, use the basic trick that we learnt about from
01:26:34.760 | simple diffusion of trying to put most of our work in the later layers. So the first
01:26:42.360 | layer just has one block, then two blocks, and then four blocks. And then I figured that
01:26:48.680 | we might then delete these final blocks. These maybe you're going to just end up being for
01:26:53.560 | classification. This might end up being our pre-trained backbone, or maybe we keep them.
01:26:57.960 | I don't know. You know, it's like, as I said, this hasn't been done before. So anyway, I
01:27:02.960 | tried to design it in a way that we've got some, you know, we can mess around a little
01:27:07.460 | bit with how many of these we keep. And so also I tried to use very few channels in the
01:27:13.440 | first blocks. And so I jump up for the channels that are aware of the works going to do a
01:27:20.800 | jump from 128 to 512. So that's why I designed it this way. You know, I haven't even taken
01:27:28.080 | it any further than this. So I don't know if it's going to be a useful backbone or not.
01:27:31.760 | I didn't even know if this is going to be possible to classify. It seemed very likely
01:27:35.720 | it was possible to classify, even based on the fact that you can still kind of recognize
01:27:39.200 | it almost like I could probably recognize it's a computer maybe. So I thought it was
01:27:44.760 | going to be possible. But yeah, this is all new. So that was the model I created. And
01:27:50.040 | then I trained it for 40 epochs. And you can see after one epoch, it was already 25% accurate.
01:28:03.120 | And that's it recognizing which one of a thousand categories is it. So I thought that was pretty
01:28:07.960 | amazing. And so after 40 epochs, I ended up at 66%, which is really quite fantastic because
01:28:16.280 | a ResNet 34 is kind of like 73% or 74% accuracy when trained for quite a lot longer. You know,
01:28:28.480 | so to me, this is extremely encouraging that, you know, this is a really pretty good ResNet
01:28:35.640 | at recognizing images from their latent representations without any decoding or whatever. So from
01:28:44.560 | here, you know, if you want to, you guys could try, yeah, building a better bedroom diffusion
01:28:54.080 | model or whatever you like. It's not to be bedrooms. Actually, one of our colleagues
01:29:00.720 | Molly, I'm just going to find it. So one of our colleagues Molly actually used the, do
01:29:08.640 | you guys remember, was it the celeb faces that she used? So there's a celeb A HQ data set
01:29:21.200 | that consists of images of faces of celebrities. And so what Molly did was she basically used
01:29:29.240 | this exact notebook, but used this faces data set instead. And this one's really pretty
01:29:36.220 | good, isn't it? You know, this one's really pretty good. They certainly look like celebrities,
01:29:41.760 | that's for sure. So yeah, you could try this data set or whatever, but yeah, try it. Yeah,
01:29:48.280 | maybe try it with the pre-trained backbone, try it with ResNets on the cross connections,
01:29:55.160 | try it with all the tricks we used in SuperRes, try it with perceptual loss. Some folks we
01:30:00.200 | spoke to about the perceptual loss think it won't help with Latents because the underlying
01:30:07.480 | VAE was already trained with perceptual loss, but we should try, you know, or you guys should
01:30:11.680 | try all these things. Yeah, so be sure to check out the forum as well to see what other people
01:30:18.520 | have already tried here because it's a whole new world. But it's just an example of the
01:30:23.520 | kind of like fun research ideas I guess we can play with. Yeah, what do you guys think
01:30:28.760 | about this? Are you like surprised that we're able to quickly get this kind of accuracy
01:30:33.200 | from Latents or do you think this is a useful research path? What are your thoughts? Yeah,
01:30:39.560 | I think it's very interesting. Oh, go ahead. I was going to say the Latents are already
01:30:43.080 | like a slightly compressed, richer representation of an image, right? So it makes sense that
01:30:48.560 | that's a useful thing to train on. And 66%, I think AlexNet is like 63% or something like
01:30:54.800 | that. So, you know, we were already at state of the art, what, eight years ago, whatever.
01:31:01.000 | It might be more like 10 years ago. I know time passes quickly. Yeah, yeah, I guess next
01:31:07.240 | year. Yeah, next year it is 10 years ago. But yeah, I'm kind of curious with the pre-training
01:31:13.360 | the whole, the whole value for me for like using a pre-trained network where someone
01:31:16.960 | else has done lots and lots of compute on ImageNet to learn some features and I'm going
01:31:21.360 | to use that because it's kind of funny to be like, oh, well, let's pre-train for ourselves
01:31:26.320 | and then try and use that. I'm curious whether like how best you'd allocate that compute
01:31:32.240 | whether you should, if you've got 10 hours of GPU, just do 10 hours of training versus
01:31:37.000 | like five hours of pre-training and five hours of training. I mean, based on our super res
01:31:42.320 | thing, the pre-training like was so much better. So that's why I'm feeling somewhat hopeful
01:31:49.360 | about this direction. Yeah. Yeah, I'm really curious to see how it goes.
01:31:55.360 | I guess I was going to say it's like, yeah, I think there's just a lot of opportunities
01:31:58.760 | for I guess the latent doing stuff in the latents. And like, I guess maybe like, yeah,
01:32:04.600 | you could, I mean, hear your trade classifier as a backbone, but you could think of like
01:32:08.480 | trade classifiers on other things for, you know, guidance or things like this. Yeah.
01:32:13.080 | Of course, we've done some experiments with that. I know John has his mid you guidance
01:32:18.040 | approach for some of these sort of things, but there are different approaches that you
01:32:21.480 | can play around here that, you know, exploring in the latent space can make it computationally
01:32:28.640 | cheaper than, you know, having to decode it every time you want to, you know, like you
01:32:32.640 | have to look at the image and then maybe apply a classifier, apply some sort of guidance
01:32:36.360 | on the image. But if you can do it directly in the latent space, a lot of interesting
01:32:40.080 | opportunities there as well. Yeah. And, you know, now we're showing that indeed.
01:32:44.120 | Yeah, yeah, style transfer is on latents. Everything on latents, like you also do models. Like that's
01:32:52.000 | something I've done to make a latent clip is just have it like try and mirror an image
01:32:56.520 | space clip. And so for classifiers as well, you could like distill an image net classifier
01:33:01.440 | rather than just having the label, you try and like copy the logits. And then that's
01:33:06.480 | like an even richer signal, like you get more value per example. So then you can create
01:33:13.120 | your latent version of some existing image classifier or object detector or multi-modal
01:33:20.280 | model like clip. I feel funny about this because I'm like both excited about simple diffusion
01:33:26.200 | on the basis that it gets rid of latents, but I'm also excited about latents on the
01:33:29.440 | basis of it gets rid of most of the pixels. I don't know how I can be cheering for both,
01:33:34.920 | but somehow I am. I guess may the best method win. So, you know, the folks that are finishing
01:33:47.560 | this course, well, first of all, congratulations, because it's been a journey, particularly
01:33:53.000 | part two, it's a journey, which requires a lot of patience and tenacity. You know, if
01:33:58.840 | you've got to zip through by binging on the videos, that's totally fine. It's a good approach,
01:34:03.280 | but you know, maybe go back now and do it more slowly and do the, you know, build it
01:34:08.960 | yourself and really experiment. But assuming, you know, for folks who have got to the end
01:34:15.320 | of this and feel like, okay, I get it more or less. Yeah. Do you guys have any sense
01:34:21.000 | of like, what kind of things make sense to do now? You know, where would you guys go
01:34:28.840 | from here? I think that great opportunities implementing papers that I guess come along
01:34:35.520 | these days. And I think at this stage, no way. Yeah. But also at this stage, I think,
01:34:44.760 | you know, we're already discussing research ideas. And I think, you know, we're in a solid
01:34:49.080 | position to come up with our own research ideas and explore, explore those ideas. So
01:34:53.800 | I think that's a, that's a real opportunity that we have here. I think that's best done
01:34:59.520 | often collaboratively. So I'll just, you know, mention that Fast AI has a Discord, which
01:35:06.400 | if you've got to this point, then you're probably somebody who would benefit from, from being
01:35:11.640 | there. And yeah, just pop your head in and say, like, there's an introduction straight
01:35:16.560 | to say hello. And you don't, you know, maybe say what you're interested in or whatever,
01:35:21.760 | because it's, it's nice to work with others, I think. I mean, both Jono and Tanishka only
01:35:26.840 | know because of the Discord and the forums and so forth. So that would be one.
01:35:32.480 | And we also have a, we have a generative channel. So anything related to generative models,
01:35:38.800 | that's the place. So for example, Bali was posting some of her experiments in that channel.
01:35:43.760 | I think there are other Fast AI members posting their experiments. So if you're doing anything
01:35:47.640 | generative model related, that's a great way to also get feedback and thoughts from, from
01:35:53.400 | the community. Yeah.
01:35:55.880 | I'd also say that like this, this, if you're at the stage where you finish this course,
01:36:00.560 | you actually understand how diffusion models work. You've got a good handle on what the
01:36:04.520 | different components and like stable diffusion are. And you know how to wrangle data for
01:36:08.520 | training and all these things. You're like so far ahead of most people who are building
01:36:13.480 | in this space. And I've got lots of, lots of companies and people reaching out to me
01:36:18.440 | to say, do you know anybody who has like more than just, oh, I know how to like load stable
01:36:23.040 | diffusion and make an image. Like, you know, someone who knows how to actually like tinkle
01:36:25.920 | with it and make it better. And if you've got those skills, like don't feel like, oh,
01:36:29.400 | I'm definitely not qualified to like apply or like, there's lots of stuff where, yeah,
01:36:34.840 | just taking these ideas now and like just simple, sensible ideas that we've covered
01:36:39.200 | in the course that have come up and saying, oh, actually, maybe I could try that. Maybe
01:36:42.480 | I could play with this, you know, take this experimentalist approach. I feel like there's
01:36:46.360 | actually a lot of people who would love to have you helping them build the million and
01:36:51.200 | one little stable diffusion based apps or whatever that you're working on.
01:36:54.960 | And particularly like the thing we always talk about at Fast AI, which is particularly
01:36:58.520 | if you can combine that with your domain expertise, you know, whether it be from your, your hobbies
01:37:05.320 | or your work in some completely different field or whatever, you know, there'll be lots
01:37:11.080 | of interesting ways to combine, you know, you probably are one of the only people in
01:37:15.400 | the world right now that understand your areas of passion or of vocation as well as these
01:37:24.280 | techniques. So, and again, that's a good place to kind of get on the forum or the discord
01:37:30.680 | or whatever and start having those conversations because it can be, yeah, it can be difficult
01:37:35.600 | when you're at the cutting edge, which you now are by definition.
01:37:41.800 | All right. Well, we better go away and start figuring out how on earth GPT-4 works. I don't
01:37:50.680 | think we're going to necessarily build the whole GPT-4 from scratch, at least not at
01:37:55.320 | that scale, but I'm sure we're going to have some interesting things happening with NLP.
01:38:01.000 | And Jano, Tanish, thank you so much. It's been a real pleasure. It was nice doing things
01:38:06.160 | with the, with a live audience, but I got to say, I really enjoyed this experience of
01:38:12.320 | doing stuff with you guys the last few lessons. So thank you so much.
01:38:16.280 | Yeah, thanks for having us. This is really, really fun.
01:38:19.720 | All right. Bye.
01:38:20.720 | Cool.