back to indexLesson 20: Deep Learning Foundations to Stable Diffusion
Chapters
0:0 noisify inside a collation function
2:56 MixedPrecision callback
5:59 Getting the benefits from MixedPrecision
7:27 HuggingFace Accelerator
13:57 Sneaky trick: keep GPUs busy with MultDL
16:53 Homework and experiment ideas
20:33 Style Transfer notebook
24:19 Optimizing an image
30:7 Loss function and Learner
32:33 Viewing progress: ImageLogCB
35:4 Extracting features from a pre-trained network, VGG16
40:36 Normalizing the image
44:21 Intermediate representations, features
46:21 Hooks homework
47:20 Optimizing an image with Content Loss
56:5 Style Loss with Gram Matrix
59:21 “A Neural Algorithm of Artistic Style” paper
65:59 Optimizing to get the final result
67:42 Possible experiments and miniai
74:26 Neural Cellular Automata (NCA) notebook
79:37 Alexander Mordvintsev’s NCA simulation
81:44 Setting up a Neural Network
87:16 Getting into code
97:51 Training
102:50 Preview of what’s possible
00:00:00.000 |
Hi and welcome to lesson 20. In the last lesson we were about to learn about implementing mixed precision training. Let's dive into it. 00:00:10.000 |
And I'm going to fiddle with other things just because I want to really experiment. I just love fiddling around. 00:00:17.000 |
So one thing I wanted to do is I wanted to get rid of the DDPMCB entirely. 00:00:26.000 |
We made it pretty small here, but I wanted to remove it. So much as Janice said, isn't it great the callbacks make everything so cool. 00:00:34.000 |
I wanted to show we can actually make things so cool without callbacks at all. 00:00:38.000 |
And so to do that, I realized what we could do is we could put noisify inside a collation function. 00:00:46.000 |
So the collation function, if you remember back to our datasets notebook, which was going back to notebook five and you've probably forgotten that by now. 00:00:57.000 |
So go and reread that to remind yourself. It's the function that runs to take all of the, you know, you've basically each kind of row of data. 00:01:10.000 |
It will be a separate tuple, but then it hits the collation function and the collation function turns that into tensors. 00:01:17.000 |
You know, one tensor representing the independent variable, one tensor representing the dependent variable, something like that. 00:01:24.000 |
And the default collation function is called, not surprisingly, default collate. 00:01:28.000 |
So if our collation function calls that on our batch and then grabs the X part, which is it's always been the same for the last few things. 00:01:39.000 |
That's the image because we use because datasets uses dictionaries. 00:01:44.000 |
Then we can call noisify on that collated batch. 00:01:52.000 |
Then that's exactly the same thing as Tunis before batch did because before batch is operating on the thing that came out of the default collect function. 00:02:03.000 |
So if we do it here and then we create a DDPM data loader function, which just creates a data loader from some dataset that we pass in with some batch size with that collation function. 00:02:19.000 |
Then we can create our DL's not using data loaders from data loaders from DD. But instead of that, the original, you know, the plane in it that we created for data loaders. 00:02:35.000 |
And again, you should go back and remind yourself of this. 00:02:37.000 |
You just pass in the data loaders for training and test. 00:02:43.000 |
So with that, we don't need a DDPM callback anymore. 00:02:48.000 |
All right. So now that we've, you know, and again, this is this is not required for mixed precision. 00:02:52.000 |
This is just because I wanted to experiment and flex our muscles a little bit of trying things out. 00:03:03.000 |
And basically, if you Google for PyTorch mixed precision, you'll see that the docs show the typical mixed precision basically says with autocast, device equals CUDA, type equals float 16, get your predictions and core your loss. 00:03:30.000 |
So, again, remind yourself if you've forgotten that this is called a context manager and context managers when they start call something called done to enter. 00:03:40.000 |
And when they finish, they call something called done to exit. 00:03:43.000 |
So we could therefore put the torch autocast into an attribute and call done to enter before the batch begins. 00:03:55.000 |
And then after we've calculated the loss, we want to finish that context manager. 00:04:03.000 |
So after loss, we call autocast done to exit. 00:04:10.000 |
So you'll find now in the 09 learner that there's a section called updated version since the lesson where I've added an after predict, an after loss, an after backward and an after step. 00:04:32.000 |
And that means that a callback can now insert code at any point of the training loop. 00:04:39.000 |
And so we haven't used all of those different things here, but we certainly do want. 00:04:50.000 |
And then, yeah, this is just code that has to be run according to it, to the PyTorch docs. 00:04:59.000 |
So instead of calling lost dot backwards, you have to call scalar dot scale, lost dot backward. 00:05:04.000 |
So we replace our backward in the train callback with something called scalar dot scale, lost dot backward. 00:05:11.000 |
And then it says that finally, when you do the step, you don't call optimizer dot step, you call scalar dot step optimizer scalar dot update. 00:05:20.000 |
So we've replaced step with scalar dot step scalar dot update. 00:05:29.000 |
And the nice thing is now that this exists, we don't have to think about any of that. 00:05:34.000 |
We can add mixed precision to anything, which is really nice. 00:05:41.000 |
And so we now, as you'll see, CB is no longer has a DDPM CB, but we do have the mixed precision. 00:05:52.000 |
So we just need a normal learner, not a train learner. 00:05:57.000 |
Now, to get benefit from mixed precision, you need to do quite a bit at a time. 00:06:07.000 |
And on something as small as fashion MNIST, it's not easy to keep a GPU busy. 00:06:11.000 |
So that's why I've increased the batch size by four times. 00:06:16.000 |
Now, that means that each epoch, it's going to have four times less batches because they're bigger. 00:06:24.000 |
That means it's got four times less opportunities to update. 00:06:27.000 |
And that's going to be a problem because if I want to have as good a result as to niche CAD and as I had here in less time, 00:06:35.000 |
that's the whole purpose of this is to do it in less time, then I'm going to need to, you know, increase the learning rate 00:06:47.000 |
So I increase the epochs up to eight from five and I increase the learning rate up to one in egg two. 00:06:54.000 |
And yeah, I've found I could train it fine with that once I use the proper initialization and most importantly, 00:07:02.000 |
use the optimization function that has epsilon of one in egg five. 00:07:07.000 |
And so this trains, even though it's doing more epochs, this trains about twice as fast. 00:07:26.000 |
Cool. Now, the good news is actually we don't even need to write all this because there's a nice library from Hugging Face, 00:07:37.000 |
originally created by Silver, who used to work with me at FastAI and went to Hugging Face and kept on doing awesome work. 00:07:44.000 |
And he started this project called Accelerator, which he now works on with another FastAI alum named Zach Riele, 00:07:51.000 |
and accelerates a library that provides this in-court accelerator that does things to accelerate your training loops. 00:07:58.000 |
And one of the things it does is mix precision training and it basically handles these things for you. 00:08:11.000 |
So by adding a train_cb subclass that will allow us to use Accelerate, 00:08:17.000 |
that means we can now hopefully use TPUs and multi-GPU training and all that kind of thing. 00:08:24.000 |
So the Accelerate docs show that what you have to do to use Accelerate is to create an accelerator, 00:08:35.000 |
tell it what kind of mixed precision you want to use. So we're going to use 16-bit floating point FP16. 00:08:43.000 |
And then you have to basically call Accelerator.Prepare and you pass in your model, your optimizer, 00:08:53.000 |
and your training and validation data loaders. 00:08:56.000 |
And it returns you back a model, an optimizer, and training and validation data loaders, 00:09:01.000 |
but they've been wrapped up in Accelerate. And Accelerate is going to now do all the things we saw you have to do automatically. 00:09:11.000 |
And that's why that's almost all the code we need. 00:09:14.000 |
The only other thing we need is we didn't tell it how to change our loss function to use Accelerate, 00:09:21.000 |
so we actually have to change backward. That's why we inherit from train_cb. 00:09:25.000 |
We have to change backward to not call loss.backward, but self.accelerate.backward and pass in loss. 00:09:34.000 |
OK. And then I had another idea of something I wanted to do, which is I like the idea that Noisify, 00:09:41.000 |
I've copied Noisify here, but rather than returning a tuple of tuples, I just return a tuple with three things. 00:09:49.000 |
I think this is neater to me. I would like to just have three things in the tuple. 00:09:54.000 |
I don't want to have to modify my model. I don't want to have to modify my training callback. 00:10:00.000 |
I don't want to do anything tricky. I don't even want to have a custom collation function. 00:10:12.000 |
Sorry, I want to have a custom collation function, but I don't want to have a modified model. 00:10:16.000 |
So I'm going to go back to using a UNet2D model. 00:10:20.000 |
So how can we use a UNet2D model when we've now got three things? 00:10:25.000 |
And what I did in my modified ladder, just underneath it, sorry, actually what I did is I modified train_cb 00:10:39.000 |
to add one parameter, which is number of inputs. 00:10:43.000 |
And so this tells you how many inputs are there to the model. 00:10:51.000 |
Normally you would expect one input, but our model has two inputs. 00:10:57.000 |
So here we say, okay, so accelerate_cb is a train_cb. So when we call it, we say we're going to have two inputs. 00:11:13.000 |
And so what that's going to do is it's just going to remember how many you asked for. 00:11:18.000 |
And so when you call predict, it's not going to pass learn.batch0. It's going to call star learn.batch colon self.name_inputs. 00:11:30.000 |
And ditto, when you call the loss function, it's going to be the rest. So it's star learn.batch self.add_imp onwards. 00:11:37.000 |
So this way you can have one, two, three, four, five inputs, one, two, three, four, five outputs, whatever you like. 00:11:43.000 |
And it's just up to you then to make sure that your model and your loss function take the number of parameters. 00:11:49.000 |
So the loss function is going to first of all take your creds and then, yeah, however many non-inputs you have. 00:12:00.000 |
So that way, yeah, we now don't need to replace anything except that we did need to do the thing to make sure that we get the dot sample out. 00:12:14.000 |
So I just had a little, this is the whole DDPMCB callback now, DDPMCB2. 00:12:20.000 |
So after the predictions are done, replace them with the dot sample. 00:12:31.000 |
So we end up with quite a bit of pieces, but they're all very decoupled, you know. 00:12:36.000 |
So with mini AI, you know, with and with AccelerateCB and whatever, I guess we should export these actually into a nice module. 00:12:49.000 |
Well, if you had all those, then, yeah, you wouldn't have AccelerateCB. The only thing you would need would be the noisify and the collation function and this tiny callback. 00:13:01.000 |
And then, yeah, use our learner and fit and we get the same result as usual. 00:13:10.000 |
And this takes basically an identical amount of time because at this point I'm not using all the GPU or TPU or whatever. I'm just using mixed precision. 00:13:19.000 |
So this is just a shortcut for this. It's not a huge shortcut. 00:13:25.000 |
The main purpose of it really is to allow us to use other types of accelerators or multiple accelerators or whatever. 00:13:32.000 |
So we'll look at those later. Does that make sense so far? 00:13:37.000 |
Yeah. Yeah. Accelerator is really powerful and pretty amazing. 00:13:42.000 |
Yeah, it is. And I know like a lot of, yeah, like I know Kat Carlson uses it in all her K diffusion code, for example. 00:13:52.000 |
Yeah, it's used a lot out there in the real world. 00:13:56.000 |
I've got one more thing I just want to mention briefly. Just a sneaky trick. I haven't even bothered training anything with it because it's just a sneaky trick. 00:14:06.000 |
But sometimes thinking about speed, loading the data is the slow bit. 00:14:17.000 |
And so particularly if you use Kaggle, for example, on Kaggle, you get two GPUs, which is amazing, but trying to get two CPUs, which is crazy. 00:14:26.000 |
So it's really hard to take advantage of them because the amount of time it takes to open a PNG or a JPEG, your GPU is sitting there waiting for you. 00:14:36.000 |
So if your data loading and transformation process is slow and it's difficult to keep your GPUs busy, there's a trick you can do, 00:14:50.000 |
which is you could create a new data loader class, which wraps your existing data loader and it replaces DunderIdder. 00:15:00.000 |
Now, DunderIdder is the thing that gets called when you use a for loop, right? Or when you use NextIdder, it calls this. 00:15:08.000 |
And when you call this, you just go through the data loader as per usual. 00:15:12.000 |
So that's what DunderIdder would normally do. But then you also go through I from zero to by default two and then you spit out the batch. 00:15:26.000 |
And what this is going to do is it's going to go through the data loader and spit out the batch twice. 00:15:33.000 |
Why is that interesting? Because it means every epoch is going to be twice as long, but it's going to only load and augment the data as often as one epoch, but it's going to give you two epochs worth of updates. 00:15:48.000 |
And basically, there's no reason to have a whole new batch every time. 00:15:54.000 |
You know, looking at the same batch two or three or four times at a row is totally fine. 00:16:00.000 |
And what happens in practice is you look at that batch, you do an update, get 20 part of the weight space, look at exactly the same batch and find out now where to go in the in the weight space. 00:16:17.000 |
So I just wanted to add this little sneaky trick here, particularly because if we start doing more stuff on Kaggle, we'll probably want to surprise all the Kagglers with how fast our mini AI solutions are. 00:16:31.000 |
And they'll be like, how is that possible? We'll be like, oh, we're using our, you know, two GPUs, six to accelerate. Thinking about how do we use the two GPUs and like, oh, and we're, you know, using, you know, getting out, loading, flying through using multi L. 00:16:56.000 |
Yeah, it's great to see the various different ways that we can use mini AI to do to do the same thing, I guess, or, you know, however you feel like doing it or whatever works best for you. 00:17:10.000 |
And I'll be curious to see if other people find other ways to, you know, I'm sure there's so many different ways to to handle this problem. I think it's an interesting, interesting problem to solve. 00:17:20.000 |
And I think for the homework, it'd be useful for people to run some of their own experiments, maybe either use these techniques on other data sets or see if you can come up with other variants of these approaches or come up with some different noise schedules to try. 00:17:43.000 |
It would all be useful. Any other thoughts of exercises people could try? 00:17:51.000 |
Yeah, I mean, getting away with less than a thousand steps. Yeah, a thousand steps happening in the final 200. So why not just train with only 200 steps? Yeah, that steps would be good. 00:18:03.000 |
Yeah, because the sampling is actually pretty slow. So that's a good point. 00:18:10.000 |
Yeah, yeah, I was gonna say something similar in terms of like, yeah, many, I guess, work with less number of steps. You know, you have to adjust the noise schedule appropriately and you have to, I guess there's maybe a little bit more thought into some of these things. 00:18:26.000 |
Or, you know, another aspect is like, when you're selecting the time step during training, right now we select it randomly, kind of uniformly, each time step has equal probability of being selected. 00:18:41.000 |
Maybe different probabilities are better and some papers do analyze that more carefully. So that's another thing to play around with as well. 00:18:50.000 |
That's almost kind of like, I guess there are almost two ways of doing the same thing in a sense, right? If you change that mapping from t to beta, then you could reduce t and have different betas would kind of give you a similar result as changing the probabilities of the t's, I think. 00:19:14.000 |
Yeah, I think there's definitely, they're kind of similar, but potentially something complementary happening there as well. And I think those could be some interesting experiments to study that and also the sort of noise levels that you do choose affect the sort of behavior of the sampling process. 00:19:41.000 |
And of course, what features you focus on. And so maybe as people play around with that, maybe they'll start to notice how using different noise levels or different noise schedules affect maybe some of the features that you see in the final image. 00:19:58.000 |
And that could be something very interesting to study as well. 00:20:04.000 |
Right. Well, let me also say it's been really fun doing something a bit different, which is doing a lesson with you guys rather than all on my lonesome. I hope we can do this again because I've really enjoyed it. 00:20:17.000 |
So, of course, now we're, you know, strictly speaking in the recording, we'll next up see Johnno, who's actually already recorded his thanks to the Zoom mess up, but stick around. So I've already seen it. Johnno's thing is amazing. So you definitely don't want to miss that. 00:20:33.000 |
Hello, everyone. So today, depending on the order that this ends up happening, you've probably seen Denise a DDPM implementation where we're taking the default training call back and doing some more interesting things with preparing the data for the learner or interpreting the results. 00:20:49.000 |
So in these two notebooks that I'm going to show, we're going to be doing something similar, just exploring like what else can we do besides just the classic kind of classification model or where we have some inputs and a label. 00:21:01.000 |
And what else can we do with this mini AI setup that we have. 00:21:05.000 |
And so in the first one, we're going to approach a kind of classic AI art approach called style transfer. And so the idea here is that we're going to want to somehow create an artistic combination of two images where we have the structure and layout of one image and the style of another. 00:21:26.000 |
So we'll look at how we do that. And then we'll also talk along the way in terms of like, why is this actually useful beyond just making pretty pictures. 00:21:34.000 |
So to start with, I've got a couple of URLs for images. You're welcome to go and slip in your own as well and definitely recommend trying this notebook with some different ones just to see what effects you can get. 00:21:45.000 |
And we're going to download the image and load it up as a tensor. So we have here a three channel image 256 by 256 pixels. 00:21:54.000 |
And so this is the kind of base image that we're going to start working with. So before we talk about styles or anything, let's just think what is our goal here. 00:22:03.000 |
We'd like to do some sort of training or optimization. We'd like to get to a point where we can match some aspect of this image. 00:22:11.000 |
And so maybe a good place to start is to just try and do what can we start from a random image and optimize it until it matches pixel for pixel exactly. 00:22:23.000 |
I think that might be helpful is if you type style transfer deep learning into Google images, you could maybe show some examples so that people will see what their goal is. 00:22:37.000 |
Yeah, that's a very good point. So let's see. This is a good one here. We've got the Mona Lisa as our our base. 00:22:44.000 |
But we've managed to apply some how some different artistic styles to that same base structure. 00:22:50.000 |
So we have the Great Wave by Psaki. We have Starry Night by Vincent Mango. This is some sort of Kandinsky or something. 00:22:57.000 |
Yeah. So this is our end goal to be able to take the overall structure and layout of one image and the style from some different reference image. 00:23:05.000 |
And in fact, this was the first ever I think fast AI generative modeling lesson looked at style transfer. 00:23:15.000 |
It's it's been around for a few years. It's kind of a classic technique. 00:23:19.000 |
And it's really I think a lot of the students when we first did it found it extremely useful way of better understanding, like, you know, flexing their deep learning muscles, understanding what's going on. 00:23:34.000 |
And also created some really interesting new approaches. So hopefully we'll see the same thing again. 00:23:40.000 |
Maybe some students will be able to show some really interesting results from this. 00:23:47.000 |
Yeah. And I mean, today we're going to focus on kind of the classic approach. 00:23:51.000 |
But I know one of the previous students from fast AI did a whole different way of doing that style loss that we'll maybe post in the forums or, you know, I've got some comparisons that we can look at. 00:24:02.000 |
So, yeah, definitely a fruitful field still. And I think after the initial hype of like everyone was excited about style transfer apps and things five years ago. 00:24:11.000 |
I feel like there's still some things to explore there. 00:24:14.000 |
Very creative and fun little and divergent in the deep learning world. 00:24:20.000 |
OK, so our first step in getting to that point is being able to optimize an image. 00:24:25.000 |
And so up until now, we've been optimizing like the weights of a neural network. 00:24:29.000 |
But now you want to go to something a bit more simple and we just want to optimize the raw pixels of an image. 00:24:33.000 |
Do you mind if you scroll up a bit to the previous code just so we can have a look at it? 00:24:37.000 |
So there's a couple of interesting points about this code here is, you know, we're not we're not cheating. 00:24:43.000 |
Well, not really. So we're so yeah, we've seen how to download things in the network before. 00:24:50.000 |
So we're using fastcodes URL read because we're allowed to. 00:24:53.000 |
And then I think we decided we weren't going to write our own JPEG parser. 00:24:57.000 |
So TorchVision actually has a pretty good one, which a lot of people don't realize exists. 00:25:03.000 |
And a lot of people tend to use PIL, but actually TorchVision has a more performant option. 00:25:11.000 |
And it's actually quite difficult to find any examples of how to use it like this. 00:25:22.000 |
Yeah, actually, Google load image from a URL in PyTorch. All of the examples are going to use PIL. 00:25:27.000 |
And that's what I've done historically is use the requests library to download the URL and then feed that into PIL's image.open function. 00:25:37.000 |
So yeah, that was fun when I was working with Jeremy on this notebook. 00:25:40.000 |
Like that's how I was doing it. It's with me breaking the rules. 00:25:43.000 |
And let's see if we can do this directly into a tensor without this intermediate step of loading it with with pillow. 00:25:51.000 |
Cool. Okay. So how are we going to do this image optimization? 00:25:56.000 |
Well, first thing is we don't really have a data set of lots of training examples. 00:26:00.000 |
We just have a single target and a single thing we're optimizing. 00:26:04.000 |
And so we built this linked data set here, which is just going to follow the PyTorch data set standard. 00:26:09.000 |
We're going to tell it how to get a particular item and what our length is. 00:26:14.000 |
But in this case, we're just always going to return zero zero. 00:26:18.000 |
We're not actually going to care about the results from this data set. 00:26:21.000 |
We just want something that we can pass to the learner to do some number of training iterations. 00:26:25.000 |
So we create like a thick dummy data set with a hundred items and then we create our data loaders from that. 00:26:32.000 |
And that's going to give us a way to train for some number of steps without really caring about what this data is. 00:26:41.000 |
Yeah. So just to clarify the reason we're doing this. 00:26:44.000 |
So basically, the idea is we're going to start with that photo you downloaded and I guess you're going to be downloading another photo. 00:26:54.000 |
So that photo is going to be like the content. 00:26:56.000 |
We're going to try to make it continue to look like that lady. 00:26:59.000 |
And then we're going to try to change the style so that the style looks like the style of some other picture. 00:27:04.000 |
And the way we're going to be doing that is by doing an optimization loop with like SGD or whatever. 00:27:13.000 |
So the idea is that each step of that, we're going to be moving the style somehow of the image closer and closer to one of those images you downloaded. 00:27:22.000 |
So it's not that we're going to be looping through lots of different images, but we're just going to be looping through steps of a optimization loop. 00:27:34.000 |
Exactly. And so, yeah, we can create this data loader. 00:27:39.000 |
And then in terms of the actual model that we're optimizing and passing to the learner, we created this tensor model class, which just has whatever tensor we pass in as its parameter. 00:27:50.000 |
So there's no actual neural network necessarily. 00:27:52.000 |
We're just going to pass in a random image or some image-shaped thing, a set of numbers that we can then optimize. 00:27:59.000 |
So just in case people have forgotten that, so to remind people, when you put something in an n-end or parameter, it doesn't change it in any way. 00:28:06.000 |
It's just a normal tensor, but it's stored inside the module as being something as being a tensor to optimize. 00:28:14.000 |
So what you're doing here, Jono, I guess, is to say I'm not actually optimizing a model at all. 00:28:20.000 |
I'm optimizing an image, the pixels of an image directly. 00:28:26.000 |
Exactly. And because it's in a parameter, if we look at our model, we can see that, for example, model.t, it does require grad, right? 00:28:40.000 |
Because that's already set up, because this n-end module is going to look for any parameters. 00:28:44.000 |
And if our optimizer is looking at, let's look at the shape of the parameters. 00:28:54.000 |
So this is the shape of the parameters that we're optimizing. 00:28:57.000 |
This is just that tensor that we passed in, the same shape as our image. 00:29:01.000 |
And this is what's going to be optimized if we pass this into any sort of learner fit method. 00:29:06.000 |
OK, so this model does have a thing being passed to forward, which is x, which we're ignoring. 00:29:12.000 |
And I guess that's just because our learner passes something in. 00:29:15.000 |
So we're making life a bit easier for ourselves by making the model look the way our learner expects. 00:29:21.000 |
Yeah. And we could do that using like train_cb or something if we wanted to. 00:29:25.000 |
But this seems like a nice, nice, easy way to do it. 00:29:30.000 |
Yeah, so I mean, this is the way I've done it. 00:29:32.000 |
If you do want to use train_cb, you can set it up with a custom predict method that is just going to call the model forward method with no parameters. 00:29:43.000 |
And if you want, likewise, just calling the loss function on just the predictions. 00:29:48.000 |
But if you want to skip this because we take this argument x equals zero and never use it, that should also work without this callback. 00:29:57.000 |
This is a nice approach if you have something that you're using an existing model, which expects some number of parameters or something. 00:30:03.000 |
Yeah, you can just modify that training callback, but we almost don't need to in this case. 00:30:14.000 |
Oh, just to clarify, I get it. So the get_loss you had to change because normally we pass a target to the loss function. 00:30:20.000 |
Yes, it's learner.prids and then learner.batch. 00:30:23.000 |
And again, we could avoid, we could remove that as well if we wanted to by having our loss function take a target that we then ignore. 00:30:36.000 |
So both other approaches, I like this because we're going to kind of be building on this idea of modifying the training callback in the DDPM example and the other examples. 00:30:46.000 |
But in this case, it's just these two lines change. 00:30:49.000 |
We just call the forward method, which returns this image that we're optimizing. 00:30:54.000 |
And we're going to evaluate this according to some loss function that just takes in an image. 00:31:00.000 |
So for our first loss function, we're just going to use the mean squared error between the image that we are generating, like this output of our model and that content image that's our target. 00:31:12.000 |
Right. So we're going to set up our model, start it out with a random image like this above. 00:31:16.000 |
We're going to create a learner with a dummy data loader for 100 steps. 00:31:22.000 |
Our loss function is going to be this mean squared error loss function, set a learning rate and an optimizer function. 00:31:32.000 |
And if we run this, something's going to happen. 00:31:34.000 |
Our loss is going to go from a non-zero number to close this error. 00:31:39.000 |
And we can look at the final result, like if we call learn.model and show that as an image versus the actual image, we'll see that they look pretty much identical. 00:31:48.000 |
So just to clarify, this is like a pointless example, but what we did, we started with that noisy image she showed above. 00:31:55.000 |
And then we used SGD to make those pixels get closer and closer to the lady in the sunglasses. 00:32:03.000 |
Not for any particular purpose, but just to show that we can turn noisy pixels into something else by having it follow a loss function. 00:32:12.000 |
Make the pixels look as much as possible, like that idea in the sunglasses. 00:32:18.000 |
Exactly. And so in this case, it's a very simple loss. 00:32:20.000 |
There's like a one direction that you update, so it's almost trivial to solve, but it still helps us get the framework in place. 00:32:27.000 |
But just seeing this final result is not very instructive because you almost think, well, did I get a bug in my code? 00:32:33.000 |
How do I know this is actually doing what we expect? 00:32:36.000 |
And so before we even move on to any more complicated loss functions, I thought it was important to have some sort of more obvious way of doing progress. 00:32:45.000 |
So I've created a little logging callback here that is just after every batch, it's going to store the output as an image. 00:32:57.000 |
I guess after every 10 batches here by default. 00:32:59.000 |
Oh, yes. Yeah. Sorry. So we can set how often it's going to update and then every 10 iterations or 50 iterations, whatever we set the log every argument to, it's going to store that in a list. 00:33:10.000 |
And then after the training is done, after that, we're just going to show those images. 00:33:15.000 |
And so everything else the same as before, but passing in this extra logging callback, it's going to give us the kind of progress. 00:33:23.000 |
And so now you can see, OK, there is actually something happening. 00:33:26.000 |
We're starting from this noise after a few iterations. 00:33:31.000 |
And by the end of this process, it looks exactly like the content image. 00:33:35.000 |
So I really like this because what you've basically done here is you've now already got all the pooling and infrastructure in place you need to basically create a really wide variety of interesting outputs that could either be artistic 00:33:54.000 |
or like, you know, there could be more like image reconstruction, super resolution, colorization, whatever. 00:34:05.000 |
And you just have to modify the loss function. 00:34:09.000 |
And I really like the way you've created the absolute easiest possible first and fully checked it. 00:34:18.000 |
And before you start doing the fancy stuff and now you kind of, I guess, feel really comfortable doing the fancy stuff because you know that's all in place. 00:34:30.000 |
And we know that we're going to see some tracking. 00:34:32.000 |
Hopefully it'll be visually obvious if things are going wrong and we know exactly what we need to modify. 00:34:36.000 |
If we can now express some desired property that's more interesting than just like mean squared error to a target image, then we could have everything in place to optimize. 00:34:45.000 |
And so this is now really fun to like, OK, let's think about what other loss functions we could do. 00:34:50.000 |
Maybe we wanted to match an image, but also have a particular overall color. 00:34:54.000 |
Maybe we want some some more complicated thing. 00:34:57.000 |
And so towards that, like towards starting to get a more richer like measure of what this output image looks like, we're going to talk about extracting features from a pre-trained network. 00:35:07.000 |
And this is kind of like the core idea of this notebook is that we have these big convolutional neural networks. 00:35:17.000 |
And it's a relatively simple compared to some of the big, you know, dense nets and so on used today. 00:35:21.000 |
It's actually a model like our Pre-ResNet Fashioned MNIST model. 00:35:33.000 |
And so we're feeding in an image and then we have these like convolutional layers, downsampling, convolution, you know, downsampling with max pooling up until some final prediction. 00:35:45.000 |
There's one big difference here, which is that seven by seven by five twelve, if you can point at that. 00:35:51.000 |
Normally nowadays and in our models, we tried, you know, using an adaptive or global pooling to get down to a one by one by five twelve. 00:36:01.000 |
VGG16 does something which is very unusual by today's standards, which is it just flattens that out into a one by one by four or nine six. 00:36:10.000 |
Which actually might be a really interesting feature of VGG. 00:36:16.000 |
And I've always felt like people might want to consider training, you know, res nets and stuff without the global pooling and instead do the flattening. 00:36:25.000 |
The reason we don't do the flattening nowadays is that that very last linear layer that goes from one by one by four or nine six to one by one by a thousand, because this is an image net model, 00:36:35.000 |
is going to need an awfully big weight matrix. 00:36:40.000 |
You've got a four or nine six by a thousand weight matrix as a result of which this is actually horrifically memory intensive for a reasonably poor performing model by modern standards. 00:36:51.000 |
But yeah, I think that doing that actually also has some some benefits potentially as well. 00:36:58.000 |
Yeah. And in this case, we are not even really interested in the classification side. 00:37:04.000 |
We're more excited about the capacity of this to extract different features. 00:37:10.000 |
And so the idea here and maybe I should pull up this classic article looking at like what do neural networks learn and trying to visualize some of these features. 00:37:25.000 |
And this is something we've mentioned before with these big pre-trained networks is that the early layers tend to pick up on very simple features, edges and shapes and textures and those get mixed together into more complicated textures. 00:37:37.000 |
And by the way, this is just trying to visualize like what what kind of input maximally activates a particular like output on each of these layers. 00:37:45.000 |
And so it's a great way to see like what kinds of things that's learning. And so you can see as we move deeper and deeper into the into the network, we're getting more and more complicated, like hierarchical features. 00:37:56.000 |
And so we've looked at the Zeiler and Fergus paper before that which is an earlier version doing something like this to see what kind of features were available. 00:38:05.000 |
So we'll link to this distill paper from the forum and the course lesson page because it's actually a more modern and fancy version kind of the same thing. 00:38:19.000 |
Yeah. Also note the names here. All of these people are worth following. Chris does amazing work on interpretability and Alexander Modvinsov we'll see in the second notebook that I look at today and doing all sorts of other cool stuff as well. 00:38:32.000 |
And anyway, so we want to think about like let's extract the outputs of these layers in the hope that they give us a representation of our image that's richer than just the raw pixels. 00:38:43.000 |
So we can list the idea being there that if we had another if we were able to change our image to have the same features at those various types that you were just showing us that then it would like have similar textures or similar kind of higher level concepts or whatever. 00:39:05.000 |
Exactly. So if you think of this like 14 by 14 feature map over here, maybe it's capturing that there's an eye in the top left and some hair on the top right, these kind of abstract things. 00:39:16.000 |
And if you change the brightness of the image, it's unlikely that it's going to change what features are stored there because the networks learned to be somewhat invariant to these like rough transformations, a bit of noise, a bit of changing texture early on. 00:39:29.000 |
It's not going to affect the fact that it still thinks this looks like a dog and a few layers before that, that it still thinks that part looks like a nose and that part looks like an ear. 00:39:37.000 |
Maybe the more interesting bits then for what you're doing are those earlier layers where it's going to be like there's a whole bunch of kind of diagonal lines here or there's a kind of a loopy bit here because then if you replicate those, you're going to get similar textures without changing the semantics. 00:39:55.000 |
Exactly. Yeah. So, I mean, I guess let's load the model and look at what the layers are. And then in the next section, we can try and like see what kinds of images work when we optimize towards different layers in there. 00:40:07.000 |
And so this is the network we have, revolutions, relu, max pooling, all of this we should be familiar with by now. 00:40:15.000 |
And it's all just in one big and in dot sequential. This doesn't have the head. So we said dot features. If you did this without you'd have then the this is like the features sub sub sub network. 00:40:27.000 |
That's everything up until some point. And then you have the flattening and the classification, which we are kind of just throwing away. 00:40:33.000 |
So this is the body of the network. And we're going to try and tag into various layers here and extract the outputs. 00:40:40.000 |
But before we do that, there's one more bit of admin we need to handle. And this was trained on a normalized version of ImageNet, right, where you took the data set me and the data set standard deviation and use that to normalize your images. 00:40:52.000 |
So if we want to match what the data looked like during training, we need to match that normalization step. And we've done this on grayscale images where we just subtract the mean divide by the standard deviation. 00:41:02.000 |
But with three channel images, these RGB images, we can't get away with just saying, let's subtract our mean from our image and divide by the standard deviation. 00:41:11.000 |
You're going to get an error that's going to pop up. And this is because we now need to think about broadcasting and these shapes a little bit more carefully than we can with just a scalar value. 00:41:21.000 |
So if we look at the mean here, we just have three values, right, one for each channel, the red, green and blue channels, whereas our content image has three channels and then 256 by 256 for the spatial dimensions. 00:41:35.000 |
So if we try and say content image divided by the mean or minus the mean, it's going to go from right to left and find the first non-unit axis. So anything with a size greater than one. 00:41:51.000 |
And it's going to try and line those up. And in this case, the three and the 256, those aren't going to match. And so we're going to get an error. 00:41:57.000 |
More perniciously, if the shape did happen to match, that might still not be what you intended. 00:42:02.000 |
So what we'd like is to have these three channels map to the three channels of our image and then somehow expand those values out across the two other dimensions. 00:42:12.000 |
And the way we do that is we just add two additional dimensions on the right for our image.net.mean. 00:42:18.000 |
And you could also do the unsqueezed minus one, the unsqueezed minus one. 00:42:22.000 |
But this is the kind of syntax that we're using in this course. 00:42:26.000 |
And now our shapes are going to match because we're going to go from right to left if it's a unit dimension size one, we're going to expand it out to match the other tensor. 00:42:36.000 |
And if it's a non-unit dimension, then the shapes have to match. And that looks like it's the case. 00:42:40.000 |
And so now with this reshaping operation, we can write a little normalized function, which we can then apply to our content image. 00:42:48.000 |
And I'm just checking the min and the max to make sure that this roughly makes sense. 00:42:53.000 |
And we could check the mean as well to make sure that the mean is somewhat close to zero. 00:43:02.000 |
OK, in this case, less maybe because it's a darker image than average, but at least we are doing the operation. 00:43:08.000 |
It seems like the math is correct. And now the sharing the channel wise mean would be interesting. 00:43:14.000 |
Oh, yes. So that would be the mean over the dimensions. 00:43:20.000 |
One and two, I think. I think you have to type all one comma to this. 00:43:29.000 |
I wasn't sure which way around. Yeah, I always forget to say. 00:43:35.000 |
OK, so our blue channel is brighter than the others. And if you go back and look at our image, maybe believe that the image interest is going to be blue and red and the face is going to be just blue. 00:43:49.000 |
Yeah, OK, so that seems to be working. We can double check because now that we've implemented ourselves, 00:43:55.000 |
TorchVision dot transforms has a normalized function that you can pass the mean and standard deviation to. 00:43:59.000 |
And it's going to handle making sure that the devices match, that the shapes match, et cetera. 00:44:04.000 |
And you can see if we check them in a max, it's exactly the same. 00:44:07.000 |
Just a little bit of reassurance that our function is doing the same thing as this normalized transform. 00:44:13.000 |
I appreciate you not cheating by implementing that, Jono. Thank you. 00:44:18.000 |
You're welcome. Got to follow the rules. Got to follow that. OK. 00:44:21.000 |
So with that bit of admin out the way, we can finally say, how do we extract the features from this network? 00:44:27.000 |
Now, if you remember the previous lesson on hooks, that might be something that springs to mind. 00:44:33.000 |
I'm going to leave that as an exercise for the reader. 00:44:36.000 |
And what we're going to do is we're just going to normalize our input and then we're going to run through the layers one by one in this sequential stack. 00:44:45.000 |
We're going to pass our X through that layer. 00:44:48.000 |
And then if we're in one of the target layers, which we can specify, we're going to store the outputs of that layer. 00:44:54.000 |
And I can't remember if I've used the term features before or not. 00:44:58.000 |
So apologies if I have, but just to clarify here, when we say features, we just mean the activations of a layer. 00:45:06.000 |
And in this case, Jono has picked out two particular layers, 18 and 25. 00:45:12.000 |
I mean, I'm not sure it matters in this particular case, but there's a bit of a gotcha you've got here, Jono, 00:45:17.000 |
which is you should change that default 18, 25 from a list to a tuple. 00:45:24.000 |
And the reason for that is that when you use a mutable type like a list in a Python default parameter, 00:45:34.000 |
it does this really weird thing where it actually keeps it around. 00:45:38.000 |
And if you change it at all later, then it actually kind of modifies your function. 00:45:42.000 |
So I would suggest, yeah, never using a list as a default parameter because at some point it will create the weirdest bug you've ever had. 00:45:53.000 |
Yeah, that sounds like something that was hard one. 00:45:58.000 |
And by the time you see this notebook, that change should be there. 00:46:01.000 |
All right. So this is one way to do it, just manually running through the layers one by one up until whatever the latest layer we're interested in. 00:46:07.000 |
But you could do this just as easily by adding hooks to the specific layers 00:46:11.000 |
and then just feeding your data through the whole network at once and relying on the hooks to store those intermediates. 00:46:17.000 |
Yeah. So let's make that homework, actually, not just an exercise you can do. 00:46:21.000 |
But yeah, I want let's make sure everybody does that. 00:46:24.000 |
Use one of the hooks callbacks we had or the hooks context managers we had, 00:46:30.000 |
or you can use the register forward hook PyTorch directly. 00:46:36.000 |
Yeah. And so what we get out here, we're feeding in an image that's 256 by 256. 00:46:42.000 |
And the first layer that we're looking at is this one here. 00:46:50.000 |
These ones are just different because it's a different starting size and then to 32 by 32 by 512. 00:46:56.000 |
And so those are the features that we're talking about for that layer 18. 00:47:00.000 |
It's this thing of shape 512 by 32 by 32. For every kind of spatial location in that 32 by 32 grid, 00:47:07.000 |
we have the output from 512 different filters. 00:47:11.000 |
And so those are going to be the features that we're talking about. 00:47:14.000 |
So there's purpose being that the channels in a single convolution. 00:47:24.000 |
Well, like I said, we're hoping that we can capture different things at different layers. 00:47:29.000 |
And so to kind of first get a feel for this, like what if we just compared these feature maps, 00:47:35.000 |
we can institute what I'm calling a content loss, or you might see it as a perceptual loss. 00:47:41.000 |
And we're going to focus on a couple of later layers. 00:47:44.000 |
Again, make sure that this is as I've learned. 00:47:49.000 |
And what we're going to do is we're going to pass in a target image, in this case our content image. 00:47:54.000 |
And we're going to calculate those features in those target layers. 00:48:00.000 |
And then in the forward method when we're comparing to our inputs, 00:48:10.000 |
we're going to calculate the features of our inputs. 00:48:13.000 |
And we're going to do the mean squared error between those and our target features. 00:48:18.000 |
So maybe there's a bad way of explaining it. But the idea is that I can maybe read it back to you to make sure you understand. 00:48:26.000 |
Okay. So this is a loss function you've created. 00:48:29.000 |
It has a done to call method, which means you can pretend that it's a function. 00:48:38.000 |
Your forward, so yeah, any module would call it forward, but in normal Python, we just use done to call. 00:48:46.000 |
It's taking one input, which is the way you set up your image training callback earlier. 00:48:55.000 |
It's just going to pass in the input, which is this is the image as it's been optimized to so far. 00:49:00.000 |
So initially it's going to be that random noise. 00:49:03.000 |
And then the loss you're calculating is the mean squared error of how far away is this input image from the target image, 00:49:17.000 |
the mean squared error for each of the layers by default 18 and 25. 00:49:26.000 |
And so you're literally actually it's a bit weird. You're actually calling a different neural network. 00:49:33.000 |
Tap features actually call the neural network, but not because that's the model we're optimizing, 00:49:38.000 |
but because it's actually the loss function is how far away are we. 00:49:48.000 |
And so if we with SGD optimize that loss function, you're not going to get the same pixels. 00:49:56.000 |
You're going to get I don't even know what this is going to look like. 00:49:59.000 |
You're going to get some pixels which have the same activations of those features. 00:50:07.000 |
Yeah. And so if we run that, we see you can see the sort of shape of our person there, 00:50:13.000 |
but it definitely doesn't match on like a color and style basis. 00:50:20.000 |
So 18 and 25 remind us how deep they are in the scheme of things. 00:50:29.000 |
OK. So I guess color often doesn't have much of a semantic kind of property. 00:50:35.000 |
So that's probably why it doesn't care much about color because it's still going to be an eyeball, 00:50:43.000 |
Yeah. There's something else I should mention, which is we aren't constraining our tensor 00:50:50.000 |
that we're optimizing to be in the same bounds as a normal image. 00:50:54.000 |
And so some of these will also be less than zero or greater than one, 00:50:58.000 |
as kind of like almost hacking the neural network to get the same features at those deep layers 00:51:03.000 |
by passing in something that it's never seen during training. 00:51:06.000 |
And so for display, we're clipping it to the same bounds as an image, 00:51:09.000 |
but you might want to have either some sort of sigmoid function 00:51:12.000 |
or some other way that you'd clamp your tensor model 00:51:16.000 |
and to have outputs that are like within the allowed range for. 00:51:20.000 |
Oh, good point. Also, it's interesting to note the background hasn't changed much. 00:51:25.000 |
And I guess the reason for that would be that the VGG model you were using in the lost function 00:51:31.000 |
And ImageNet is specifically about recognizing generally as a single big object, 00:51:40.000 |
So it's not going to care about the background, 00:51:43.000 |
and the background probably isn't going to have much in the way of features at all, 00:51:47.000 |
which is why it hasn't really changed the background. 00:51:52.000 |
Yeah, exactly. And so, I mean, this is kind of interesting to see how little it looks like the image, 00:51:57.000 |
while at the same time still being like, if you squint, you can recognize it. 00:52:01.000 |
But we can also try passing in earlier layers 00:52:08.000 |
and see that we get a completely different result 00:52:10.000 |
because now we're optimizing to some image that is a lot closer to the original. 00:52:19.000 |
And so there's a few things that I thought were worth noting, 00:52:23.000 |
One is that we're looking into these ready layers, 00:52:27.000 |
which might mean, for example, that if you're looking at the very early layers, 00:52:31.000 |
you're missing out on some kinds of features. 00:52:33.000 |
That was one of my guesses as to why this didn't have as dark a darks as the input image. 00:52:39.000 |
And then also we still have this thing where we might be going out of bounds 00:52:45.000 |
So yeah, you can see how by looking at really deep layers, 00:52:48.000 |
we really don't care about the color or texture at all. 00:52:50.000 |
We're just getting some glossy bits and nosy bits there. 00:52:56.000 |
we have much more rigid adherence to the lower-level features as well. 00:53:02.000 |
It gives you a very tunable way to compare two images. 00:53:05.000 |
You can say, do I care that they match exactly on pixels? 00:53:09.000 |
But do I care quite a lot about the exact match? 00:53:14.000 |
But do I only care about the overall semantics? 00:53:17.000 |
In that case, I can go to some deeper layers. 00:53:21.000 |
If I remember correctly, this is also something like the kind of technique 00:53:25.000 |
that Zyler and Fergus and the Distill.pub papers used to just identify 00:53:30.000 |
what do filters look at, which is you can optimize an image 00:53:37.000 |
For example, that would be a similar loss function to the one you've built here. 00:53:40.000 |
And that would show you what they're looking at. 00:53:45.000 |
Yeah, and that would be a really fun little project, actually. 00:53:47.000 |
So do it where you calculate these feature maps 00:53:55.000 |
and optimize the image to maximize that activation. 00:54:00.000 |
By default, you might get quite a noisy, weird result, 00:54:05.000 |
And so what these feature visualization people do 00:54:10.000 |
so that you're optimizing an image that even under some augmentations 00:54:16.000 |
But yeah, that might be a good one to play with. 00:54:19.000 |
Cool, OK. So we have a lot of our infrastructure in place. 00:54:24.000 |
We know how to extract features from this neural network. 00:54:28.000 |
at these different types of feature, how similar two images are. 00:54:33.000 |
The final piece that we need for our full style transfer artistic application 00:54:38.000 |
is to say I'd like to keep the structure of this image, 00:54:41.000 |
but I'd like to have the style come from a different image. 00:54:46.000 |
We just look at the early layers, like you've shown us. 00:54:49.000 |
But there's a problem, which is that these feature maps, by default, 00:54:52.000 |
we feed in our image and we get these feature maps. 00:54:58.000 |
We said we had a 32 by 32 by 512 feature map out. 00:55:03.000 |
And each of those locations in that 32 by 32 grid 00:55:07.000 |
are going to correspond to some part of the input image. 00:55:10.000 |
And so if we just said let's do mean squared error 00:55:15.000 |
what we'd be saying is I want the same types of feature, 00:55:23.000 |
And so we can't just get like Van Gogh brush strokes. 00:55:26.000 |
We're going to try and have the same colors in the same place 00:55:31.000 |
And so we're going to get something that just matches our image. 00:55:34.000 |
What we'd like is something that has the same colors and textures, 00:55:37.000 |
but they might be in different parts of the image. 00:55:39.000 |
So we want to get rid of this spatial aspect. 00:55:43.000 |
Just to clarify, when we're saying to it, for example, 00:55:46.000 |
give it to us in the style of Van Gogh's Starry Night, 00:56:07.000 |
And so the solution that Siler-Werksen proposed 00:56:13.000 |
So what we want is some measure of what kinds of styles are present 00:56:19.000 |
And so there's always a trouble trying to represent 00:56:22.000 |
more than two-dimensional things on a 2D grid. 00:56:25.000 |
But what I've done here is I've made our feature map, 00:56:28.000 |
where we have our height and our width that might be 32 by 32 00:56:32.000 |
but instead of having those be like a third dimension, 00:56:35.000 |
I've just represented those features as these little colored dots. 00:56:38.000 |
And so what we're going to do with the Gram Matrix 00:56:41.000 |
is we're going to flatten out our spatial dimension. 00:56:49.000 |
so that like the spatial location on one axis 00:56:54.000 |
So each of these rows is like this is the location here. 00:57:05.000 |
So we've kind of flattened out this feature map into a 2D thing. 00:57:09.000 |
And then instead of caring about the spatial dimension at all, 00:57:13.000 |
all we care about is which features do we have in general, 00:57:16.000 |
which types of features, and do they occur with each other. 00:57:20.000 |
And so we're going to get effectively the dot products 00:57:27.000 |
and then this row with the next row and this row with the next row. 00:57:45.000 |
but I just want to make sure I got the citation here. 00:57:51.000 |
it was first invented in the Gattis et al paper. 00:57:56.000 |
So a neural algorithm or test of artistic style? 00:58:00.000 |
Yeah, I mean Gattis, that's the style transfer one. 00:58:04.000 |
Cider in Ferguson is the feature visualization one. 00:58:06.000 |
Yeah, sorry, I got the switch. Thanks, Jeremy. 00:58:09.000 |
Okay, so we are ending up with this kind of like, 00:58:12.000 |
this grand matrix, this correlation of features. 00:58:15.000 |
And the way you can read this in this example is to say, 00:58:24.000 |
And if you go and count them, there's seven there. 00:58:26.000 |
And then if I look at any other one in this row, 00:58:29.000 |
like here, there's only one red that occurs alongside a green, right? 00:58:33.000 |
This is the only location where there's one red cell and one green cell. 00:58:37.000 |
There's three reds that occur with the yellow, there, there, and there. 00:58:41.000 |
And so this grand matrix here has no spatial component at all. 00:58:45.000 |
It's just the feature dimension by the feature dimension. 00:58:48.000 |
But it has a measure of how common these features are, 00:58:57.000 |
Yeah, maybe there's only three greens in total, right? 00:59:06.000 |
One of them occurs alongside a red, one of them occurs alongside a blue. 00:59:10.000 |
This is some measure of what features are present, 00:59:13.000 |
where if they occur together with other features often, 00:59:21.000 |
This is the first clear explanation I've ever seen of how a grand matrix works. 00:59:31.000 |
I also want to maybe you can open up the original paper, 00:59:34.000 |
because I'd also like to encourage people to look at the original paper, 00:59:37.000 |
because this is something we're trying to practice at this point, 00:59:42.000 |
And so hopefully you can take Jono's fantastic explanation and bring it back 01:00:01.000 |
Oh, yeah, it's a different search engine that I'm trying out that has some AI magic, 01:00:05.000 |
but they use Bing for their actual searching, which... 01:00:15.000 |
I don't know if I've actually read this paper as horrific as that sounds. 01:00:22.000 |
But I think it's got some nice pictures, and I'm going to zoom in a bit. 01:00:44.000 |
Okay, yeah, ground matrix, inner product between the vectorized feature map. 01:00:48.000 |
And so those kinds of wordings kind of put me off. 01:00:51.000 |
For a while, the way I explained ground matrices when I had to deal with them at all 01:00:54.000 |
was to say it's magic that measures what features are there without worrying 01:01:05.000 |
They talk about which layers they're looking into. 01:01:12.000 |
Okay, yeah, so it doesn't really explain how the ground matrix works, 01:01:15.000 |
but it's something that people use historically in some other contexts as well 01:01:22.000 |
Nowadays, actually PyTorch has named parameters, 01:01:29.000 |
but you can name layers of a sequential model as well. 01:01:37.000 |
Okay, so just quickly, I wanted to implement this diagram in code. 01:01:41.000 |
I should mention these are like zero or one for simplicity, 01:01:44.000 |
but you could have obviously different size activations and things. 01:01:47.000 |
The correlation idea is still going to be there, 01:01:54.000 |
because it makes it easy to add later the bash dimension and so on. 01:01:59.000 |
But I wanted to also highlight that this is just this matrix 01:02:02.000 |
multiplied with its own transpose, and you're going to get the same result. 01:02:07.000 |
So, yeah, that's our ground matrix calculation. 01:02:11.000 |
There's no magic involved there as much as it might seem like it. 01:02:15.000 |
And so we can now use this, like can we create this measure and then... 01:02:22.000 |
I think it's got some similarities, this idea of kind of co-occurrence of features. 01:02:25.000 |
And it also reminds me of the clip loss similar idea of like basically a dot product, 01:02:35.000 |
I mean, we've seen how covariance is basically that as well. 01:02:40.000 |
So this idea of kind of like multiplying with your own transpose 01:02:46.000 |
We've come across three or four times a ring in this course. 01:02:53.000 |
Even, yeah, you'll see that in like protein folding stuff as well. 01:02:59.000 |
They have a big covariance matrix for like... 01:03:03.000 |
So the difference in each case is like, yeah, 01:03:06.000 |
the difference in each case is the matrix that we're multiplying by its own transpose. 01:03:09.000 |
So for covariance, the matrix is the matrix of differences to the mean, for example. 01:03:15.000 |
And yeah, in this case, the matrix is this flattened picture thing. 01:03:23.000 |
Cool. So I have here the calculate grams function 01:03:27.000 |
that's going to do exactly that operation we did above, 01:03:32.000 |
And the reason we're adding the scaling is that we have this feature map 01:03:35.000 |
and we might pass in images of different sizes. 01:03:37.000 |
And so what this gives us is the absolute like... 01:03:40.000 |
You can see there's a relation to the number of spatial locations here. 01:03:44.000 |
And so by scaling by this width times height, 01:03:47.000 |
we're going to get like a relative measure as opposed to like an absolute measure. 01:03:51.000 |
It just means that the comparisons are going to be valid 01:03:56.000 |
And so that's the only extra complexity here, 01:03:59.000 |
but we have channels by height by width, image in, 01:04:06.000 |
So this is like channels being the number of features. 01:04:09.000 |
We're going to pass it in two versions of that, right? 01:04:18.000 |
But we're going to map this down to just the features by features, 01:04:33.000 |
you can see I'm targeting five different layers. 01:04:36.000 |
And for each one, the first layer has 64 features. 01:04:47.000 |
So this is doing, it seems like, what we want. 01:04:52.000 |
Because this is a list, we can use this Atrogot method, which I... 01:04:57.000 |
Well, actually, it's a fast call capital L, not a list. 01:05:19.000 |
We're going to calculate these gram matrices for that. 01:05:21.000 |
And then when we get in an input to our loss function, 01:05:26.000 |
we're going to calculate the gram matrices for that 01:05:29.000 |
and do the mean squared error between the gram matrices. 01:05:37.000 |
comparing the two to make sure that they're ideally 01:05:41.000 |
and the same kinds of correlations between features. 01:05:48.000 |
So our content image at the moment has quite a high loss 01:05:54.000 |
And that means that the content image doesn't look anything 01:05:56.000 |
like a spider web in terms of its textures and whatever. 01:06:01.000 |
So we're going to set up an optimization thing here. 01:06:13.000 |
For style transfer, it's quite nice to use the content image 01:06:23.000 |
we maintain the structure because we're still using 01:06:27.000 |
the content loss as one component of our loss function. 01:06:31.000 |
But now we also have more and more of the style 01:06:34.000 |
because of the early layers, we're evaluating that style loss. 01:06:38.000 |
And you can see this doesn't have the same layout 01:06:41.000 |
as our spider web, but it has the same kinds of textures 01:06:50.000 |
And you can see it's done ostensibly what our goal is. 01:06:53.000 |
It's taken one image and it's done it in the style of another. 01:06:59.000 |
And it's actually done it in a particularly clever way 01:07:04.000 |
Her arm has the spider web nicely laid out on it. 01:07:07.000 |
And she's almost picking it out with her fingers. 01:07:13.000 |
And her face, which is quite important or very important 01:07:20.000 |
the model didn't want to mess with the face much at all. 01:07:27.000 |
I think the more you look at it, the more impressive it is 01:07:31.000 |
in how it's managed to find a way to add spider webs 01:07:35.000 |
without messing up the overall semantics of the image. 01:07:46.000 |
If you've been running the notebook with the demo images, 01:07:48.000 |
please right now go and find your own pictures. 01:07:51.000 |
Make sure you're not stealing someone's licensed work, 01:07:53.000 |
but there's lots of Creative Commons images out there. 01:08:03.000 |
And then there's so much that you can experiment with. 01:08:05.000 |
So for example, you can change the content loss 01:08:20.000 |
You can change how you scale these two components 01:08:28.000 |
All of this is up for grabs in terms of what you can optimize 01:08:35.000 |
And you get different results with different scalings 01:08:40.000 |
So there's a whole lot of fun experimentation to be done 01:08:45.000 |
that gives you a pleasing result for a given style content pair 01:08:49.000 |
and for a given effect that you want on the output. 01:08:55.000 |
one of the really interesting things about this 01:09:07.000 |
I think there's definitely some special properties of VGG 01:09:11.000 |
that allow for it to do well for style transfer. 01:09:18.000 |
how we can use maybe other networks for style transfer 01:09:21.000 |
that maintain maybe some of these nice properties of VGG. 01:09:29.000 |
And of course, we have this very nice framework 01:09:31.000 |
that allows us to easily plug and play different networks 01:09:54.000 |
one of the things that when we were developing this, 01:09:56.000 |
I said to Jeremy, was like, ah, we're doing all this work 01:10:00.000 |
Isn't it nicer to just have like, here's my image 01:10:07.800 |
And the answer is that it is theoretically easier 01:10:14.000 |
And that's why you see in a tutorial or something, 01:10:20.400 |
But as soon as you say, OK, I'd like to try this again 01:10:30.080 |
And then you say, oh, let me add some progress stuff, so images. 01:10:35.880 |
As soon as you want to save images for a video 01:10:40.600 |
and you want to do some sort of annealing on your learning 01:10:43.520 |
rate, each of these things is going to grow this loop 01:10:51.360 |
able to experiment with a completely new version 01:10:59.400 |
and having everything in its own piece, like the image logging. 01:11:03.200 |
Or if you wanted to make a little movie showing the progress, 01:11:09.480 |
You're just tweaking one thing, but all the other infrastructure 01:11:25.080 |
Something I actually think that Paul do too much of when 01:11:32.480 |
they use the fast AI library is jumping straight into data 01:11:42.240 |
might be working on a slightly more custom thing 01:11:44.520 |
where there isn't a data block already written for them. 01:11:49.080 |
And so then step one is like, oh, write a data block. 01:11:53.440 |
And you actually want to be focusing on building your model. 01:11:57.360 |
So I say to people, I will go down a layer of abstraction. 01:12:07.000 |
start at the very lowest level of abstraction. 01:12:08.920 |
So something like the very last thing that you showed, Jono, 01:12:11.380 |
just because in my experience, I'm not good enough 01:12:17.000 |
And so most of the time, yeah, I'll forget zero grad, 01:12:26.680 |
by using like, you know, FP16, mixed precision, 01:12:34.400 |
to think about how to put a metrics in so that I can 01:12:43.760 |
but I do quite often start at a reasonably low level. 01:12:51.160 |
have this tool where we fully understand all the layers, 01:12:58.200 |
And yeah, you could like write your own train cb or whatever. 01:13:02.920 |
And at least you've got something that makes sure, 01:13:04.920 |
for example, that, oh, OK, you remember to use torch.nograd 01:13:07.720 |
here, and you remember to put it in, you know, 01:13:15.880 |
And you'll be able to easily run a learning rate 01:13:17.980 |
finder and easily have it run on CUDA or whatever device 01:13:23.320 |
So I think hopefully this is a good place for people 01:13:27.400 |
now to have a framework that they can call their own, 01:13:32.760 |
you know, and use as much or as little as it makes sense. 01:13:38.560 |
like there are multiple ways of doing the same thing. 01:13:41.120 |
And it's like, whatever way maybe works better for you, 01:13:45.200 |
Like, for example, Jonathan showed with the image opt 01:13:48.020 |
callback, you could implement that in different ways, 01:13:50.600 |
and whichever one, I guess, is easier for you to understand 01:13:55.040 |
or easier for you to work with, you can implement it that way. 01:13:58.480 |
And yeah, Mini-AI is flexible in multiple ways. 01:14:06.980 |
Yeah, and this is one extreme of like weirdness, 01:14:09.320 |
I think, which is like, John is like using Mini-AI for something 01:14:13.120 |
that we never really considered making it for, 01:14:16.080 |
which is like it's not even looping through data. 01:14:23.480 |
So, you know, this is about as weird as it's going to get, 01:14:26.800 |
Yeah, well, the next notebook is about as weird 01:14:34.920 |
what we're going to do next is use this kind of style loss 01:14:37.200 |
in an even funkier way to train a different kind of thing. 01:14:42.480 |
to just call out like using these pre-trained networks 01:14:45.040 |
as very smart feature extractors is pretty powerful. 01:14:54.040 |
So if you're doing like a super resolution or even something 01:15:04.360 |
We've played around with using perceptual loss for diffusion. 01:15:07.520 |
Or even during like say you want to generate an image that 01:15:10.840 |
matches effects in some kind of image-to-image thing 01:15:13.040 |
with stable diffusion, maybe you have an extra guidance 01:15:16.160 |
function that makes sure that structurally it matches, 01:15:26.040 |
to say in the style of it's a Van Gogh starry night. 01:15:30.240 |
And for all sorts of like image-to-image tasks, 01:15:33.960 |
this idea of using the features from a network like VGG, 01:15:47.080 |
going to look at something a little bit more niche now 01:15:51.440 |
And so try and spend about half an hour on this 01:16:06.840 |
And so you may be familiar with classic cellular automata. 01:16:19.240 |
But you've probably seen this kind of classic Conway's 01:16:26.160 |
I was still not like Edward Lively there when I was a kid. 01:16:40.560 |
it's going to remain the same state for the next one. 01:16:49.920 |
of a distributed system, a self-organizing system, 01:16:54.800 |
where there's no global communication or anything. 01:16:58.000 |
Each cell can only look at its immediate neighbors. 01:17:00.160 |
And typically, the rules are really small and simple. 01:17:03.640 |
And so we can use these to model these complex systems. 01:17:08.320 |
where we actually do have huge arrangements of cells, each 01:17:13.880 |
like sensing chemicals in the bloodstream next to it 01:17:17.120 |
And yet somehow, they're able to coordinate together. 01:17:19.720 |
I watched a really cool Hergizak video the other day about ants. 01:17:29.520 |
were organized by having little chemical signals 01:17:35.680 |
And yeah, it can organize the entire massive ant colony 01:17:42.080 |
I thought it was crazy, but it sounds really similar. 01:17:53.360 |
You could do very similar things where you have-- 01:18:05.600 |
But this is exactly that kind of system, right? 01:18:07.640 |
Each little tiny dot, which are almost too small to see, 01:18:14.320 |
The difference between this and what we're going to do today 01:18:17.880 |
Just to clarify, I think you've told me before 01:18:19.760 |
that actual slime molds kind of do this, right? 01:18:29.320 |
And then after that, that signal is going to propagate. 01:18:31.680 |
And anything that's moving is going to follow. 01:18:34.360 |
And so yeah, if you play with this kind of simulation, 01:18:36.600 |
you often get patterns that look exactly like emergent 01:18:39.680 |
patterns in nature, like ants moving to food or corals 01:19:02.160 |
It's just each individual cell looking at its neighbors 01:19:11.120 |
it's kind of like a single kind of if statement. 01:19:25.000 |
or there's no one there, you with zero or one neighbors, 01:19:31.400 |
replace that hard-coded if statement with a neural network, 01:19:35.880 |
and in particular, a very small neural network. 01:19:49.120 |
And they built these neural cellular automata. 01:19:57.360 |
And they can't see the global structure at all. 01:19:59.160 |
And it starts out with a single black pixel in the middle. 01:20:04.160 |
can see it builds this little lizard structure, 01:20:07.920 |
So that's wild to me that a bunch of pixels that only 01:20:22.520 |
And what's more, the way that they train them, 01:20:32.880 |
No little agent here knows what the full picture looks like. 01:20:37.560 |
All it knows is that its neighbors have certain values. 01:20:40.480 |
And so it's going to update itself to match those values. 01:20:47.820 |
that ought to have a lot of use in the real world with-- 01:20:54.000 |
I don't know-- having a bunch of drones working together 01:20:58.000 |
when they can't contact some kind of central base. 01:21:00.400 |
So I'm thinking about work that some Australian folks have 01:21:04.000 |
been involved in, where they were doing subterranean, 01:21:11.640 |
through thousands of meters of rock, stuff like that. 01:21:16.880 |
Yeah, so this idea of self-organizing systems, 01:21:23.120 |
and things like that that can do pretty amazing things. 01:21:32.320 |
And definitely, yeah, that's a pattern that's 01:21:37.600 |
Lots of loosely coordinated cells coming together 01:21:40.080 |
and talking about deep learning is quite a miracle. 01:21:43.720 |
And so I think, yeah, that's an interesting pattern to explore. 01:21:54.920 |
so that you can get something that not only builds out 01:21:57.420 |
an image or builds out something like a texture, 01:22:07.760 |
going to set up a neural network with some learnable weights 01:22:11.000 |
that's going to apply our little update rule. 01:22:14.720 |
And this is just going to be a little dense MLP. 01:22:26.360 |
that aren't shown that the agents can use as communication 01:22:34.920 |
We'll be able to get our neighbors using maybe convolution 01:22:40.080 |
and feed them through a little MLP, and take our outputs 01:22:44.880 |
And just to clarify something that I missed originally 01:22:54.080 |
You're only allowed to see the little things right next to you. 01:23:05.280 |
But we're going to do one that doesn't even have that. 01:23:16.940 |
look at the final output and compare it to our targets, 01:23:21.080 |
And you might think, OK, well, that's pretty cool. 01:23:27.840 |
get something that after some number of steps 01:23:34.220 |
But there's this problem, which is that you're 01:23:36.560 |
applying some number of steps, and then you're 01:23:40.840 |
But that doesn't guarantee that it's going to be stable-- 01:23:43.960 |
Hubspool, I think is my phone-- stable longer term. 01:23:47.600 |
And so we need some additional way to say, OK, 01:23:51.600 |
I'd like to then maintain that shape once I have it. 01:24:10.560 |
We'll apply our loss function and update our network. 01:24:13.440 |
And then most of the time, we'll take that final output, 01:24:23.880 |
might see the initial state and have to produce the lizard. 01:24:27.360 |
Or it might see a lizard that's already been produced. 01:24:33.000 |
And so this is adding an additional constraint that 01:24:42.160 |
And so, yeah, it's also nice because, like I mentioned here, 01:24:46.360 |
initially, the model ends up in various incorrect states 01:24:51.480 |
And it then has to learn to correct those as well. 01:24:58.160 |
And you can see here, now they have a thing that 01:25:00.120 |
is able to grow into the lizard and then maintain 01:25:08.380 |
where they sometimes chop off half of the image 01:25:16.200 |
or something that can only see the ones nearby. 01:25:22.200 |
And a gust of wind could come along and set them off path. 01:25:30.460 |
Half of them go offline and run out of battery. 01:25:37.360 |
is a little bit more complicated than we just 01:25:40.000 |
have a network and some target outputs and we optimize it. 01:25:43.680 |
So we're not going to follow that paper exactly, 01:25:45.600 |
although it should be fairly easy to tweak what 01:25:49.800 |
We're instead going to go for a slightly different one 01:25:53.320 |
by the same authors where they train even smaller 01:26:01.600 |
We'd like to produce a texture without necessarily worrying 01:26:08.680 |
And so the same sort of idea, the same sort of training, 01:26:14.400 |
we'd like it to look like our target style image. 01:26:21.000 |
And something that makes a texture a texture in this case, 01:26:28.520 |
And so that tiling is going to come almost for free. 01:26:35.800 |
And every cell is going to do the same rule, which 01:26:46.480 |
like, say, a convolution to get our neighbors, 01:26:48.240 |
we need to think about what happens for the neighbors 01:26:53.720 |
And by default, those will just be padding of 0. 01:26:59.300 |
And B, they won't necessarily have any communication 01:27:03.440 |
If we want this to tile, what we're going to do 01:27:05.320 |
is we're going to set our padding mode to circular. 01:27:08.280 |
Or in other words, the neighbors of this top right cell 01:27:10.560 |
are going to be these cells next to it here and these cells 01:27:14.920 |
And then for free, we're going to get tiling. 01:27:30.440 |
And again, feel free to experiment with your own, please. 01:27:41.040 |
to this calculate grams function, which I didn't do 01:27:44.760 |
in the style transfer example because you're always 01:27:48.560 |
Everything else is going to be pretty much the same. 01:27:51.040 |
So we can set up our style loss with the target image. 01:27:53.920 |
And then we can feed in a new image, or in this case, 01:28:13.960 |
And our number of hidden neurons in the brain 01:28:22.000 |
people might want to play with in style loss to target, 01:28:25.320 |
is you're giving all the layers the same weight. 01:28:28.440 |
A nice addition would be to have a vector of weights 01:28:39.260 |
So the world in which these cellular automata 01:28:45.060 |
if we call this function, number of channels, and the size. 01:28:48.800 |
You could make it non-square if you cared about that. 01:28:51.920 |
For our perception, in this little diagram here, 01:28:59.920 |
There would be additional weights in the neural network. 01:29:02.760 |
The reason they're hard-coded is because the people who 01:29:07.760 |
wanted to keep the parameter counts really low. 01:29:10.840 |
Literally like a few hundred parameters total. 01:29:25.280 |
one of the coolest things about these systems 01:29:27.360 |
is they really can do a lot with very little parameters. 01:29:30.840 |
And so these filters that we're just going to hard-code 01:29:33.040 |
are going to be the identity, just looking at itself. 01:29:36.720 |
And then a couple that are looking at gradients. 01:29:40.160 |
Again, inspired by biology, where even simple cells 01:29:42.760 |
can sense gradients of chemical concentration. 01:29:47.240 |
We're going to have a way to apply these filters 01:29:50.840 |
Just to help people understand that that first one, 01:30:06.800 |
This one is going to sense a horizontal gradient. 01:30:09.960 |
This one is going to sense a vertical gradient. 01:30:21.360 |
And rather than having a kernel that has separate weights 01:30:32.640 |
I didn't know circular was a padding mode before. 01:30:36.320 |
where it's basically going to circle around and kind of copy 01:30:40.960 |
in the thing from the other side when you reach the edge. 01:30:48.480 |
You'll see a lot of limitations just deal with the fact 01:30:51.320 |
that they have slightly weird pixels around the edge 01:30:59.520 |
We can apply our filters to get our model inputs. 01:31:05.320 |
because we have four channels and four filters. 01:31:07.840 |
16 inputs, that's going to be the input to our little brain. 01:31:10.560 |
And we have this for every location in the grid. 01:31:13.640 |
So now, how do we implement that little neural network 01:31:23.720 |
We have a linear layer with number of channels 01:31:28.960 |
as its number of inputs, some hidden number of neurons. 01:31:31.720 |
We have a relu, and then we have a second linear layer 01:31:34.880 |
that's outputting one output per channel as the update. 01:31:39.440 |
And so if we wanted to use this as our brain, what we'd have to do 01:31:42.000 |
is we'd have to deal with these extra dimensions. 01:31:46.000 |
So we take our batch by channel by height and width. 01:31:48.760 |
We're going to map the batch and the height and the width 01:31:51.040 |
all to one dimension and the channels to the second dimension. 01:31:54.280 |
So now we have a big grid of 16 inputs and lots of examples. 01:32:04.520 |
to teach people about that, and then maybe in the next case. 01:32:14.560 |
that has just 16 features, feed that through the linear layer, 01:32:25.080 |
can see what parameters we have on our brain. 01:32:26.920 |
We have an 8 by 16 inputs and 8 biases for the first layer. 01:32:36.200 |
And I've said bias equals false, because we're 01:32:43.840 |
the update is usually going to be 0 or close to it. 01:32:53.680 |
And so that's why we're setting bias equals false. 01:33:01.080 |
feeding these examples through the linear layer. 01:33:06.960 |
So this might seem like, wait, but isn't this a linear layer? 01:33:19.440 |
But we can do that by having a filter size of 1, 01:33:23.080 |
a kernel size of 1 in our convolutional layer. 01:33:25.640 |
So I have 16 input channels in my model inputs here. 01:33:43.600 |
And so we can see this gives me the right shape output. 01:33:46.000 |
And if I look at the parameters, my first convolutional layer 01:33:49.360 |
has 8 by 16 by 1 by 1 parameters in its filters. 01:33:54.600 |
And so maybe spend a little bit of time convincing yourself 01:34:09.840 |
is basically the same idea as this idea of applying 01:34:16.560 |
And I should mention that convolutions are very efficient. 01:34:23.680 |
And what makes neural cellular automata quite exciting 01:34:26.200 |
is that because we're doing this convolution, 01:34:29.560 |
you have an operation for every pixel that we're applying. 01:34:35.720 |
There's no global thing that we need to handle. 01:34:38.760 |
And so this is actually exactly what GPUs were designed for. 01:34:41.960 |
They're designed for running some operation for every pixel 01:34:46.200 |
or show you your video game or make your website 01:34:50.160 |
And so we can take advantage of that kind of built-in bias 01:34:52.840 |
of the hardware from doing lots of little operations in parallel 01:34:57.960 |
And I'll show you just now we can run these real time 01:35:03.640 |
OK, so now that we have all that infrastructure in place, 01:35:10.120 |
We have our little brain, two convolutional layers 01:35:14.120 |
Optionally, we can set the weights of the second layer 01:35:22.560 |
Not necessarily necessary, but it does help the training. 01:35:28.080 |
I don't know if it matters, but I'd be inclined to put that-- 01:35:30.600 |
I would be inclined to do nn.init constant zero 01:35:37.680 |
or put that in a nograd, like often initializing things 01:35:49.000 |
In the forward method, we're going to apply our filters 01:35:59.320 |
Oh, it's the .data, which is the thing that makes it-- 01:36:01.640 |
yeah, you don't need torch.nograd because you've got .data. 01:36:11.640 |
It's feeding it through the first convolutional layer, 01:36:16.800 |
And then it's doing this final step, which again goes back 01:36:26.280 |
And one thing that you don't have in a biological system 01:36:28.860 |
is some sort of global clock where everything updates 01:36:41.320 |
And if you go in-- let's just actually write this out. 01:36:43.440 |
Let's just make a cell and check that this is what we're doing. 01:36:50.360 |
So I'm going to just go the hw1, just to visualize. 01:37:10.160 |
And this is going to determine whether we apply this update 01:37:24.440 |
and then every cell is running the exact same rule, 01:37:26.920 |
after one update, we'll still have a perfectly uniform grid. 01:37:29.840 |
There's no way for there to be any randomness. 01:37:36.760 |
only a subset of cells are going to be updated. 01:37:39.760 |
They have different neighborhoods and things. 01:37:44.600 |
And this is very much like in a biological system, 01:37:49.160 |
So that's a little bit of additional complexity. 01:37:51.240 |
But again, inspired by nature and inspired by paper. 01:37:55.680 |
With all of this in place, we can do our training. 01:37:58.520 |
We're going to use the same dummy data set idea as before. 01:38:03.280 |
We are going to have a progress callback, which is a lot of code. 01:38:11.040 |
And so I'm not going to spend too much time on that. 01:38:13.800 |
And then the fun stuff is going to happen in our training 01:38:16.720 |
And so now we are actually getting deep into the weeds. 01:38:21.480 |
This is much more complicated than just feeding a batch 01:38:25.520 |
We are setting up a pool of grids, 256 examples. 01:38:31.720 |
And these are all going to start out as just uniform zeros. 01:38:37.840 |
going to pick some random samples from that pool. 01:38:41.960 |
We are occasionally going to reset those samples 01:38:46.720 |
And then we're going to apply the model a number of times. 01:38:56.760 |
this is like a 50-layer deep model all of a sudden. 01:39:00.840 |
Is that going to be a learn.model rather than self.learn.model? 01:39:12.840 |
that by applying this a large number of times, 01:39:15.000 |
we could get something like gradient exploding 01:39:16.880 |
and things like that, which we'll deal with a little bit 01:39:20.360 |
But we apply the model a large number of steps. 01:39:28.960 |
These are the outputs after we've applied a number of steps. 01:39:32.720 |
And in the loss, we're going to use a style loss saying, 01:39:36.240 |
does this match the style of my target image? 01:39:39.000 |
And we're going to add an overflow loss that penalizes it 01:39:41.440 |
if the values are out of bounds, just to try and-- 01:39:54.160 |
One more self.learn.pred.clamp and the overflow loss one. 01:40:09.680 |
Again, something that's quite likely to happen 01:40:12.120 |
when you're applying a large number of steps, 01:40:23.040 |
going to be quite useful in some other places as well, 01:40:27.680 |
And so we're just running through the parameters 01:40:33.320 |
And this means that even if they're really, really tiny, 01:40:35.600 |
really, really large at the end of that multiple number 01:40:44.540 |
Now, let's put a bookmark to come back to that as well 01:40:49.320 |
And I guess that before fit, maybe we don't need anymore. 01:40:56.640 |
So I should have set this running before we started talking. 01:41:05.920 |
And the reason I'm-- you'll see in the callback here, 01:41:17.820 |
It's just because the overflow loss is sometimes 01:41:19.820 |
so much larger than the rest of the loss ones 01:41:26.000 |
So using a log scaling and clipping the bounds 01:41:28.580 |
tends to help just visualize what's actually important, 01:41:48.720 |
But you can take a look at them and kind of compare them 01:41:54.840 |
After some training, we get some definite webby tendencies. 01:42:02.080 |
to a random grid and log the images every 100 steps 01:42:08.600 |
And you can see that starting from this random position, 01:42:16.240 |
But in its defense, this model has 168 parameters. 01:42:26.960 |
they're able to do something pretty impressive. 01:42:31.680 |
to where we define number of channels and number of layers. 01:42:35.240 |
If you give it more channels to work with, 8 or 16, 01:42:54.000 |
And just to give you a little preview of what's possible, 01:43:05.520 |
But what I did was I logged the cellular automata. 01:43:11.320 |
We-- this is way outside of the bounds for this course. 01:43:16.720 |
But you can write something called a fragment shader 01:43:22.920 |
It's a little program that runs once for every pixel. 01:43:29.720 |
I have sampling the neighborhood of each cell. 01:43:39.560 |
We're running through the layers of our network 01:43:49.880 |
and optimized with a slightly different loss function. 01:43:52.000 |
So it was a style loss plus clip to the prompt, 01:43:54.640 |
I think, dragon scales or glowing dragon scales. 01:43:57.920 |
And you can see this is running in real time or near real time 01:44:08.840 |
And so in a similar way, in this weights and biases report, 01:44:17.000 |
And just logging the grids from the different things. 01:44:21.440 |
And so you can see these are still pretty small 01:44:31.120 |
But quite fun to see what you can do with these. 01:44:35.280 |
and train for a bit longer and use a few more channels, 01:44:39.520 |
And you can get really creative applying them 01:44:42.120 |
Or I did some messing around with video, which, again, 01:44:47.040 |
is just like messing with the inputs to different cells 01:44:52.040 |
So yeah, to me, this is a really exciting, fun-- 01:44:57.640 |
Yeah, I don't know if there's too many practical applications 01:45:03.000 |
cellular automata and stylizing or image restoration 01:45:07.840 |
And you can really have a lot of fun with this structure. 01:45:10.880 |
And I also thought it was just a good demo of how far can we 01:45:14.240 |
push what you can do with the training callback 01:45:17.000 |
to have this pool training and the gradient normalization 01:45:24.160 |
Very, very different from here's a batch of images 01:45:30.360 |
And then, Jeremy, if you have any questions or follow-ups? 01:45:37.160 |
But that's just one of the coolest things I've seen.