back to indexLesson 13: Deep Learning Part 2 2018 - Image Enhancement
Chapters
0:0
40:7 Language
40:37 Machine Learning can amplify bias.
46:11 Runaway feedback loops
48:8 Bias in Al: offensive to tragic
57:15 A Neural Algorithm of Artistic Style
00:00:00.000 |
Welcome to Lesson 13, where we're going to be talking about image enhancement. 00:00:11.400 |
Image enhancement would cover things like this painting that you might be familiar with. 00:00:15.880 |
However, you might not have noticed before that this painting actually has a picture 00:00:22.760 |
The reason you may not have noticed that before is that this painting actually didn't use 00:00:28.280 |
By the same token, actually, on that first page, this painting did not use to have Captain 00:00:34.600 |
This painting did not use to have a clock in it either. 00:00:37.960 |
This is a cool new paper that just came out a couple of days ago called "Deep Painterly 00:00:44.200 |
It uses almost exactly the technique we're going to learn in this lesson with some minor 00:00:51.200 |
But you can see the basic idea is to take one picture, paste it on top of another picture, 00:00:57.040 |
and then use some kind of approach to combine the two. 00:01:01.320 |
And the basic approach is something called style transfer. 00:01:07.920 |
Before we talk about that, though, I wanted to mention this really cool contribution by 00:01:14.160 |
William Horton, who added this stochastic weight averaging technique to the FastAI library 00:01:24.160 |
And he's written a whole post about that which I strongly recommend you check out, not just 00:01:27.560 |
because stochastic weight averaging actually lets you get higher performance from your existing 00:01:33.680 |
neural networks with basically no extra work. 00:01:37.440 |
It's as simple as adding these two parameters to your fit function. 00:01:41.480 |
But also he's described his process of building this and how he tested it and how he contributed 00:01:47.960 |
So I think it's interesting if you're interested in doing something like this, because I think 00:01:55.080 |
William had not built this kind of library before, so he describes how he did it. 00:02:02.140 |
Another very cool contribution to the FastAI library is a new train phase API. 00:02:10.320 |
And I'm going to do something I've never done before, which I'm actually going to present 00:02:16.400 |
And the reason I haven't done it before is because I haven't liked any notebooks enough 00:02:21.280 |
to think they're worth presenting, but still has done a fantastic job here of not just 00:02:25.720 |
creating this new API, but also creating a beautiful notebook describing what it is and 00:02:33.440 |
And the background here is, as you guys know, we've been trying to train networks faster 00:02:41.880 |
partly as part of this Dawnbench competition, and also for a reason that you'll learn about 00:02:50.000 |
And I mentioned on the forums last week, it would be really handy for our experiments 00:02:56.160 |
if we had an easier way to try out different learning rate schedules and stuff, and I basically 00:03:04.280 |
I said it would be really cool if somebody could write this, because I'm going to bed 00:03:12.640 |
And Sylvain replied on the forum, "Well, that sounds like a good challenge." 00:03:23.120 |
I want to take you through it because it's going to allow you to do research into things 00:03:32.160 |
So it's called the train phase API, and the easiest way to show it is to show an example 00:03:41.880 |
Here is an iteration against learning rate chart, as you're familiar with seeing. 00:03:48.560 |
And this is one where we train for a while at a learning rate of 0.01, and then we train 00:03:58.240 |
I actually wanted to create something very much like that learning rate chart because 00:04:02.120 |
most people that train ImageNet use this stepwise approach, and it's actually not something 00:04:09.920 |
that's built into fast AI because it's not generally something we recommend. 00:04:14.040 |
But in order to replicate existing papers, I wanted to do it the same way. 00:04:18.400 |
And so rather than writing a number of fit calls with different learning rates, it would 00:04:23.280 |
be nice to be able to basically say train for n epochs at this learning rate and then 00:04:32.420 |
You can say a phase is a period of training with particular optimizer parameters, and 00:04:39.920 |
it consists of a number of training phase objects. 00:04:43.040 |
Training phase objects is how many epochs to train for, what optimization function to 00:04:49.360 |
use, and what learning rate, amongst other things that we'll see. 00:04:54.080 |
And so here you'll see the two training phases that you just saw on that graph. 00:05:00.240 |
So now, rather than calling learn.fit, you say learn.fit with an optimizer scheduler 00:05:12.480 |
And then from there, most of the things you pass in can just get sent across to the fit 00:05:17.040 |
function as per usual, so most of the usual parameters will work fine. 00:05:22.560 |
But in this case, generally speaking, actually we can just use these training phases and 00:05:33.520 |
And not only does it plot the learning rate, it also plots momentum, and for each phase 00:05:42.880 |
You can turn off the printing of the optimizers, you can turn off the printing of mentums, 00:05:49.000 |
and you can do other little things like a training phase could have an LR decay parameter. 00:05:54.840 |
So here's a fixed learning rate, and then a linear decay learning rate, and then a fixed 00:06:03.960 |
And this might be quite a good way to train actually, because we know at high learning 00:06:09.640 |
rates you get to explore better, and at low learning rates you get to fine-tune better, 00:06:16.400 |
and it's probably better to gradually slide between the two. 00:06:20.000 |
So this actually isn't a bad approach, I suspect. 00:06:26.260 |
You can use other decay types, not just linear, so cosine, and this probably makes even more 00:06:30.960 |
sense as a genuinely, potentially useful learning rate and kneeling shape exponential, which 00:06:43.240 |
Polynomial which isn't terribly popular, but actually in the literature works better than 00:06:48.560 |
just about anything else, but seems to have been largely ignored, so polynomial is good 00:06:54.200 |
And what Sylvain's done is he's given us the formula for each of these curves. 00:06:59.520 |
And so with a polynomial you get to pick what polynomial to use. 00:07:15.640 |
And I believe a p of 0.9 is the one that I've seen really good results for, FYI. 00:07:27.120 |
If you don't give a tuple of learning rates when there's an LR decay, then it will decay 00:07:34.360 |
And as you can see, you can happily start the next cycle at a different point. 00:07:44.240 |
So the cool thing is now we can replicate all of our existing schedules using nothing 00:07:52.120 |
So here's a function called Phases SGDR, which does SGDR using the new training phase API. 00:08:00.100 |
And so you can see if he runs this schedule, then here's what it looks like. 00:08:05.580 |
That is even done the little trick I have where you train at a really low learning rate 00:08:08.920 |
just for a little bit, and then pop up and do a few cycles, and the cycles are increasing 00:08:12.400 |
in length, and that's all done in a single function. 00:08:20.720 |
So the new one cycle we can now implement with, again, a single little function. 00:08:28.920 |
And so if we fit with that, we get this triangle followed by a little flatter bit, and the 00:08:45.400 |
And then here we've got a fixed momentum at the end. 00:08:48.840 |
So it's doing the momentum and the learning rate at the same time. 00:08:54.360 |
So something that I haven't tried yet that I think would be really interesting is to 00:09:02.160 |
We've changed the name now to discriminative learning rates. 00:09:15.660 |
So a combination of discriminative learning rates and one cycle, no one's tried yet. 00:09:25.440 |
The only paper I've come across which has discriminative learning rates uses something 00:09:29.680 |
called Lars, L-A-R-S, and it was used to train ImageNet with very, very large batch sizes 00:09:38.440 |
by basically looking at the ratio between the gradient and the mean at each layer and 00:09:46.600 |
using that to change the learning rate of each layer automatically, and they found that 00:09:54.920 |
That's the only other place I've seen this kind of approach used, but there's lots of 00:09:59.320 |
interesting things you could try with combining discriminative learning rates and different 00:10:06.680 |
So you can now write your own LRFinder of different types specifically because there's 00:10:11.440 |
now this stop div parameter which basically means that it'll use whatever schedule you 00:10:17.560 |
asked for, but when the loss gets too bad it'll stop training. 00:10:22.680 |
So here's one with learning rate versus loss and you can see it stops itself automatically. 00:10:37.700 |
One useful thing that's been added is the linear parameter to the plot function. 00:10:44.480 |
If you use linear schedule rather than an exponential schedule in your learning rate 00:10:50.280 |
finder, which is a good idea if you're fine-tuned into roughly the right area, then you can 00:10:56.520 |
use linear to find exactly the right area, and then you probably want to plot it with 00:11:01.880 |
So that's why you can also pass linear to plot now as well. 00:11:07.460 |
You can change the optimizer H-face, and that's more important than you might imagine because 00:11:15.560 |
actually the current state-of-the-art for training on really large batch sizes really 00:11:22.200 |
quickly for ImageNet actually starts with RMSprop for the first bit and then they switch to 00:11:31.680 |
And so that could be something interesting to experiment more with because at least one 00:11:41.720 |
And again it's something that isn't well appreciated as yet. 00:11:50.080 |
And then the bit I find most interesting is you can change your data. 00:11:56.080 |
Because you remember from lessons 1 and 2 you could use smaller images at the start 00:12:02.720 |
And the theory is that you could use that to train the first bit more quickly with smaller 00:12:15.280 |
And remember if you have half the height and half the width and you've got a quarter of 00:12:19.240 |
the activations of basically every layer, so it can be a lot faster. 00:12:28.360 |
So you can now create a couple of different -- for example in this case he's got 28 and 00:12:34.440 |
This is just sci-fi 10, so there's only so much you can do. 00:12:37.460 |
And then if you pass in an array of data in this data list parameter, when you call fit.shed, 00:12:43.720 |
it'll use a different data set for each phase. 00:12:49.000 |
So that's really cool because we can use that now, like we could use that in our dorm bench 00:12:53.080 |
entries and see what happens when we actually increase the size with very little code. 00:13:06.920 |
Well the answer is here in dorm bench training on ImageNet. 00:13:13.320 |
And you can see here that Google won this with half an hour on a cluster of TPUs. 00:13:23.320 |
The best non-cluster of TPU result is fastAI+ students under three hours beating out Intel 00:13:34.440 |
on 128 computers, or else we ran on a single computer. 00:13:47.600 |
So using this approach we've shown the fastest GPU result, the fastest single machine result, 00:13:55.880 |
the fastest publicly available infrastructure result, these TPU pods you can't use unless 00:14:04.320 |
And the cost is tiny, like this Intel one cost them $1200 worth of compute, they haven't 00:14:12.200 |
That's what you get if you use 128 computers in parallel, each one with 36 cores, each 00:14:19.840 |
one with 140 GB compared to our single AWS instance. 00:14:26.080 |
So this is a kind of a breakthrough in what we can do, the idea that we can train ImageNet 00:14:39.040 |
And this $72, by the way, it was actually $25 because we used a spot instance, so one 00:14:45.200 |
of our students, Andrew Shaw, built this whole system to allow us to throw a whole bunch 00:14:49.480 |
of spot instance experiments up and run them simultaneously, and pretty much automatically. 00:14:55.480 |
But Dawn Bench doesn't quote the actual number we used, so it's actually $25, not $72. 00:15:03.160 |
So this data list idea is super important and helpful. 00:15:16.000 |
And so our SciFi 10 results are also now up there officially, and you might remember the 00:15:22.560 |
previous best was a bit over an hour, and the trick here was using one cycle, basically. 00:15:28.040 |
So all this stuff that's in Sylvan's training phase API is really all the stuff that we 00:15:35.400 |
And really cool, another fast AI student who goes by the name here, BKJ, has taken that 00:15:47.360 |
He took ResNet18 and added the concat pooling that you might remember that we learned about 00:15:51.880 |
on top, and used Leslie Smith's one cycle, and so he's got on the leaderboard. 00:15:59.560 |
So all the top three fast AI students, which is wonderful. 00:16:11.160 |
So Brett ran this on paper space and got the cheapest result, just ahead of BKJ. 00:16:26.640 |
So I think you can see a lot of the interesting opportunities at the moment for training stuff 00:16:34.480 |
more quickly and cheaply are all about the learning rate annealing, and size annealing, 00:16:39.560 |
like training with different parameters at different times, and I still think everybody's 00:16:44.720 |
I think we can go a lot faster and a lot cheaper. 00:16:48.520 |
And that's really helpful for people in resource constrained environments, which is basically 00:17:04.000 |
And one of the things we looked at last week was just like creating a simpler architecture, 00:17:08.120 |
which is basically state of the art, like the really basic kind of dark-net architecture. 00:17:13.680 |
But there's a piece of architecture we haven't talked about, which is necessary to understand 00:17:23.360 |
And the inception network is actually pretty interesting because they use some tricks to 00:17:31.080 |
actually make things more efficient, and we're not currently using these tricks, and I kind 00:17:38.040 |
And so this is -- the most interesting, most successful inception network is their Inception 00:17:42.880 |
ResNet 2 network, and most of the blocks in that look something like this. 00:17:48.280 |
And it looks a lot like a standard ResNet block, and there's an identity connection here, 00:17:53.120 |
and then there's a conv-con path here, and then we add them up together. 00:18:06.100 |
The first is that this path is a 1 by 1 conv, not just any old conv, but a 1 by 1 conv. 00:18:18.100 |
And so it's worth thinking about what a 1 by 1 conv actually is. 00:18:24.040 |
So a 1 by 1 conv is simply saying for each grid cell in your input, you've got a -- basically 00:18:31.960 |
it's a vector, a 1 by 1 by number of filters tensor is basically a vector. 00:18:38.140 |
So for each grid cell in your input, you're just doing a dot product with that tensor. 00:18:45.520 |
And then of course it's going to be one of those vectors for each of the 192 activations 00:18:52.160 |
So you basically do 192 dot products with grid cell 1, 1, and then 192 with grid cell 00:18:57.800 |
1, 2 and 1, 3 and so forth, and so you'll end up with something which has got the same grid 00:19:03.640 |
size as the input and 192 channels in the output. 00:19:09.160 |
So that's a really good way to either reduce the dimensionality or increase the dimensionality 00:19:20.260 |
That's normally what we use 1 by 1 convs for. 00:19:25.020 |
So here we've got a 1 by 1 conv and then we've got another 1 by 1 conv and then they're added 00:19:30.760 |
And then there's a third path and this third path is not added. 00:19:35.920 |
This third path is not actually explicitly mentioned but it's concatenated. 00:19:41.260 |
And so actually there is a form of resnet which is basically identical to resnet but 00:19:51.920 |
So it's just a resnet where we do concat instead of plus. 00:19:56.300 |
And that's an interesting approach because then the kind of the identity path is literally 00:20:06.040 |
So you kind of get that flow through all the way through and so as we'll see next week 00:20:11.680 |
that tends to be good for like segmentation and stuff like that where you really want 00:20:16.200 |
to kind of keep the original pixels and the first layer of pixels and the second layer 00:20:23.320 |
So concatenating rather than adding branches is a very useful thing to do. 00:20:31.300 |
And so here we're concatenating this branch and this branch is doing something interesting 00:20:36.000 |
which is it's doing first of all the 1 by 1 conv and then a 1 by 7 and then a 7 by 1. 00:20:46.440 |
So what's going on there is basically what we really want to do is do a 7 by 7 conv. 00:20:54.280 |
The reason we want to do a 7 by 7 conv is that if you've got multiple paths, each of 00:20:59.200 |
which has different kernel sizes, then it's able to look at different amounts of the image. 00:21:07.040 |
And so like the original inception network had like a 1 by 1, a 3 by 3, a 5 by 5, 7 by 00:21:12.160 |
7 kind of getting concatenated in together, something like that. 00:21:17.360 |
And so if we can have a 7 by 7 filter then we get to kind of look at a lot of the image 00:21:22.320 |
at once and create a really rich representation. 00:21:25.880 |
And so actually the stem of the inception network, that is the first few layers of the 00:21:33.080 |
inception network actually also use this kind of 7 by 7 conv because you start out with 00:21:40.280 |
224 by 224 by 3 and you want to turn it into something that's like 112 by 112 by 64. 00:21:48.880 |
And so by using a 7 by 7 conv you can get a lot of information in each one of those outputs 00:21:57.200 |
But the problem is that 7 by 7 conv is a lot of work. 00:22:05.000 |
We've got 49 kernel values to multiply by 49 inputs for every input pixel across every 00:22:19.480 |
You can kind of get away with it maybe for the very first layer and in fact the very 00:22:23.880 |
first layer, the very first conv of ResNet is a 7 by 7 conv. 00:22:31.400 |
But not so for Inception, for Inception they don't do a 7 by 7 conv. 00:22:37.000 |
Instead they do a 1 by 7 followed by a 7 by 1. 00:22:42.520 |
And so to explain, the basic idea of the inception networks, all of the different versions of 00:22:48.360 |
it, is that you have a number of separate paths which have different convolution widths. 00:22:55.200 |
In this case conceptually the idea is this is a 1 by 1 convolution width and this is 00:23:02.360 |
And so they're looking at different amounts of data and then we combine them together. 00:23:09.840 |
But we don't want to have a 7 by 7 conv throughout the network because it's just too computationally 00:23:19.440 |
But if you think about it, if we've got some input coming in and we have some big filter 00:23:28.920 |
that we want and it's too big to deal with, what could we do? 00:23:33.760 |
Let's make it a little bit less drawing, let's do 5 by 5. 00:23:38.440 |
What we can do is to create two filters, one which is 1 by 5, one which is 5 by 1, or 7 00:23:55.000 |
So we take our activations, the previous layer, and we put it through the 1 by 5. 00:24:03.040 |
We take the activations out of that and put it through the 5 by 1, and something comes 00:24:12.520 |
Well rather than thinking of it as first of all we take the activations, then we put it 00:24:18.640 |
through the 5 by 1, then we put it through the 1 by 5, then the 5 by 1. 00:24:25.200 |
What if instead we think of these two operations together and say what is a 5 by 1 dot product 00:24:40.800 |
And effectively you could take a 1 by 5 and a 5 by 1, and the outer product of that is 00:24:56.980 |
You can't create any possible 5 by 5 matrix by taking that product, but there's a lot 00:25:07.620 |
And so the basic idea here is when you think about the order of operations, and I'm not 00:25:12.320 |
going to go into the detail of this, if you're interested in more of the theory here, you 00:25:16.640 |
should check out Rachel's Numerical Linear Algebra course, which is basically a whole 00:25:25.000 |
But conceptually the idea is that very often the computation you want to do is actually 00:25:33.140 |
more simple than an entire 5 by 5 convolution. 00:25:39.320 |
Very often the term we use in linear algebra is that there's some lower rank approximation. 00:25:46.180 |
In other words, the 1 by 5 and 5 by 1 combined together, that 5 by 5 matrix is nearly as 00:25:52.520 |
good as the 5 by 5 matrix you ideally would have computed if you were able to. 00:26:00.080 |
And so this is very often the case in practice, just because the nature of the real world 00:26:07.760 |
is that the real world tends to have more structure than randomness. 00:26:17.400 |
So the cool thing is, if we replace our 7 by 7 conv with a 1 by 7 and a 7 by 1, then this 00:26:32.560 |
has 14 dot products to do, whereas this one has 49 to do. 00:26:49.440 |
So it's just going to be a lot faster, and we have to hope that it's going to be nearly 00:26:55.200 |
It's certainly capturing as much width of information by definition. 00:27:01.000 |
So if you're interested in learning more about this specifically in the deep learning area, 00:27:07.680 |
The idea was to come up with three or four years ago now, it's probably been around for 00:27:12.960 |
longer but that was when I first saw it, and it turned out to work really well and the 00:27:27.040 |
It's interesting actually, we've talked before about how we tend to say like there's this 00:27:34.640 |
main backbone, like when we have ResNet34 for example, we say there's this main backbone 00:27:43.400 |
And then we've talked about how we can add on to it a custom head. 00:27:47.860 |
And that tends to be like a max pooling layer and a fully connected layer. 00:27:55.500 |
It's actually kind of better to talk about the backbone as containing kind of two pieces. 00:28:01.560 |
One is the stem, and then the other is kind of the main backbone. 00:28:11.080 |
And the reason is that the thing that's coming in, remember it's only got three channels, 00:28:17.220 |
and so we want some sequence of operations that's going to expand that out into something 00:28:22.280 |
richer, generally something like 64 channels. 00:28:24.920 |
And so in ResNet, the stem is just super simple. 00:28:29.460 |
It's a 7x7 conv, a stride 2 conv, followed by a stride 2 max pool. 00:28:36.960 |
I think that's it if memory says it correctly. 00:28:40.000 |
In Inception they have a much more complex stem with multiple paths, getting combined 00:28:45.320 |
and concatenated, including factored comms, 1x7 and 7x1. 00:28:56.080 |
What would happen if you stuck a standard ResNet on top of an InceptionStem, for instance? 00:29:03.880 |
I think that would be a really interesting thing to try because an InceptionStem is quite 00:29:11.600 |
And this thing of how do you take your three-channel input and turn it into something richer seems 00:29:17.840 |
And all of that work seems to have got thrown away for ResNet. 00:29:24.420 |
But what if we put a denseNet backbone on top of an InceptionStem? 00:29:31.320 |
Or what if we replaced the 7x7 conv with a 1x7 7x1 factored conv in a standard ResNet? 00:29:44.080 |
So there's some more thoughts about potential research directions. 00:29:53.160 |
So that was kind of my little bunch of random stuff section. 00:30:00.560 |
Moving a little bit closer to the actual main topic of this, which is -- what was the word 00:30:11.840 |
I'm going to talk about a new paper briefly because it really connects what I just discussed 00:30:20.800 |
And the new paper -- well, it's not that new, maybe it's a year old. 00:30:26.120 |
It's a paper on progressive GANs, which came from NVIDIA. 00:30:32.880 |
And the progressive GANs paper is really neat. 00:30:41.760 |
One-by-one conv is usually called a network within a network in the literature. 00:30:49.520 |
Network in network is more than just a one-by-one conv. 00:30:56.680 |
And I don't think there's any particular reason to look at that that I'm aware of. 00:31:12.120 |
So the progressive GAN basically takes this idea of gradually increasing the image size. 00:31:21.320 |
It's the only other direction I'm aware of where people have gradually increased the 00:31:28.520 |
And it kind of surprises me because this paper is actually very popular and very well-known 00:31:34.560 |
And yet people haven't taken the basic idea of gradually increasing the image size and 00:31:37.880 |
use it anywhere else, which shows you the general level of creativity you can expect 00:31:42.880 |
to find in the deep learning research community, perhaps. 00:31:51.600 |
Start with the 4x4 GAN, literally, they're trying to replicate 4x4 pixels, and then 8x8. 00:32:02.720 |
So we're trying to recreate pictures of celebrities. 00:32:05.280 |
And then they go 60x16, and then 32, and then 64, and then 128, and then 256. 00:32:12.960 |
And one of the really nifty things they do is that as they increase size, they also add 00:32:19.000 |
more layers to the network, which kind of makes sense, because if you're doing a more 00:32:24.200 |
of a resnetty type thing, then you're spitting out something which hopefully makes sense 00:32:29.600 |
in each grid cell size, and so you should be able to layer stuff on top. 00:32:34.560 |
And they do another nifty thing where they add a skip connection when they do that, and 00:32:41.000 |
they gradually change a linear interpolation parameter that moves it more and more away 00:32:45.640 |
from the old 4x4 network and towards the new 8x8 network. 00:32:51.320 |
And then once they've totally moved it across, they throw away that extra connection. 00:32:56.080 |
So the details don't matter too much, but it uses the basic ideas we've talked about 00:33:00.840 |
gradually increasing the image size, skip connections and stuff. 00:33:05.200 |
But it's a great paper to study because it's one of these rare things where good engineers 00:33:12.320 |
actually built something that just works in a really sensible way. 00:33:15.320 |
It's not surprising, this actually comes from Nvidia themselves. 00:33:18.640 |
So Nvidia don't do a lot of papers, but it's interesting that when they do, they build 00:33:22.400 |
something that's so thoroughly practical and sensible. 00:33:26.120 |
And so I think it's a great paper to study if you want to put together lots of the different 00:33:36.220 |
And there aren't many re-implementations of this, so it's an interesting thing to project, 00:33:43.320 |
and maybe you could build on and find something else. 00:33:47.840 |
We eventually go up to 1024x1024, and you'll see that the images are not only getting higher 00:33:54.440 |
And so 1024x1024, I'm going to see if you can guess which one of the next page is fake. 00:34:09.920 |
You go up, up, up, up, up, up, up, and then boom. 00:34:16.600 |
So like, dans and stuff are getting crazy, and some of you may have seen this during 00:34:29.200 |
Yeah, so this video just came out, and it's a speech by Barack Obama, and let's check 00:34:40.080 |
So like Jordan Peele, this is a dangerous time. 00:34:48.040 |
Moving forward, we need to be more vigilant with what we trust from the internet. 00:34:51.960 |
It's a time when we need to rely on trusted news sources. 00:34:56.480 |
It may sound basic, but how do we move forward? 00:35:02.740 |
So as you can see, they've used this kind of technology to literally move Obama's face 00:35:13.200 |
in the way that Jordan Peele's face was moving. 00:35:17.800 |
You basically have all the techniques you need now to do that. 00:35:31.740 |
So this is the bit where we talk about what's most important, which is like, now that we 00:35:37.160 |
can do all this stuff, what should we be doing, and how do we think about that? 00:35:49.400 |
And the TODR version is, I actually don't know. 00:35:55.640 |
Actually a lot of you saw the founders of the Spacey Prodigy folks, founders of explosion 00:36:02.320 |
AI, I did a talk with Matthew, and I went to dinner with them afterwards, and we basically 00:36:08.000 |
spent the entire evening talking, debating, arguing about what does it mean that companies 00:36:15.920 |
like ours are building tools that are democratizing access to tools that can be used in harmful 00:36:25.600 |
They're incredibly thoughtful people, and I wouldn't say we didn't agree, we just couldn't 00:36:35.680 |
So I'm just going to lay out some of the questions and point to some of the research. 00:36:42.640 |
And when I say research, most of the actual literature review and putting this together 00:36:53.040 |
We start by saying the models we build are often pretty shitty in ways which are not 00:37:02.360 |
immediately apparent, and you won't know how shitty they are unless the people that are 00:37:07.900 |
building them with you are a range of people, and the people that are using them with you 00:37:14.120 |
So for example, a couple of wonderful research is Timnett's at Stanford. 00:37:29.440 |
So Joy and Timnett did this really interesting research where they looked at some basically 00:37:36.120 |
off-the-shelf face recognizers, one from Face++ which is a huge Chinese company, IBM's and 00:37:42.960 |
Microsoft's, and they looked for a range of different face types. 00:37:48.960 |
And generally speaking, the Microsoft one in particular was incredibly accurate unless 00:37:53.680 |
the face type happened to be dark-skinned when suddenly it went 25 times worse, got it wrong 00:38:05.800 |
And for somebody to, a big company like this, to release a product that for a very, very 00:38:15.360 |
large percentage of the world basically doesn't work, it's more than a technical failure, right? 00:38:22.520 |
It's a really deep failure of understanding what kind of team needs to be used to create 00:38:29.160 |
such a technology and to test such a technology, or even an understanding of who your customers 00:38:39.200 |
I was also going to add that the classifiers all did worse on women than on men. 00:38:47.520 |
It's funny, actually, Rachel tweeted about something like this the other day, and some 00:38:54.520 |
guy was like, "What's this all about, what are you saying, don't you know about, people 00:39:02.080 |
made cars for a long time, you're saying you don't need women to make cars too?" 00:39:05.800 |
And Rachel pointed out, "Well, actually, yes, for most of the history of car safety, women 00:39:12.880 |
in cars have been far, far more at risk of death than men in cars because the men created 00:39:19.920 |
male-looking, feeling-sized crash test dummies. 00:39:24.120 |
And so car safety was literally not tested on women-sized bodies. 00:39:28.540 |
So the fact that shitty product management with a total failure of diversity and understanding 00:39:36.280 |
And I was just going to say, that was comparing impacts of similar strength, men and women. 00:39:45.360 |
Whenever you say something on Twitter, Rachel has to say this, because any time you say 00:39:49.080 |
something like this on Twitter, there's like 10 people who'll be like, "Oh, you have to 00:39:52.120 |
compare all these other things," as if we didn't know that. 00:39:55.800 |
Other things our very best, most famous systems do, like Microsoft's Face Recognizer or Google's 00:40:08.920 |
Language Translator, you turn "she is a doctor, he is a nurse" into Turkish, and quite correctly, 00:40:15.320 |
both the pronouns become "oh," because there's no gendered pronouns in Turkish. 00:40:20.320 |
So go the other direction, "I'll be a doctor" -- I don't know how to say that, the equivalent 00:40:32.160 |
So we've got these kind of biases built into tools that we're all using every day. 00:40:39.000 |
And again, people are like, "Oh, it's just showing us what's in the world, and well, 00:40:42.480 |
okay, there's lots of problems with that basic assertion," but as you know, machine learning 00:40:50.120 |
And so because they love to generalize -- this is one of the cool things about you guys knowing 00:40:54.420 |
the technical details now -- because they love to generalize, when you see something like 00:40:59.680 |
60% of people cooking are women in the pictures they use to build this model, and then you 00:41:04.620 |
actually run the model on a separate set of pictures, then 84% of the people they choose 00:41:10.840 |
as cooking are women rather than the correct 67%, which is like a really understandable 00:41:18.020 |
thing for an algorithm to do, is it took a biased input and created a more biased output 00:41:26.040 |
because for this particular loss function, that's kind of where it ended up. 00:41:31.280 |
And this is a really common kind of model amplification. 00:41:44.960 |
It matters in ways more than just awkward translations, or like black people's photos 00:41:55.080 |
not being classified correctly, or maybe there's some wins too as well, like horrifying surveillance 00:42:02.260 |
everywhere maybe won't work on black people, I don't know. 00:42:05.880 |
Or it'll be even worse because it's horrifying surveillance and it's flat-out racist and 00:42:20.880 |
For all we say about human failings, there's a long history of civilization and societies 00:42:30.840 |
creating layers of human judgment which avoid hopefully the most horrible things happening. 00:42:38.080 |
And sometimes companies which love technology think "let's throw away the humans and replace 00:42:47.080 |
So two or three years ago, Facebook literally got rid of their human editors, like this 00:42:52.760 |
was in the news at the time, and they were replaced with algorithms. 00:42:56.720 |
And so now it's algorithms that put all the stuff on your newsfeed and human editors right 00:43:06.440 |
One of which was a massive horrifying genocide in Myanmar. 00:43:13.240 |
Babies getting torn out of their mother's arms and thrown onto fires, mass rape, murder, 00:43:19.880 |
and an entire people exiled from their homeland. 00:43:26.240 |
I'm not going to say that was because Facebook did this, but what I will say is that when 00:43:33.360 |
the leaders of this horrifying project are interviewed, they regularly talk about how 00:43:41.280 |
everything they learned about the disgusting animal behaviors of Rohingya that need to 00:43:46.880 |
be thrown off the earth, they learned from Facebook. 00:43:50.760 |
Because the algorithms just want to feed you more stuff that gets you clicking. 00:43:56.000 |
And so if you get told these people that don't look like you and you don't know are bad people 00:44:01.040 |
and here's lots of stories about the bad people, and then you start clicking on them and then 00:44:04.720 |
they feed you more of those things, the next thing you know you have this extraordinary 00:44:12.520 |
So for example, we've been told a few times people click on our fast AI videos and then 00:44:18.840 |
the next thing recommended to them is conspiracy theory videos from Alex Jones and then that 00:44:26.680 |
Because humans click on things that shock us and surprise us and horrify us. 00:44:34.240 |
And so at so many levels, this decision has had extraordinary consequences which we're 00:44:46.400 |
And again, this is not to say this particular consequence is because of this one thing, 00:44:51.840 |
but to say it's entirely unrelated would be clearly ignoring all of the evidence and information 00:45:01.880 |
So this is really kind of the key takeaway is to think like, what are you building and 00:45:13.960 |
So lots and lots of effort now being put into face detection, including in our course, we've 00:45:24.000 |
been spending a lot of time thinking about how to recognize stuff and where it is. 00:45:29.120 |
And there's lots of good reasons to want to be good at that, for improving crop yields 00:45:34.680 |
in agriculture, for improving diagnostic and treatment planning in medicine, for improving 00:45:47.760 |
But it's also being widely used in surveillance and propaganda and disinformation, and again, 00:46:02.640 |
I don't exactly know, but it's definitely at least important to be thinking about it, 00:46:09.000 |
talking about it, and sometimes you can do really good things. 00:46:14.520 |
For example, meetup.com did something which I would put in the category of really good 00:46:20.280 |
thing, which is they recognized early a potential problem, which is that more men were tending 00:46:32.200 |
And that was causing their collaborative filtering systems, which you're all familiar with building 00:46:38.280 |
now, to recommend more technical content to men. 00:46:44.760 |
And that was causing more men to go to more technical content, which is causing the recommendation 00:46:49.320 |
systems to suggest more technical content to men. 00:46:53.960 |
And this kind of runaway feedback loop is extremely common when we interface the algorithm 00:47:03.400 |
So what did meetup do? They intentionally made the decision to recommend more technical 00:47:10.600 |
content to women, not because of some highfalutin idea about how the world should be, but just 00:47:26.760 |
There are women that want to go to tech meetups, but when you turn up to a tech meetup and 00:47:30.160 |
it's all men, then you don't go and it recommends more men, and so on and so forth. 00:47:36.520 |
So meetup made a really strong product management decision here, which was to not do what the 00:47:48.320 |
Most of these runaway feedback loops, for example in predictive policing, where algorithms 00:47:53.680 |
tell policemen where to go, which very often is more black neighborhoods, which end up 00:47:58.600 |
crawling with more policemen, which leads to more arrests, which has assistance to tell 00:48:02.560 |
more policemen to go to more black neighborhoods, and so forth. 00:48:09.560 |
So this problem of algorithmic bias is now very widespread, and as algorithms become 00:48:20.960 |
more and more widely used for specific policy decisions, judicial decisions, day-to-day decisions 00:48:30.000 |
about who to give what offer to, this just keeps becoming a bigger problem. 00:48:40.960 |
And some of them are really things that the people involved in the product management 00:48:47.480 |
decision should have seen at the very start didn't make sense and were unreasonable under 00:48:55.440 |
For example, this stuff that I had gone and pointed out, these were questions that were 00:49:01.900 |
used to decide - Rachel, is this Sentencing Guidelines? 00:49:08.360 |
This software is used for both pretrial, so who it was required to post bail, so these 00:49:13.760 |
are people that haven't even been convicted, as well as for sentencing and for who gets 00:49:18.200 |
parole, and this was upheld by the Wisconsin Supreme Court last year, despite all the flaws 00:49:25.480 |
So whether you have to stay in jail because you can't pay the bail and how long your sentence 00:49:31.920 |
is for and how long you stay in jail for depends on what your father did, whether your parents 00:49:38.680 |
stayed married, who your friends are, and where you live. 00:49:43.920 |
Now it turns out these algorithms are actually terribly, terribly bad, so some recent analysis 00:49:51.740 |
showed that they're basically worse than chance, but even if the companies building them were 00:49:56.480 |
confident and these were statistically accurate correlations, does anybody imagine there's 00:50:03.440 |
a world where it makes sense to decide what happens to you based on what your dad did? 00:50:14.480 |
So a lot of this stuff at the basic level is obviously unreasonable, and a lot of it 00:50:23.800 |
just fails in these ways, but you can see empirically that these runaway feedback loops 00:50:28.760 |
must have happened, and these overgeneralizations must have happened. 00:50:31.920 |
For example, these are the kind of cross tabs that anybody working in these fields, in any 00:50:37.760 |
field that's using algorithms, should be preparing. 00:50:41.160 |
So prediction of likelihood of reoffending for black versus white defendants, we can 00:50:54.480 |
Of the people that were labeled high risk but didn't reoffend, there were 23.5% white, 00:51:04.760 |
but about twice that African American, whereas those that were labeled lower risk but did 00:51:11.640 |
reoffend was like half the white people and only 20% of the African American. 00:51:19.600 |
So this is the kind of stuff where at least if you're taking the technologies we've been 00:51:25.240 |
talking about and putting the production in any way, or building an API for other people, 00:51:33.240 |
or providing training for people, or whatever, then at least make sure that what you're doing 00:51:42.360 |
can be tracked in a way that people know what's going on, so at least they're informed. 00:51:49.480 |
I think it's a mistake in my opinion to assume that people are evil and trying to break society. 00:52:00.960 |
I prefer to start with an assumption of if people are doing dumb stuff it's because they 00:52:07.600 |
don't know better, so at least make sure that they have this information. 00:52:12.240 |
And I find very few ML practitioners thinking about what is the information they should 00:52:21.560 |
And then often I'll talk to data scientists who will say, "Oh, the stuff I'm working on 00:52:30.680 |
Like a number of people who think that what they're doing is entirely pointless? 00:52:37.400 |
People are paying you to do it for a reason, it's going to impact people in some way. 00:52:46.360 |
The other thing I know is a lot of people involved here are hiring people. 00:52:50.480 |
And so if you're hiring people, I guess you're all very familiar with the fast.ai philosophy 00:52:57.600 |
And I think it comes back to this idea that I don't think people on the whole are evil, 00:53:02.800 |
I think they need to be informed and have tools. 00:53:07.440 |
So we're trying to give as many people the tools as possible that they need, and particularly 00:53:12.960 |
we're trying to put those tools in the hands of a more diverse range of people. 00:53:18.520 |
So if you're involved in hiring decisions, perhaps you can keep this kind of philosophy 00:53:25.280 |
If you're not just hiring a wider range of people, but also promoting a wider range of 00:53:33.360 |
people and providing really appropriate career management for a wider range of people, apart 00:53:39.600 |
from anything else, your company will do better. 00:53:44.780 |
It actually turns out that more diverse teams are more creative and tend to solve problems 00:53:50.160 |
more quickly and better than less diverse teams. 00:53:53.480 |
But also you might avoid these awful screw-ups which at one level are bad for the world, 00:54:02.600 |
and at another level if you ever get found out they can also destroy your company. 00:54:09.160 |
Also they can destroy you, or at least make you look pretty bad in history. 00:54:18.660 |
One is going right back to the Second World War, IBM basically provided all of the infrastructure 00:54:33.720 |
These were the forms that they used, and they had different code for Jews were 8 and Gypsies 00:54:40.240 |
were 12, death in the gas chambers were 6, and they all went on these punch cards. 00:54:44.880 |
You can go and look at these punch cards in museums now. 00:54:49.320 |
This has actually been reviewed by a Swiss judge who said that IBM's technical assistance 00:54:55.920 |
facilitated the task of the Nazis in the commission of the crimes against humanity. 00:55:03.600 |
It's interesting to read back the history from these times to see what was going through 00:55:14.160 |
What was clearly going through the minds was the opportunity to show technical superiority, 00:55:18.720 |
the opportunity to test out their new systems, and of course the extraordinary amount of 00:55:32.680 |
When you do something which at some point down the line turns out to be a problem, even 00:55:39.200 |
if you were told to do it, that can turn out to be a problem for you personally. 00:55:44.280 |
For example, you'll remember the diesel emissions scandal in VW, who was the one guy that went 00:55:56.480 |
So if all of this stuff about actually not fucking up the world isn't enough to convince 00:56:05.880 |
So if you do something that turns out to cause problems, even though somebody told you to 00:56:11.800 |
do it, you can absolutely be held criminally responsible. 00:56:17.640 |
And you'll certainly look at Kogan, I think a lot of people now know the name Alexander 00:56:23.480 |
Kogan, he was the guy that handed over the Cambridge Analytica data. 00:56:29.080 |
He's a Cambridge academic, now a very famous Cambridge academic the world over for doing 00:56:36.680 |
his part to destroy the foundations of democracy. 00:56:39.960 |
So this is probably not how we want to go down in history. 00:56:54.360 |
In one of your tweets, you said dropout is patented. 00:56:57.120 |
I think this is about WaveNet patent from Google. 00:57:01.560 |
Can you please share more insight on this subject? 00:57:03.880 |
Does it mean that we'll have to pay to use dropout in the future? 00:57:15.920 |
The question before the break was about patents. 00:57:26.760 |
So I guess the reason it's coming up was because I wrote a tweet this week, which I think was 00:57:33.160 |
like three words, and said dropout is patented. 00:57:45.720 |
Inventions all about patents, blah blah blah, right? 00:57:55.360 |
The amount of things that are patentable that we talk about every week would be dozens. 00:58:01.440 |
Like it's so easy to come up with a little tweak and then if you turn that into a patent 00:58:09.000 |
you stop everybody from using that little tweak for the next 14 years and you end up 00:58:12.400 |
with a situation we have now where everything is patented in 50 different ways and so then 00:58:19.360 |
you get these patent trolls who have made a very very good business out of basically 00:58:24.480 |
buying lots of shitty little patents and then suing anybody who accidentally turned out 00:58:30.560 |
did that thing, like putting rounded corners on buttons. 00:58:36.360 |
So what does it mean for us that a lot of stuff 00:58:58.680 |
One of the main people doing this is Google and people from Google who reply to this patent 00:59:05.720 |
tend to assume that Google is doing it because they want to have it defensively, so if somebody 00:59:11.280 |
sues them they'll be like don't sue us, we'll sue you back because we have all these patents. 00:59:17.600 |
The problem is that as far as I know they haven't signed what's called a defensive patent 00:59:23.280 |
So basically you can sign a legally binding document that says our patent portfolio will 00:59:28.040 |
only be used in defense and not offense, and even if you believe all the management of 00:59:33.040 |
Google would never turn into a patent troll, you've got to remember that management changes. 00:59:41.240 |
To give a specific example, I know the somewhat recent CFO of Google has a much more aggressive 00:59:50.400 |
stance towards the P&L and I don't know, maybe she might decide that they should start monetizing 00:59:57.000 |
their patents, or maybe the group that made that patent might get spun off and then sold 01:00:03.280 |
to another company that might end up in private equity hands and decide to monetize the patents. 01:00:12.780 |
There has been a big shift legally recently away from software patents actually having 01:00:19.880 |
any legal standing, so it's possible that these all end up thrown out of court, but 01:00:25.680 |
the reality is that anything but a big company is unlikely to have the financial ability 01:00:30.920 |
to defend themselves against one of these huge patent trolls. 01:00:42.680 |
You can't avoid using patented stuff if you write code. 01:00:48.800 |
I wouldn't be surprised if most lines of code you write have patents on them. 01:00:54.280 |
So actually, funnily enough, the best thing to do is not to study the patents, because 01:01:00.800 |
if you do and you infringe knowingly, the penalties are worse. 01:01:09.080 |
The best thing to do is to put your hands in your ears, sing a song, and get back to 01:01:17.120 |
So that thing I said about dropouts patented, forget I said that, you skipped that bit. 01:01:33.720 |
We're going to go a bit retro here, because this is actually the original artistic style 01:01:41.160 |
There's been a lot of updates to it, a lot of different approaches. 01:01:45.680 |
And I actually think, in many ways, the original is the best. 01:01:50.520 |
We're going to look at some of the newer approaches as well, but I actually think the original 01:01:56.200 |
is a terrific way to do it, even with everything that's gone since. 01:02:13.680 |
So the idea here is that we want to take a photo of this bird, and we want to create 01:02:24.940 |
a painting that looks like Van Gogh painted the picture of the bird. 01:02:37.640 |
Quite a bit of the stuff that I'm doing, by the way, uses ImageNet. 01:02:41.200 |
You don't have to download the whole of ImageNet for any of the things I'm doing. 01:02:44.400 |
There's an ImageNet sample on files.fast.ai/data, which has a couple of gigs, and it should be 01:02:52.320 |
plenty good enough for everything we're doing. 01:02:54.680 |
If you want to get really great results, you can grab ImageNet. 01:03:00.160 |
On Kaggle, the localization competition actually contains all of the classification data as 01:03:11.080 |
So if you've got room, it's good to have a copy of ImageNet because it comes in handy 01:03:18.680 |
So I just grabbed a bird out of my ImageNet folder, and there is my bird. 01:03:28.480 |
What I'm going to do is I'm going to start with this picture, and I'm going to try and 01:03:35.160 |
make it more and more like a picture of this bird painted by Van Gogh. 01:03:45.200 |
And the way I do that is actually very simple. 01:03:53.160 |
We will create a loss function, which we'll call f, and the loss function is going to 01:04:04.480 |
take as input a picture, and spit out as output a value, and the value will be lower if the 01:04:18.440 |
image looks more like the bird photo painted by Van Gogh. 01:04:30.260 |
Having written that loss function, we will then use the PyTorch gradient and optimizers 01:04:36.640 |
gradient times the learning rate, and we're not going to update any weights, we're going 01:04:50.040 |
to update the pixels of the input image to make it a little bit more like a picture which 01:05:01.800 |
And we'll stick it through the loss function again to get more gradients, and do it again 01:05:10.320 |
It's identical to how we solve every problem. 01:05:19.540 |
Create a loss function, use it to get some gradients, multiply it by learning rates to 01:05:23.240 |
update something, always before we've updated weights in a model, but today we're not going 01:05:31.240 |
We're going to update the pixels of the input, but it's no different at all. 01:05:38.720 |
We're just taking the gradient with respect to the input, rather than with respect to 01:05:51.800 |
Let's mention here that there's going to be two more inputs to our loss function. 01:05:58.820 |
One is the picture of the bird, birds look like this. 01:06:07.400 |
And the second is an artwork by Van Gogh, they look like this. 01:06:16.720 |
By having those as inputs as well, that means we'll be able to re-run the function later 01:06:31.520 |
to make it look like a bird painted by Monet or a jumbo jet painted by Van Gogh or whatever. 01:06:45.280 |
And so initially, as we discussed, our input here, this is going to be the first time I've 01:06:54.760 |
So we start with some random noise, use the loss function, get the gradients, make it 01:07:05.320 |
a little bit more like a bird painted by Van Gogh and so forth. 01:07:09.680 |
So the only outstanding question which I guess we can talk about briefly is how we calculate 01:07:18.320 |
how much our image looks like a bird, this bird, painted by Van Gogh. 01:07:28.880 |
Let's put it into a part called the content_loss, and that's going to return a value that's 01:07:47.520 |
Not just any bird, the specific bird that we had coming in. 01:07:56.140 |
And then let's also create something called the style_loss, and that's going to be a lower 01:08:02.440 |
number if the image is more like Van Gogh's style. 01:08:22.300 |
So there's one way to do the content_loss which is very simple. 01:08:27.320 |
We could look at the pixels of the output, compare them to the pixels of the bird, and 01:08:36.940 |
So if we did that, I ran this for a while, eventually our image would turn into an image 01:08:47.740 |
You should try this as an exercise, try to use the optimizer_npy torch to start with 01:08:53.640 |
a random image, and turn it into another image by using mean squared error pixel_loss. 01:09:00.400 |
Not terribly exciting, but that would be step 1. 01:09:06.180 |
The problem is, even if we already had a style_loss function working beautifully, and then presumably 01:09:14.000 |
what we're going to do is we're going to add these two together, and then one of them will 01:09:27.160 |
Some number will pick to adjust how much style versus how much content. 01:09:31.320 |
So assuming we had a style_loss, or we had picked some sensible lambda, if we used a 01:09:35.600 |
pixel-wise content_loss, then anything that makes it look more like Van Gogh and less 01:09:41.580 |
like the exact photo, the exact background, the exact contrast, lighting, everything will 01:09:48.280 |
decrease the content loss, which is not what we want. 01:09:53.000 |
We want it to look like the bird, but not in the same way. 01:10:00.320 |
It's still going to have the same two eyes in the same place, and be the same kind of 01:10:03.960 |
shape and so forth, but not the same representation. 01:10:09.480 |
So what we're going to do is, this is going to shock you, we're going to use a neural 01:10:14.920 |
network, we're going to use a neural network. 01:10:19.760 |
I totally meant that to be black and it came out green. 01:10:28.320 |
And we're going to use the VGG neural network, because that's what I used last year and I 01:10:33.320 |
didn't have time to see if other things worked, so you can try that yourself during the wig. 01:10:40.000 |
And the VGG network is something which takes in an input and sticks it through a number 01:10:52.760 |
And I'm just going to treat these as just the convolutional layers. 01:10:55.500 |
There's obviously ReLU there, and if it's a VGG with batch norm, which most are today, 01:11:05.160 |
And there's max pooling and so forth, but that's fine. 01:11:09.320 |
What we could do is we could take one of these convolutional activations, and then rather 01:11:20.260 |
than comparing the pixels of this bird, we could instead compare the VGG layer 5 activations 01:11:32.760 |
of this to the VGG layer 5 activations of our original bird, or layer 6, layer 7 or whatever. 01:11:44.160 |
Well for one thing, it wouldn't be the same bird. 01:11:47.760 |
It wouldn't be exactly the same, because we're not checking the pixels, we're checking some 01:11:53.480 |
And so what do those later sets of activations contain? 01:11:57.280 |
Well assuming that after some max pooling they contain a smaller grid, so it's less specific 01:12:03.760 |
about where things are, and rather than containing pixel color values, they're more like semantic 01:12:09.680 |
things like, is this kind of like an eyeball, or is this kind of furry, or is this kind 01:12:15.380 |
of bright, or is this kind of reflective, or is this laying flat, or whatever. 01:12:22.000 |
So we would hope that there's some level of semantic features through those layers, where 01:12:29.660 |
if we get a picture that matches those activations, then any picture that matches those activations 01:12:38.040 |
looks like the bird, but it's not the same representation of the bird. 01:12:49.340 |
People generally call this a perceptual loss, because it's really important in deep learning 01:12:56.000 |
that you always create a new name for every obvious thing you do. 01:13:00.280 |
So if you compare two activations together, you're doing a perceptual loss. 01:13:09.780 |
Our content loss is going to be a perceptual loss, and then we'll do the style loss later. 01:13:13.200 |
So let's start by trying to create a bird that initially is random noise, and we're 01:13:22.080 |
going to use perceptual loss to create something that is bird-like, but it's not this bird. 01:13:31.380 |
So let's start by saying we're going to do 288 by 288. 01:13:36.880 |
Because we're only going to do one bird, there's going to be no GPU memory problems. 01:13:42.120 |
So I was actually disappointed that I realized that I picked a rather small input image. 01:13:45.920 |
It would be fun to try this with something much bigger to create a really grand scale 01:13:52.440 |
The other thing to remember is if you were productionizing this, you could do a whole 01:13:59.780 |
People sometimes complain about this approach, Gatties is the lead author, the Gatties style 01:14:07.680 |
It takes a few seconds and you can do a whole batch in a few seconds. 01:14:13.760 |
So we're going to stick it through some transforms as per usual, transforms through a BGG16 model. 01:14:18.820 |
And so remember, the transform class has a dunder call method, so we can treat it as 01:14:29.680 |
So if you pass an image into that, then we get the transformed image. 01:14:35.280 |
Try not to treat the fastai and PyTorch infrastructure as a black box, because it's all designed 01:14:45.800 |
So this idea of transforms are just callables, i.e. things that you can do with parentheses 01:14:52.360 |
comes from PyTorch, and we totally plagiarized the idea. 01:14:56.360 |
So with TorchVision or with fastai, your transforms are just callables. 01:15:02.900 |
The whole pipeline of transforms is just a callable. 01:15:06.000 |
So now we have something of 3x288x288, because PyTorch likes the channel to be first, and 01:15:12.400 |
as you can see it's been turned into a square for us, it's been normalized to 0.1, all that 01:15:28.400 |
Trying to turn this into a picture of anything is actually really hard. 01:15:33.160 |
I found it very difficult to actually get an optimizer to get reasonable gradients that 01:15:40.560 |
And just as I thought I was going to run out of time for this class and really embarrass 01:15:44.360 |
myself, I realized the key issue is that pictures don't look like this, they have more smoothness. 01:15:54.580 |
So I turned this into this by just blurring it a little bit. 01:15:59.480 |
I used a median filter, basically it's like a median pooling effectively. 01:16:08.240 |
As soon as I changed it from this to this, it immediately started training really well. 01:16:12.680 |
So it's like a number of little tweaks you have to do to get these things to work is 01:16:21.220 |
So we start with a random image which is at least somewhat smooth. 01:16:32.800 |
I found that my bird image had a standard deviation of pixels that was about half of 01:16:38.320 |
this mean, so I divided it by 2, just trying to make it a little bit easier for it to match. 01:16:46.760 |
Turn that into a variable because this image, remember, we're going to be modifying those 01:16:55.160 |
So anything that's involved in the loss function needs to be a variable, and specifically it 01:17:00.040 |
requires a gradient because we're actually updating the image. 01:17:07.760 |
So we now have a mini-batch of 1, 3 channels, 288 by 288, random noise. 01:17:17.040 |
We're going to use for no particular reason the 37th layer of VGG. 01:17:21.720 |
If you print out the VGG network, you can just type in m_VGG and print it out. 01:17:27.260 |
You'll see that this is a mid to late stage layer. 01:17:32.880 |
So we can just grab the first 37 layers and turn it into a sequential model, so now we've 01:17:39.460 |
got a subset of VGG that will spit out some mid-layer activations. 01:17:48.860 |
So we can take our actual bird image, and we want to create a mini-batch of 1. 01:17:56.380 |
So remember if you slice in numpy with none, also known as np.newaxis, it introduces a 01:18:10.060 |
So here I want to create an axis of size 1 to say this is a mini-batch of size 1, so 01:18:15.920 |
slicing with none, just like I did here, sliced with none to get this 1 unit axis at the front. 01:18:28.360 |
And this one doesn't need to be updated, so we use vv to say you don't need gradients 01:18:37.400 |
And so that's going to give us our target activations. 01:18:42.600 |
So we've basically taken our bird image, turned it into a variable, stuck it through our model 01:18:49.280 |
to grab the 37th layer activations, and that's our target, is that we want our content lost 01:19:03.520 |
We'll go back to the details of this in a moment, but we're going to create an optimizer, 01:19:07.280 |
and we're going to step a bunch of times, going 0 to gradients, call some lost function, 01:19:20.680 |
So that's the high-level version, and I'm going to come back to the details in a moment. 01:19:29.360 |
But the key thing is that the lost function we're passing in that randomly generated image, 01:19:35.600 |
the optimization image, or actually the variable of it. 01:19:43.560 |
And so it's going to update this using the lost function, and the lost function is the 01:19:48.760 |
mean squared error loss, comparing our current optimization image, pass through our VGG to 01:19:56.360 |
get the intermediate activations, and comparing it to our target activations. 01:20:04.120 |
And we'll run that a bunch of times, and we'll print it out, and we have our bird, but not 01:20:12.160 |
the representation of the bird, so there it is. 01:20:27.920 |
Anybody who's done, I don't know exactly what courses they're in, but certain parts of math 01:20:34.280 |
and computer science courses come into deep learning, discovers we use all this stuff 01:20:41.080 |
like Adam and SGD and always assume that nobody in the field knows the first thing about computer 01:20:48.560 |
science and immediately says, "Oh, have any of you guys tried using VFGS?" 01:20:55.360 |
There's basically a long history of a totally different kind of algorithm for optimization 01:21:03.480 |
And of course the answer is actually the people who have spent decades studying neural networks 01:21:07.560 |
do know a thing or two about computer science, and it turns out these techniques don't work 01:21:13.280 |
But it's actually going to work well for this, and it's a good opportunity to talk about 01:21:16.800 |
an interesting algorithm for those of you that haven't studied this type of optimization 01:21:31.520 |
I can't remember, anyway, initials are four different people. 01:21:37.160 |
The L stands for limited memory, so it's really just called VFGS. 01:21:45.160 |
So as an optimizer, it means that there's some loss function, and it's going to use 01:21:49.440 |
some gradients to -- not all optimizers use gradients, but all the ones we use do -- use 01:21:54.520 |
gradients to find a direction to go and try to make the loss function go lower and lower 01:22:05.760 |
But it's an interesting kind of optimizer because it does a bit more work than the ones 01:22:36.840 |
So the way it works is it starts the same way that we're used to, which is we just kind 01:22:41.000 |
of pick somewhere to get started, and in this case we pick a random image, as you saw. 01:22:59.720 |
But we don't just take a step, but what we actually do is as well as find in the gradient, 01:23:08.600 |
So the second derivative says how fast does the gradient change? 01:23:12.920 |
So the gradient is how fast does the function change, the second derivative is how fast 01:23:18.880 |
And the basic idea is that if you know that it's not very curvy, then you can probably 01:23:31.760 |
But if it is very curvy, then you probably don't want to jump as far. 01:23:36.520 |
And so in higher dimensions, the gradient's called the Jacobian, and the second derivative's 01:23:42.040 |
You'll see those words all the time, but that's all they mean. 01:23:45.240 |
Again, mathematicians have to invent new words for everything as well. 01:23:48.880 |
They're just like people learning researchers, except maybe a bit more snooty. 01:23:56.160 |
So with BFGS, we're going to try and calculate the second derivative, and then we're going 01:24:03.880 |
to use that to figure out what direction to go and how far to go. 01:24:11.200 |
So it's less of a wild jump into the unknown. 01:24:16.260 |
Now the problem is that actually calculating the Hessian, the second derivative, is almost 01:24:22.240 |
certainly not a good idea, because in each possible direction that you can head, for 01:24:28.120 |
each direction that you're measuring the gradient in, you also have to calculate the Hessian 01:24:38.640 |
So rather than actually calculating it, we take a few steps and we basically look at 01:24:44.800 |
how much the gradient's changing as we do each step, and we approximate the Hessian 01:24:54.880 |
And again, this seems like a really obvious thing to do, but nobody thought of it until 01:25:04.920 |
Keeping track of every single step you take takes a lot of memory. 01:25:09.960 |
So don't keep track of every step you take, just keep the last 10 or 20. 01:25:16.000 |
And the second bit there, that's the L to the L_BFGS. 01:25:20.240 |
So a limited memory BFGS means keep the last 10 or 20 gradients, use that to approximate 01:25:27.960 |
the amount of curvature, and then use the curvature and gradient to estimate what direction 01:25:38.520 |
And so that's normally not a good idea in deep learning for a number of reasons. 01:25:43.120 |
It's obviously more work to do than an atom or an SGD update, and obviously more memory. 01:25:51.000 |
Memory is much more of a big issue when you've got a GPU to store it on and hundreds of millions 01:25:56.440 |
But more importantly, the mini-batches are super bumpy. 01:26:00.760 |
So figuring out curvature to decide exactly how far to travel is kind of polishing turds 01:26:07.840 |
Is that an American expression or just an Australian thing? 01:26:20.320 |
But also, interestingly, using the second derivative information, it turns out it's like 01:26:29.100 |
So there's some interesting theoretical results that basically say it actually sends you towards 01:26:34.640 |
nasty flat areas of the function if you use second derivative information. 01:26:40.660 |
But in this case, we're not optimizing weights. 01:26:43.240 |
We're optimizing pixels, so all the rules change. 01:26:46.800 |
And actually it turns out LBFTS does make sense. 01:26:51.200 |
And because it does more work each time, when it's a different kind of optimizer, the API 01:26:58.360 |
As you can see here, when you say optimizer.step, you actually pass in the loss function. 01:27:06.900 |
And so my loss function is to call step with a particular loss function, which is my activation 01:27:16.920 |
And as you can see, inside the loop, you don't say step, step, step, but rather it looks 01:27:25.760 |
And you're welcome to try and rewrite this to use SGD, it'll still work, it'll just take 01:27:31.880 |
I haven't tried it with SGD, I'd be interested to know how much longer it takes. 01:27:39.400 |
So you can see the loss function going down, the mean squared error between the activations 01:27:46.560 |
at layer 37 of our VGG model for our optimized image versus the target activations, and remember 01:27:55.900 |
the target activations were the VGG applied to our bird. 01:28:13.320 |
Now one thing I'll say about this content loss is we don't know which layer is going 01:28:20.320 |
to work best, so it would be nice if we were able to experiment a little bit more, and 01:28:30.480 |
So rather than like lopping off all of the layers after the one we want, wouldn't it 01:28:37.000 |
be nice if we could somehow grab the activations of a few layers as it calculates? 01:28:47.560 |
Back when we did SSD, we actually wrote our own network which had a number of outputs. 01:28:56.360 |
Like the different convolutional layers, we spat out a different Ocon thing. 01:29:02.400 |
But I don't really want to go and add that to the TorchVision ResNet model, especially 01:29:08.240 |
not if later on I want to try the TorchVision VGG model, and then I want to try a NASNet 01:29:14.760 |
I don't want to go into all of them and change their outputs, besides which I'd like to easily 01:29:20.180 |
be able to turn certain activations on and off the demand. 01:29:24.320 |
So we've briefly touched before on this idea that PyTorch has these fantastic things called 01:29:30.520 |
You can have forward hooks that let you plug anything you like into the forward path of 01:29:37.560 |
a calculation, or a backward hook that lets you plug anything you like into the backward 01:29:43.080 |
So we're going to create the world's simplest forward hook. 01:29:47.680 |
This is one of these things that almost nobody knows about, so like almost any code you find 01:29:52.920 |
on the internet that implements style transfer will have all kinds of horrible hacks rather 01:30:05.440 |
So to create a forward hook, you just create a class, and the class has to have something 01:30:15.320 |
And your hook function is going to receive the module that you've hooked, it's going 01:30:21.000 |
to receive the input for the forward pass, and it's going to receive the target, and 01:30:28.040 |
So what I'm going to do is I'm just going to store the output of this module in some 01:30:42.680 |
So this can actually be called anything you like, but hook function seems to be the standard. 01:30:46.240 |
You can see what happens here in the constructor is I store inside some attribute the result 01:30:52.320 |
of -- this is going to be the layer that I'm going to hook -- you go module.register_forward_hook 01:30:59.480 |
and pass in the function that you want to be called when this module, when its forward 01:31:08.040 |
So when its forward method is called, it will call self.hook function which will store the 01:31:22.200 |
So now what we can do is we can create our VGG as before, and that's said it's not trainable 01:31:31.760 |
so we don't waste time and memory calculating gradients for it. 01:31:36.540 |
And let's go through and find all of the MaxPool layers. 01:31:41.800 |
So let's go through all of the children of this module, and if it's a MaxPool layer, 01:31:51.140 |
So that's going to give me the layer before the MaxPool. 01:31:54.200 |
And so in general the layer before a MaxPool or the layer before a Stride2Con is a very 01:31:59.600 |
interesting layer because it's the most complete representation we have at that grid cell size. 01:32:10.680 |
Because the very next layer is changing the grid. 01:32:14.040 |
So that seems to me like a good place to grab the content loss from is the best, most semantic, 01:32:22.520 |
most interesting content we have at that grid size. 01:32:26.040 |
So that's why I'm going to pick those indexes. 01:32:30.680 |
Those are the indexes of the last layer before each MaxPool in VGG. 01:32:37.820 |
So I'm going to grab this one here, 22, for no particular reason, just to try something 01:32:44.840 |
So I'm going to say blockends3, that's going to be 32. 01:32:50.940 |
So children VGG index to blockends3 will give me the 32nd layer of VGG as a module. 01:33:02.700 |
And then if I call the save features constructor, it's going to go self.hook equals 32nd layer 01:33:14.480 |
So now every time I do a forward pass on this VGG model, it's going to store the 32nd layer's 01:33:29.760 |
So we can now say, see here I'm calling my VGG network, but I'm not storing it anywhere. 01:33:38.120 |
I'm not saying activations equals VGG of my image. 01:33:43.320 |
I'm calling it, throwing away the answer, and then grabbing the features that we stored 01:33:56.640 |
So that way, this is now going to contain, this is a forward pass, that's how you do 01:34:04.320 |
You don't say .forward, you just use it as a callable. 01:34:07.600 |
And using it as a callable on an nn.module automatically calls forward. 01:34:16.080 |
So we call it as a callable, that ends up calling our forward hook. 01:34:19.960 |
That forward hook stores the activations in sf.features. 01:34:24.640 |
And so now we have our target variable, just like before, but in a much more flexible way. 01:34:35.040 |
These are the same four lines of code we had earlier, I've just stuck them into a function. 01:34:39.240 |
And so it's just giving me my random image to optimize, and an optimizer to optimize 01:34:46.920 |
This is exactly the same code as before, so that gives me these. 01:34:50.440 |
And so now I can go ahead and do exactly the same thing. 01:34:54.600 |
But now I'm going to use a different loss function, activation_loss_2, which doesn't 01:35:01.200 |
Again, it calls mvgg to do a forward pass, throws away the results, and grabs sf.features. 01:35:11.600 |
And so that's now my 30-second layer activations, which I can then do my MSE loss on. 01:35:19.440 |
You might have noticed the last loss function and this one are both multiplied by a thousand. 01:35:26.520 |
Again, this was like all the things that were trying to get this lesson to not work correctly. 01:35:31.680 |
I didn't used to have the a thousand, it wasn't training. 01:35:36.000 |
Lunch time to date, nothing was working, after days of trying to get this thing to work. 01:35:43.360 |
And finally, just randomly noticed, the loss functions, the numbers are really low, like 01:35:53.800 |
And I just thought, what if they weren't so low? 01:35:56.240 |
So I multiplied them by a thousand and it started working. 01:36:01.400 |
Because we're doing single precision floating point, and single precision floating point 01:36:07.360 |
And particularly once you're getting gradients that are kind of small and then you're multiplying 01:36:10.840 |
the learning rate, it can be kind of small and you end up with a small number. 01:36:14.760 |
And if it's so small, it can get rounded to zero and that's what was happening and my 01:36:22.720 |
So I'm sure there are better ways of multiplying by a thousand, but whatever, it works fine. 01:36:27.960 |
It doesn't matter what you multiply a loss function by, because all you care about is 01:36:37.560 |
And interestingly, this is actually something similar for when we were training ImageNet, 01:36:41.800 |
we were using half-precision floating point because the Volta tensor cores require that. 01:36:47.760 |
And it's actually a standard practice if you want to get the half-precision floating point 01:36:53.920 |
to train, you actually have to multiply the loss function by a scaling factor. 01:37:03.040 |
And I think FastAI is now the first library that has all of the tricks necessary to train 01:37:11.240 |
So if you have a Volta or you can pay for a P3, if you've got a learner object, you can 01:37:18.800 |
just say "learn.half" and it'll now just magically train correctly half-precision floating point 01:37:27.600 |
built into the model data objects as well, it's all automatic, and pretty sure no other 01:37:36.960 |
So this is just doing the same thing on a slightly earlier layer. 01:37:40.640 |
And you can see that the later layer doesn't look very bird-like at all, but you can kind 01:37:47.760 |
of tell it's a bird, slightly earlier layer, more bird-like. 01:37:51.760 |
And hopefully that makes sense to you that earlier layers are getting closer to the pixels. 01:38:01.720 |
It's a smaller grid size, well, there's more grid cells, each cell is smaller, a smaller 01:38:09.760 |
receptive field, less complex semantic features. 01:38:14.260 |
So the earlier we get, the more it's going to look like a bird. 01:38:18.840 |
And in fact, the paper has a nice picture of that showing various different layers and 01:38:26.200 |
kind of zooming into this house, they're trying to make this house look like this picture. 01:38:30.360 |
And you can see that later on it's pretty messy and earlier on it looks like this. 01:38:40.080 |
And I will say one of the things I've noticed in our study group is anytime I say to somebody 01:38:45.960 |
to answer a question, anytime I say read the paper, there's a thing in the paper that tells 01:38:51.880 |
you the answer to that question, there's always this shocked look. 01:38:56.160 |
Read the paper, me, the paper, but seriously, the papers have done these experiments and 01:39:04.380 |
drawn the pictures, like there's all this stuff in the papers. 01:39:08.800 |
It doesn't mean you have to read every part of the paper, but at least look at the pictures. 01:39:14.320 |
So check out the Gattis paper, it's got nice pictures. 01:39:20.320 |
So they've done the experiment for us, they basically did this experiment, but it looks 01:39:25.320 |
like they didn't go as deep, they just got some earlier ones. 01:39:30.080 |
The next thing we need to do is to create style loss. 01:39:33.920 |
So we've already got the loss, which is how much like the bird is it. 01:39:39.520 |
Now we need how much like this painting style is it. 01:39:47.480 |
We're going to grab the activations of some layer. 01:39:51.040 |
Now the problem is that the activations of some layer, let's say it was a 5x5 layer. 01:40:02.120 |
Of course there are no 5x5 layers at 224x224, but we'll pretend 5x5 by 19, totally unrealistic 01:40:20.960 |
So here's some activations, and we could get these activations both for the image we're 01:40:39.480 |
I downloaded this from Wikipedia, and I was wondering why it was taking so long to load. 01:40:44.240 |
It turns out that the Wikipedia version I downloaded was 30,000 by 30,000 pixels. 01:40:49.840 |
It's pretty cool, they've got this like serious gallery-quality archive stuff there, I didn't 01:40:56.200 |
know it existed, so don't try and run a neural net on that. 01:41:07.740 |
So we can do that for our Van Gogh image and we can do that for our optimized image. 01:41:14.880 |
And then we can compare the two and we would end up creating an image that looks content 01:41:20.520 |
like the painting, but it's not the painting. 01:41:23.440 |
We want something with the same style, but it's not the painting, it doesn't have the 01:41:27.680 |
So we actually want to throw away all of the spatial information. 01:41:33.160 |
We're not trying to create something that has a moon here and stars here, that's a church 01:41:44.920 |
So how do we throw away all the spatial information? 01:41:48.720 |
What we do is let's grab, in this case there are like 19 faces on this, like 19 slices. 01:41:58.720 |
So let's grab this top slice, so that's going to be a 5x5 matrix. 01:42:27.880 |
Now in one stroke, we've thrown away the bulk of the spatial information by flattening it. 01:42:37.320 |
Now let's grab a second slice, another channel, and do the same thing. 01:42:55.740 |
So here's channel 1, flattened, here's channel 2, flattened, and they've both got 25 elements. 01:43:04.000 |
And now let's take the dot product, which we can do with @, and so the dot product's 01:43:21.920 |
Well, assuming this is somewhere around the middle layer of the VGG network, we might 01:43:31.480 |
expect some of these activations to be like how textured is the brush stroke, and some 01:43:36.800 |
of them to be like how bright is this area, and some of them to be like is this part of 01:43:41.960 |
a house or part of a circular thing, or other parts to be how dark is this part of the painting. 01:43:51.540 |
And so a dot product, remember, is basically a correlation. 01:43:58.920 |
If this element and this element are both highly positive or both highly negative, it 01:44:06.000 |
gives us a big result, whereas if they're the opposite, it gives us a small result. 01:44:11.120 |
If they're both close to zero, it gives no result. 01:44:13.640 |
So it's basically a dot product as a measure of how similar these two things are. 01:44:19.240 |
And so if the activations of channel 1 and channel 2 are similar, let's give an example. 01:44:29.600 |
Let's say this first one was like how textured are the brush strokes, and this one here was 01:44:37.560 |
like how diagonally oriented are the brush strokes. 01:44:43.320 |
And if both of these were high together and both of these were high together, then it's 01:44:47.800 |
basically saying anywhere that there's more textured brush strokes, they tend to be diagonal. 01:44:55.160 |
Another interesting one is what would be the dot product of C1 with C1? 01:45:03.440 |
So that would be basically the 2-norm, the sum of the squares of that channel. 01:45:11.640 |
Which in other words is basically just, let's go back, I screwed this up. 01:45:24.200 |
Channel 1 might be texture, and channel 2 might be diagonal, and this one here would 01:45:33.200 |
be cell 1,1, and this cell here would be cell 4,2. 01:45:41.400 |
What I should have been saying is if these are both high at the same time, and these 01:45:46.400 |
are both high at the same time, then it's saying grid cells that have texture tend to 01:45:57.160 |
The idea was right, I just drew it all wrong. 01:46:01.400 |
So this number is going to be high when grid cells that have texture also have diagonal, 01:46:17.240 |
Whereas C1 dot product C1 is basically the 2-norm effectively, or the sum of the squares 01:46:38.680 |
And this is basically saying how in how many grid cells is the textured channel active, 01:46:51.380 |
So in other words, C1 dot product C1 tells us how much textured painting is going on, 01:46:59.560 |
and C2 dot product C2 tells us how much diagonal paint strokes is going on. 01:47:10.800 |
So C3 dot product C3 would be how often do we have bright colored cells. 01:47:17.960 |
So what we could do then is we could create a 25 by 25 matrix containing every one, channel 01:47:28.120 |
1, channel 2, channel 3, channel 1, channel 2, channel 3 -- sorry, not channel -- man, 01:47:38.860 |
it's been a long day -- 19, there are 19 channels. 01:47:48.760 |
Channel 1, channel 2, channel 3, channel 19, channel 1, channel 2, channel 3, channel 19. 01:47:59.280 |
And so this would be the dot product of channel 1 with channel 1, this would be the dot product 01:48:04.400 |
of channel 2 with channel 2, and so forth, after flattening. 01:48:11.920 |
And like we've discussed, mathematicians have to give everything a name. 01:48:17.040 |
So this particular matrix where you flatten something out and then do all the dot products 01:48:29.900 |
And I'll tell you a secret, most deep learning practitioners either don't know or don't remember 01:48:37.400 |
all these things, like what is a Gram Matrix if they ever did study at university, they 01:48:41.920 |
probably forgot it because they had a big night afterwards. 01:48:44.880 |
And the way it works in practice is like you realize, oh, I could create a kind of non-spatial 01:48:51.120 |
representation of how the channels correlate with each other, and then when I write up 01:48:57.100 |
the paper I have to go and ask around and say, does this thing have a name? 01:49:01.520 |
And somebody would be like, isn't it a Gram Matrix? 01:49:06.240 |
So don't think you have to go and study all of math first. 01:49:09.920 |
You use your intuition and common sense and then you worry about what the math is called 01:49:17.720 |
Sometimes it works the other way, not with me, because I can't do math. 01:49:23.400 |
So this is called the Gram Matrix, and of course if you're a real mathematician it's 01:49:26.460 |
very important that you say this as if you always knew it was a Gram Matrix and you kind 01:49:32.200 |
of just go, oh yes, we just calculate the Gram Matrix, that's really important. 01:49:38.900 |
So the Gram Matrix then is this kind of map of -- the diagonal is perhaps the most interesting. 01:49:51.720 |
The diagonal is like which channels are the most active, and then the off-diagonal is 01:50:01.600 |
And overall, if two pictures have the same style, then we're expecting that some layer 01:50:09.800 |
of activations, they will have similar Gram Matrices. 01:50:14.580 |
Because if we found the level of activations that capture a lot of stuff about paint strokes 01:50:19.560 |
and colors and stuff, the diagonal alone might even be enough. 01:50:25.960 |
And that's another interesting homework assignment if somebody wants to take it, is try doing 01:50:31.260 |
Gaddy style transfer, not using the Gram Matrix, but just using the diagonal of the Gram Matrix. 01:50:38.120 |
And that would be like a single line of code to change, but I haven't seen it tried. 01:50:43.040 |
I don't know if it would work at all, but it might work fine. 01:50:52.960 |
I was going to say I have tried that, and it works most of the time except when you 01:50:56.800 |
have funny pictures where you need two styles to appear in the same spot. 01:51:00.880 |
So if you have grass in one half and a crowd in one half, and you need the two styles. 01:51:07.200 |
You still want to do your homework, but Christine says she'll do it for you. 01:51:27.280 |
I've tried to resize the painting so it's the same size as my bird picture. 01:51:42.120 |
It doesn't matter too much which bit I use as long as it's got a nice style in it. 01:51:48.680 |
I grab my optimizer and my random image just like before. 01:51:53.760 |
And this time I call save features for all of my blockends, and that's going to give 01:51:59.320 |
me an array of save features objects, one for each module that appears the layer before 01:52:09.480 |
Because this time I want to play around with different activation layer styles, or more 01:52:17.160 |
specifically I want to let you play around with it. 01:52:24.160 |
So now I call my VGG module on my image again. 01:52:44.520 |
So I take my style image, put it through my transformations to create my transform style 01:52:49.240 |
I turn that into a variable, put it through the forward pass of my VGG module, and now 01:52:55.560 |
I can go through all of my save features objects and grab each set of features. 01:53:01.720 |
And notice I call clone, because later on if I call my VGG object again, it's going 01:53:11.200 |
I haven't quite thought about whether this is necessary. 01:53:13.360 |
If you take it away, it's fine, but I was just being careful. 01:53:18.000 |
So here's now an array of the activations at every block and layer. 01:53:30.840 |
And you can see being able to whip up a list comprehension really quickly, it's really 01:53:35.200 |
important in your Jupyter fiddling around because you really want to be able to immediately 01:53:39.720 |
see the grid size halving as we would expect because all of these appeared just before 01:53:53.440 |
So to do a gram MSE loss, it's going to be the MSE loss on the gram matrix of the input 01:54:05.080 |
And the gram matrix is just the matrix multiply of x with x transpose, where x is simply equal 01:54:15.140 |
to my input, where I've flattened the batch and channel axes all down together. 01:54:23.680 |
And I've already got one image, so you can kind of ignore the batch part, basically channel, 01:54:29.400 |
and then everything else, which in this case is the height and width, is the other dimension. 01:54:33.600 |
So this is now going to be channel by height and width, and then as we discussed we can 01:54:39.400 |
then just do the matrix multiply of that by its transpose. 01:54:44.400 |
And just to normalize it, we'll divide that by the number of elements. 01:54:49.680 |
It would actually be more elegant if I had said "divided by input.num_elements". 01:55:04.880 |
And then again, this kind of gave me tiny numbers, so I multiply it by a big number to 01:55:14.000 |
So now my style loss is to take my image to optimize, throw it through vgg_forward_pass, 01:55:21.160 |
have an array of the features in all of the features objects, and then call my gram_msc_loss 01:55:37.720 |
Now you could add them up with different weightings, you could add up a subset, whatever, in this 01:55:46.880 |
case I'm just grabbing all of them, pass that into my optimizer as before, and here we have 01:55:56.560 |
a random image in the style of Van Gogh, which I think is kind of cool. 01:56:06.880 |
Here is different layers of random image in the style of Van Gogh. 01:56:13.360 |
And so the first one, as you can see, the activations are simple geometric things, not 01:56:23.300 |
So we kind of have a suspicion that we probably want to use later layers largely for our style 01:56:38.280 |
I added this save_features.close, which just calls, remember I stored the hook here, and 01:56:50.880 |
so hook.remove gets rid of it, and it's a good idea to get rid of it because otherwise 01:57:01.560 |
So at the end I go through each of my save_features objects and close it. 01:57:08.760 |
So style_transfer is adding the two together with some weight. 01:57:21.080 |
Grab my optimizer, grab my image, and now my combined_loss is the MSC_loss at one particular 01:57:28.280 |
layer, my style_loss at all of my layers, sum up the style_losses, add them to the content_loss, 01:57:38.960 |
Actually the style_loss I scaled already by 1e6, and this one is 1, 2, 3, 4, 5, 6. 01:57:47.480 |
So actually they're both scaled exactly the same, add them together, and again you could 01:57:53.000 |
try weighting the different style_losses, or you could remove some of them, whatever. 01:58:02.160 |
Train that, and holy shit, it actually looks good. 01:58:20.040 |
The main takeaway here is if you want to solve something with a neural network, all you've 01:58:31.180 |
got to do is set up a loss function and then optimize something. 01:58:39.000 |
The loss function is something which a lower number is something that you're happier with. 01:58:44.880 |
Because then when you optimize it, it's going to make that number as low as you can, and 01:58:51.880 |
So here we came up with a loss function that does a good job of being a smaller number 01:59:02.680 |
when it looks like the thing we want it to look like, and it looks like the style of 01:59:09.480 |
When it actually comes to it, apart from implementing grammse_loss, which was like 6 lines of code 01:59:17.400 |
of that, that's our loss function, pass it to our optimizer, wait about 5 seconds, and 01:59:27.380 |
And remember, we could do a batch of these at a time. 01:59:29.360 |
So we could wait 5 seconds and 64 of these will be done. 01:59:38.920 |
Once this paper came out, it's really inspired a lot of interesting work. 01:59:47.400 |
To me though, most of the interesting work hasn't happened yet, because to me the interesting 01:59:51.640 |
work is the work where you combine human creativity with these kinds of tools. 01:59:59.880 |
I haven't seen much in the way of tools that you can download or use where the artist is 02:00:10.960 |
It's interesting, talking to the guys at the Google Magenta project, which is their Creative 02:00:17.160 |
AI project, all of the stuff they're doing with music is specifically about this. 02:00:22.720 |
It's building tools that musicians can use to perform in real time. 02:00:27.300 |
And so you'll see much more of that on the music space thanks to Magenta. 02:00:30.880 |
If you go to their website, there's all kinds of things where you can press the buttons 02:00:34.520 |
to change the drum beats or melodies or keys or whatever. 02:00:40.040 |
You can definitely see Adobe and Nvidia starting to release little prototypes that have started 02:00:49.040 |
This kind of creative AI explosion hasn't happened yet. 02:00:55.160 |
I think we have pretty much all the technology we need, but no one's put it together into 02:00:59.760 |
a thing and said look at the thing I built and look at the stuff that people built with 02:01:16.600 |
The paper that I mentioned at the start of class in passing, the one where we can add 02:01:23.200 |
Captain America's shield to arbitrary paintings, basically used this technique. 02:01:31.760 |
The trick was some minor tweaks to make the pasted Captain America shield blend in nicely. 02:01:42.520 |
That paper's only a couple of days old, so that would be an interesting project to try. 02:01:49.040 |
You can use all this code, it really does leverage this approach. 02:01:56.380 |
You could start by making the content image be like the painting with the shield, and 02:02:04.720 |
then the style image could be the painting without the shield. 02:02:08.800 |
That would be a good start, and then you could kind of see what specific problems they're 02:02:12.160 |
trying to solve in this paper to make it better. 02:02:24.280 |
Let's make a quick start on the next bit, which is, yes, Rachel. 02:02:37.160 |
Earlier there were a number of people that expressed interest in your thoughts on pyro 02:02:49.360 |
So TensorFlow's now got this TensorFlow probability or something. 02:02:54.800 |
There's a bunch of probabilistic programming frameworks out there. 02:03:01.760 |
I think they're intriguing, but as yet unproven in the sense that I haven't seen anything 02:03:15.480 |
done with any probabilistic programming system which hasn't been done better without them. 02:03:22.720 |
The basic premise is that it allows you to create more of a model of how you think the 02:03:34.760 |
Back when I used to work in management consulting 20 years ago, we used to do a lot of stuff 02:03:39.000 |
where we would use a spreadsheet and then we would have these Monte Carlo simulation 02:03:45.440 |
There's one called at-risk and one called crystal ball, I don't know if they still exist 02:03:48.780 |
decades later, but basically they would let you change a spreadsheet cell to say this 02:03:54.200 |
is not a specific value, but it actually represents a distribution of values with this mean and 02:03:59.360 |
the standard deviation, or it's got this distribution. 02:04:02.460 |
And then you would hit a button and the spreadsheet would recalculate a thousand times pulling 02:04:07.520 |
random numbers from the distributions and show you the distribution of your outcome 02:04:11.400 |
that might be some profit or market share or whatever, and we used them all the time 02:04:20.920 |
I partly think that a spreadsheet is a more obvious place to do that kind of work because 02:04:26.200 |
you can see it all much more naturally, but at this stage I hope it turns out to be useful 02:04:39.280 |
because I find it very appealing and it kind of appeals to, as I say, the kind of work 02:04:46.400 |
There's actually whole practices around this stuff that you used to call systems dynamics, 02:04:50.080 |
which really was built on top of this kind of stuff, but I don't know, it's not quite 02:04:56.880 |
Then there was a question about pre-training for a generic style transfer. 02:05:09.600 |
I don't think you can pre-train for a generic style, but you can pre-train for a generic 02:05:16.840 |
photo for a particular style, which is where we're going to get to, although it may end 02:05:27.200 |
up being homework, I haven't decided, but I'm going to do all the pieces. 02:05:32.020 |
One more question is, "Please ask him to talk about multi-GPU." 02:05:49.520 |
Before we do just another interesting picture from the Gatties paper, they've got a few 02:05:54.800 |
more that didn't fit in my slide here, but different convolutional layers for the style, 02:06:00.680 |
different style to content ratios, and here's the different images. 02:06:05.840 |
Obviously this isn't Van Gogh anymore, this is a different combination. 02:06:10.160 |
You can see if you just do all style, you don't see any image, if you do all lots of 02:06:16.160 |
content, but you use a low enough convolutional layer, it looks okay, but the background's 02:06:22.080 |
kind of dumb, so you kind of want somewhere around here or here. 02:06:27.880 |
You can play around with an experiment, but also use the paper to help guide you. 02:06:34.160 |
I think I might work on the math now, and we'll talk about multi-GPU and super-resolution 02:06:44.000 |
I think this is from the paper, and one of the things I really do want you to do after 02:06:49.280 |
we talk about a paper is to read the paper and then ask questions on the forum, anything 02:06:56.240 |
But there's kind of a key part of this paper which I wanted to talk about and discuss how 02:07:03.680 |
So the paper says we're going to be given an input image, x, and this little thing means 02:07:11.240 |
it's a vector, but this one's a matrix, I guess it could mean either. 02:07:28.760 |
So normally small-letter bold means vector, or small-letter with doobie on top means vector, 02:07:36.440 |
they can both mean vector, and normally big-letter means matrix, or small-letter with two doobies 02:07:46.720 |
We are going to basically treat it as a vector, so maybe we're just getting ahead of ourselves. 02:07:52.000 |
So we've got an input image, x, and it can be encoded in a particular layer of the CNN 02:07:59.800 |
So the activations, filter responses are activations. 02:08:03.700 |
So hopefully that's something you all understand, that's basically what a CNN does, is it produces 02:08:11.480 |
A layer has a bunch of filters which produce a number of channels, and so this here says 02:08:17.600 |
that layer number L has capital NL filters, and again this capital does not mean matrix. 02:08:26.720 |
So I don't know, math notation is so inconsistent. 02:08:30.580 |
So capital NL distinct filters at layer L, which means it has also that many feature 02:08:39.600 |
So make sure you can see that this letter is the same as this letter. 02:08:42.360 |
So you've got to be very careful to read the letters and recognize it's like snap, that's 02:08:49.560 |
So obviously NL feature maps or NL filters create NL feature maps or channels, h1 is 02:08:57.440 |
of size m, okay so I can see this is where the unrolling is happening, hmap is of size 02:09:04.600 |
m little l, so this is like m square bracket l in numpy notation, it's the lth layer. 02:09:15.960 |
And the size is height times width, so we flattened it out. 02:09:22.840 |
So the responses of that layer l can be stored in a matrix F, and now the l goes at the top 02:09:31.680 |
So this is not F to the power of l, this is just another indexing, we're just moving it 02:09:38.660 |
And this thing here where we say it's an element of R, this is a special R meaning the real 02:09:42.640 |
numbers n times m, this is saying that the dimensions of this is n by m. 02:09:48.840 |
So this is really important, you don't move on, it's just like with PyTorch, making sure 02:09:53.360 |
that you understand the rank and size of your dimensions first. 02:09:57.160 |
Same with math, these are the bits where you stop and think, why is it n by m? 02:10:03.640 |
So n is the number of filters, m is height by width, so do you remember that thing where 02:10:09.120 |
we did view batch times channel comma minus 1? 02:10:31.640 |
If I was nicer to you, I would have used the same letters as the paper, but I was too busy 02:10:36.440 |
getting this damn thing working to do that carefully. 02:10:39.920 |
So you can go back and rename it as capital F. 02:10:44.160 |
This is why we moved the l to the top, because we're now going to have some more indexing. 02:10:48.560 |
So like where else in NumPy or PyTorch we index things by square brackets and then lots 02:10:53.220 |
of things with commas between, the approach in math is to surround your letter by little 02:10:59.320 |
letters all around it, and just throw them up there everywhere. 02:11:03.460 |
So here fl is the lth layer of f, and then ij is the activation of the i-th filter at 02:11:14.600 |
So position j is up to size m, which is up to size height by width. 02:11:20.640 |
This is the kind of thing that would be easy to get confused. 02:11:22.640 |
Like often you'd see an ij and assume that's like indexing into a position of an image 02:11:27.400 |
like height by width, but it's totally not, is it? 02:11:31.400 |
It's indexing into channel by flattened image, and it even tells you it's the i-th filter, 02:11:40.240 |
the i-th channel in the jth position in the flattened out image in layer l. 02:11:47.960 |
So you're not going to be able to get any further in the paper unless you understand 02:11:56.420 |
So that's why these are the bits where you stop and make sure you're comfortable. 02:12:04.000 |
So now the content loss I'm not going to spend much time on, but basically we're going to 02:12:10.920 |
just check out the values of the activations versus the predictions squared. 02:12:21.440 |
So there's our content loss, and the style loss will be much the same thing but using 02:12:29.280 |
And I really wanted to show you this one because sometimes I really like things you can do 02:12:34.080 |
in math notation, and they're things you can also generally do in j and APL, which is kind 02:12:43.360 |
What this is saying is there's a whole bunch of values of i and a whole bunch of values 02:12:48.000 |
of j, and I've got to define g for all of them. 02:12:52.280 |
And there's a whole bunch of values of l as well, and I've got to define g for all of 02:12:56.840 |
And so for all of my g at every l, at every i, at every j, it's going to be equal to something. 02:13:03.200 |
And you can see that something has an i and a j and an l, so matching these, and it also 02:13:17.160 |
Well it's saying that my Gram matrix in layer l for the i-th channel, well these aren't channels 02:13:26.720 |
anymore, in the i-th position in one axis, in the j-th position in another axis, is equal 02:13:33.080 |
to my f matrix, so my flattened out matrix, for the i-th channel in that layer versus 02:13:49.900 |
And then I'm going to sum over, see this k and this k, they're the same letter. 02:13:55.720 |
So we're going to take the k-th position and multiply them together and then add them all 02:14:01.800 |
So that's exactly what we just did before when we calculated our Gram matrix. 02:14:06.400 |
So there's a lot going on because of some very neat notation, which is there are three 02:14:15.360 |
implicit loops all going on at the same time, plus one explicit loop in the sum, and then 02:14:22.040 |
they all work together to create this Gram matrix for every layer. 02:14:26.960 |
So let's go back and see if you can match this. 02:14:38.000 |
So all that's kind of happening all at once, which I think is pretty great. 02:14:47.000 |
So next week we're going to be looking at a very similar approach, basically doing style 02:14:52.180 |
transfer all over again, but in a way where we're actually going to train a neural network 02:14:56.880 |
to do it for us rather than having to do the optimization. 02:15:00.480 |
We'll also see that you can do the same thing to do super resolution, and we're also going 02:15:05.440 |
to go back and revisit some of that SSD stuff as well as doing some segmentation. 02:15:15.280 |
So if you've forgotten SSD, it might be worth doing a little bit of revision this week.